CN111814006A

CN111814006A - Analysis method and device of graph network structure and computer equipment

Info

Publication number: CN111814006A
Application number: CN202010733106.8A
Authority: CN
Inventors: 刘利
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2020-10-23

Abstract

The application relates to the field of artificial intelligence, and discloses an analysis method of a graph network structure, wherein the graph network structure to be analyzed is obtained according to an execution script of a graph analysis task, the graph network structure comprises a vertex set and an edge set, each vertex in the vertex set carries a first type label and a first attribute corresponding to the vertex respectively, each edge in the edge set is formed by connecting two vertexes with an incidence relation, and each edge carries a second type label and a second attribute corresponding to the edge respectively; respectively mapping each vertex and each edge in the graph network structure to a specified data set; acquiring a graph operator corresponding to the graph analysis task according to the execution script; and analyzing and calculating the designated data set corresponding to the graph network structure by utilizing the graph operator to obtain an analysis result. And establishing a graph network structure with extended attributes, converting graph operation into operation of a data set, realizing analysis of a single graph or a graph set, and providing a new function for graph analysis.

Description

Analysis method and device of graph network structure and computer equipment

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, and a computer device for analyzing a graph network structure.

Background

Objects and connections between objects can be more intuitively expressed through the graph, but when the computer identifies the graph, modeling needs to be carried out according to the graph so as to identify information in the graph and calculate the relation between the objects in the graph. Graph analysis plays an important role in research and industry. Such as a graph, can represent community relationships in a social network. Complex analysis problems with graphs require the integration of multiple analysis operations, such as ranking websites or analyzing social networks, and graph data models are a prerequisite for the execution of graph algorithms. At present, large-scale graph computation is implemented based on a library on a certain data stream framework, such as GraphX on an Apache Spark or Gelly on an Apache Flink, and performs certain graph computation by combining a general data conversion operator provided by a bottom-layer framework, but does not support a data set and cannot solve a complex analysis task.

Disclosure of Invention

The application mainly aims to provide an analysis method of a graph network structure, and aims to solve the technical problems that the existing graph calculation does not support a data set and cannot solve complex analysis tasks.

The application provides an analysis method of a graph network structure, which comprises the following steps:

acquiring a graph network structure to be analyzed according to an execution script of a graph analysis task, wherein the graph network structure comprises a vertex set and an edge set, each vertex in the vertex set carries a first type label and a first attribute corresponding to the vertex respectively, each edge in the edge set is formed by connecting two vertexes with an association relation, and each edge carries a second type label and a second attribute corresponding to the edge respectively;

respectively mapping each vertex and each edge in the graph network structure to a specified data set;

acquiring a graph operator corresponding to the graph analysis task according to the execution script;

and analyzing and calculating the designated data set corresponding to the graph network structure by utilizing the graph operator to obtain an analysis result.

Preferably, the step of obtaining the graph network structure to be analyzed according to the execution script of the graph analysis task includes:

acquiring graph elements of the graph network structure to be analyzed, wherein the graph elements are carried in an execution script of a graph analysis task, and the graph elements comprise a graph header;

extracting a diagram head data set corresponding to the diagram head from a specified database according to the diagram head;

determining whether the header data set includes a plurality of objects;

if not, judging that the graph network structure to be analyzed is a logic graph with a single object as a graph head;

acquiring vertex data and edge data associated with the logic diagram from the data set, and respectively and correspondingly forming a vertex set and an edge set;

and representing the logic diagram through the vertex set and the edge set to obtain the network structure of the diagram to be analyzed.

Preferably, the specifying data set comprises a Flink data set, and the step of obtaining graph operators corresponding to the graph analysis task according to the execution script comprises, after the step of obtaining the graph operators corresponding to the graph analysis task:

acquiring the execution sequence of each graph operator;

linking the graph operators into operation logic through a Java programming language according to the execution sequence;

and associating the arithmetic logic with a trigger of a Flink trigger program to form the Flink arithmetic program.

Preferably, the graph operator includes an exclusion operator, and the step of performing analysis calculation on the specified data set corresponding to the graph network structure by using the graph operator to obtain an analysis result includes:

acquiring a first vertex set and a first edge set corresponding to the first logic diagram, and acquiring a second vertex set and a second edge set corresponding to the second logic diagram;

extracting an identifier of the second logic diagram from the second logic diagram through a function ID _ ONLY to form a data set corresponding to the new ID object;

screening a designated vertex and a designated edge through the exclusion operator to obtain an exclusion result data set corresponding to the exclusion operator, wherein the designated vertex is included in the first vertex set and is not included in the second vertex set, and the designated edge is really included in the first edge set;

and sending the exclusion result data set to all nodes of the Apache Flink framework cluster through Flink broadcasting.

Preferably, the graph network structure is a social network graph composed of a plurality of logic graphs, the graph operators include relationship operators provided by Flink, and the step of analyzing and calculating the designated data set corresponding to the graph network structure by using the graph operators to obtain an analysis result includes:

identifying whether a user-selected relationship operator is finding a largest common subgraph between logical graphs in the social networking graph;

if yes, aggregating the number of vertexes of each logic diagram in the social network diagram;

selecting a designated logic diagram higher than the minimum vertex count according to the vertex number of each logic diagram;

and obtaining the maximum common subgraph among the logic graphs in the social network graph through a label propagation algorithm based on the designated logic graphs.

Preferably, the graph budget symbol includes at least two, and the step of performing analysis calculation on the specified data set corresponding to the graph network structure by using the graph operator to obtain an analysis result includes:

acquiring a designated Flink program corresponding to a plurality of graph budget symbols;

acquiring the current running node of the specified Flink program, and predicting a data set capable of executing operation concurrently;

distributing the data sets which are subjected to concurrent operation execution to a designated machine in the Apache Flink framework cluster;

receiving a calculation result obtained by executing the distributed data set by each appointed machine;

summarizing each calculation result to obtain the analysis results of a plurality of graph preactors;

and sending the analysis results of the plurality of graph preactors to all nodes of the Apache Flink framework cluster through Flink broadcasting.

Preferably, the specified Flink program includes an aggregation operation, a transformation budget, and a pattern matching budget for a particular logic diagram, and the step of distributing a dataset of concurrently executing operations to specified machines in the Apache Flink framework cluster includes:

acquiring a data set corresponding to the specific logic diagram, an aggregation function corresponding to aggregation operation, a conversion function corresponding to transformation operation and a matching function corresponding to pattern matching operation;

distributing the data set corresponding to the specific logic diagram and the aggregation function corresponding to the aggregation operation to a first machine in the cluster, distributing the data set corresponding to the specific logic diagram and the conversion function corresponding to the transformation operation to a second machine in the cluster, and distributing the data set corresponding to the specific logic diagram and the matching function corresponding to the pattern matching operation to a third machine in the cluster;

controlling the first machine, the second machine, and the third machine to run concurrently.

The present application also provides an analysis apparatus for a graph network structure, including:

the graph analysis system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a graph network structure to be analyzed according to an execution script of a graph analysis task, the graph network structure comprises a vertex set and an edge set, each vertex in the vertex set carries a first type label and a first attribute corresponding to the vertex respectively, each edge in the edge set is formed by connecting two vertexes with an incidence relation, and each edge carries a second type label and a second attribute corresponding to the edge respectively;

the mapping module is used for mapping each vertex and each edge in the graph network structure to a specified data set respectively;

the second acquisition module is used for acquiring a graph operator corresponding to the graph analysis task according to the execution script;

and the calculation module is used for analyzing and calculating the specified data set corresponding to the graph network structure by utilizing the graph operator to obtain an analysis result.

The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the above method when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as described above.

The method not only establishes the graph network structure with the extended attributes, the graph has richer semantics, supports a plurality of graphs with different attributes, provides declarative and combinable graph operators based on the graph network structure with the extended attributes, realizes the graph network structure with the extended attributes on Apache flush, converts graph operation into operation of a data set, realizes analysis of a single graph or a graph set, provides a new function for graph analysis, and realizes the capacity of processing large-scale data and complex graphs.

Drawings

Fig. 1 is a schematic flow chart illustrating an analysis method of a graph network structure according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an analysis apparatus for graph network structure according to an embodiment of the present application;

fig. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, a method for analyzing a graph network structure according to an embodiment of the present application includes:

s1: acquiring a graph network structure to be analyzed according to an execution script of a graph analysis task, wherein the graph network structure comprises a vertex set and an edge set, each vertex in the vertex set carries a first type label and a first attribute corresponding to the vertex respectively, each edge in the edge set is formed by connecting two vertexes with an association relation, and each edge carries a second type label and a second attribute corresponding to the edge respectively;

s2: respectively mapping each vertex and each edge in the graph network structure to a specified data set;

s3: acquiring a graph operator corresponding to the graph analysis task according to the execution script;

s4: and analyzing and calculating the designated data set corresponding to the graph network structure by utilizing the graph operator to obtain an analysis result.

The graph network structure in the application has the extended attribute, and consists of a vertex set V and an edge set E, wherein the vertex set V comprises various types of vertexes, and the vertexes are described and represented by first type labels and first attributes corresponding to the vertexes. For example, the graph network structure with the extended attribute is a graph social network, which is represented as a directed graph G ═ G<V，E>From a set of vertices V and a set of edges

And (4) forming. For example, the graph network consists of a set of vertices V ═ V₀，…，v₉E and E-set of edges₀，…，e₁₉And (9) composition. The vertices may be classified into various types including a vertex representing a Person (Person), a vertex representing a Forum (Forum), and a vertex representing an interest (Tag). The different types of vertices are described by respective first type label representations and respective first attributes. The edges describe the relationship between the vertices and also have corresponding second type labels (e.g., knows) and second attributes. Vertices with the same type label may have different attributes, for example, v0 and v1, v1 may include an age attribute, but v0 may not have an age attribute, which is related to the type of the vertex, and the type label and attribute of each vertex and the type label and attribute of each edge may be configured through a configuration file, so that a graph network structure with extended attributes is realized, the semantic of the network structure is richer, a plurality of graphs with different attributes may be supported, the information amount is larger, and expressible relationships are more complex. The method divides each vertex and each edge in the graph network structureAnd the data are respectively mapped into a specified data set of a specified data type, such as a Flink data set, so that analysis and calculation of a complex graph network structure are realized by analyzing the relationship among the specified data sets and calculating an analysis result corresponding to a graph operator, and application possibility of a new function field is provided for graph analysis. The method not only establishes the graph network structure with the extended attributes, but also has richer graph representation semantics, supports a plurality of graphs with different attributes, provides declarative and combinable graph operators based on the graph network structure with the extended attributes, realizes the graph network structure with the extended attributes on Apache FLink, converts graph operations into operations of a data set, realizes the analysis of a single graph or a graph set, provides a new function for graph analysis, and realizes the capability of processing large-scale data and complex graphs.

Further, the step S1 of obtaining the graph network structure to be analyzed according to the execution script of the graph analysis task includes:

s10: acquiring graph elements of the graph network structure to be analyzed, wherein the graph elements are carried in an execution script of a graph analysis task, and the graph elements comprise a graph header;

s11: extracting a diagram head data set corresponding to the diagram head from a specified database according to the diagram head;

s12: determining whether the header data set includes a plurality of objects;

s13: if not, judging that the graph network structure to be analyzed is a logic graph with a single object as a graph head;

s14: acquiring vertex data and edge data associated with the logic diagram from the data set, and respectively and correspondingly forming a vertex set and an edge set;

s15: and representing the logic diagram through the vertex set and the edge set to obtain the network structure of the diagram to be analyzed.

The embodiment of the present application takes a graph network structure to be analyzed as an example of a logic diagram, and details a process of obtaining a data set of the graph network structure from a database and representing the graph network structure. A logical graph is a special case of a graph set, wherein the data set of the graph head only contains a single object, and all data information in the graph is communicated. Logic diagramThere are also type labels (e.g., Community) and properties that can be described by using specific metrics (e.g., vertexCount:3) or general information about the logical graph (e.g., interest: Databases). The type labels and attributes of the logic diagram can be displayed in the logic diagram through an explicit declaration mode, and can also be output of a graph algorithm or used as input of a subsequent operator, such as community detection or graph pattern matching application. In the present application, the predefined database DB ═ is<V,E,L>From the set of vertices V ═ V_iE, edge set E ═ E_kAnd a set of logic diagrams L ═ G_mAnd (9) composition. Logic diagram Gm ═<V_m,E_m>From a subset of the set of vertices V

Subset of sum edge set

Ordered pairs of compositions. The graph head represents the relationship of data and a certain logic graph, if the graph heads of the graph network structures to be analyzed only aim at one object, the whole graph network structures to be analyzed are logic graphs with data communication or data association. The logic diagram of the application is provided with extended attributes, such as the extended attributes of the diagram header including address, tag and data attributes, which are represented as Graphhead: ═ Graphhead ═ data attributes<Id,Label,Properties>(ii) a The extended attributes of the Vertex include address, label, Vertex data attribute and Vertex number in the figure, which is expressed as Vertex ═ x<Id,Label,Properties,GraphIds>(ii) a The extended attributes of the Edge include an address, a label, an Edge start vertex address, an Edge end vertex address, an Edge data attribute, and an Edge number in the graph, denoted as Edge: ═ Edge<Id,Label,SrcId,TrgtId,Properties,GraphIds>. For example, the logic diagram is g0 ═ { DataSet ═<GraphHead>graphHead＝{<0，′Community′，{′interest′：′Databases′，...}>}，DataSet<Vertex>vertices＝{<0，′Person′，{′name′：′Alice′，...}，{0，2}>，<，′Person′，{′name′：′Bob′，...}，{0，2}>}，DataSet<Edge>edges＝{<0，′knows′，0，1，{′since′：2014}，{0，2}>，<1，′Person′，1，0，{′snce′：2014}，{0，2}>}}。

Further, the step S3 of obtaining a graph operator corresponding to the graph analysis task according to the execution script includes:

s31: acquiring the execution sequence of each graph operator;

s32: linking the graph operators into operation logic through a Java programming language according to the execution sequence;

s33: and associating the arithmetic logic with a trigger of a Flink trigger program to form the Flink arithmetic program.

The graph network structure of the embodiment of the application is represented by a graph based on an Apache Flink framework, and three object types are used for representing the primitive elements of the graph network structure with the extended attributes: graph head, vertices and edges. To represent the data set of the graph network structure, a dedicated Flink data set is used for each primitive. And the Flink data sets corresponding to the graph elements are converted into the proprietary Flink data sets corresponding to the Flink data sets through data mapping conversion. In the Apache Flink framework, a program needs to be explicitly triggered to execute, e.g., to write results to a file or database. Since the processes implemented by the graph operators in the embodiments of the present application do not include a trigger explicitly triggering a program, in this embodiment, a plurality of operators are linked into an arithmetic logic according to the execution order in the execution script, and then the arithmetic logic is associated with the trigger of the Flink trigger program to form a Flink arithmetic program, so that the entire arithmetic process is executed as one Flink program.

Further, the graph operator includes an exclusion operator, and the step S4 of performing analysis calculation on the specified data set corresponding to the graph network structure by using the graph operator to obtain an analysis result includes:

s41: acquiring a first vertex set and a first edge set corresponding to the first logic diagram, and acquiring a second vertex set and a second edge set corresponding to the second logic diagram;

s42: extracting an identifier of the second logic diagram from the second logic diagram through a function ID _ ONLY to form a data set corresponding to the new ID object;

s43: and screening specified vertexes and specified edges through the exclusion operator to obtain an exclusion result data set corresponding to the exclusion operator, wherein the specified vertexes are contained in the first vertex set and are not contained in the second vertex set, and the specified edges are really contained in the first edge set.

S44: and sending the exclusion result data set to all nodes of the Apache Flink framework cluster through Flink broadcasting.

The present application describes a process of obtaining a converted data set by operating an exclude operator of graph computation. The implementation of the exclusion operator includes that the generated logic diagram consists of vertices and edges that are contained in the first logic diagram but not in the second logic diagram. The implementation is based on filtering of vertices and edges of the logic diagram. The identifier of the second logic diagram is extracted by mapping the header data set of the second logic diagram. The conversion is parameterized by calling a user-defined function ID _ ONLY, so that the identifier is extracted from the header data set by the function ID _ ONLY. The data set of the generated logical graph contains a new id object for screening vertices not contained in the second logical graph with the new id object. The filtering conversion is performed by using a user-defined function (NOT _ IN _ GRAPH _ FILTER) as a parameter, and the function is called for each vertex IN the data sets of the first logic diagram and the second logic diagram. In order to enable the new id object to be suitable for the filter function, the data set obtained after operation is sent to all nodes of the Apache Flink framework cluster by using the Flink broadcast concept. The resulting data set then contains only the vertices whose result of the filter function calculation is true. The same procedure is applied to the filtering of edges, but the filtering of edges also requires the detection of two semi-connected states, i.e., the case where only one end of the source vertices and the target vertices constituting the edges is in the generated data set needs to be excluded to ensure that both the source vertices and the target vertices are included in the new set of vertices. And creating a new logic diagram according to the data set generated by the operation, wherein the diagram head of the new logic diagram is created by a constructor, and a new id is generated. The process of detecting the two half-link states refers to a case where the source vertex and the target vertex forming the edge have one end in another logical diagram, that is, a part of one edge of one logical diagram is in another logical diagram, and an overlapped region needs to be excluded.

Further, the graph network structure is a social network graph composed of a plurality of logic graphs, the graph operator includes a relationship operator provided by Flink, and the step S4 of analyzing and calculating the designated data set corresponding to the graph network structure by using the graph operator to obtain an analysis result includes:

s401: identifying whether a user-selected relationship operator is finding a largest common subgraph between logical graphs in the social networking graph;

s402: if yes, aggregating the number of vertexes of each logic diagram in the social network diagram;

s402: selecting a designated logic diagram higher than the minimum vertex count according to the vertex number of each logic diagram;

s403: and obtaining the maximum common subgraph among the logic graphs in the social network graph through a label propagation algorithm based on the designated logic graphs.

The present application uses the Flink dataset to represent graphics, so the graphics analysis is not limited to predefined operators, but can also use all data operations provided by Flink, such as relational operators, machine learners, or graphical handlers. The analysis calculation process is described in detail by taking the example of finding the largest common subgraph among larger communities in the social network through a relation operator. The objects represented by the graph network structure are typically heterogeneous, e.g., vertices of a social network graph may represent users, groups, or bands, while relationships may represent friendships, membership, or interests. I.e. entity objects of the same type may have different attributes, i.e. different users may provide different attribute information. The social network graph comprises a plurality of logic graphs, and a group of logic graphs L ═ G is included₀,G₁,G₂Each representing a community in the social network graph, such as a particular interest group, e.g., Databases. Each logic diagram has a corresponding vertex subset and edge subset, e.g., V (G0) { V0, V1} and E (G0) } chromatic pockete0, e1 }. Analyzing the relationships of logic diagrams G0 and G2, as may be derived from the extended-attribute social network diagram of the three logic diagrams of the social network diagram, the vertices and edges to which logic diagram G0 is associated are shared with logic diagram G2, and it can be appreciated that the vertices and edges of G0 and G2 may overlap, because V (G0) — n (G2) = { V0, V1} and E (G0) — n E (G2) { E0, E1 }. The embodiment selects a community higher than the minimum vertex count, i.e., determines a selected larger community, by identifying communities and aggregating the vertex number of each community, with the minimum vertex count being preset by the user. Finally, a label propagation algorithm is applied to find the maximum common subgraph. The label propagation algorithm is a semi-supervised learning method based on a graph, and the process is to predict label information of unmarked nodes from label information of marked nodes and establish a complete graph model by utilizing the relation among samples. And each node label is propagated to the adjacent nodes according to the similarity, each node updates the label of the node according to the label of the adjacent node in each step of node propagation, the greater the similarity with the node is, the greater the influence weight value of the adjacent node on the label is, the more the labels of the similar nodes tend to be consistent, and the easier the label is to be propagated. During the label propagation process, the label of the marked data is kept unchanged, so that the label is transmitted to the unmarked data. And finally, when the iteration is finished, the probability distribution of the similar nodes tends to be similar, the similar nodes can be divided into a class, and the maximum independent set corresponding to the maximum public subgraph is found.

Further, the step S4 of analyzing and calculating the specified data set corresponding to the graph network structure by using the graph operator to obtain an analysis result includes:

s4001: acquiring a designated Flink program corresponding to a plurality of graph budget symbols;

s4002: acquiring the current running node of the specified Flink program, and predicting a data set capable of executing operation concurrently;

s4003: distributing the data sets which are subjected to concurrent operation execution to a designated machine in the Apache Flink framework cluster;

s4004: receiving a calculation result obtained by executing the distributed data set by each appointed machine;

s4005: summarizing each calculation result to obtain the analysis results of a plurality of graph preactors;

s4006: and sending the analysis results of the plurality of graph preactors to all nodes of the Apache Flink framework cluster through Flink broadcasting.

The Apache Flink framework of the present application supports declarative definition and distributed execution analysis of data from both streaming and batch processing, with the supported data including streaming and batch processing. The basic components of a program are data sets and transformations, which abstract through the tasks that define and execute the data sets, i.e. the transformation process that forms the data sets and the data sets. A data set is a collection of arbitrary data objects, and a transformation describes a transformation of a data type from one data set to another. For example, assuming X, Y are datasets, then the transformation can be viewed as the function t: x → Y. The above transformation is, for example, a MAP, for each input data object x_iE.g. X, there is one and only one output data object y_iE.g. Y. For example, reduce at conversion time, all input data objects are aggregated into a dataset of objects. And data conversion in the relational database comprises join, group, project, unity, distint and the like. In order to describe the application logic of the Flink program, the data conversion process is parameterized through a user-defined function. The Flink program can be represented as a plurality of chained transformations, and the Flink framework is responsible for program optimization, data distribution and cluster parallel execution across machines when tree transformation is performed. That is, a Flink script task may include multiple chained data transformations, which when executed, optimize the program, accomplish data distribution, and execute concurrently on a cluster across machines.

Further, the step S4003 of distributing the dataset for concurrently executing operations to the designated machines in the Apache Flink framework cluster comprises:

s40031: acquiring a data set corresponding to the specific logic diagram, an aggregation function corresponding to aggregation operation, a conversion function corresponding to transformation operation and a matching function corresponding to pattern matching operation;

s40032: distributing the data set corresponding to the specific logic diagram and the aggregation function corresponding to the aggregation operation to a first machine in the cluster, distributing the data set corresponding to the specific logic diagram and the conversion function corresponding to the transformation operation to a second machine in the cluster, and distributing the data set corresponding to the specific logic diagram and the matching function corresponding to the pattern matching operation to a third machine in the cluster;

s40033: controlling the first machine, the second machine, and the third machine to run concurrently.

In this embodiment, a process of concurrent execution by a machine in an Apache Flink framework cluster is described in detail by taking an example that a graph operator simultaneously includes an aggregation operator, a transformation operator, and a pattern matching operator. The above aggregation operator refers to mapping an input specific logic graph G to an output graph G'. By executing the aggregation function α: L → a, where L is a logic diagram and a is a value range, defined in advance by a user, the resulting output diagram G 'is a modified version of the input specific logic diagram G with a new attribute k, such that κ (G', k) → α (G), where κ is a mapping relationship, the attribute k of the logic diagram is mapped to a value. For example, vertex count: alpha (g ═ g->Count ()); agragegate ('vertexCount', alpha). The above transformation operator performs a user-defined conversion function γ on the input specific logic diagram G: l → L, v: v → V and ∈: e → E, resulting in an output map G '═ γ (G), where V (G') ═ ν (V): v ∈ v (G) }, E (G') { ∈ (E): e ∈ E (G) }. The pattern matching operator includes two parameters, namely a pattern graph G^*And predicates

The pattern matching operation is applied to a specific logic diagram G, and a diagram set containing all the matches is returned

Referring to fig. 2, an apparatus for analyzing a graph network structure according to an embodiment of the present application includes:

the first obtaining module 1 is configured to obtain a graph network structure to be analyzed according to an execution script of a graph analysis task, where the graph network structure includes a vertex set and an edge set, each vertex in the vertex set carries a first type label and a first attribute corresponding to the vertex, each edge in the edge set is formed by connecting two vertices having an association relationship, and each edge carries a second type label and a second attribute corresponding to the edge;

a mapping module 2, configured to map each vertex and each edge in the graph network structure with a specified data set respectively;

the second obtaining module 3 is used for obtaining a graph operator corresponding to the graph analysis task according to the execution script;

and the calculation module 4 is used for analyzing and calculating the specified data set corresponding to the graph network structure by using the graph operator to obtain an analysis result.

And (4) forming. For example, the graph network consists of a set of vertices V ═ V₀，…，v₉E and E-set of edges₀，…，e₁₉And (9) composition. The vertices may be classified into various types including a vertex representing a Person (Person), a vertex representing a Forum (Forum), and a vertex representing an interest (Tag). The different types of vertices are described by respective first type label representations and respective first attributes. The edges describe the relationship between the vertices and also have corresponding second type labels (e.g., knows) and second attributes. Vertices with the same type label may have different attributes, e.g., v0 and v1, v1 may include an age attribute, but v0 may not have an age attribute, which is related to the type of vertex, and the type label and attribute of each vertex may be configured by a configuration fileAnd type labels and attributes of all sides, the graph network structure with the extended attributes is realized, the semantics of the network structure is richer, a plurality of graphs with different attributes can be supported, the information amount is larger, and the expressive relationship is more complex. According to the method and the device, each vertex and each edge in the graph network structure are respectively mapped to a specified data set of a specified data type, such as a Flink data set, so that analysis and calculation of the complex graph network structure are realized by analyzing the relation between the specified data sets and calculating the analysis result corresponding to the graph operator, and the application possibility of a new function field is provided for graph analysis. The method not only establishes the graph network structure with the extended attributes, but also has richer graph representation semantics, supports a plurality of graphs with different attributes, provides declarative and combinable graph operators based on the graph network structure with the extended attributes, realizes the graph network structure with the extended attributes on Apache FLink, converts graph operations into operations of a data set, realizes the analysis of a single graph or a graph set, provides a new function for graph analysis, and realizes the capability of processing large-scale data and complex graphs.

Further, the first obtaining module 1 includes:

a first obtaining unit, configured to obtain a graph element of the graph network structure to be analyzed, where the graph element is carried in an execution script of a graph analysis task, and the graph element includes a graph header;

the extraction unit is used for extracting a diagram head data set corresponding to the diagram head from a specified database according to the diagram head;

a judging unit configured to judge whether the header data set includes a plurality of objects;

the judging unit is used for judging that the graph network structure to be analyzed is a logic graph with a single object at the graph head if the graph network structure to be analyzed does not comprise a plurality of objects;

the second acquisition unit is used for acquiring vertex data and edge data associated with the logic diagram from the data set, and respectively and correspondingly forming a vertex set and an edge set;

and the first obtaining unit is used for representing the logic diagram through the vertex set and the edge set to obtain the graph network structure to be analyzed.

The embodiment of the present application takes a graph network structure to be analyzed as an example of a logic diagram, and details a process of obtaining a data set of the graph network structure from a database and representing the graph network structure. A logical graph is a special case of a graph set, wherein the data set of the graph head only contains a single object, and all data information in the graph is communicated. The logical graph also has type labels (e.g., Community) and properties, which can be described by using specific metrics (e.g., vertexCount:3) or general information about the logical graph (e.g., interest: Databases). The type labels and attributes of the logic diagram can be displayed in the logic diagram through an explicit declaration mode, and can also be output of a graph algorithm or used as input of a subsequent operator, such as community detection or graph pattern matching application. In the present application, the predefined database DB ═ is<V，E，L>From the set of vertices V ═ V_iE, edge set E ═ E_kAnd a set of logic diagrams L ═ G_mAnd (9) composition. Logic diagram Gm ═<V_m，E_m>From a subset of the set of vertices V

Subset of sum edge set

Ordered pairs of compositions. The graph head represents the relationship of data and a certain logic graph, if the graph heads of the graph network structures to be analyzed only aim at one object, the whole graph network structures to be analyzed are logic graphs with data communication or data association. The logic diagram of the application is provided with extended attributes, such as the extended attributes of the diagram header including address, tag and data attributes, which are represented as Graphhead: is ═ i<Id，Label，Properties>(ii) a The extended attributes of the Vertex include an address, a label, a Vertex data attribute and a Vertex number in the figure, and are expressed as Vertex:<Id，Label，Properties，GraphIds>(ii) a The extended attributes of the Edge include an address, a label, an Edge start vertex address, an Edge end vertex address, an Edge data attribute, and an Edge number in the graph, which are denoted as Edge:<Id，Label，SrcId，TrgtId，Properties，GraphIds>. For example, the logic diagram is g0 ═ { DataSet ═<GraphHead>graphHead＝{<0，′Community′，{′interest′：′Databases′，...}>}，DataSet<Vertex>vertices＝{<0，′Person′，{′name′：′Alice′，...}，{0，2}>，<1，′Person′，{′name′：′Bob′，...}，{0，2}>}，DataSet<Edge>edges＝{<0，′knows′，0，1，{′since：2014}，{0，2}>，<1，′Person′，1，0，{′since′：2014}，{0，2}>}}。

Further, the specified data set includes a Flink data set, and the apparatus for analyzing a graph network structure includes:

the third acquisition module is used for acquiring the execution sequence of each graph operator;

the linkage module is used for linking the graph operators into operation logic through a Java programming language according to the execution sequence;

and the association module is used for associating the operation logic with a trigger of a Flink trigger program to form the Flink operation program.

Further, the graph operator includes an exclusion operator, and the calculation module 4 includes:

a third obtaining unit, configured to obtain a first vertex set and a first edge set corresponding to the first logic diagram, and obtain a second vertex set and a second edge set corresponding to the second logic diagram;

the extracting unit is used for extracting the identifier of the second logic diagram from the second logic diagram through a function ID _ ONLY to form a data set corresponding to the new ID object;

and the screening unit is used for screening a designated vertex and a designated edge through the exclusion operator to obtain an exclusion result data set corresponding to the exclusion operator, wherein the designated vertex is included in the first vertex set and is not included in the second vertex set, and the designated edge is really included in the first edge set.

And the first sending unit is used for sending the elimination result data set to all nodes of the Apache Flink framework cluster through Flink broadcasting.

Further, the graph network structure is a social network graph composed of a plurality of logic graphs, the graph operators include a relationship operator provided by Flink, and the calculation module 4 includes:

the identifying unit is used for identifying whether the relation operator selected by the user is the largest common subgraph between the logic graphs in the social network graph;

the aggregation unit is used for aggregating the number of vertexes of each logic diagram in the social network diagram if the maximum public subgraph among the logic diagrams in the social network diagram is found;

the selection unit is used for selecting the specified logic diagram which is higher than the minimum vertex count according to the vertex number of each logic diagram;

and the second obtaining unit is used for obtaining the maximum public subgraph among the logic graphs in the social network graph through a label propagation algorithm based on the specified logic graph.

The present application uses the Flink dataset to represent graphics, so the graphics analysis is not limited to predefined operators, but can also use all data operations provided by Flink, such as relational operators, machine learners, or graphical handlers. The analysis calculation process is described in detail by taking the example of finding the largest common subgraph among larger communities in the social network through a relation operator. The objects represented by the graph network structure are typically heterogeneous, e.g., vertices of a social network graph may represent users, groups, or bands, while relationships may represent friendships, membership, or interests. I.e. entity objects of the same type may have different attributes, i.e. different users may provide different attribute information. On the upper partThe social network diagram comprises a plurality of logic diagrams, and a group of logic diagrams L ═ G is included₀,G₁,G₂Each representing a community in the social network graph, such as a particular interest group, e.g., Databases. Each logic diagram has a corresponding vertex subset and edge subset, e.g., V (G0) { V0, V1} and E (G0) { E0, E1 }. Analyzing the relationships of logic diagrams G0 and G2, as may be derived from the social network diagram with extended attributes of the three logic diagrams of the social network diagram, the vertices and edges associated with logic diagram G0 are shared with logic diagram G2, and it may be appreciated that the vertices and edges of G0 and G2 may overlap because of the sum. The embodiment selects a community higher than the minimum vertex count, i.e., determines a selected larger community, by identifying communities and aggregating the vertex number of each community, with the minimum vertex count being preset by the user. Finally, a label propagation algorithm is applied to find the maximum common subgraph. The label propagation algorithm is a semi-supervised learning method based on a graph, and the process is to predict label information of unmarked nodes from label information of marked nodes and establish a complete graph model by utilizing the relation among samples. And each node label is propagated to the adjacent nodes according to the similarity, each node updates the label of the node according to the label of the adjacent node in each step of node propagation, the greater the similarity with the node is, the greater the influence weight value of the adjacent node on the label is, the more the labels of the similar nodes tend to be consistent, and the easier the label is to be propagated. During the label propagation process, the label of the marked data is kept unchanged, so that the label is transmitted to the unmarked data. And finally, when the iteration is finished, the probability distribution of the similar nodes tends to be similar, the similar nodes can be divided into a class, and the maximum independent set corresponding to the maximum public subgraph is found.

Further, the graph budgeting symbol includes at least two, and the calculating module 4 includes:

a fourth acquiring unit, configured to acquire the designated Flink programs corresponding to the plurality of graph budget symbols;

a fifth obtaining unit, configured to obtain a currently running node of the specified Flink program, and predict a data set that can perform operations concurrently;

the distribution unit is used for distributing the data sets for executing the operation concurrently to a specified machine in the Apache Flink framework cluster;

the receiving unit is used for receiving a calculation result obtained by executing the distributed data set by each appointed machine;

the summarizing unit is used for summarizing all calculation results to obtain analysis results of a plurality of graph preactors;

and the second sending unit is used for sending the analysis results of the plurality of graph preactors to all nodes of the Apache Flink framework cluster through Flink broadcasting.

Further, the specified Flink program includes an aggregation operation, a transformation budget, and a pattern matching budget for a specific logic diagram, and the distribution unit includes:

the acquisition subunit is used for acquiring a data set corresponding to the specific logic diagram, an aggregation function corresponding to aggregation operation, a conversion function corresponding to transformation operation and a matching function corresponding to pattern matching operation;

the distribution subunit is configured to distribute the data set corresponding to the specific logic diagram and the aggregation function corresponding to the aggregation operation to a first machine in the cluster, distribute the data set corresponding to the specific logic diagram and the conversion function corresponding to the transformation operation to a second machine in the cluster, and distribute the data set corresponding to the specific logic diagram and the matching function corresponding to the pattern matching operation to a third machine in the cluster;

and the control subunit is used for controlling the first machine, the second machine and the third machine to run concurrently.

In this embodiment, a process of concurrent execution by a machine in an Apache Flink framework cluster is described in detail by taking an example that a graph operator simultaneously includes an aggregation operator, a transformation operator, and a pattern matching operator. The above aggregation operator refers to mapping an input specific logic graph G to an output graph G'. By executing the aggregation function α defined in advance by the user: l → a, L is the logic diagram, a is the value field, the resulting output diagram G 'is a modified version of the input specific logic diagram G with a new attribute k, such that k (G', k) → α (G), k being the mapping relationship, mapping the attribute k of the logic diagram to a value. For example, vertex count: alpha (g ═ g->Count ()); aggregate ('vertexCount', alpha). The above transformation operator performs a user-defined conversion function γ on the input specific logic diagram G: l → L, v: v → V and ∈: e → E, obtaining an output graph G '═ γ (G), where G' ═ v (v): v ∈ v (G) }, E (G') { ∈ (E): e ∈ E (G) }. The pattern matching operator includes two parameters, namely a pattern graph G^*And predicates

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used to store all data required for the analysis process of the graph network structure. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of analyzing a graph network structure.

The processor executes the analysis method of the graph network structure, and obtains the graph network structure to be analyzed according to an execution script of a graph analysis task, wherein the graph network structure comprises a vertex set and an edge set, each vertex in the vertex set carries a first type label and a first attribute corresponding to the vertex respectively, each edge in the edge set is formed by connecting two vertexes with an incidence relation, and each edge carries a second type label and a second attribute corresponding to the edge respectively; respectively mapping each vertex and each edge in the graph network structure to a specified data set; acquiring a graph operator corresponding to the graph analysis task according to the execution script; and analyzing and calculating the designated data set corresponding to the graph network structure by utilizing the graph operator to obtain an analysis result.

The computer equipment not only establishes a graph network structure with extended attributes, but also has richer graph representation semantics, supports a plurality of graphs with different attributes, provides declarative and combinable graph operators based on the graph network structure with the extended attributes, realizes the graph network structure with the extended attributes on Apache FLink, converts graph operations into operations of a data set, realizes the analysis of a single graph or a graph set, provides a new function for graph analysis, and realizes the capability of processing large-scale data and complex graphs.

In an embodiment, the step of acquiring, by the processor, a graph network structure to be analyzed according to an execution script of the graph analysis task includes: acquiring graph elements of the graph network structure to be analyzed, wherein the graph elements are carried in an execution script of a graph analysis task, and the graph elements comprise a graph header; extracting a diagram head data set corresponding to the diagram head from a specified database according to the diagram head; determining whether the header data set includes a plurality of objects; if not, judging that the graph network structure to be analyzed is a logic graph with a single object as a graph head; acquiring vertex data and edge data associated with the logic diagram from the data set, and respectively and correspondingly forming a vertex set and an edge set; and representing the logic diagram through the vertex set and the edge set to obtain the network structure of the diagram to be analyzed.

In one embodiment, the specified data set includes a Flink data set, and the step of the processor obtaining graph operators corresponding to the graph analysis task according to the execution script includes: acquiring the execution sequence of each graph operator; linking the graph operators into operation logic through a Java programming language according to the execution sequence; and associating the arithmetic logic with a trigger of a Flink trigger program to form the Flink arithmetic program.

In one embodiment, the graph operator includes an exclusion operator, and the processor performs analysis calculation on the specified data set corresponding to the graph network structure by using the graph operator to obtain an analysis result, including: acquiring a first vertex set and a first edge set corresponding to the first logic diagram, and acquiring a second vertex set and a second edge set corresponding to the second logic diagram; extracting an identifier of the second logic diagram from the second logic diagram through a function ID _ ONLY to form a data set corresponding to the new ID object; screening a designated vertex and a designated edge through the exclusion operator to obtain an exclusion result data set corresponding to the exclusion operator, wherein the designated vertex is included in the first vertex set and is not included in the second vertex set, and the designated edge is really included in the first edge set; and sending the exclusion result data set to all nodes of the Apache Flink framework cluster through Flink broadcasting.

In one embodiment, the graph network structure is a social network graph composed of a plurality of logic graphs, the graph operator includes a relationship operator provided by Flink, and the processor performs analysis calculation on a specified data set corresponding to the graph network structure by using the graph operator to obtain an analysis result, including: identifying whether a user-selected relationship operator is finding a largest common subgraph between logical graphs in the social networking graph; if yes, aggregating the number of vertexes of each logic diagram in the social network diagram; selecting a designated logic diagram higher than the minimum vertex count according to the vertex number of each logic diagram; and obtaining the maximum common subgraph among the logic graphs in the social network graph through a label propagation algorithm based on the designated logic graphs.

In one embodiment, the graph budget symbol includes at least two, and the step of the processor performing analysis calculation on the specified data set corresponding to the graph network structure by using the graph operator to obtain an analysis result includes: acquiring a designated Flink program corresponding to a plurality of graph budget symbols; acquiring the current running node of the specified Flink program, and predicting a data set capable of executing operation concurrently; distributing the data sets which are subjected to concurrent operation execution to a designated machine in the Apache Flink framework cluster; receiving a calculation result obtained by executing the distributed data set by each appointed machine; summarizing each calculation result to obtain the analysis results of a plurality of graph preactors; and sending the analysis results of the plurality of graph preactors to all nodes of the Apache Flink framework cluster through Flink broadcasting.

In one embodiment, the specified Flink program includes an aggregate operation, a transformation budget, and a pattern matching budget for a particular logic diagram, and the step of the processor distributing a dataset of concurrently executing operations to specified machines in the Apache Flink framework cluster includes: acquiring a data set corresponding to the specific logic diagram, an aggregation function corresponding to aggregation operation, a conversion function corresponding to transformation operation and a matching function corresponding to pattern matching operation; distributing the data set corresponding to the specific logic diagram and the aggregation function corresponding to the aggregation operation to a first machine in the cluster, distributing the data set corresponding to the specific logic diagram and the conversion function corresponding to the transformation operation to a second machine in the cluster, and distributing the data set corresponding to the specific logic diagram and the matching function corresponding to the pattern matching operation to a third machine in the cluster; controlling the first machine, the second machine, and the third machine to run concurrently.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored thereon, and when the computer program is executed by a processor, the method for analyzing a graph network structure is implemented, and a graph network structure to be analyzed is obtained according to an execution script of a graph analysis task, where the graph network structure includes a vertex set and an edge set, each vertex in the vertex set carries a first type label and a first attribute corresponding to the vertex, each edge in the edge set is formed by connecting two vertices having an association relationship, and each edge carries a second type label and a second attribute corresponding to the edge; respectively mapping each vertex and each edge in the graph network structure to a specified data set; acquiring a graph operator corresponding to the graph analysis task according to the execution script; and analyzing and calculating the designated data set corresponding to the graph network structure by utilizing the graph operator to obtain an analysis result.

The computer readable storage medium not only establishes a graph network structure with extended attributes, the graph has richer semantic representation, supports a plurality of graphs with different attributes, but also provides a declarative and combinable graph operator based on the graph network structure with the extended attributes, realizes the graph network structure with the extended attributes on Apache FLink, converts graph operation into operation of a data set, realizes analysis of a single graph or the graph set, provides a new function for graph analysis, and realizes the capability of processing large-scale data and complex graphs.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method for analyzing a graph network structure, comprising:

2. The method for analyzing a graph network structure according to claim 1, wherein the step of obtaining the graph network structure to be analyzed according to the execution script of the graph analysis task comprises:

determining whether the header data set includes a plurality of objects;

3. The method for analyzing graph network structure according to claim 1, wherein the specified dataset comprises a Flink dataset, and said step of obtaining graph operators corresponding to the graph analysis task according to the execution script is followed by:

acquiring the execution sequence of each graph operator;

4. The method according to claim 1, wherein the graph operators include an exclusion operator, and the step of performing analysis calculation on the specified data set corresponding to the graph network structure by using the graph operators to obtain an analysis result includes:

5. The method according to claim 1, wherein the graph network structure is a social network graph composed of a plurality of logic graphs, the graph operators include relationship operators provided by Flink, and the step of performing analysis calculation on the designated data set corresponding to the graph network structure by using the graph operators to obtain an analysis result includes:

6. The method according to claim 1, wherein the graph budget indicators include at least two, and the step of performing analysis computation on the specified data set corresponding to the graph network structure by using the graph operators to obtain an analysis result comprises:

7. The method for analyzing a graph network structure according to claim 6, wherein the specified Flink program comprises an aggregation operation, a transformation budget and a pattern matching budget for a specific logic diagram, and the step of distributing a dataset for executing operations concurrently to a specified machine in an Apache Flink framework cluster comprises:

8. An apparatus for analyzing a graph network structure, comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.