CN113656411B

CN113656411B - Method and device for storing graph data

Info

Publication number: CN113656411B
Application number: CN202110963045.9A
Authority: CN
Inventors: 张国庆
Original assignee: Beijing Zhongjing Huizhong Technology Co ltd
Current assignee: Beijing Zhongjing Huizhong Technology Co ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2022-08-05
Anticipated expiration: 2041-08-20
Also published as: CN113656411A

Abstract

A method and device for storing graph data. The method comprises the following steps: generating a unique vertex ID for each vertex data according to a preset vertex ID generation rule; recording each vertex data into a graph database according to the corresponding unique vertex ID; for each edge data in the edge data table, respectively replacing two vertex values in the edge data with corresponding unique vertex IDs; generating a unique edge ID for the edge data based on at least one edge primary key in the edge data according to a preset edge ID generation rule; the edge data is entered into a graph database according to the unique edge ID and the two unique vertex IDs of the edge data. The method of the present invention sets a specific ID generation rule so that each vertex data and edge data can have a specific unique vertex ID and unique edge ID. Therefore, even if the drawing data is repeatedly entered a plurality of times, the ID generated each time is the same for the same vertex data or edge data, so that data overwriting can be realized upon repeated entry.

Description

Method and device for storing graph data

Technical Field

The present disclosure relates to the field of computers, and in particular, to the field of graph database technologies, and in particular, to a method and an apparatus for graph data storage, a computer device, a computer-readable storage medium, and a computer program product.

Background

A graph is an important data structure that consists of a collection of Vertex (Vertex) data and Edge (Edge) data, which may typically have several attributes. A Graph Database (Graph Database) is a data management system of the NoSQL type, and its main function is to store Graph structure data and provide query service of Graph semantics to the outside.

Data warehousing refers to the entry of vertex data and edge data into a graph database. Generally, vertex data and edge data will be entered into the graph database in sequence, in a certain order. However, in the case of a failure of hardware equipment, for example, there may be a problem that the binning operation is interrupted, and at this time, it may be necessary to perform another binning operation on the entire graph data. However, at the time of the first warehousing, part of the data is already entered into the graph database, so that repeated entry is likely to cause repetition of part of the data. Therefore, a scheme for preventing repeated entry of vertex data or edge data when multiple storage operations are performed on the graph data is needed.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

It would be advantageous to provide a mechanism that alleviates, mitigates or even eliminates one or more of the above-mentioned problems.

The present disclosure provides a method of graph data binning, wherein graph data includes at least one vertex data table having a plurality of vertex data and an edge data table having a plurality of edge data, respectively, each vertex data including a vertex value, each edge data including associated two vertex values and at least one edge primary key, the method comprising: generating a unique vertex ID for each vertex data in at least one vertex data table according to a preset vertex ID generation rule; recording each vertex data in the vertex data table into a graph database according to the corresponding unique vertex ID; for each edge data in the edge data table, respectively replacing two vertex values in the edge data with corresponding unique vertex IDs; generating a unique edge ID for the edge data based on at least one edge primary key in the edge data according to a preset edge ID generation rule; and for each edge data in the edge data table, recording the edge data into the graph database according to the unique edge ID and the two unique vertex IDs of the edge data.

According to an aspect of the present disclosure, there is provided an apparatus for graph data warehousing, wherein the graph data includes at least one vertex data table having a plurality of vertex data and an edge data table having a plurality of edge data, respectively, each vertex data including one vertex value, each edge data including associated two vertex values and at least one edge primary key, the apparatus comprising: a vertex ID generation unit configured to generate a unique vertex ID for each vertex data in the vertex data table according to a preset vertex ID generation rule; a vertex data entry unit configured to enter each vertex data in the vertex data table into the graph database according to the corresponding unique vertex ID; a replacement unit configured to replace, for each edge data in the edge data table, two vertex values in the edge data with corresponding unique vertex IDs, respectively; an edge ID generation unit configured to generate a unique edge ID for the edge data based on at least one edge primary key in the edge data according to a preset edge ID generation rule; and an edge data entry unit configured to, for each edge data in the edge data table, enter the edge data into the graph database according to the unique edge ID and the two unique vertex IDs of the edge data.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the above method.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program realizes the above method when executed by a processor.

According to one or more embodiments of the present disclosure, a specific ID generation rule is set, a vertex ID is generated corresponding to a vertex value in the vertex data, and an edge ID is generated based on at least one edge primary key in the edge data, so each vertex and edge can have a specific unique vertex ID and unique edge ID. Therefore, even if the drawing data is repeatedly entered a plurality of times, the same vertex data or edge data is entered with the same ID each time, so that data overwriting can be realized upon repeated entry.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description. These and other aspects of the disclosure will be apparent from and elucidated with reference to the embodiments described hereinafter.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of example only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 shows a schematic diagram of a data model for HBase according to one embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of a data model of JanusGraph according to one embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a method for graph data binning, according to one embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of a vertex data table, according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of another vertex data table, according to an embodiment of the present disclosure;

FIG. 6 illustrates a schematic diagram of an edge data table associated with the vertex data tables illustrated in FIGS. 4 and 5, according to one embodiment of the present disclosure;

FIG. 7 is a schematic diagram of the edge data table after replacing the two vertex values of FIG. 6;

FIG. 8 illustrates a flow diagram of a method of generating unique vertex IDs from vertex ID generation rules, according to one embodiment of the present disclosure;

FIG. 9 illustrates a flow diagram of a method of generating a unique edge ID from an edge ID generation rule according to one embodiment of the present disclosure;

FIG. 10 illustrates a flow diagram of a method of entering edge data according to one embodiment of the present disclosure;

FIG. 11 shows a schematic diagram of an apparatus for graph data warehousing, according to one embodiment of the present disclosure;

FIG. 12 shows a schematic diagram of an apparatus for graph data warehousing, according to another embodiment of the present disclosure;

FIG. 13 is a block diagram illustrating an exemplary computer device that can be applied to the exemplary embodiments.

Detailed Description

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. As used herein, the term "plurality" means two or more, and the term "based on" should be interpreted as "based, at least in part, on". Further, the terms "and/or" and at least one of "… …" encompass any and all possible combinations of the listed items.

Before describing the method of the embodiment of the present disclosure in detail, a brief description is first made of the data structures of the storage engine HBase and the query engine janussgraph. Fig. 1 shows a schematic diagram of a data model of HBase according to an exemplary embodiment of the present disclosure. As shown in fig. 1, under the HBase data model, the data list consists of rows (Row). Each row of data is identified by a key and is made up of a number of data cells (cells). The data unit is then composed of a column (column) and a value (value). A data cell is identified by a column in a given row.

Fig. 2 shows a schematic diagram of a data model of janussgraph according to an exemplary embodiment of the present disclosure. As shown in fig. 2, similar to HBase, JanusGraph stores each piece of data as a line in the storage back-end. The vertex ID (ID assigned by JanusGraph to each vertex of the graph data) is a key for identifying a line. Each edge and attribute information in the graph data is stored as a single data element in a row and allows for insertion and deletion. Thus, the query process for the target data is actually the process of finding the data units of the edges that meet the requirements. The maximum number of data cells allowed per row in a particular storage backend is also the maximum extent that JanusGraph can support for vertices that this backend. If the storage backend supports ordering of keys, the data will be ordered by vertex ID, which JanusGraph can assign to partition the graph efficiently.

It should be noted that the columns and values are formed by splicing multiple elements, for example, the columns of the data units of the edge are formed by the tag ID, the direction, the sort key (sort key), and the like of the edge, and the value part is formed by the signature key (signature key) and other attributes (Properties) of the edge. As can be seen from fig. 6 and 7, the data model of JanusGraph and the data model of HBase have a corresponding relationship, and when data is written into the storage layer, JanusGraph can store columns and values in its data unit on the columns and values of the corresponding HBase data unit, respectively.

In the related art, when the graph data is put in storage, the system randomly allocates a vertex ID or an edge ID to each vertex or edge of janus graph, and then stores each piece of data in janus graph into the graph database according to the vertex ID or the edge ID. An operation interruption may occur during the warehousing operation, resulting in a need to re-enter the image data. When the data is re-warehoused, the system randomly re-allocates a vertex ID or an edge ID to each vertex or edge of the JanusGraph again, and then stores the data into the graph database according to the newly allocated vertex ID or edge ID. However, the newly assigned ID may be different from the first assigned ID, resulting in the same data being stored at a different location in the graph database when being re-binned, resulting in repeated entry of data.

Exemplary embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

FIG. 3 shows a schematic diagram of a method 300 for graph data binning, which includes two parts of a vertex data table and an edge data table, according to one embodiment of the present disclosure. The vertex data table has a plurality of vertex data and the edge data table has a plurality of edge data, each vertex data comprising a vertex value and each edge data comprising an associated two vertex values and at least one edge primary key.

Before describing the method 300 shown in FIG. 3, a brief description of the above-described vertex data tables and edge data tables is provided. In some embodiments, the graph data may include two vertex data tables and an edge data table associated with the two vertex data tables. FIG. 4 shows a schematic diagram of a vertex data table, according to an embodiment of the present disclosure. FIG. 5 shows a schematic diagram of another vertex data table, according to one embodiment of the present disclosure. FIG. 6 illustrates a schematic diagram of an edge data table associated with the vertex data tables shown in FIGS. 4 and 5, according to one embodiment of the present disclosure. Fig. 7 shows a schematic diagram of the edge data table after replacing the two vertex values in fig. 6. The vertex data table shown in FIG. 4 is the source vertex data table in the map data for the banking transfer service, which contains data for multiple transfer bodies. As shown in FIG. 4, each vertex data includes a vertex value, which may be a vertex primary key uniquely identifying the vertex data, such as an account name, and a plurality of attributes, which may include: gender, age, etc. The vertex data table shown in FIG. 5 is a target vertex data table in the map data for a banking transfer transaction, which contains data for a plurality of transfer objects. As shown in FIG. 5, each vertex data includes a vertex value, which may be a vertex primary key uniquely identifying the vertex data, such as a transfer card number, and a plurality of attributes, which may include: the card number belongs to the organization and the like. FIG. 6 is an edge data table associated with the vertex data tables shown in FIGS. 4 and 5, including source vertex values, target vertex values, and a plurality of attributes, where the source vertex values are the source vertex values present in the source vertex table of FIG. 4, the target vertex values are the target vertex values present in the target vertex table of FIG. 5, and the attributes include: time, amount, etc. Each edge data is used to record the interaction between the source vertex and the target vertex, which is illustrated by the first row of edge data in fig. 6, which indicates that "zhang san performs a transfer operation to card number 2 at 2020.01.01".

The method 300 of graph data binning shown in fig. 3 includes:

step 301, generating a unique vertex ID for each vertex data in two vertex data tables according to a preset vertex ID generation rule;

step 302, recording each vertex data in the vertex data table into a graph database according to the corresponding unique vertex ID;

step 303, for each piece of edge data in the edge data table, replacing two vertex values in the piece of edge data with corresponding unique vertex IDs respectively;

step 304, generating a unique edge ID for the edge data based on at least one edge primary key in the edge data according to a preset edge ID generation rule; and

step 305, aiming at each edge data in the edge data table, recording the edge data into the graph database according to the unique edge ID and the two unique vertex IDs of the edge data.

The method of the present embodiment sets a specific ID generation rule, the generated vertex ID corresponds to the vertex value in the vertex data, and the edge ID is generated based on at least one edge primary key in the edge data, so each vertex and edge can have a specific unique vertex ID and unique edge ID. Therefore, even if the drawing data is repeatedly entered a plurality of times, the ID generated each time is the same for the same vertex data or edge data, so that data overwriting can be realized upon repeated entry.

The specific operation of steps 301 to 305 will be described in detail below with reference to fig. 4 to 7. When the source vertex table is put in storage, a unique vertex ID is generated for each vertex data according to a vertex ID generation rule. As shown in fig. 4, for example, a unique vertex ID 113 is generated for vertex data whose vertex value is zhang, a unique vertex ID 114 is generated for vertex data whose vertex value is lie. The vertex value may be a vertex primary key uniquely identifying the vertex data, and the generated unique vertex ID corresponds to the vertex value one to one, so that different vertex values correspond to different unique vertex IDs. If the warehousing operation is interrupted and the graph data is recorded again, each vertex data is still identified according to the unique vertex ID generated in the step 301, so that the unique vertex ID during the second warehousing operation is the same as the unique vertex ID during the first warehousing operation. For simplicity of embodiment, the unique vertex IDs are shown as 3-bit numbers in the embodiments shown in FIGS. 4-7, but it will be appreciated that in other embodiments the number of bits of the unique vertex IDs may be greater than 3, and in general, the unique vertex IDs and the subsequently mentioned unique edge IDs are both 64-bit long type integers. The unique vertex ID may be generated for each vertex data in the target vertex table shown in fig. 5 according to the same method as described above, and detailed operations thereof are not described herein again.

In step 302, each vertex data in the vertex data table is recorded into the graph database based on the unique vertex ID obtained in step 301. Since the unique vertex ID of the same vertex data remains unchanged every time it is put in storage, the same vertex data will always be stored at a specific location in the graph database during multiple storage, and thus the same vertex data will be overwritten. As described above, each vertex data includes a plurality of attributes, each attribute includes an attribute ID and an attribute value, and when each vertex data in the vertex data table is entered into the graph database, the attribute value of each attribute is entered according to the attribute ID for each vertex data, so that each attribute in the same vertex data can be accurately covered when repeatedly entering the graph database.

In step 303, a join in operation is performed on the edge data tables, wherein the join in operation is to replace two vertex values (a source vertex value and a target vertex value, respectively) in each edge data with a unique vertex ID corresponding to the vertex value in the vertex data table so as to associate the unique vertex IDs of the two vertex data tables. As shown in fig. 7, zhang three in fig. 6 is replaced with the corresponding unique vertex ID 113 in fig. 4, lie four is replaced with the corresponding unique vertex ID 114 in fig. 4, and so on; the card number 2 in fig. 6 is replaced with the corresponding unique vertex ID 222 in fig. 5, the card number 1 is replaced with the corresponding unique vertex ID 221 in fig. 5, and so on.

In step 304, a unique edge ID is generated for the edge data based on at least one edge primary key in the edge data according to a preset edge ID generation rule. The edge primary key can uniquely identify the edge data, the edge primary keys of each edge data are different from each other, and the edge ID generated based on at least one edge primary key in the edge data is also unique. Therefore, in step 305, the entry of the side data according to the unique side ID can prevent the repeated entry of the side data in the multiple warehousing operations. If the warehousing operation is interrupted and the image data is recorded again, a unique edge ID is generated for each piece of edge data according to the edge ID generation rule in step 304, so that the unique edge ID generated in the second warehousing operation is the same as the unique edge ID generated in the first warehousing operation. In order to prevent erroneous entry of data, it is necessary to refer to two unique vertex IDs of the data at the same time when the data is entered. Moreover, in the case where there are two edge data whose unique edge IDs are identical by accident, the simultaneous reference to two unique vertex IDs can further ensure the uniqueness of each edge data because even if the unique edge IDs of the two edge data are identical, their respective unique vertex IDs are different, thereby further preventing the duplicate entry of data.

FIG. 8 illustrates a flow diagram of a method 800 for generating unique vertex IDs from vertex ID generation rules, according to one embodiment of the present disclosure. To increase the speed of assigning unique vertex IDs, the method 800 groups a plurality of vertex data in a vertex data table into a plurality of groups, each group containing at least one vertex data, and generates a unique vertex ID for each vertex data in parallel for each group of vertex data. For each set of vertex data, generating a unique vertex ID for each vertex data in parallel comprises:

step 801, randomly assigning a different value as a vertex ID partition portion of a unique vertex ID of the vertex data;

step 802, determining the vertex ID generated last with the same value of the vertex ID partition part of the vertex data in the generated vertex IDs;

step 803, obtain the first value of the vertex ID count portion of the last generated unique vertex ID;

and step 804, accumulating the first numerical value with a preset numerical value to obtain a second numerical value which is used as a vertex ID counting part of the unique vertex ID of the vertex data.

In the present embodiment, the unique vertex ID has a shape such as [0| count | partition |000]Mainly including the vertex ID first, the vertex ID count portion (i.e., count in the above equation), the vertex ID partition portion (i.e., partition in the above equation), and the vertex type portion (i.e., last three digits). partition represents the virtual partition id, whose number of bits N is determined by the size of the backend HBase cluster. Each partition maintains a count pool, each count pool ranging from [0,2 ] ^64-1-3-N -1]。

In step 801, a unique vertex ID is generated for each vertex data group by group, starting from [0,2 ] ^N -1]A value is randomly selected within the range as part of the vertex ID partition. For example, a certain group (hereinafter referred to as group a) has 5 vertex data, and the 5 vertex data are randomly assigned vertex ID partition portions (partitions) different from each other, and the 5 vertex data generate unique vertex IDs in parallel in different partitions.

In

steps

802 and 803, when assigning a vertex ID count portion to a vertex data in group a (hereinafter referred to as vertex data 1), a count value of a unique vertex ID of vertex data 1 may be obtained by determining a count value of a unique vertex ID having the same partition value that has been generated last and then performing an operation of +1 on the count value. In other words, all unique vertex IDs assigned with the same partition value are sequentially and incrementally generated with their count values in the order of generation of the unique vertex IDs. Although in the above embodiment, the count value of the last unique vertex ID is subjected to +1 operation to obtain a new count value, it is understood that in other embodiments, other values such as 2, 3, etc. may be incremented to obtain a new count value.

The finally generated vertex ID may be expressed as vertexId ((Count < < N) + PartitionID) < <3L | 0L. The method of this embodiment is particularly suitable for generating vertex IDs of vertex data in batches, and assigning different partition values to vertex data of the same batch (i.e., the same group in this embodiment), so that it can be ensured that unique vertex IDs of each vertex data in each group are different from each other, and for vertex data of different batches, even if there are multiple unique vertex IDs of the same partition value, the multiple unique vertex IDs are different from each other due to the assignment manner in which the count value gradually increases.

FIG. 9 illustrates a flow diagram of a method 900 of generating a unique edge ID from an edge ID generation rule according to one embodiment of the present disclosure. In the present embodiment, the unique edge ID and the unique vertex ID are similar, having a format like [0| count | partition ], which mainly includes an edge ID header 0, an edge ID count portion (i.e., count in the above equation), and an edge ID partition portion (i.e., partition in the above equation).

As shown in fig. 9, the method 900 includes:

step 901, acquiring corresponding vertex data in a vertex data table based on the source vertex value of the edge data, wherein a vertex ID partition part of a unique vertex ID of the vertex data comprises a third numerical value;

step 902, using the third value as the edge ID partition part of the unique edge ID of the edge data;

step 903, hashing at least one edge primary key of the edge data to obtain a hash value;

at step 904, an edge ID count portion of the edge ID is generated from the hash value.

As shown in fig. 6, each edge data contains a source vertex value and a target vertex value associated with the edge data. In step 901, the partition value of the unique vertex ID corresponding to the source vertex value may be obtained, and in step 902, the partition value of the unique vertex ID may be used as the partition value of the unique edge ID of the edge data, that is, the partition value of the unique edge ID of each edge data is the same as the partition value of the unique vertex ID of the source vertex included in the edge data. Taking the first piece of data in fig. 7 as an example, the partition value of the unique edge ID 001 of the edge data is the same as the partition value of the unique vertex ID 113 of the source vertex.

In

steps

903 and 904, at least one edge primary key of the edge data is hashed, and the resulting hash value is used as an edge ID count portion (count value portion) of the unique edge ID of the edge data. The edge primary key may be some attribute that can uniquely identify the edge data, such as transfer summary, serial number, etc. Hash (Hash) is the transformation of an arbitrary length input (also called a pre-mapped pre-image) into a fixed length output, the output being a Hash value, by a hashing algorithm. This transformation is a kind of compression mapping, i.e. the space of hash values is usually much smaller than the space of inputs, different inputs may hash to the same output, so it is not possible to determine a unique input value from a hash value. In short, it is a function of compressing a message of an arbitrary length to a message digest of a certain fixed length. In this embodiment, the edge is keyed (which may be in text format) using a hash functionOr in numeric format) into a fixed number of bits and using the number in the count value portion of the unique edge ID representing the edge data. It will be appreciated that the collision probability of the hash will determine the repetition rate of the edge ID. In the entire graph data, the unique edge ID generated by the hash method is repeated with a probability of P ═ P (1/(V × (V-1)))) P (hash collision), where V is the number of vertex data. It can be seen that the above probability is almost equal to V ² Inversely proportional, the probability of edge ID duplication is rather lower as the number of vertices of the graph data increases.

The final generated unique edge ID may be expressed as: EdgeID ((long. maxvalue > > N) & ekey. hashcode) < < N) + SrcPartitionID. The long. The SrcPartitioning ID is the partition value of the unique vertex ID corresponding to the vertex of the edge data source. And combining the two parts to obtain the complete unique edge ID.

In this embodiment, the edge key can uniquely identify the edge data, and the edge keys of each piece of edge data are different from each other, so that the edge ID generated by hashing the edge key is also unique. The entry of the side data according to the unique side ID can prevent the repeated entry of the side data during multiple warehousing operations.

It is understood that the

above steps

901, 902, 903 and 904 can exchange the execution order, i.e. the side ID counting part is generated first, and the side ID partition part is regenerated, which does not affect the implementation of the present embodiment.

Fig. 10 shows a flowchart of a method 1000 of entering edge data according to an embodiment of the present disclosure, as shown in fig. 10, the method 1000 additionally adds a decision step based on the method shown in fig. 3. Specifically, the method 1000 includes:

1001, generating a unique edge ID for the edge data based on at least one edge primary key in the edge data according to a preset edge ID generation rule;

step 1002, judging whether two vertex data respectively corresponding to two vertex IDs of the edge data exist in a graph database;

step 1003, if the judgment result of the step 1002 is yes, recording the edge data into a graph database according to the unique edge ID and the two unique vertex IDs of the edge data;

and step 1004, if the judgment result of the step 1002 is negative, not recording the side data.

Step 1001 is similar to step 304 of method 300 and will not be described further herein. Before the edge data is recorded in the graph database in step 1002, it is first determined whether two vertex data corresponding to the two vertex IDs of the edge data are already present in the graph database, and if so, the edge data is certified as reliable and is recorded in the graph database in step 1003. If the vertex data corresponding to one of the vertex IDs does not exist in the graph database, or if the vertex data corresponding to both of the vertex IDs does not exist in the graph database, it means that the edge data is erroneous data or redundant data, and in step 1004, the edge data is prohibited from being recorded. Taking fig. 7 as an example for illustration, when a first piece of data is entered, it is first checked whether or not the source vertex ID 113 and the target vertex ID 222 are present in the vertex data tables shown in fig. 5 and 6 (since the respective vertex data tables have been put in stock at the time of entry of the edge data table, it is possible to inquire whether or not both vertex IDs are present in the graph data), and when it is determined that the source vertex ID 113 corresponds to vertex data whose vertex value is zhang and the target vertex ID 222 corresponds to vertex data whose vertex value is card number 2, the edge data is entered in the graph database. If at least one of the two vertex IDs is not present in the vertex data tables shown in FIGS. 5 and 6, then entry of the edge data is disabled.

In entering vertex data and edge data, operations may be performed using an interface of the graph database. For example, when vertex data is entered, a vertex addition interface (AddVertex interface) may be used and data insertion by a unique vertex ID is directly specified, and when edge data is entered, an edge addition interface (AddEdge interface) may be used and data insertion by a unique edge ID is directly specified.

According to another aspect of the present disclosure, there is also provided an apparatus for graph data warehousing, and fig. 11 shows a schematic diagram of an apparatus 1100 for graph data warehousing according to an embodiment of the present disclosure. The graph data includes at least one vertex data table having a plurality of vertex data and an edge data table having a plurality of edge data, respectively, each vertex data including a vertex value, each edge data including associated two vertex values and at least one edge primary key, as shown in fig. 11, the apparatus 1100 includes: a vertex ID generation unit 1110 configured to generate a unique vertex ID for each vertex data in the at least one vertex data table according to a preset vertex ID generation rule; a vertex data entry unit 1120 configured to enter each vertex data in the vertex data table into the graph database according to the corresponding unique vertex ID; a replacing unit 1130 configured to, for each edge data in the edge data table, replace two vertex values in the edge data with corresponding unique vertex IDs, respectively; an edge ID generation unit 1140 configured to generate a unique edge ID for the edge data based on at least one edge primary key in the edge data according to a preset edge ID generation rule; and an edge data entry unit 1150 configured to, for each edge data in the edge data table, enter the edge data into the graph database according to the unique edge ID and the two unique vertex IDs of the edge data.

Fig. 12 shows a schematic diagram of an apparatus 1200 for graph data binning, according to another embodiment of the present disclosure. As shown in fig. 12, in some embodiments, the vertex ID generation unit 1210 includes: a grouping module 1211 configured to group a plurality of vertex data in the vertex data table into a plurality of groups, each group including at least one vertex data; wherein the vertex ID generation unit 1210 is further configured to generate, for each set of vertex data, a unique vertex ID for each vertex data in parallel.

In some embodiments, the unique vertex ID includes a vertex ID partition portion, and the vertex ID generation unit 1210 further includes: the first ID generation module 1212 is configured to randomly assign, to each vertex data in the set of vertex data, a mutually different numerical value as a vertex ID partition portion of a unique vertex ID of the vertex data.

In some embodiments, the unique vertex ID further includes a vertex ID count part, and the vertex ID generation unit 1210 further includes: a second ID generation module 1213 configured to, for each vertex data in the set of vertex data, after randomly assigning a mutually different value as a vertex ID partition portion of a unique vertex ID of the vertex data, determine, among the generated vertex IDs, a last generated vertex ID having the same value as the vertex ID partition portion of the vertex data; acquiring a first numerical value of a vertex ID counting part of the last generated unique vertex ID; and a vertex ID counting section for accumulating the first numerical value by a predetermined numerical value to obtain a second numerical value as a vertex ID of the vertex data.

In some embodiments, the unique edge ID includes an edge ID count part, and the edge ID generation unit 1240 includes: a third ID generation module 1241 configured to hash at least one edge key of the edge data to obtain a hash value; and an edge ID counting section that generates a unique edge ID from the hash value.

In some embodiments, the unique vertex ID comprises a vertex ID partition portion, the edge ID comprises a further edge ID partition portion, the two vertex values associated with the edge data comprise a source vertex value and a target vertex value, and the edge ID generation unit 1240 further comprises: a fourth ID generation module 1242 configured to obtain, based on the source vertex value of the edge data, corresponding vertex data in the vertex data table, the vertex ID partition portion of the unique vertex ID of the vertex data including the third numerical value; and an edge ID partition section that takes the third numerical value as a unique edge ID of the edge data.

In some embodiments, the apparatus 1200 further comprises: a determination unit 1260 configured to determine, for each edge data, whether two vertex data corresponding to the two vertex IDs of the edge data, respectively, exist in the graph database, wherein the edge data entry unit is further configured to enter the edge data into the graph database as performed in response to determining that both vertex data exist in the database, in accordance with the unique edge ID of the edge data and the two unique vertex IDs.

In some embodiments, the vertex data entry unit 1220 is further configured to: each vertex data is entered into a graph database using a vertex add interface of the graph database.

In some embodiments, the vertex data entry unit 1220 is further configured to: for each vertex data, an attribute value of each attribute is entered according to an attribute ID.

In some embodiments, the edge data entry unit 1250 is further configured to: and recording each edge data into the graph database by using an edge adding interface in the graph database.

It should be understood that the various modules of the apparatus 1100 shown in FIG. 11 may correspond to the various steps in the method 300 described with reference to FIG. 3, and the various modules of the apparatus 1200 shown in FIG. 12 may correspond to the various steps in the method 800-1000 described with reference to FIGS. 8-10. Thus, the operations, features and advantages described above with respect to the method 300 are equally applicable to the apparatus 1100 and the modules included therein, and the operations, features and advantages described above with respect to the method 800-1000 are equally applicable to the apparatus 1200 and the modules included therein. Certain operations, features and advantages may not be described in detail herein for the sake of brevity.

Although specific functionality is discussed above with reference to particular modules, it should be noted that the functionality of the various modules discussed herein may be divided into multiple modules and/or at least some of the functionality of multiple modules may be combined into a single module. Performing an action by a particular module discussed herein includes the particular module itself performing the action, or alternatively the particular module invoking or otherwise accessing another component or module that performs the action (or performs the action in conjunction with the particular module). Thus, a particular module that performs an action can include the particular module that performs the action itself and/or another module that the particular module invokes or otherwise accesses that performs the action.

It should also be appreciated that various techniques may be described herein in the general context of software, hardware elements, or program modules. The various modules described above with respect to fig. 11 and 12 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, the modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the first ID generation module 1212, the second ID generation module 1213, the third ID generation module 1241, and the fourth ID generation module 1242 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip (which includes one or more components of a Processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, Digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry), and may optionally execute received program code and/or include embedded firmware to perform functions.

According to an aspect of the disclosure, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory. The processor is configured to execute the computer program to implement the steps of any of the method embodiments described above.

According to an aspect of the present disclosure, a non-transitory computer-readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.

According to an aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of any of the method embodiments described above.

Illustrative examples of such computer devices, non-transitory computer-readable storage media, and computer program products are described below in connection with fig. 13.

Fig. 13 shows an example configuration of a computer device 1300 that can be used to implement the methods described herein. Computer device 1300 can be a variety of different types of devices, such as a server of a service provider, a device associated with a client (e.g., a client device), a system on a chip, and/or any other suitable computer device or computing system. Examples of computer device 1300 include, but are not limited to: a desktop computer, a server computer, a notebook or netbook computer, a mobile device (e.g., a tablet, a cellular or other wireless telephone (e.g., a smartphone), a notepad computer, a mobile station), a wearable device (e.g., glasses, a watch), an entertainment device (e.g., an entertainment appliance, a set-top box communicatively coupled to a display device, a gaming console), a television or other display device, an automotive computer, and so forth. Thus, the computer device 1300 may range from a full resource device with substantial memory and processor resources (e.g., personal computers, game consoles) to a low-resource device with limited memory and/or processing resources (e.g., traditional set-top boxes, hand-held game consoles).

Computer device 1300 may include at least one processor 1302, memory 1304, communication interface(s) 1306, display device 1308, other input/output (I/O) devices 1310, and one or more mass storage devices 1312, which can communicate with each other, such as through system bus 1314 or other suitable connection.

Processor 1302 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores. Processor 1302 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitry, and/or any devices that manipulate signals based on operational instructions. The processor 1302 may be configured to retrieve and execute computer readable instructions, such as program code for an operating system 1316, program code for an application program 1318, program code for other programs 1320, and the like, stored in the memory 1304, mass storage device 1312, or other computer readable medium, among other capabilities.

Memory 1304 and mass storage device 1312 are examples of computer readable storage media for storing instructions that are executed by processor 1302 to perform the various functions described above. By way of example, the memory 1304 may generally include both volatile and non-volatile memory (e.g., RAM, ROM, and the like). In addition, mass storage devices 1312 may generally include hard disk drives, solid state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network attached storage, storage area networks, and so forth. Memory 1304 and mass storage device 1312 may both be collectively referred to herein as memory or computer-readable storage medium, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code, which may be executed by processor 1302 as a particular machine configured to implement the operations and functions described in the examples herein.

A number of program modules may be stored on the mass storage device 1312. These programs include an operating system 1316, one or more application programs 1318, other programs 1320, and program data 1322, and may be loaded into memory 1304 for execution. Examples of such applications or program modules may include, for instance, computer program logic (e.g., computer program code or instructions) for implementing the following components/functions: a first ID generation module 1212, a second ID generation module 1213, a third ID generation module 1241, and a fourth ID generation module 1242, a vertex data entry unit 1220, an edge data entry unit 1250, method 300, and/or

methods

800, 900, and/or further embodiments described herein.

Although illustrated in fig. 8 as being stored in memory 1304 of computer device 1300,

modules

1316, 1318, 1320, and 1322, or portions thereof, may be implemented using any form of computer-readable media that is accessible by computer device 1300. As used herein, "computer-readable media" includes at least two types of computer-readable media, namely computer storage media and communication media.

Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computer device.

In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism. Computer storage media, as defined herein, does not include communication media.

Computer device 1300 may also include one or more communication interfaces 1306 for exchanging data with other devices, such as over a network, direct connection, and the like, as discussed above. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a Network Interface Card (NIC)), wired or wireless (such as IEEE 802.11 Wireless LAN (WLAN)) wireless interface, worldwide interoperability for microwave Access (Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth ^TM An interface, a Near Field Communication (NFC) interface, etc. The communication interface 1306 may facilitate communications within a variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. The communication interface 1306 may also provide for communication with external storage devices (not shown), such as in storage arrays, network attached storage, storage area networks, and so forth.

In some examples, a display device 1308, such as a monitor, may be included for displaying information and images to a user. Other I/O devices 1310 may be devices that receive various inputs from a user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so forth.

While the disclosure has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative and exemplary and not restrictive; the present disclosure is not limited to the disclosed embodiments. Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps than those listed and the words "a" or "an" do not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A method for graph data binning, wherein the graph data includes at least one vertex data table having a plurality of vertex data and an edge data table having a plurality of edge data, respectively, each of the vertex data including a vertex value and each of the edge data including associated two vertex values and at least one edge primary key, the method comprising:

generating a unique vertex ID for each vertex data in the at least one vertex data table according to a preset vertex ID generation rule, wherein the unique vertex ID comprises a vertex ID partition part and a vertex ID counting part;

recording each vertex data in the vertex data table into a graph database according to the corresponding unique vertex ID;

for each edge data in the edge data table,

replacing two vertex values in the edge data with corresponding unique vertex IDs respectively;

generating a unique edge ID for the edge data based on at least one edge primary key in the edge data according to a preset edge ID generation rule; and

for each piece of edge data in the edge data table, according to a unique edge ID and two unique vertex IDs of the piece of edge data, recording the piece of edge data into the graph database, wherein generating a unique vertex ID for each vertex data in the at least one vertex data table according to a preset vertex ID generation rule includes: for each of the vertex data tables, the vertex data table,

dividing a plurality of vertex data in the vertex data table into a plurality of groups, wherein each group comprises at least one vertex data;

generating, in parallel, for each set of vertex data, a unique vertex ID for each vertex data, wherein, for each vertex data in the set of vertex data,

a vertex ID partition section for randomly assigning a value different from each other as a unique vertex ID of the vertex data;

determining a last generated unique vertex ID which is the same as a vertex ID partition part value of the vertex data in the generated unique vertex IDs;

obtaining a first value of a vertex ID count portion of the last generated unique vertex ID; and

and accumulating the first numerical value by a preset numerical value to obtain a second numerical value which is used as a vertex ID counting part of the unique vertex ID of the vertex data.

2. The method of claim 1, wherein the edge ID includes an edge ID count portion, generating a unique edge ID for the edge data based on at least one edge primary key in the edge data according to an edge ID generation rule comprising:

hashing at least one edge main key of the edge data to obtain a hash value; and

an edge ID count portion of the unique edge ID is generated from the hash value.

3. The method of claim 2, wherein the unique vertex ID comprises a vertex ID partition portion, the edge ID comprises a further edge ID partition portion, the two vertex values associated with the edge data comprise a source vertex value and a target vertex value, and generating the unique edge ID for the edge data based on at least one edge primary key in the edge data according to an edge ID generation rule further comprises:

based on the source vertex value of the edge data, acquiring corresponding vertex data in the vertex data table, wherein a vertex ID partition part of the unique vertex ID of the vertex data comprises a third numerical value; and

the third value is used as an edge ID partition part of the unique edge ID of the edge data.

4. The method of any of claims 1-3, wherein for each edge data in the edge data table, entering the edge data into the graph database before further comprises, from a unique edge ID and two unique vertex IDs for the edge data:

determining, for each edge data, whether two vertex data respectively corresponding to two vertex IDs of the edge data exist in the graph database,

wherein entering the edge data into the graph database is performed in response to determining that both vertex data are present in the graph database, according to the unique edge ID and the two unique vertex IDs of the edge data.

5. The method of any of claims 1 to 3, wherein entering each vertex data in the vertex data table into a graph database according to a corresponding unique vertex ID comprises:

each vertex data is entered into the graph database using a vertex add interface of the graph database.

6. The method of any of claims 1 to 3, wherein each of the vertex data comprises a plurality of attributes, each attribute comprising an attribute ID and an attribute value, entering each vertex data in the vertex data table into a graph database according to a corresponding unique vertex ID comprising:

and for each vertex data, recording the attribute value of each attribute according to the attribute ID.

7. The method of any of claims 1-3, wherein entering each edge data in the edge data table into a respective storage location in the graph database according to its edge ID and two vertex IDs comprises:

and recording each edge data into the graph database by using an edge adding interface in the graph database.

8. An apparatus for graph data binning, wherein the graph data includes at least one vertex data table having a plurality of vertex data and an edge data table having a plurality of edge data, respectively, each of the vertex data including a vertex value and each of the edge data including associated two vertex values and at least one edge primary key, the apparatus comprising:

a vertex ID generation unit configured to generate a unique vertex ID for each vertex data in the at least one vertex data table according to a preset vertex ID generation rule, wherein the unique vertex ID includes a vertex ID partition portion and a vertex ID count portion;

a vertex data entry unit configured to enter each vertex data in the vertex data table into a graph database according to a corresponding unique vertex ID;

a replacement unit configured to replace, for each edge data in the edge data table, two vertex values in the edge data with corresponding unique vertex IDs, respectively;

an edge ID generation unit configured to generate a unique edge ID for the edge data based on at least one edge primary key in the edge data according to a preset edge ID generation rule; and

an edge data entry unit configured to, for each piece of edge data in the edge data table, enter the piece of edge data into the graph database according to a unique edge ID of the piece of edge data and two unique vertex IDs, wherein the vertex ID generation unit includes:

a grouping module configured to group, for each of the vertex data tables, a plurality of vertex data in the vertex data table into a plurality of groups, each group including at least one of the vertex data, and generate, for each group of vertex data, a unique vertex ID for each vertex data in parallel;

a first ID generation module configured to randomly assign, for each set of vertex data, a different numerical value as a vertex ID partition portion of a unique vertex ID of the vertex data for each of the set of vertex data; and

a second ID generation module configured to, for each set of vertex data, determine, for each vertex data in the set of vertex data, a last generated unique vertex ID having the same value as a vertex ID partition part of the vertex data among the generated unique vertex IDs after randomly assigning a different value to each other as a vertex ID partition part of a unique vertex ID of the vertex data; obtaining a first value of a vertex ID count portion of the last generated unique vertex ID; and accumulating the first numerical value by a preset numerical value to obtain a second numerical value which is used as a vertex ID counting part of the unique vertex ID of the vertex data.

9. The apparatus of claim 8, wherein the unique edge ID includes an edge ID count portion, and the edge ID generation unit includes:

the third ID generation module is configured to hash at least one edge primary key of the edge data to obtain a hash value; and an edge ID counting section that generates the unique edge ID from the hash value.

10. The apparatus of claim 9, wherein the unique vertex ID comprises a vertex ID partition portion, the edge ID comprises a further edge ID partition portion, the two vertex values associated with edge data comprise a source vertex value and a target vertex value, and the edge ID generation unit further comprises:

a fourth ID generation module configured to obtain corresponding vertex data in the vertex data table based on the source vertex value of the edge data, wherein a vertex ID partition part of a unique vertex ID of the vertex data comprises a third numerical value; and an edge ID partition section that takes the third numerical value as a unique edge ID of the edge data.

11. The apparatus of any of claims 8 to 10, further comprising:

a determination unit configured to determine, for each edge data, whether two vertex data respectively corresponding to two vertex IDs of the edge data exist in the graph database, wherein

The edge data entry unit is further configured to enter the edge data into the graph database in response to determining that both vertex data are present in the graph database, according to the unique edge ID and the two unique vertex IDs of the edge data.

12. The apparatus of any of claims 8 to 10, wherein the vertex data entry unit is further configured to:

13. The apparatus of any of claims 8 to 10, wherein each of the vertex data comprises a plurality of attributes, each attribute comprising an attribute ID and an attribute value, the vertex data entry unit being further configured to:

14. The apparatus of any of claims 8 to 10, wherein the side data entry unit is further configured to:

15. A computer device, comprising:

a memory, a processor, and a computer program stored on the memory,

wherein the processor is configured to execute the computer program to implement the steps of the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the steps of the method of any of claims 1-7.