CN113868434A

CN113868434A - Data processing method, device and storage medium for graph database

Info

Publication number: CN113868434A
Application number: CN202111146519.7A
Authority: CN
Inventors: 王益飞; 汪洋; 王宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2021-12-31

Abstract

The disclosure provides a data processing method, equipment and a storage medium of a graph database, and relates to the technical field of artificial intelligence such as the technical field of knowledge maps. The specific implementation scheme is as follows: dividing an entity node set to be processed to obtain a plurality of entity node subsets in small batches, after each entity node subset completes the storage position analysis of entity nodes, directly inquiring the entity nodes through corresponding storage nodes to obtain next-hop entity nodes corresponding to the entity node subsets, determining the next-hop entity node set based on the next-hop entity nodes corresponding to all the entity nodes in the corresponding entity node subset, and processing according to the next-hop entity node set of the entity node subsets to obtain the next-hop entity node set corresponding to the entity node set. Therefore, a plurality of entity node subsets in small batches can be concurrently processed, and network resources are reasonably utilized.

Description

Data processing method, device and storage medium for graph database

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for processing data in a graph database, and a storage medium.

Background

The knowledge graph is an associated network knowledge expression based on a graph model. The knowledge graph abstracts entities into vertexes, abstracts relationships between the entities into edges, models and describes knowledge in a structured form, and visualizes the knowledge. The knowledge graph is the application direction with the most compact graph database association and the most wide scene. The knowledge map takes a map database as a storage engine, and carries out intelligent processing on mass information to form a large-scale knowledge base and further support business application. .

Disclosure of Invention

The present disclosure provides a data processing method, apparatus, and storage medium for a graph database.

According to an aspect of the present disclosure, there is provided a method of data processing of a graph database, the method comprising: acquiring an entity node set to be processed in the graph database; dividing the entity node set to obtain a plurality of entity node subsets; for each entity node subset, determining a storage node to which each entity node in the entity node subset belongs, wherein the storage node to which the entity node belongs comprises the entity node and a corresponding next hop entity node; determining a next hop entity node set corresponding to the entity node subset according to a next hop entity node corresponding to each entity node in the entity node subset; and determining a next hop entity node set corresponding to the entity node set according to the next hop entity node set corresponding to each entity node subset.

According to another aspect of the present disclosure, there is provided a data processing apparatus of a graph database, the apparatus comprising: the first acquisition module is used for acquiring an entity node set to be processed in the graph database; the partitioning module is used for partitioning the entity node set to obtain a plurality of entity node subsets; a first determining module, configured to determine, for each subset of entity nodes, a storage node to which each entity node in the subset of entity nodes belongs, where the storage node to which the entity node belongs includes the entity node and a corresponding next-hop entity node; a second determining module, configured to determine, according to a next-hop entity node corresponding to each entity node in the entity node subset, a next-hop entity node set corresponding to the entity node subset; and a third determining module, configured to determine, according to the next-hop entity node set corresponding to each entity node subset, a next-hop entity node set corresponding to the entity node set.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data processing method of a graph database of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute a data processing method of a graph database disclosed in an embodiment of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the data processing method of a graph database of the present disclosure.

One embodiment in the above application has the following advantages or benefits:

when an entity node set in a graph database is processed, the entity node set to be processed is divided to obtain a plurality of small-batch entity node subsets, after the storage position of an entity node is analyzed in each entity node subset, the entity node is directly inquired through the corresponding storage node to obtain a next-hop entity node corresponding to the entity node subset, the next-hop entity node set is determined based on the next-hop entity node corresponding to each entity node in the corresponding entity node subset, and processing is carried out according to the next-hop entity node set of the entity node subset to obtain the next-hop entity node set corresponding to the entity node set. Therefore, a plurality of entity node subsets in small batches can be concurrently processed, and network resources are reasonably utilized.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic flow chart diagram of a method of data processing of a graph database according to one embodiment of the present disclosure;

fig. 2 is a schematic diagram of a refinement process for determining a next-hop entity node set corresponding to an entity node set according to a next-hop entity node set corresponding to each entity node subset, provided according to another embodiment of the present disclosure;

fig. 3 is a schematic diagram of a refinement process for determining a next-hop entity node set corresponding to a entity node subset according to a storage node to which each entity node in the entity node subset belongs, according to another embodiment of the present disclosure;

fig. 4 is a schematic diagram of a detailed process of performing deduplication on a next hop entity node set corresponding to each entity node subset respectively to obtain a deduplication result corresponding to each entity node subset, according to another embodiment of the present disclosure;

FIG. 5 is a schematic flow chart diagram of a method for data processing of a graph database according to another embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a data processing apparatus for providing a graph database according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a data processing apparatus for providing a graph database according to another embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a data processing apparatus for providing a graph database according to another embodiment of the present disclosure;

FIG. 9 is a block diagram of an electronic device for implementing a method of data processing of a graph database according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Graph walking is a fundamental problem in the field of graph retrieval, and the fundamental meaning is to perform multi-level traversal in an abstract logical graph model (i.e. a topological graph). At present, for graph walking of a distributed graph database, a single-step walking strategy is generally adopted, that is, before each walking, storage positions of input node correlation information are uniformly analyzed by a computing node and then sent to corresponding storage nodes for query, results obtained by query are collected together and then result deduplication is performed, and the results after deduplication are used as input of next walking. However, in the above manner, only one process is executed at each time, the resource usage is insufficient, and the concurrency is low.

The method comprises the steps of dividing an entity node set to be processed when processing the entity node set in a graph database to obtain a plurality of small-batch entity node subsets, directly inquiring entity nodes through corresponding storage nodes after the storage position of each entity node subset is analyzed to obtain next-hop entity nodes corresponding to the entity node subsets, determining the next-hop entity node set based on the next-hop entity nodes corresponding to the entity nodes in the corresponding entity node subsets, and processing according to the next-hop entity node set of the entity node subsets to obtain the next-hop entity node set corresponding to the entity node set. Therefore, a plurality of entity node subsets in small batches can be concurrently processed, and network resources are reasonably utilized.

A data processing method, apparatus, and storage medium of a graph database according to an embodiment of the present disclosure are described below with reference to the accompanying drawings.

FIG. 1 is a flow chart diagram illustrating a method for data processing of a graph database according to one embodiment of the present disclosure. The graph database in this embodiment is a distributed graph database designed based on a storage-computation separation architecture, where the graph database includes computation nodes and multiple storage nodes, each storage node stores different parts of a knowledge graph, and the knowledge graphs of different parts stored in each storage node are spliced together to obtain a complete knowledge graph, where the knowledge graph includes multiple entity nodes and connecting edges between the multiple entity nodes. It should be noted that the method of the present embodiment is applied to a computing node in a graph database.

As shown in fig. 1, the data processing method of the graph database may include:

step 101, acquiring an entity node set to be processed in a graph database.

Step 102, dividing the entity node set to obtain a plurality of entity node subsets.

In some embodiments, after obtaining an entity node set to be processed in a graph database, the entity node set may be divided into a plurality of entity node subsets according to a set of every preset number of entity nodes.

The preset number is a number of nodes preset in the computing node, for example, the preset number may be 10 thousands or 30 thousands, and in actual application, the preset number may be set according to an actual service requirement, and the preset number is not specifically limited in this embodiment.

In some embodiments, in a graph walk scenario, for each step of the walk, a set of entity nodes pending for the current step of the walk may be obtained.

And the entity node set to be processed in the current walking is the entity node result obtained in the previous walking corresponding to the current walking.

Step 103, determining the storage node to which each entity node in the entity node subset belongs, aiming at each entity node subset.

The storage nodes to which the corresponding entity nodes belong comprise the corresponding entity nodes and the corresponding next hop entity nodes.

And step 104, determining a next hop entity node set corresponding to the entity node subset according to the next hop entity node corresponding to each entity node in the entity node subset.

In some embodiments, one possible implementation manner of determining the next-hop entity node set corresponding to the entity node subset according to the next-hop entity node corresponding to each entity node in the entity node subset is as follows: dividing entity nodes belonging to the same storage node into one group according to the storage node to which each entity node in the entity node subset belongs to obtain a plurality of groups of entity nodes; aiming at each group of entity nodes, receiving a deduplicated next hop entity node result returned by a storage node corresponding to the current entity node group aiming at the current entity node group; and determining a next hop entity node set corresponding to the entity node subset according to the next hop entity node result of each group of entity nodes.

In some exemplary embodiments, when the computing node analyzes the storage locations of the entity nodes in a single-thread manner, the computing node may sequentially traverse a plurality of entity node subsets according to a traversal order, and for a currently traversed current entity node subset, the computing node may analyze the storage location of each entity node in the current entity node subset to determine a storage node to which each entity node in the current entity node subset belongs. Correspondingly, after determining the storage node to which each entity node in the current entity node subset belongs, the computing node divides the entity nodes belonging to the same storage node into one group according to the storage node to which each entity node in the current entity node subset belongs to obtain a plurality of groups of entity nodes, and sends a query request to the storage node corresponding to the current entity node group for each group of entity nodes. Correspondingly, the storage node queries the part of the knowledge graph stored by the storage node according to the query request to obtain the entity node obtained after each entity node in the current entity node group performs one-step migration, the obtained entity node is used as the next-hop entity node of the corresponding entity node, and the next-hop entity node result of the current entity node group is obtained according to the next-hop entity node corresponding to each entity node in the current entity node group.

It should be understood that, for any entity node, the next hop entity node corresponding to the entity node is determined at the storage node to which the entity node belongs, that is, it can be determined that the entity node completes one-step migration.

Correspondingly, the computing node receives a next hop entity node result returned by the storage node aiming at the current entity node group, and determines a next hop entity node set corresponding to the current entity node subset according to the next hop entity node result of each group of entity nodes.

In some embodiments, in order to reduce the amount of data that needs to be sent by the storage node and reduce the number of nodes processed by the computing node, after the storage node obtains the next-hop entity node result of the current entity node group, the storage node may perform deduplication processing on the next-hop entity node result corresponding to the current entity node group.

The node may be deduplicated in various ways, and the way of deduplicating the node will be described in the following embodiments.

In other embodiments, in the process that the computing node processes a plurality of entity node subsets by using multiple threads, after determining the storage node to which each entity node in the entity node subset belongs, the corresponding storage node may be queried after determining the thread corresponding to the corresponding entity node subset. Correspondingly, the storage node queries the next hop entity node corresponding to each entity node in the entity node subset according to the part of the knowledge graph stored by the storage node, and obtains the next hop entity node set corresponding to the entity node subset according to the next hop node corresponding to each entity node.

In some embodiments, in order to reduce the amount of data that needs to be sent by the storage node and reduce the number of nodes processed by the compute node, before the storage node sends the next-hop entity node set corresponding to the entity node subset to the compute node, the storage node may further perform deduplication processing on the next-hop entity node set corresponding to the node sending entity node subset.

And 105, determining a next hop entity node set corresponding to the entity node set according to the next hop entity node set corresponding to each entity node subset.

The data processing method of the graph database according to the embodiment of the present disclosure, when processing an entity node set in the graph database, divides the entity node set to be processed to obtain a plurality of small-lot entity node subsets, directly queries the entity nodes through corresponding storage nodes after each entity node subset completes analysis of the storage location of the entity nodes, so as to obtain next-hop entity nodes corresponding to the entity node subsets, determines a next-hop entity node set based on the next-hop entity nodes corresponding to each entity node in the corresponding entity node subset, and processes the next-hop entity node set according to the next-hop entity node set of the entity node subset, so as to obtain the next-hop entity node set corresponding to the entity node set. Therefore, a plurality of entity node subsets in small batches can be concurrently processed, and network resources are reasonably utilized.

Based on the above description, it can be seen that, compared with the existing walking manner, in the existing walking manner, a computing node needs to analyze all entity nodes input in the walking manner and then interact with a storage node, and the computing node in this embodiment divides an entity node set in the walking manner to obtain a plurality of small-batch entity node subsets, and after each entity node subset completes analysis of an entity node storage position, directly queries the entity nodes through corresponding storage nodes to obtain a next-hop entity node set corresponding to the entity node subset, and processes the next-hop entity node set corresponding to the entity node subset through the computing node. Therefore, the computing nodes can interact with the corresponding storage nodes at the analyzed storage positions of part of the entity nodes, so that a plurality of entity node subsets in small batches can be concurrently processed, network resources are reasonably utilized, and the query efficiency is improved.

In different application scenarios, determining the next-hop entity node set corresponding to the entity node set according to the next-hop entity node set corresponding to each entity node subset can be implemented in various ways, and an exemplary description is as follows.

In some embodiments, as shown in fig. 2, may include:

step 201, respectively performing deduplication on the next hop entity node set corresponding to each entity node subset to obtain a deduplication result corresponding to each entity node subset.

Step 202, combining the duplicate removal results corresponding to all entity node subsets to obtain a candidate next hop entity node set corresponding to the entity node set.

And 203, performing deduplication again on the candidate next hop entity node set to obtain a next hop entity node set corresponding to the entity node set.

In this embodiment, after combining the duplicate removal results corresponding to all the entity node subsets, in order to avoid repeated migration of the same entity node under the condition that the duplicate removal results corresponding to different entity node subsets may have the same entity node, the embodiment performs duplicate removal on the combined entity node set again, and combines the entity node set subjected to duplicate removal again into the next hop entity node set corresponding to the entity node set, that is, combines the entity node set subjected to duplicate removal again into the node query result corresponding to the entity node set.

In other embodiments, in the process of processing the next-hop entity node set corresponding to the entity node subset based on the traversal order, when the next-hop entity node corresponding to the currently traversed entity node subset is deduplicated, a node deduplication result corresponding to the last traversed entity node subset may be obtained in combination with the traversal order, and the next-hop entity node set corresponding to the currently traversed entity node subset is deduplicated based on the node deduplication result until all the next-hop entity node sets are deduplicated. Correspondingly, all the de-duplicated next-hop entity node sets are merged, and the merged result is used as the next-hop entity node set corresponding to the entity node set.

In other embodiments, one implementation manner of determining the next-hop entity node set corresponding to the entity node set according to the next-hop entity node set corresponding to each entity node subset is as follows: merging the next hop entity node sets corresponding to all the node subsets to obtain a merged next hop entity node set; and carrying out deduplication processing on the merged next hop entity node set to obtain a next hop entity node set corresponding to the entity node set.

In this embodiment, the next-hop entity node sets corresponding to the plurality of entity node subsets are directly merged, and the merged entity nodes are subjected to deduplication processing, so as to obtain the next-hop entity node sets corresponding to the entity node sets. Therefore, the duplicate removal processing of the entity node set corresponding to the entity node set is realized, the repeated entity nodes are prevented from being walked next, the network resource waste is reduced, and the efficiency of the walking acquisition result is improved.

Based on the foregoing embodiment, in order to reduce the amount of data that needs to be sent by a storage node and reduce the number of nodes processed by a computing node, the determining, according to a next-hop entity node corresponding to each entity node in the entity node subset, a possible implementation manner of a next-hop entity node set corresponding to the entity node subset may include, as shown in fig. 3:

step 301, dividing the entity nodes belonging to the same storage node into one group according to the storage node to which each entity node in the entity node subset belongs, so as to obtain a plurality of groups of entity nodes.

Step 302, for each group of entity nodes, receiving the deduplicated next hop entity node result returned by the storage node corresponding to the current entity node group for the current entity node group.

Step 303, determining a next hop entity node set corresponding to the entity node subset according to the next hop entity node result of each group of entity nodes.

In an embodiment of the present disclosure, in order to reduce memory consumption in the deduplication process, one possible implementation manner of performing deduplication on the next-hop entity node sets corresponding to each entity node subset respectively to obtain a deduplication result corresponding to each entity node subset, as shown in fig. 4, may include:

step 401, for each entity node subset, performing node traversal on a next hop entity node set corresponding to the entity node subset.

Step 402, for the traversed target entity node, determining an interval to which the node identifier of the target entity node belongs.

The interval is obtained by segmenting the value range of the node identifier.

In this embodiment, the data type of the node identifier may be integer data, for example, the node identifier may be represented by 8-byte integer data, where a value range of the node identifier is 1-2^{^64}。

In some embodiments, the node identifier of the target entity node may be compared with the preset interval ranges corresponding to the multiple intervals, and according to the comparison result, the interval to which the node identifier of the target entity node belongs may be determined.

For example, the node identifier of the target entity node is 1000, if the interval range 0-2 corresponding to the interval 1 is^{^32}The range of the interval corresponding to the interval 2 is 2^{^32}-2^{^33}The range of the section corresponding to the section 3 is 2^{^33}-3*2^{^32}The interval range corresponding to the interval 4 is 3 x 2^{^32}-2^{^34}Zone range 2 corresponding to zone 5^{^34}-2^{^64}At this time, the node identifier 1000 can be determined to be within the range of the interval 0-2 corresponding to the interval 1 by comparing the node identifier with the range of the interval corresponding to each of the above-mentioned intervals^{^32}At this time, it is determined that the node identification of the target node belongs to the section 1.

And step 403, determining that the target entity node appears in the traversed nodes according to the duplication eliminating mode corresponding to the interval.

In different application scenarios, it is determined that the target entity node has occurred in the traversed nodes in different implementation manners according to the deduplication manner corresponding to the interval, and the exemplary description is as follows:

as an exemplary embodiment, a plurality of bit sets corresponding to the intervals are obtained; segmenting the bits of the node identification to obtain a plurality of bit segments corresponding to the node identification; for each bit segment, obtaining a target bit set corresponding to the bit segment from the plurality of bit sets, wherein the bit length of the target bit set is greater than or equal to the bit length of the bit segment; acquiring bit bits corresponding to the values of the bit segments from the target bit set; and under the condition that the values of the bits corresponding to the bit segments are all the first values, determining that the target entity node appears in the traversed nodes, wherein the first values indicate that the values of the corresponding bit segments appear.

In this embodiment, in the process of performing node deduplication on a next hop entity node set corresponding to the entity node subset, multiple bit sets with bit lengths smaller than the node identifier are combined to perform deduplication on entity nodes in the next hop entity node set, so as to reduce memory resources occupied by the bit sets in the deduplication process.

As another exemplary embodiment, a hash table corresponding to a section is obtained; acquiring high N bits and low M bits of a node identifier, wherein N is an integer which is greater than 1 and smaller than K, K represents the bit length of the node identifier, and M is a difference value between K and N; acquiring a bit set corresponding to the values of the high N bits from the hash table; acquiring target bits corresponding to the values of the lower M bits from the bit set; and under the condition that the value on the target bit is the first value, determining that the target entity node appears in the traversed nodes, wherein the first value is used for indicating that the values of the lower M bits appear.

In this embodiment, in the process of performing node deduplication on the next-hop entity node set corresponding to the entity node subset, the entity nodes in the next-hop entity node set are deduplicated with the hash table and the bit set, so that memory resources occupied by the hash table and the bit set can be reduced, and memory resources occupied in the deduplication process can be reduced.

As another exemplary embodiment, a bit set corresponding to an interval is obtained; acquiring a target bit corresponding to the node identifier from the bit set; and under the condition that the value on the target bit is the first value, determining that the target entity node appears in the traversed nodes, wherein the first value indicates that the node identifier appears.

In this embodiment, in the process of removing duplicate from the entity node, the entity node is removed with a bit set of bits smaller than the node identifier of the entity node, so that memory consumption occupied by the bit set in the process of removing duplicate can be reduced.

As another exemplary implementation manner, a hash table corresponding to the interval may be obtained, hash calculation may be performed on the node identifier to obtain a hash value corresponding to the node identifier, at least one candidate value corresponding to the hash value is obtained from the hash table, and when a candidate value whose value is a first value exists in the at least one candidate value, it is determined that a target entity node has appeared in the traversed node, where the first value indicates that the node identifier has appeared.

In this embodiment, in the process of performing node deduplication on the entity node, the entity node is deduplicated by combining the hash table corresponding to the interval to which the node identifier belongs, so that the time for searching the hash table can be reduced, and the memory consumption can be reduced.

And step 404, deleting the target entity node in the next hop entity node set.

In some embodiments, the above-mentioned deduplication is performed again on the candidate next-hop entity node set to obtain the next-hop entity node set obtained by the current walking, and one possible implementation manner is as follows: performing node traversal on a candidate next hop entity node set, and determining an interval to which a node identifier of the next hop entity node belongs for the traversed next hop entity node; and determining that the next hop entity node appears in the traversed nodes according to the duplicate removal mode corresponding to the interval, and deleting the next hop entity node in the candidate next hop entity node set.

The process of determining that the next-hop entity node has appeared in the traversed nodes according to the deduplication mode corresponding to the interval is similar to the process of determining that the target entity node has appeared in the traversed nodes according to the deduplication mode corresponding to the interval, and details of the embodiment are omitted.

In order to make the present disclosure clear to those skilled in the art, the method of this embodiment is further described below with reference to fig. 5, where it is to be noted that this embodiment takes as an example that the compute node and the storage node both perform data processing in a single-threaded manner, where it is to be noted that a in fig. 5 represents a first batch of input data, B represents a second batch of input data, C represents a third batch of input data, and D represents a fourth batch of input data.

As shown in fig. 5, may include:

in this embodiment, after the computing node obtains the input data of the wandering input in this step, the computing node may divide the input data into a plurality of batches of input data. In fig. 5, the input data is divided into four batches of input data for example. Correspondingly, the computing nodes sequentially analyze a plurality of batches of input data according to the sequence, after the first batch of input data is analyzed, the first batch of input data is sent to the corresponding storage nodes, and the storage nodes collect the relation data of the first batch of input data and send the relation data to the computing nodes.

It should be noted that the input data here is an entity node set.

It should be noted that, in the process of collecting the relational data of the first batch of input data by the storage node, the computing node analyzes the storage location of the second batch of input data, and after the storage location corresponding to each input data in the second batch of input data is analyzed, the computing node sends a query request to the storage node corresponding to the storage location to query the relational data corresponding to the input data.

It should be noted that, for a current batch of input data, in a process of performing deduplication on a query result of a previous batch of input data by a computing node, if the computing node receives the query result of the current batch of input data, since the query results corresponding to different batches of input data may have the same entity node, the computing node may perform deduplication on the query result of the previous batch of input data and then perform deduplication processing on the query result of the current batch of input data by combining the deduplicated query result of the previous batch of input data. It should be noted that, in order to reduce the weight loss of the computing node and reduce the amount of data to be transmitted on the network, after the storage node obtains the query result corresponding to the batch input data, the storage node may perform the weight loss processing on the query result corresponding to the batch input data.

For the duplicate removal processing, reference may be made to the related description of the above embodiments, and details are not repeated here.

In this embodiment, the input data that is walked at this step is split to obtain multiple batches of input data, and the multiple batches of input data are processed in a parallel manner, so that the concurrency among different execution steps is improved, network resources are reasonably utilized, and the efficiency of obtaining query results is improved.

In addition, in this embodiment, a two-stage deduplication method is adopted, in which a query result obtained by wandering a node is deduplicated in a storage node, and the query result is deduplicated again in a computation node. By performing deduplication in the storage nodes, the amount of data that needs to be transmitted on the network can be reduced, the number of nodes for deduplication of the computing nodes can be reduced, and the burden of deduplication of the computing nodes can be reduced.

In order to implement the above embodiments, the embodiments of the present disclosure further provide a data processing device for a graph database.

FIG. 6 is a block diagram of a data processing apparatus for a graph database according to an embodiment of the present disclosure.

As shown in fig. 6, the data processing apparatus 600 of the graph database may include a first obtaining module 601, a dividing module 602, a first determining module 603, a second determining module 604, and a third determining module 605, wherein:

the first obtaining module 601 is configured to obtain an entity node set to be processed in a graph database.

A partitioning module 602, configured to partition the entity node set to obtain a plurality of entity node subsets.

A first determining module 603, configured to determine, for each entity node subset, a storage node to which each entity node in the entity node subset belongs, where the storage node to which the entity node belongs includes the entity node and a corresponding next-hop entity node.

The second determining module 604 is configured to determine, according to a next-hop entity node corresponding to each entity node in the entity node subset, a next-hop entity node set corresponding to the entity node subset.

A third determining module 605, configured to determine, according to the next-hop entity node set corresponding to each entity node subset, a next-hop entity node set corresponding to the entity node set.

It should be noted that the explanation of the embodiment of the data processing method for a graph database is also applicable to this embodiment, and the description of this embodiment is omitted.

The data processing device of the graph database according to the embodiment of the present disclosure, when processing an entity node set in the graph database, divides the entity node set to be processed to obtain a plurality of small-lot entity node subsets, directly queries entity nodes through corresponding storage nodes after each entity node subset completes analysis of storage locations of the entity nodes, to obtain next-hop entity nodes corresponding to the entity node subsets, determines a next-hop entity node set based on the next-hop entity nodes corresponding to the entity nodes in the corresponding entity node subset, and processes the next-hop entity node set according to the next-hop entity node set of the entity node subset, to obtain a next-hop entity node set corresponding to the entity node set. Therefore, a plurality of entity node subsets in small batches can be concurrently processed, and network resources are reasonably utilized.

In one embodiment of the present disclosure, as shown in fig. 7, the data processing apparatus of the graph database may include: a first obtaining module 701, a dividing module 702, a first determining module 703, a second determining module 704, and a third determining module 705, where the second determining module 704 includes: a first deduplication unit 7041, a first merging unit 7042, and a second deduplication unit 7043; the first deduplication unit 7041 may include: the node traverses through the child cell 70411, the first determining child cell 70412, the second determining child cell 70413, the delete child cell 70414.

For detailed description of the first obtaining module 701, the dividing module 702, the first determining module 703 and the third determining module 705, please refer to the description of the first obtaining module 601, the dividing module 602, the first determining module 603 and the third determining module 605 in the embodiment shown in fig. 6, and the description is not repeated here.

In one embodiment of the present disclosure, as shown in fig. 7, the second determining module 704 includes:

a first duplicate removal unit 7041, configured to separately perform duplicate removal on the next hop entity node set corresponding to each entity node subset, so as to obtain a duplicate removal result corresponding to each entity node subset;

a first merging unit 7042, configured to merge respective duplicate removal results corresponding to all entity node subsets to obtain a candidate next-hop entity node set of the entity node set;

a second deduplication unit 7043 is configured to perform deduplication again on the candidate next hop entity node set to obtain a next hop entity node set corresponding to the entity node set.

In an embodiment of the present disclosure, the first determining module 703 is specifically configured to: dividing entity nodes belonging to the same storage node into one group according to the storage node to which each entity node in the entity node subset belongs to obtain a plurality of groups of entity nodes; aiming at each group of entity nodes, receiving a deduplicated next hop entity node result returned by a storage node corresponding to the current entity node group aiming at the current entity node group; and determining a next hop entity node set corresponding to the entity node subset according to the next hop entity node result of each group of entity nodes.

In an embodiment of the present disclosure, as shown in fig. 7, the first deduplication unit 7041 includes:

a node traversal subunit 70411, configured to perform node traversal on the next-hop entity node set corresponding to each entity node subset;

a first determining subunit 70412, configured to determine, for the traversed target entity node, an interval to which the node identifier of the target entity node belongs, where the interval is obtained by segmenting a value range of the node identifier;

a second determining subunit 70413, configured to determine that a target entity node has occurred in the traversed nodes according to the deduplication manner corresponding to the interval;

a delete sub-unit 70414, configured to delete the target entity node in the next-hop entity node set.

In an embodiment of the disclosure, the second determining subunit 70413 is specifically configured to: acquiring a plurality of bit sets corresponding to the intervals; segmenting the bits of the node identification to obtain a plurality of bit segments corresponding to the node identification; for each bit segment, obtaining a target bit set corresponding to the bit segment from the plurality of bit sets, wherein the bit length of the target bit set is greater than or equal to the bit length of the bit segment; acquiring bit bits corresponding to the values of the bit segments from the target bit set; and under the condition that the values of the bits corresponding to the bit segments are all the first values, determining that the target entity node appears in the traversed nodes, wherein the first values indicate that the values of the corresponding bit segments appear.

In an embodiment of the disclosure, the second determining subunit 70413 is specifically configured to: acquiring a hash table corresponding to the interval; acquiring high N bits and low M bits of a node identifier, wherein N is an integer which is greater than 1 and smaller than K, K represents the bit length of the node identifier, and M is a difference value between K and N; acquiring a bit set corresponding to the values of the high N bits from the hash table; acquiring target bits corresponding to the values of the lower M bits from the bit set; and under the condition that the value on the target bit is the first value, determining that the target entity node appears in the traversed nodes, wherein the first value is used for indicating that the values of the lower M bits appear.

In an embodiment of the disclosure, the second determining subunit 70413 is specifically configured to: acquiring a bit set corresponding to the interval; acquiring a target bit corresponding to the node identifier from the bit set; and under the condition that the value on the target bit is the first value, determining that the target entity node appears in the traversed nodes, wherein the first value represents that the node identifier appears.

In one embodiment of the present disclosure, as shown in fig. 8, the data processing device of the graph database may include: a first obtaining module 801, a dividing module 802, a first determining module 803, a second determining module 804, and a third determining module 805, where the second determining module 804 may include: a second combining unit 8041 and a third deduplication unit 8042.

In one embodiment of the present disclosure, as shown in fig. 8, the second determining module 804 includes:

a second merging unit 8041, configured to merge next-hop entity node sets corresponding to all node subsets to obtain a merged next-hop entity node set;

a third deduplication unit 8042, configured to perform deduplication processing on the merged next hop entity node set to obtain a next hop entity node set obtained by the walking in this step.

It should be noted that the explanation of the embodiment of the data processing method for a graph database is also applicable to the data processing device for a graph database in this embodiment, and is not repeated herein.

The present disclosure also provides an electronic device and a readable storage medium and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as the data processing method of the map database. For example, in some embodiments, the data processing method of a graph database may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When a computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the data processing method of a graph database described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured in any other suitable way (e.g. by means of firmware) to execute a data processing method of a graph database.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be noted that artificial intelligence is a subject for studying a computer to simulate some human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and includes both hardware and software technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of data processing of a graph database, the method comprising:

acquiring an entity node set to be processed in the graph database;

dividing the entity node set to obtain a plurality of entity node subsets;

for each entity node subset, determining a storage node to which each entity node in the entity node subset belongs, wherein the storage node to which the entity node belongs comprises the entity node and a corresponding next hop entity node;

determining a next hop entity node set corresponding to the entity node subset according to a next hop entity node corresponding to each entity node in the entity node subset;

and determining a next hop entity node set corresponding to the entity node set according to the next hop entity node set corresponding to each entity node subset.

2. The method of claim 1, wherein the determining a next-hop entity node set corresponding to an entity node set according to a next-hop entity node set corresponding to each of the entity node subsets comprises:

respectively carrying out duplicate removal on the next hop entity node set corresponding to each entity node subset to obtain a duplicate removal result corresponding to each entity node subset;

combining the duplicate removal results corresponding to all the entity node subsets to obtain a candidate next hop entity node set corresponding to the entity node set;

and performing duplicate removal on the candidate next hop entity node set again to obtain a next hop entity node set corresponding to the entity node set.

3. The method of claim 1, wherein the determining a next-hop entity node set corresponding to an entity node set according to a next-hop entity node set corresponding to each of the entity node subsets comprises:

merging the next hop entity node sets corresponding to all the node subsets to obtain a merged next hop entity node set;

and carrying out deduplication processing on the merged next hop entity node set to obtain a next hop entity node set corresponding to the entity node set.

4. The method according to any one of claims 1 to 3, wherein the determining, according to the next-hop entity node corresponding to each entity node in the subset of entity nodes, the set of next-hop entity nodes corresponding to the subset of entity nodes comprises:

dividing the entity nodes belonging to the same storage node into one group according to the storage node to which each entity node in the entity node subset belongs to obtain a plurality of groups of entity nodes;

aiming at each group of entity nodes, receiving a deduplicated next hop entity node result returned by a storage node corresponding to a current entity node group aiming at the current entity node group;

and determining a next hop entity node set corresponding to the entity node subset according to the next hop entity node result of each group of entity nodes.

5. The method according to claim 2, wherein the separately performing deduplication on the next-hop entity node set corresponding to each entity node subset to obtain a deduplication result corresponding to each entity node subset comprises:

for each entity node subset, performing node traversal on a next hop entity node set corresponding to the entity node subset;

for the traversed target entity node, determining an interval to which a node identifier of the target entity node belongs, wherein the interval is obtained by segmenting a value range of the node identifier;

determining that the target entity node appears in the traversed nodes according to the duplicate removal mode corresponding to the interval;

deleting the target entity node in the next hop entity node set.

6. The method of claim 5, wherein the determining that the target entity node has appeared in the traversed nodes according to the deduplication style corresponding to the interval comprises:

acquiring a plurality of bit sets corresponding to the intervals;

segmenting the bits of the node identification to obtain a plurality of bit segments corresponding to the node identification;

for each bit segment, obtaining a target bit set corresponding to the bit segment from the plurality of bit sets, wherein the bit length of the target bit set is greater than or equal to the bit length of the bit segment;

acquiring the bit corresponding to the value of the bit segment from the target bit set;

and determining that the target entity node has appeared in the traversed nodes under the condition that the values of the bits corresponding to the bit segments are all first values, wherein the first values indicate that the values of the corresponding bit segments have appeared.

7. The method of claim 5, wherein the determining that the target entity node has appeared in the traversed nodes according to the deduplication style corresponding to the interval comprises:

acquiring a hash table corresponding to the interval;

acquiring high N bits and low M bits of the node identifier, wherein N is an integer greater than 1 and smaller than K, K represents the bit length of the node identifier, and M is a difference value between K and N;

acquiring a bit set corresponding to the values of the high N bits from the hash table;

acquiring target bits corresponding to the values of the lower M bits from the bit set;

and under the condition that the value of the target bit is a first value, determining that the target entity node appears in the traversed nodes, wherein the first value is used for indicating that the values of the lower M bits appear.

8. The method of claim 5, wherein the determining that the target entity node has appeared in the traversed nodes according to the deduplication style corresponding to the interval comprises:

acquiring a bit set corresponding to the interval;

acquiring a target bit corresponding to the node identifier from the bit set;

and under the condition that the value on the target bit is the first value, determining that the target entity node appears in the traversed nodes.

9. A data processing apparatus for a graph database, the apparatus comprising:

the first acquisition module is used for acquiring an entity node set to be processed in the graph database;

the partitioning module is used for partitioning the entity node set to obtain a plurality of entity node subsets;

a first determining module, configured to determine, for each subset of entity nodes, a storage node to which each entity node in the subset of entity nodes belongs, where the storage node to which the entity node belongs includes the entity node and a corresponding next-hop entity node;

a second determining module, configured to determine, according to a next-hop entity node corresponding to each entity node in the entity node subset, a next-hop entity node set corresponding to the entity node subset;

and a third determining module, configured to determine, according to the next-hop entity node set corresponding to each entity node subset, a next-hop entity node set corresponding to the entity node set.

10. The apparatus of claim 9, wherein the second determining means comprises:

a first duplicate removal unit, configured to separately remove duplicates of a next hop entity node set corresponding to each entity node subset, so as to obtain a duplicate removal result corresponding to each entity node subset;

a first merging unit, configured to merge duplicate removal results corresponding to all the entity node subsets to obtain a candidate next-hop entity node set corresponding to the entity node set;

and the second duplicate removal unit is used for removing the duplicate of the candidate next hop entity node set again to obtain a next hop entity node set corresponding to the entity node set.

11. The apparatus of claim 9, wherein the second determining means comprises:

a second merging unit, configured to merge the next-hop entity node sets corresponding to all the node subsets to obtain a merged next-hop entity node set;

and a third deduplication unit, configured to perform deduplication processing on the merged next hop entity node set to obtain a next hop entity node set corresponding to the entity node set.

12. The apparatus according to any one of claims 9 to 11, wherein the first determining module is specifically configured to:

13. The apparatus of claim 10, wherein the first deduplication unit comprises:

a node traversal subunit, configured to perform node traversal on a next-hop entity node set corresponding to each entity node subset;

the first determining subunit is configured to determine, for a traversed target entity node, an interval to which a node identifier of the target entity node belongs, where the interval is obtained by segmenting a value range of the node identifier;

a second determining subunit, configured to determine that the target entity node has occurred in the traversed nodes according to the deduplication manner corresponding to the interval;

and the deleting subunit is configured to delete the target entity node in the next-hop entity node set.

14. The apparatus according to claim 13, wherein the second determining subunit is specifically configured to:

acquiring a plurality of bit sets corresponding to the intervals;

15. The apparatus according to claim 13, wherein the second determining subunit is specifically configured to:

acquiring a hash table corresponding to the interval;

16. The apparatus according to claim 13, wherein the second determining subunit is specifically configured to:

acquiring a bit set corresponding to the interval;

acquiring a target bit corresponding to the node identifier from the bit set;

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1-8.