CN111309750A

CN111309750A - Data updating method and device for graph database

Info

Publication number: CN111309750A
Application number: CN202010241791.2A
Authority: CN
Inventors: 邓崇鑫; 蔡苗; 陈震宇; 刘国华
Original assignee: Postal Savings Bank of China Ltd
Current assignee: Postal Savings Bank of China Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2020-06-19

Abstract

The application relates to a data updating method and device of a graph database, wherein the method comprises the following steps: determining incremental data and operation types of the updating operation; locking the data number corresponding to the incremental data by using a distributed lock; calling a query interface of a graph database, and querying whether corresponding historical data exist in the data number; updating corresponding data in the graph database according to the operation type and the judgment result; and after the updating operation is completed, unlocking the data number. The scheme of the application supports distributed parallel transmission and processing of data, and supports multiple processes and multiple threads; on the premise of not modifying the existing system (data production and transmission system and graph database), the distributed type, the real-time performance, the time sequence and the idempotent performance can be simultaneously realized in the cross-system processing process from source data to the graph database.

Description

Data updating method and device for graph database

Technical Field

The application relates to the technical field of databases, in particular to a data updating method and device of a graph database.

Background

With the development of the internet and internet of things technology, the data growth speed is faster and faster. Meanwhile, the application of graph data is more and more extensive, and many related applications have high requirements on real-time performance. Therefore, the graph database is required to be capable of realizing real-time processing of mass data, and the problem to be solved is that the graph database is used for real-time incremental updating of the mass data.

The graph database is used for storing entity information and relationship information between entities, and corresponds to points (also called nodes or vertexes) and edges (also called arcs or lines) in the graph theory. For example, the relationship between people can be stored by using a graph database, and the people are entity information and vertex; the relationship between people is the relationship information between entities, and is edge. There are many existing graph databases, such as: neo4j, ArangoDB, OrientDB.

In the related art, a data production and transmission system and a graph database are implemented by selecting different technologies in different organizations and enterprises and different specific applications. If these systems need to be modified one by one in order to implement the relevant functions or features, labor and time costs can be high. Under the condition of not modifying the existing system (a data production and transmission system and a graph database), no corresponding technical scheme exists, and the distributed type, the real-time property, the time sequence and the idempotent property of the process from source data to the graph database can be simultaneously solved.

Disclosure of Invention

To overcome, at least to some extent, the problems in the related art, the present application provides a method and apparatus for updating data in a graph database.

According to a first aspect of embodiments of the present application, there is provided a data updating method for a graph database, including:

determining incremental data and operation types of the updating operation;

locking the data number corresponding to the incremental data by using a distributed lock;

calling a query interface of a graph database, and querying whether corresponding historical data exist in the data number;

updating corresponding data in the graph database according to the operation type and the judgment result;

and after the updating operation is completed, unlocking the data number.

Further, locking the data number corresponding to the incremental data includes:

when the increment data is vertex, the vertex id is locked.

Further, the updating the corresponding data in the graph database according to the operation type and the judgment result includes:

when the operation type is adding or covering, if the judgment result is that the operation type does not exist, entering a sub-process of adding vertex; if the judgment result is that the flow exists, entering a sub-process covering vertex;

and when the operation type is deletion, if the judgment result is that the operation type exists, entering a subprocess for deleting vertex, and updating the vertex log.

Further, the sub-flow of adding vertex includes:

setting the vertex id of the vertex as the vertex id of the message;

setting available of vertex as true;

traversing properties of the message to generate properties corresponding to vertex;

updating the vertex log;

vertex data is added to the graph database.

Further, the sub-process of the overlay vertex includes:

comparing the vertex update message id of vertex with the message id of the message, if the two message ids are the same, ending the process;

traversing the properties of the message, and searching whether $ { property name } of the vertex is the same as the name of the property of the message exists in the properties of the vertex;

property of vertex, if any, is overridden; if not, adding a property in the vertex;

traversing properties of vertex, and comparing $ { property name } update message id of vertex with message id of message;

if the two message ids are not the same, comparing the time of the message with $ { propertylame } update date time of vertex;

and if the timestamp time of the message is later, updating the related information.

and when the incremental data is edge, sequencing the starting vertex id and the end vertex id of the edge, and locking the starting vertex id and the end vertex id according to the sequence.

when the operation type is adding or covering, if the judgment result is that the operation type does not exist, entering a sub-process of adding edge; if the judgment result is that the edge exists, entering a sub-process covering the edge;

and when the operation type is deletion, deleting the corresponding edge data according to the additional condition.

Further, the sub-process of adding edge includes:

setting the from vertex id of the edge as the from vertex id of the message;

setting the to vertex id of the edge as the to vertex id of the message;

setting the relationship of the edge as the relationship of the message;

setting available of the edge as true;

traversing properties of the message to generate properties corresponding to the edge;

updating the edge log;

adding edge data to the graph database.

Further, the deleting the corresponding edge data according to the additional condition includes:

if the additional conditions are from vertex id, to vertex id and relationship, deleting the corresponding 1 edg data;

if the additional conditions are from vertex id and to vertex id, deleting a plurality of corresponding edg data;

if the additional condition is from vertex id or to vertex id, the corresponding plurality of edg data is deleted.

According to a second aspect of embodiments of the present application, there is provided a data updating apparatus for a graph database, comprising:

the determining module is used for determining the incremental data and the operation type of the updating operation;

the locking module is used for locking the data number corresponding to the incremental data by using a distributed lock;

the query module is used for calling a query interface of the graph database and querying whether the data number has corresponding historical data;

the judging module is used for updating corresponding data in the graph database according to the operation type and the judging result;

and the unlocking module is used for unlocking the data numbers after the updating operation is finished.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

the scheme of the application supports distributed parallel transmission and processing of data, and supports multiple processes and multiple threads; on the premise of not modifying the existing system (data production and transmission system and graph database), the distributed type, the real-time performance, the time sequence and the idempotent performance can be simultaneously realized in the cross-system processing process from source data to the graph database.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart illustrating a method for data updating of a graph database according to an exemplary embodiment.

FIG. 2 is a system block diagram illustrating a distributed graph database according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of methods and apparatus consistent with certain aspects of the present application, as detailed in the appended claims.

FIG. 1 is a flow chart illustrating a method for data updating of a graph database according to an exemplary embodiment. The method can be applied to a distributed graph database, and specifically comprises the following steps:

step S1: determining incremental data and operation types of the updating operation;

step S2: locking the data number corresponding to the incremental data by using a distributed lock;

step S3: calling a query interface of a graph database, and querying whether corresponding historical data exist in the data number;

step S4: updating corresponding data in the graph database according to the operation type and the judgment result;

step S5: and after the updating operation is completed, unlocking the data number.

The scheme of the application realizes the distribution, real-time performance, time sequence and idempotent performance of the incremental updating process from the source data to the graph database. The method supports distributed parallel transmission and processing of data (simultaneously supports multiple processes and multiple threads), regardless of the sequence of data transmission and processing and whether the transmission and processing are repeated, as long as all related data are successfully transmitted and processed at least once, and the final result is ensured to be completely consistent with the result of serial execution of the data according to the time sequence.

To further detail the technical solution of the present application, first, a description is made of related concepts of a distributed architecture and source data.

The distributed architecture is a common solution for solving high concurrency, a source data generation system of a graph database also often adopts the mode, and the transmission and processing of source data from the generation system to the graph database also involve the distribution. The core design idea of the distributed architecture is parallel splitting and horizontal expansion, which has the following advantages and is being used by more and more systems. The computing and storing capacity can be expanded transversely based on general hardware, the processing capacity of the system is improved, and the continuously increasing requirements of services are met. And the efficiency bottleneck of the traditional serial processing is broken through by parallel processing. The reliability of the system is improved, and the unavailability of system functions caused by single-point faults is avoided. The cost of acquiring the same processing power can be lower than with conventional architectures, based on relatively inexpensive general-purpose computing and storage devices.

Most production systems of source data adopt a distributed architecture, and in order to ensure real-time processing of massive source data, a link from the source data to a graph database needs to adopt the distributed architecture. The source data generated by the distributed parallel system cannot guarantee the global time sequence of source data transmission under the general condition. The same processing is carried out in a plurality of processes or threads at the same time, the consumed time has uncertainty, and partial nodes can also generate errors or faults in the processing. For distributed parallel processing of a batch of ordered data, the time sequence of the batch of data during processing cannot be guaranteed. Data processing can lead to logical errors in data updates if the timing is not guaranteed. For example, the value of a data a field should be updated to a0 and then to a1, and if the update sequence is reversed, the value of the data a field will be updated to a1 and then to a0, and finally to a 0. In order to ensure the time sequence of the source data during transmission, a cache manner is generally adopted, and the data of each node is waited to be sent completely, and after being sorted in the cache, the subsequent serial transmission is performed. This approach can result in delays in data transmission, reducing timeliness. Moreover, if a node fails to normally transmit data for a long time due to a failure, a serious delay is caused, or the data is non-timely. In order to ensure the ordering of data during processing, a serial processing mode is generally adopted, and under the condition that the source data amount is continuously increased, the serial processing mode always reaches the processing upper limit of a single node. If a distributed architecture is adopted, the global time sequence during the source data processing cannot be guaranteed. That is, under the distributed architecture, it is difficult to ensure real-time performance and timing performance simultaneously in transmission and processing of source data.

Since errors cannot be eliminated, error handling is considered in any data transmission or processing procedure. Data transmission is easy to realize at least one successful transmission, and when the data is transmitted and processed across systems, the processing is difficult to guarantee and is only successfully processed once. For example, if data is received, data processing is performed first, and then a data reception Acknowledgement (ACK) is sent. After the data is processed, an error occurs before the data is notified to be successfully received, and since the data sending end does not receive a data receiving Acknowledgement (ACK), the data is considered to be unsuccessfully received, and then the data is sent again, so that the data is repeatedly processed; and if the data is received, firstly sending a data receiving Acknowledgement (ACK) and then carrying out data processing. When data is processed in error, data loss may be caused. If the process of data persistence to the graph database is idempotent, then the easy to implement nature of the data transmission being at least once successful, the source data transmission can be resolved and updated to the graph database, this complete process being processed only once successfully.

The following describes the scheme of the present application in an expanded manner with reference to specific application scenarios, and introduces a system architecture, a related data structure, and a graph data processing procedure, respectively.

In the first section, a system architecture is introduced.

Referring to FIG. 2, the system architecture of a distributed graph database can be divided into three parts: source data production, graph data processing, and graph database operations.

The source data of the graph data can come from any service system, and the source data needs to be arranged into an associated message data structure when being transmitted. The source data transmission mode can be directly sent by a service system, and can also be transmitted through a message queue, so that a distributed parallel architecture is supported. In the transmission process, a service system or a message queue is not needed to ensure the time sequence of data transmission, and the clocks of all nodes for data transmission are consistent.

The graph data processing supports a distributed architecture, and a plurality of graph data processing nodes can be deployed for parallel processing. The data processing flow of the diagram is divided into 3 links of data receiving, data processing and receiving confirmation. When a temporary error occurs in any link, the temporary error can be processed in a retry mode, and the final result can be guaranteed not to be influenced. The data receiving supports parallel batch receiving, the data processing can adopt a thread pool mode, multi-thread parallel processing is carried out, and each thread processes one piece of data.

In the data processing process, graph data can be inquired and persisted through a graph database API. The method does not depend on the special function of the graph database, and related operations only use the most basic graph database API, and comprise the following steps: querying a vertex by a vertex id; querying an edge through from vertex id, to vertex id and relationship; querying a plurality of edges through from vertex and to vertex ids; querying a plurality of edges through from vertex id or to vertex id; adding and updating a vertex; one edge is added and updated. There is no particular requirement for the data stored in the graph database, and only the most basic fields of the graph data need to be stored. The vertex data includes: a vertex id field; the "property" field may include a vertex property field (in a vertex data structure, other information may be stored using different fields besides the vertex id. where the information of vertex log may be merged into one field store, $ { property name } log may be merged into one field store per $). The edge data includes: an edge id field; a from vertex id field; a to vertex id field; a relation field; the edge attribute field (in the edge data structure, other information can be stored by using different fields besides the edge id, from version id, to version and relationship). In this, the information of the edge log can be merged into one field for storage, and each $ { property name } log can be merged into one field for storage). Various graph databases may be used including, but not limited to Neo4j, arango db, OrientDB, and the like. After one batch of data is processed, data Acceptance Confirmation (ACK) is carried out, and then the next batch of data is received and enters the next round of processing.

In the second section, a related data structure is introduced.

The data structure includes 3 parts: the data structure of message when source data is transmitted, the data structure of vertex and the data structure of edge in the graph database.

2.1.message data Structure

Namely, the source data transmission, the data structure used includes the following parts.

message id: and the message id ensures global uniqueness, and a UUID can be used for judging the repeatability of data and processing.

timing and map: the time when the message occurs, the chronology of the present scheme, means the time sequence of the field from far to near.

operation: data processing type, there are 8 processes in total.

data: the data processing related graph data and different data processing types relate to different specific data structures. The following describes the data structure corresponding to this field in detail when each data processing is described.

data structure of properties part of data:

properties:

property:

name is attribute name-1.

value-1 of the attribute.

property:

name is attribute name-2.

value-2 of the attribute.

...

property:

name is the attribute name-n.

value-n of the attribute.

2.2. vertex's data Structure

Data structure of vertex in graph database. Illustratively, $ { property name } represents the name of a vertex property, e.g., name, age. There may be multiple sets of $ { property name } value, $ { property name } available, and $ { property name } log.

vertex id: id of vertex.

vertex available: whether vertex is deleted, false indicates deleted, true indicates present.

vertex log: the processing log information of vertex includes first creation, latest update, deletion, and coverage related log information.

vertex create date time: the timestamp of the message when vertex is added.

vertex create message id: message id of message when vertex is added. This value is only set when vertex is added for the first time and is not modified thereafter, even if it is added again after deletion.

vertex update date time: the timestamp of the message when vertex was last updated. The initial value is vertex create date time.

vertex update message id: message id of message when vertex is updated recently. The initial value is the vertex create message id.

vertex delete datetime: the timestamp of the message when vertex deletes. This field may not be present.

vertex delete message id: message id of message when vertex deletes. This field may not be present.

vertex place date: the timestamp of the message when vertex covers. This field may not be present.

vertex place message id: message id of message when vertex overlays. This field may not be present.

$ Property name value: the value of property $ { property name }.

$ property name available: if the attribute $ { property name } is deleted, false indicates deleted and true indicates present.

$ Property name log: the processing log information of the attribute $ { property name }, including first creation, last update, deletion, overwriting related log information.

$ property name ] create date time: $ Property name, the timestamp of the message when added.

$ property name ] create message id: $ Property name, message id of message when added. This value is only set when $ { property name } is added for the first time, and is not modified later, even if it is added again after deletion.

$ property name } update date time: $ Property name $ timestamp of message when it was last updated. The initial value is $ { property name } create time.

$ property name } update message id: $ Property name $, the message id of the message when it was most recently updated. The initial value is $ { property name } create message id.

$ property name } delete date: $ Property name, time of message when deleted. This field may not be present.

vertex delete message id: $ Property name, message id of message when deleted. This field may not be present.

$ property name replace date: time of message when $ Property name covers. This field may not be present.

$ property name } replace message id: $ Property name $, message id of message when covered. This field may not be present.

For example, the following steps are carried out:

for example, a vertex is a person:

vertex id：506

vertex available：truevertex log：

vertex create datetime：2001-02-2810:05:23.613

vertex create message id：5432898950

vertex update datetime：2019-08-1713:53:09.97

vertex update message id：5894309905

name value: zhang three

name available：truename log：

name create datetime：2001-02-2810:05:23.613

name create message id：5432898950

name update datetime：2001-02-2810:05:23.613

name update message id：5432898950

Reduction value: university

education available：trueeducation log：

education create datetime：2001-02-2810:05:23.613

education create message id：5432898950

education update datetime：2019-08-1713:53:09.97

education update message id：5894309905

For example, another vertex is a course:

vertex id：7960

vertex available：truevertex log：

vertex create datetime：2005-12-2815:11:37.889

vertex create message id：5589049053

name value: college English

name available：truename log：

name create datetime：2005-12-2815:11:37.889

name create message id：5589049053

Data structure of edge

Data structures for edge in a graph database. Illustratively, $ { property name } represents the name of an edge property, e.g., score, rank. There may be multiple sets of $ { property name } value, $ { property name } available, and $ { property name } log.

edge id: id of edge. A group of from vertex id, to vertex id and relationship can only correspond to one edge id, and one edge id can be obtained through the character string combination of the from vertex id, to vertex id and relationship, for example, from vertex id + separator + to vertex id + separator + relationship, and the separator is a character which cannot appear in the from vertex id, to vertex id and relationship.

from vertex id: id of origin vertex of edge.

to vertex id: id of end-point vertex of edge.

And (2) relationship: the relationship between the starting vertex and the ending vertex.

edge available: whether edge is deleted, false indicates deleted, true indicates present.

edge log: the log information of edge is processed, including first creation, latest update, deletion, and overwriting related log information.

edge create data time: when edge is added, the timestamp of the message.

edge create message id: message id of message when edge is added. This value is only set when vertex is added for the first time and is not modified thereafter, even if it is added again after deletion.

edge update data time: the timestamp of the message when edge was last updated. The initial value is vertexcreate date.

edge update message id: message id of message when edge is updated recently. The initial value is the vertex create message id.

edge delete date: when edge deletes, timestamp of message. This field may not be present.

edge delete message id: message id of message when edge is deleted. This field may not be present.

edge display date: edge overrides the timestamp of the message. This field may not be present.

edge replace message id: message id of message when edge overrides. This field may not be present.

$ Property name value: the value of property $ { property name }.

For example, the following steps are carried out:

for example, an edge is a person learning a course:

edge id：26171

from vertex id：506

to vertex id：7960

relation：study

edge available：true

edge log：

edge create datetime：2019-06-1517:00:00.59

edge create message id：7099812

edge update datetime：2019-09-0515:00:01.237

edge update message id：7234456

score value：86

score available：true

score log：

score create datetime：2019-06-1517:00:00.59

score create message id：7099812

score update datetime：2019-09-0515:00:01.237

score update message id：7234456

score delete datetime：2019-07-1514:22:51.9

score delete message id：7134112

score replace datetime：2019-09-0515:00:01.237

score replace message id：7234456

rank value：13

rank available：true

rank log：

score create datetime：2019-06-1517:00:00.59

score create message id：7099812

score update datetime：2019-09-0515:00:01.237

score update message id：7234456

in the third section, the graph data processing procedure is described.

The data processing types are divided into two categories, namely vertex-related data processing and edge-related data processing. For parallel processing, with vertex as the minimum unit, different vertex are allowed to be processed simultaneously, and edge is regarded as processing 2 vertex simultaneously. Before processing, the vertex id is locked. If the process involves multiple vertexes, then these vertexes are sorted by vertex id and then the vertex id is locked in this order. Locking in this manner can avoid the problem of deadlock in concurrent locking. With distributed locks, the implementation of distributed locks can be in redis.

In some embodiments, the locking the data number corresponding to the incremental data includes:

when the increment data is vertex, the vertex id is locked.

In some embodiments, the updating the corresponding data in the graph database according to the operation type and the determination result includes:

3.1. Related data processing

3.1.1.vertex related data processing:

add or overwrite vertex (add or replace vertex). If this occurs after the vertex is deleted, it indicates that the vertex is re-added by means of overwriting.

Delete vertex (delete vertex). And deleting the logic.

Add or overlay the property of vertex (add) or vertex properties). If this occurs after the vertex attribute is deleted, it indicates that the vertex attribute is re-added by means of overwriting.

Delete the property of vertex (delete vertex properties). And deleting the logic.

Edge-related data processing:

add or overwrite an edge (add) edge. If this occurs after deleting an edge, it indicates that the edge is re-added by means of overwriting.

Delete edge (delete edge). And deleting the logic.

Delete 1 edge (delete edge byfrom version id, to version id and relationship) by from version id, to version id and relationship.

Delete multiple edges (delete edge by from vertex id and to vertex id) by from vertex id and to vertex id.

Delete multiple edges (delete edge by from vertex id ortho vertex id) by from vertex id or to vertex id.

Add or overwrite the attribute of the edge (add or replace edge property). If this occurs after the edge attribute is deleted, it indicates that the edge attribute has been re-added by means of overwriting.

Delete the attribute of edge (delete edge property). And deleting the logic.

3.2. Process for adding or overlaying vertex

data structure:

vertex id:vertex id。

properties refers to the data structure of the properties part of the data in the data structure of messge.

The specific process is as follows:

1. the vertex id is locked using a distributed lock.

2. Calling the API of the graph database, and inquiring vertex according to vertex id.

3. It is determined whether vertex already exists.

3.1. If vertex does not exist, the sub-flow of adding vertex is entered.

3.2. If vertex already exists, the sub-flow covering vertex is entered.

4. The vertex id is unlocked using a distributed lock.

In some embodiments, the sub-flow of adding vertex includes:

setting the vertex id of the vertex as the vertex id of the message;

setting available of vertex as true;

updating the vertex log;

vertex data is added to the graph database.

3.2.1. sub-Process of adding vertex

1.Vertex data is generated.

1.1. The vertex id of this vertex is set to the vertex id of the message.

1.2. The available of this vertex is set to true.

1.3. And traversing the properties of the message to generate the properties corresponding to the vertex.

1.3.1. Set $ { property name } for this vertex to the name of the property for the message.

1.3.2. The value of $ { property name } value of this vertex is set to the value of property of message.

1.3.3. When entering the process of adding property, the property log updates the sub-process.

1.4. And when the vertex is added, updating the sub-flow by the vertex log.

2. Calling the API of the graph database, and adding vertex data to the graph database.

In some embodiments, the sub-flow of the overlay vertex includes:

3.2.2. Universal sub-process covering vertex/edge

The following modifications to the vertex/edge data are based on the existing vertex/edge data.

1. The vertex/edge update message id of vertex is compared with the message id of message.

1.1. If the two message ids are the same, indicating that the messages are repeated, the process ends.

2. And traversing properties of the message.

2.1. In the properties of vertex/edge, whether $ { property name of vertex/edge is the same as the name of property of message exists or not.

2.1.1. If so, the property of vertex/edge is overridden.

2.1.1.1. Compare the message's timestamp with $ { property name } update date of this vertex/edge.

2.1.1.1.1. If the timestamp time of the message is later, the message is in a normal time sequence, and relevant information is updated.

2.1.1.1.1.1. The value of $ { property name } value of this vertex/edge is set to the value of property of message.

2.1.1.1.2. If the timestamp time of the message is earlier, the message is an abnormal time sequence, and the related information does not need to be updated.

2.1.1.2. And when the property is covered, updating the sub-process by the property log.

2.1.2. If not, a property is added to this vertex/edge.

2.1.2.1. Set $ { property name } for this vertex/edge to the name of the property for the message.

2.1.2.2. The value of $ { property name } value of this vertex/edge is set to the value of property of message.

2.1.2.3. When entering the process of adding property, the property log updates the sub-process.

3. And traversing properties of vertex/edge.

3.1. Compares $ { property name } update message id of vertex/edge with the message id of message.

3.1.1. If the two message ids are the same, it indicates that the property of the vertex/edge has been processed by the previous step.

3.1.2. If the two message ids are not the same.

3.1.2.1. Compare the message's timestamp with $ { property name } update date of this vertex/edge.

3.1.2.1.1. If the timestamp time of the message is later, the message is in a normal time sequence, and relevant information is updated.

3.1.2.1.1.1. The value of $ { property name } value for this vertex/edge is set to null.

3.1.2.1.2. If the timestamp time of the message is earlier, the message is an abnormal time sequence, and the related information does not need to be updated.

3.1.2.2. And when entering the deletion of the property, updating the sub-process by the property log.

4. And when entering the coverage of vertex/edge, updating the sub-flow by vertex log/edge log.

5. Calling the API of the graph database, and updating vertex/edge data to the graph database.

3.3. Flow for deleting vertex

This process does not affect the property information of vertex.

data structure:

vertex id:vertex id。

the specific process is as follows:

1. the vertex id is locked using a distributed lock.

3. It is determined whether vertex already exists.

3.1. If vertex already exists.

3.1.1. And when the vertex is deleted, updating the sub-process by using the vertex log.

4. The vertex id is unlocked using a distributed lock.

3.4. Flow of adding or overwriting edges

data structure:

from vertex id: id of origin vertex of edge.

to vertex id: id of end-point vertex of edge.

The specific process is as follows:

1. and sequencing a starting vertex id and a finishing vertex id of the edge, and locking the vertex ids by using a distributed lock according to the sequence.

2. And calling an API of the graph database, and inquiring the edge according to the from vertex id, the to vertex id and the relationship.

3. And judging whether edge exists.

3.1. If edge does not exist, enter the sub-flow of adding edge.

3.2. If an edge already exists, the sub-flow that overrides the edge is entered.

4. The vertex id is unlocked using a distributed lock.

In some embodiments, the add edge sub-flow includes:

setting the from vertex id of the edge as the from vertex id of the message;

setting the to vertex id of the edge as the to vertex id of the message;

setting the relationship of the edge as the relationship of the message;

setting available of the edge as true;

updating the edge log;

adding edge data to the graph database.

3.4.1. sub-Process of adding edge

1. Edge data is generated.

1.1. The from vertex id of this edge is set to the from vertex id of the message.

1.2. The to vertex id of this edge is set to the to vertex id of the message.

1.3. The relationship of this edge is set to the relationship of the message.

1.4. The available of this edge is set to true.

1.5. And traversing the properties of the message to generate the properties corresponding to the edge.

1.5.1. Set $ { property name } for this edge to the name of property for the message.

1.5.2. The value of $ { property name } value of this vertex is set to the value of property of message.

1.5.3. When entering the process of adding property, the property log updates the sub-process.

1.6. And when the edge is added, the edge log updates the sub-flow.

2. Calling an API of the graph database, and adding edge data to the graph database.

3.5. Delete edge

The processing does not affect the property information of vertex, and the deletion of edge is divided into three seed treatments: delete 1 edge by from vertex id, to vertex id, and relationship, delete multiple edges by from vertex id and to vertex id, and delete multiple edges by from vertex id or to vertex id. Note that when multiple edges are deleted, the edge must be queried again from the graph database after locking, so as to avoid dirty writing (i.e., after multiple edges are queried, before 2 vertex locks on 1 of the edges, during this time, if other processes update the edge, the information of the edge changes, and the information of the edge should be retrieved).

In some embodiments, the deleting the corresponding edge data according to the additional condition includes:

3.5.1. Flow for deleting 1 edge through from vertex id, to vertex id and relationship

data structure:

from vertex id: id of origin vertex of edge.

to vertex id: id of end-point vertex of edge.

The process comprises the following steps:

2. Calling the API of the graph database, and inquiring the vertex according to the starting vertex id, the ending vertex id and the relationship.

3. And judging whether edge exists.

3.1. If edge already exists.

3.1.1. And sequencing the vertex ids of the starting point vertex and the ending point vertex of the edge, and locking the vertex ids by using a distributed lock according to the sequence.

3.1.2. And when the edge is deleted, the edge log updates the sub-process.

3.1.3. The vertex id is unlocked using a distributed lock.

4. The vertex id is unlocked using a distributed lock.

3.5.2. Deleting the flow data structures of a plurality of edges through from vertex id and to vertex id:

from vertex id: id of origin vertex of edge.

to vertex id: id of end-point vertex of edge.

The process comprises the following steps:

1. calling an API of the graph database, and inquiring related edge according to the starting vertex id and the ending vertex id.

2. And traversing the queried edge.

2.1. And sequencing the vertex ids of the starting point vertex and the ending point vertex of the edge, and locking the vertex ids by using a distributed lock according to the sequence.

2.2. And when the edge is deleted, the edge log updates the sub-process.

2.3. The vertex id is unlocked using a distributed lock.

3.5.3. Deleting the flow data structures of a plurality of edges through from vertex id or to vertex id:

from vertex id: id of origin vertex of edge.

Or

to vertex id: id of end-point vertex of edge.

The process comprises the following steps:

1. calling the API of the graph database, and inquiring related edge according to the starting vertex id or the ending vertex id.

2. And traversing the queried edge.

2.2. And when the edge is deleted, the edge log updates the sub-process.

2.3. The vertex id is unlocked using a distributed lock.

3.6 Universal sub-Process of vertex log/edge log/property log

Including updates to the vertex/edge/$ { property name } available field.

3.6.1. Updating sub-process of vertex log/edge log/property log when vertex/edge/property is added

1. The message id of the message is updated from vertex/edge/$ { property name } create.

2. The version/edge/$ { property name } create date is updated to the time of the message.

3. The message id of the message is updated from vertex/edge/$ { property name } update.

4. The version/edge/$ { property name } update date time is updated to the time of the message.

3.6.2. When the vertex/edge/property is covered, updating the sub-process of vertex log/edge log/property log

1. The time of the message is compared to the vertex/edge/$ { property name } createtatime.

1.1. If the timestamp time of the message is later, the message is in a normal time sequence, and the related information does not need to be updated.

1.2. If the timestamp time of the message is earlier, the message is an abnormal time sequence, and relevant information is updated.

1.2.1. The version/edge/$ { property name } create date is updated to the time of the message.

1.2.2. The message id of the message is updated from vertex/edge/$ { property name } create.

2. The time of the message is compared to vertex/edge/$ { property name } updatetime.

2.1. If the timestamp time of the message is later, the message is in a normal time sequence, and relevant information is updated.

2.1.1. The version/edge/$ { property name } update date time is updated to the time of the message.

2.1.2. The message id of the message is updated from vertex/edge/$ { property name } update.

2.2. If the timestamp time of the message is earlier, the message is an abnormal time sequence, and the related information does not need to be updated.

3. The time of the message is compared to the vertex/edge/$ { property name } replayatetime.

3.1. If the vertex/edge/$ (property name) replace date does not exist or the time of the message is later, the time is a normal time sequence, and the relevant information is updated.

3.1.1. The version/edge/$ { property name } replace message id is updated to the message id of the message.

3.1.2. This vertex/edge/$ { property name } replace datatime is updated to the time map of the message.

3.2. If the timestamp time of the message is earlier, the message is an abnormal time sequence, and the related information does not need to be updated.

4. The timestamp of the message is compared to vertex/edge/$ { property name } deletedetime.

4.1. If vertex/edge/$ { property name } delete date exists, and

the message is deleted when the timestamp is later

The version/edge/$ { property name } is added again later.

4.1.1. The vertex/edge/$ { property name } available is updated to true.

3.6.3. When delete vertex/edge/property, update sub-process of vertex log/edge log/property log

1. The time of the message is compared to vertex/edge/$ { property name } updatetime.

1.1. If the timestamp time of the message is later, the message is in a normal time sequence, and relevant information is updated.

1.1.1. The version/edge/$ { property name } update date time is updated to the time of the message.

1.1.2. The message id of the message is updated from vertex/edge/$ { property name } update.

1.2. If the timestamp time of the message is earlier, the message is an abnormal time sequence, and the related information does not need to be updated.

2. The time of the message is compared to the vertex/edge/$ { property name } replayatetime.

2.1. If vertex/edge/$ { property name } replace datatime does not exist, or

The message is updated with the timing later, which shows that the time sequence is normal.

2.1.1. This vertex/edge/$ { property name } available is updated to false.

2.1.2. The version/edge/$ { property name } delete message id is updated to the message id of the message.

2.1.3. This vertex/edge/$ { property name } delete datatime is updated to the time map of the message.

3. The timestamp of the message is compared to vertex/edge/$ { property name } deletedetime.

3.1. If vertex/edge/$ { property name } delete date exists, and

the message is deleted again when the timestamp time is later.

3.1.1. The version/edge/$ { property name } delete message id is updated to the message id of the message.

3.1.2. This vertex/edge/$ { property name } delete datatime is updated to the time map of the message.

According to the scheme, the creation time, the latest modification time, the deletion time, the coverage time and the corresponding message id of the vertex/edge/property are recorded through the vertex/edge/property log. The message id is used to determine whether the same message has been processed. The creation time, the latest modification time, the deletion time and the coverage time are combined together to solve the time sequence problem together, namely, whether the currently processed time sequence is correct is judged firstly, and then the corresponding flow designed by the scheme is adopted for processing, so that the final processing result of the time sequence is the same as that of the normal time sequence even if the time sequence is abnormal.

For example, there are 2 messages (message-1, message-2) that are the same $ { property name } value (score value) that updates the same vertex (vertex id is 1200), but the values of $ { property name } values to be updated (one is 80.5, and the other is 60.9) are different. If the sequence is normal, message-1 is processed first, message-2 is processed later, and score value is updated to 60.9 finally. If the time sequence is abnormal, the message-2 is processed first, and the message-1 is processed later. After the message-2 is processed, the score update time is updated to 2019-10-0313:19: 26.152. When processing the message-1, according to the designed flow, firstly comparing the timestamp of the message-2 with the score update date, finding that the timestamp time of the message-2 is earlier and is an abnormal time sequence, and then not updating the score value. Thus the final processing results of the normal timing and the abnormal timing are identical, including scorelog. Thus, both chronology and idempotency are achieved.

message-1：

message id：187374

timestamp：2019-10-0309:30:29.374

operation: adding or covering attribute (add or replace vertex properties) data of vertex:

vertex id:1200

properties:

property:

name:score

value:80.5

message-2：

message id：187481

timestamp：2019-10-0313:19:26.152

vertex id:1200

properties:

property:

name:score

value:60.9

with vertex as the minimum unit, different vertex are allowed to be processed simultaneously, and edge is regarded as processing 2 vertex simultaneously. By locking the vertex id, dirty reading or dirty writing is avoided when the same vertex or edge is processed at the same time. The concurrency is processed by taking vertex as the minimum unit, so that the conflict can be reduced as small as possible, and the actual parallelism during distributed parallel processing is effectively increased. Generally, as the amount of graph data processing increases, the data of the graph database also increases, i.e., vertex and edge also increase. When the processing amount of the graph data is increased, if the parallelism of the processing is also increased, the probability of collision is not increased linearly. Therefore, the processing amount of the unit node can be reduced in a transverse expansion mode, the delay is reduced, and the real-time performance is guaranteed.

In conclusion, the scheme simultaneously realizes the distribution, the real-time performance, the time sequence performance and the idempotent performance of the incremental updating process from the source data to the graph database. The method supports distributed parallel transmission and processing of data (simultaneously supports multiple processes and multiple threads), regardless of the sequence of data transmission and processing and whether the transmission and processing are repeated, as long as all related data are successfully transmitted and processed at least once, and the final result is ensured to be completely consistent with the result of serial execution of the data according to the time sequence.

The present application further provides the following embodiments:

a data updating apparatus for a graph database, comprising:

With regard to the apparatus in the above embodiment, the specific steps in which the respective modules perform operations have been described in detail in the embodiment related to the method, and are not described in detail herein.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method for updating data in a graph database, comprising:

determining incremental data and operation types of the updating operation;

and after the updating operation is completed, unlocking the data number.

2. The method of claim 1, wherein locking the data number corresponding to the incremental data comprises:

when the increment data is vertex, the vertex id is locked.

3. The method according to claim 2, wherein said updating the corresponding data in the graph database according to the operation type and the determination result comprises:

and when the operation type is deletion, if the judgment result is that the operation type exists, entering a subprocess of deleting vertex, and updating vertexlog.

4. The method of claim 3, wherein the sub-process of adding vertex comprises:

setting the vertex id of the vertex as the vertex id of the message;

setting available of vertex as true;

updating the vertex log;

vertex data is added to the graph database.

5. The method of claim 3, wherein the sub-flow of the overlay vertex comprises:

if the two message ids are not the same, comparing the time of the message with $ { property name } update date of vertex;

6. The method of claim 2, wherein locking the data number corresponding to the incremental data comprises:

7. The method according to claim 6, wherein said updating the corresponding data in the graph database according to the operation type and the determination result comprises:

8. The method of claim 7, wherein the adding edge sub-flow comprises:

setting the from vertex id of the edge as the from vertex id of the message;

setting the to vertex id of the edge as the to vertex id of the message;

setting the relationship of the edge as the relationship of the message;

setting available of the edge as true;

updating the edge log;

adding edge data to the graph database.

9. The method of claim 6, wherein deleting the corresponding edge data according to the additional condition comprises:

10. An apparatus for updating data in a graph database, comprising: