CN111046241A

CN111046241A - Graph storage method and device for stream graph processing

Info

Publication number: CN111046241A
Application number: CN201911178913.1A
Authority: CN
Inventors: 李东升; 贾孟涵; 赖志权; 陈易欣
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2020-04-21
Anticipated expiration: 2039-11-27
Also published as: CN111046241B

Abstract

The application relates to a graph storage method and device for stream graph processing. The method comprises the following steps: acquiring triple data corresponding to data to be stored in a data set; the triple data includes: a subject entity, an entity relationship, and an object entity; storing a subject entity through a first array of a stream graph, storing an entity relation through a second array of the stream graph, and storing an object entity through an array chain of the stream graph to obtain graph data corresponding to a data set; the array elements in the first array point to the second array, and the array elements in the second array point to an array chain. By adopting the method, the requirements of low storage overhead and high throughput can be met under the condition of ensuring the accuracy.

Description

Graph storage method and device for stream graph processing

Technical Field

The present application relates to the field of graph storage technologies, and in particular, to a graph storage method and apparatus for stream graph processing.

Background

The processing of the current flow chart is a difficult point in the field of chart calculation, and has important practical significance. Regardless of the social network, even various user information can be regarded as a kind of flow chart, for example, in the social network, the relationship between a plurality of users can be represented by the flow chart. In recent years, streaming graph processing is developed towards the goals of high throughput and storage overhead reduction, however, when the traditional technology achieves the goal, data is processed in a hash mode, and although the throughput of the data is improved to a certain extent and the storage overhead of the data is reduced, for a system which ensures the accuracy, the requirements of low storage and high throughput are difficult to meet.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a graph storage method and apparatus for stream graph processing, which can solve the problems of ensuring low storage and high throughput while ensuring accuracy.

A graph storage method for streaming graph processing, the method comprising:

acquiring triple data corresponding to data to be stored in a data set; the triple data includes: a subject entity, an entity relationship, and an object entity;

storing the subject entity through a first array of a streaming graph, storing the entity relationship through a second array of the streaming graph, and storing the object entity through an array chain of the streaming graph to obtain graph data corresponding to the data set;

wherein array elements in the first array point to the second array, and array elements in the second array point to one of the array chains.

In one embodiment, the method further comprises the following steps: and sequentially storing object entities as array elements in the arrays of the array chain, and generating a new array in the array chain to store the object entities when the arrays of the array chain are full.

In one embodiment, the method further comprises the following steps: setting a first pointer corresponding to array elements in the first array, wherein the first pointer points to the second array; and setting a second pointer corresponding to the array elements in the second array, wherein the second pointer points to the array chain.

In one embodiment, the method further comprises the following steps: when detecting that a first entity is stored in the first array, pointing an array element corresponding to the first entity to a null pointer; the first entity is not in a set of subject entities consisting of the subject entities; and when detecting that the subject entity stored in the first array and the entity relationship stored in the second array do not have the object entity, pointing array elements corresponding to the entity relationship in the second array to a null pointer.

A graph storage device for streaming graph processing, the device comprising:

the data analysis module is used for acquiring triple data corresponding to data to be stored in the data set; the triple data includes: a subject entity, an entity relationship, and an object entity;

the data storage module is used for storing the subject entity through a first array of a stream chart, storing the entity relation through a second array of the stream chart, and storing the object entity through an array chain of the stream chart to obtain chart data corresponding to the data set;

A data insertion method of streaming graph processing, comprising:

acquiring insertion ternary group data of data to be inserted;

detecting whether the first array of the graph data comprises the subject entity inserted into the triple data;

if yes, detecting whether the second array of the graph data comprises the entity relation of the inserted triple data;

if yes, detecting whether an object entity inserted into the ternary group data is included in the array chain of the graph data;

if not, inserting the object entity inserted with the triple data into the array chain

A data query method for stream graph processing comprises the following steps:

acquiring query ternary group data of data to be queried;

detecting whether the first array of the graph data comprises subject entities in the query triple data;

if not, returning the query result as null; if yes, detecting whether the second array of the graph data comprises an entity relation in the query ternary array data;

if not, returning the query result as null; if yes, detecting whether an object entity in the query ternary group data is included in the array chain of the graph data;

if not, returning the query result as null; if yes, returning the result as present.

A data deletion method for stream graph processing comprises the following steps:

acquiring deletion ternary group data of data to be deleted;

detecting whether the first array of the graph data comprises a subject entity in the deletion ternary array data;

if yes, detecting whether the second array of the graph data comprises an entity relation in the deletion ternary array data;

if yes, detecting whether an object entity in the deleted ternary group data is included in the array chain of the graph data;

and if so, deleting the object entity of the deleted ternary group data in the group chain.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the graph storage method and device for processing the flow graph, the data to be stored are converted into the form of the ternary group data, so that all parts in the ternary group data are respectively stored through different arrays and array chains, and fixed flow logic exists between the arrays and the array chains in the flow graph, so that when the flow graph is used for data operation, the flow of data operation can be greatly simplified, the graph storage throughput is improved, the storage overhead is saved, on the other hand, the flow logic is fixed, and therefore the accuracy of the data operation is guaranteed.

Drawings

FIG. 1 is a diagram of an application scenario of a graph storage method for stream graph processing in one embodiment;

FIG. 2 is a flow diagram that illustrates a graph storage method for stream graph processing, according to one embodiment;

FIG. 3 is a block diagram of a flow graph architecture in accordance with an embodiment;

FIG. 4 is a flow diagram that illustrates a data insertion method for streaming graph processing, according to one embodiment;

FIG. 5 is a flow diagram that illustrates a data query method of the streaming graph process, according to one embodiment;

FIG. 6 is a flow diagram that illustrates a data deletion methodology for streaming graph processing, according to one embodiment;

FIG. 7 is a block diagram of a diagram storage device used in streaming diagram processing in one embodiment;

FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The graph storage method for streaming graph processing provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

Specifically, the terminal 102 acquires the triple-group data from the server 104, so that graph data corresponding to the data set is established in the terminal 102. The terminal 102 may perform streaming graph processing on graph data.

In one embodiment, as shown in fig. 2, a graph storing method for streaming graph processing is provided, which is described by taking the method as an example applied to the terminal in fig. 1, and includes the following steps:

step 202, triple data corresponding to data to be stored in the data set is obtained.

The triple data includes: subject entities, entity relationships, and object entities. An entity refers to things that exist objectively and can be distinguished from each other, such as: changsha is a provincial meeting city in Hunan. The Changsha and Hunan are both entities, and the entity relationship is province city, so when the computer processes, the triple data are as follows: (Changsha, Hunan, province City), Changsha is the subject entity and Hunan is the object entity. Each array needs to occupy certain storage resources, so the fewer the arrays and the fewer elements in the arrays, the less the storage resources are consumed.

For the data acquired from the server, entities and entity relationships contained in the text can be extracted by predefining the entities, or the entities and the entity relationships contained in the text can be extracted by semantically identifying the text, so that the information contained in the text is stored in a form of triple data.

And 204, storing the subject entity through the first array of the stream graph, storing the object entity through the second array of the stream graph, and storing the entity relationship through the array chain of the stream graph to obtain graph data corresponding to the data set.

As shown in fig. 3, the array elements in the first array point to the second array, and the array elements in the second array point to a chain of arrays, thereby forming streaming processing logic for the streaming graph. Array elements in the array can be newly added, inquired and deleted, so that the data operation can be rapidly performed through the processing logic of the flow chart. Graph data refers to data stored using streaming graph logic on data in a dataset.

In the graph storage method for processing the flow graph, the data to be stored is converted into the form of the ternary group data, so that all parts in the ternary group data are respectively stored through different arrays and array chains, and because fixed flow logic exists between the arrays and the array chains in the flow graph, when the flow graph is used for data operation, the flow of the data operation can be greatly simplified, so that the graph storage throughput is improved, the storage overhead is saved, on the other hand, the flow logic is fixed, and therefore the accuracy of the data operation is ensured.

In one embodiment, each array in the array chain is fixed in length to ensure efficiency of data queries, so that when an object entity is stored as an array element in an array of the array chain, a new array of the array chain is generated to store the object entity when the array of the array chain is full.

Specifically, the length of each array in the array chain is fixed to be 5, that is, 5 elements can be stored, when the array is full of 5 elements, a new array is automatically generated, the original array and the new array form the array chain of the chain assembly, and the elements in the array chain are not limited.

In this embodiment, since the array elements in the first array and the second array are fixed, taking data query as an example, the query result includes: the method has two results or no results, and the number of the relation that the subject entity points to the entity does not need to be considered, so that the number of data query is greatly reduced, and in addition, when the corresponding object entity is queried, the query number is equal to the length of an array chain for storing the object entity, so that the length of the array can be flexibly configured by adopting the array chain, and the query efficiency is ensured.

In one embodiment, the flow chart processing logic is determined by setting array pointers, specifically, setting a first pointer corresponding to an array element in a first array, where the first pointer points to a second array; and setting a second pointer corresponding to the array elements in the second array, wherein the second pointer points to the array chain. In this embodiment, each array element serves as a pointer, for example, one array element in the first array serves as a pointer to point to a second array. By means of the pointer, accuracy of data query, insertion and deletion can be guaranteed.

In another embodiment, when it is detected that a first entity is stored in the first array, array elements corresponding to the first entity are pointed to a null pointer, the first entity is not in a set of subject entities comprised of subject entities, and, when it is detected that no object entity exists for the subject entities stored in the first array and the entity relationships stored in the second array, array elements corresponding to the entity relationships in the second array are pointed to a null pointer. In this embodiment, if there is a subject entity and there is no corresponding entity relationship, the pointer of the first entity in the first array is pointed to the null pointer in order to ensure the correct logic of the pointer in the first array. Similarly, when the subject entity and the relationship entity have no corresponding object entity, the array element corresponding to the relationship entity points to the null pointer.

In one embodiment, as shown in fig. 4, there is provided a schematic flow chart of a data insertion method of a streaming graph process, including:

step 402, obtaining the insertion triple data of the data to be inserted.

Step 404, detecting whether the first array of the graph data includes a subject entity inserted into the triple data.

Step 406, if yes, detecting whether the second array of the graph data includes an entity relationship for inserting the triple data.

And step 408, if yes, detecting whether the object entity inserted with the ternary group data is included in the array chain of the graph data.

And step 410, if not, inserting the object entity inserted with the triple data into the array chain.

In this embodiment, a data insertion method for stream graph processing is provided, and when data insertion is performed, only simple logic judgment is needed to insert specified data into graph data, so that instructions for data storage are greatly reduced, and throughput of data insertion is improved.

For step 406, in one embodiment, the first array of test graph data does not include subject entities inserted into triple data. Then the subject entity inserted into the triple data needs to be inserted into the first array and pointed to the second array by the pointer, and then step 404 is repeated.

For step 408, in an embodiment, if the second array of the graph data does not include the entity relationship of the inserted triple data, a second array is created, the entity relationship of the inserted triple data is stored in the second array, and the array element corresponding to the subject entity of the inserted triple data points to the second array.

For step 410, in an embodiment, detecting that the array chain of the graph data includes the object entity into which the triple data is inserted indicates that the double data is already present in the graph data, and the result of the insertion failure is returned without performing the insertion.

In one embodiment, as shown in fig. 5, a schematic flow chart of a data query method for stream graph processing is provided, which includes the following specific steps:

step 502, obtaining the query triple data of the data to be queried.

Step 504, whether the first array of the graph data comprises the subject entity in the query triple data is detected.

Step 506, if not, returning that the query result is null, and if so, detecting whether the second array of the graph data comprises the entity relationship in the query triple data.

Step 508, if not, returning the query result as null; if yes, detecting whether the array chain of the graph data comprises the object entity in the query triple data.

Step 510, if not, returning the query result as null; if yes, returning the result as present.

In the embodiment, when data query is performed, only simple judgment logic needs to be performed, so that the frequency of data query instructions can be reduced, and the throughput of data storage is improved.

In one embodiment, as shown in fig. 6, a schematic flowchart of a data deleting method for stream graph processing is provided, which includes the following specific steps:

step 602, obtaining deletion ternary group data of the data to be deleted.

Step 604, detecting whether the first array of the graph data includes the subject entity in the deleted triple data.

And step 606, if yes, detecting whether the second array of the graph data comprises deleting the entity relationship in the triple data.

And step 608, if yes, detecting whether the array chain of the graph data comprises deleting the object entity in the triple data.

And step 610, if yes, deleting the object entity of the deletion ternary group data in the group chain.

In this embodiment, only a simple judgment logic is required for the deletion operation, and therefore, the number of times of data deletion instructions can be reduced, thereby improving the throughput of data storage.

With respect to step 606, in one embodiment, detecting that the first array of the graph data does not include the subject entity in the deleted triple data indicates that the deleted triple data is not in the graph data, and thus the deletion fails. And returning a message of deletion failure.

With respect to step 608, in one embodiment, detecting that the second array of the graph data does not include the entity relationship in the deletion triple data indicates that the deletion triple data is not in the graph data, and therefore the deletion fails. And returning a message of deletion failure.

With respect to step 610, in one embodiment, detecting that the object entity in the deletion triple data is not included in the array chain of the graph data indicates that the deletion triple data is not in the graph data, and thus the deletion fails. And returning a message of deletion failure.

Taking a specific embodiment as an example, S represents a first array, S (S) represents an S-th array element in the first array, the array element is a subject entity, P represents a second array, wherein S (S) _ P represents the second array with the subject entity being S, O represents an array chain, S (S) _ P (P) _ O represents the subject entity being S, an entity relationship is the array chain with P, and new P represents that a predicate array is newly created. Taking data query, deletion and insertion as examples, in contrast to the conventional CSR compression, a parameter D is set, D represents the average length of the array chain, E represents the total number of edges, and H represents the average number of entity-directed relationships of each subject entity on average. As shown in table 1:

TABLE 1 CSR compression vs. streaming graph processing

As can be seen from table 1, in the embodiment of the present invention, the influence of the edge does not need to be considered when performing data query, insertion, and deletion, so that the number of times of issuing the instruction can be greatly reduced, thereby shortening the data processing speed and meeting the requirement of high throughput.

The method is applied to a notebook computer, 1000 pieces of data are generated one by one, the insertion operation is executed, the time consumption is 14.09 seconds, 1000 pieces of data are generated one by one, the query operation is executed, the total time consumption is 10.493 seconds, then 100 pieces of data are generated, the insertion operation is executed, the total time consumption is 1.284 seconds, finally 1000 pieces of data are generated one by one, the deletion operation is executed, the total time consumption is 10.631 seconds, and the total storage expense is 997M.

It should be understood that although the various steps in the flowcharts of fig. 2, 4-6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 4-6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 7, there is provided a graph storage apparatus for streaming graph processing, including: a data parsing module 702 and a data storage module 704, wherein:

the data analysis module 702 is configured to obtain triple data corresponding to data to be stored in a data set; the triple data includes: a subject entity, an entity relationship, and an object entity;

a data storage module 704, configured to store the subject entity through a first array of a streaming graph, store the entity relationship through a second array of the streaming graph, and store the object entity through an array chain of the streaming graph, so as to obtain graph data corresponding to the data set;

In one embodiment, the data storage module 704 is further configured to sequentially store the object entities as array elements in the arrays of the array chain, and when the arrays of the array chain are full, generate new arrays in the array chain to store the object entities.

In one embodiment, the method further comprises the following steps: the pointer setting module is used for setting a first pointer corresponding to array elements in the first array, and the first pointer points to the second array; and setting a second pointer corresponding to the array elements in the second array, wherein the second pointer points to the array chain.

In one embodiment, the data storage module 704 is further configured to, when it is detected that a first entity is stored in the first array, point an array element corresponding to the first entity to a null pointer; the first entity is not in a set of subject entities consisting of the subject entities; and when detecting that the subject entity stored in the first array and the entity relationship stored in the second array do not have the object entity, pointing array elements corresponding to the entity relationship in the second array to a null pointer.

For specific limitations of the graph storage device for stream graph processing, reference may be made to the above limitations of the graph storage method for stream graph processing, which are not described herein again. The various modules in the above-described graph storage apparatus for streaming graph processing may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a graph storage method for streaming graph processing. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor which, when executed, performs the steps of the above-described method embodiments.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A graph storage method for streaming graph processing, the method comprising:

2. The method of claim 1, wherein storing the object entity through an tuple chain of the streaming graph comprises:

and sequentially storing object entities as array elements in the arrays of the array chain, and generating a new array in the array chain to store the object entities when the arrays of the array chain are full.

3. The method of claim 1, further comprising:

setting a first pointer corresponding to array elements in the first array, wherein the first pointer points to the second array;

and setting a second pointer corresponding to the array elements in the second array, wherein the second pointer points to the array chain.

4. The method of claim 3, further comprising:

when detecting that a first entity is stored in the first array, pointing an array element corresponding to the first entity to a null pointer; the first entity is not in a set of subject entities consisting of the subject entities;

and when detecting that the subject entity stored in the first array and the entity relationship stored in the second array do not have the object entity, pointing array elements corresponding to the entity relationship in the second array to a null pointer.

5. A graph storage device for streaming graph processing, the device comprising:

6. A data insertion method for stream graph processing, comprising:

acquiring insertion ternary group data of data to be inserted;

detecting whether a subject entity inserted in the triplet data is included in the first array of graph data as claimed in any one of claim 1 to claim 4;

and if not, inserting the object entity inserted with the triple data into the array chain.

7. A data query method for stream graph processing is characterized by comprising the following steps:

acquiring query ternary group data of data to be queried;

detecting whether a subject entity in the query triple data is included in the first array of graph data as claimed in any one of claim 1 to claim 4;

8. A data deletion method for stream graph processing is characterized by comprising the following steps:

acquiring deletion ternary group data of data to be deleted;

detecting whether a subject entity in the delete triple data is included in the first array of graph data as claimed in any one of claim 1 to claim 4;

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.