CN111046241B

CN111046241B - Graph storage method and device for flow graph processing

Info

Publication number: CN111046241B
Application number: CN201911178913.1A
Authority: CN
Inventors: 李东升; 贾孟涵; 赖志权; 陈易欣
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2023-09-26
Anticipated expiration: 2039-11-27
Also published as: CN111046241A

Abstract

The application relates to a graph storage method and device for flow graph processing. The method comprises the following steps: acquiring triplet data corresponding to data to be stored in a data set; the triplet data includes: subject entities, entity relationships, and object entities; storing a subject entity through a first array of the streaming map, storing an entity relationship through a second array of the streaming map, and storing object entities through an array chain of the streaming map to obtain map data corresponding to a data set; wherein the array elements in the first array point to the second array, and the array elements in the second array point to a chain of arrays. The method can meet the requirements of low storage overhead and high throughput under the condition of ensuring the accuracy.

Description

Graph storage method and device for flow graph processing

Technical Field

The present application relates to the field of graph storage technologies, and in particular, to a method and an apparatus for storing a flow graph.

Background

The processing of the current flow chart is a difficult point in the field of chart calculation and has important practical significance. Whether it be a social network, even various user information can be considered as a flow graph, e.g., in a social network, the relationship between multiple users can be represented by a flow graph. In recent years, the flow graph processing is developed towards the aim of high throughput and reduced storage overhead, however, when the conventional technology achieves the aim, the data is processed in a hash mode, and although the throughput of the data is improved to a certain extent and the storage overhead of the data is reduced, the requirements of low storage and high throughput are difficult to meet for a system ensuring the accuracy.

Disclosure of Invention

In view of the above, it is necessary to provide a graph storage method and apparatus for streaming graph processing that can solve the problem of ensuring low storage and high throughput while ensuring accuracy.

A graph storage method for streaming graph processing, the method comprising:

acquiring triplet data corresponding to data to be stored in a data set; the triplet data includes: subject entities, entity relationships, and object entities;

storing the subject entity through a first array of the streaming map, storing the entity relationship through a second array of the streaming map, and storing the object entity through an array chain of the streaming map to obtain map data corresponding to the data set;

wherein the array elements in the first array point to the second array, and the array elements in the second array point to one of the array chains.

In one embodiment, the method further comprises: and sequentially storing object entities as array elements in the arrays of the array chain, and generating a new array in the array chain when the arrays of the array chain are full so as to store the object entities.

In one embodiment, the method further comprises: setting a first pointer corresponding to an array element in the first array, wherein the first pointer points to the second array; setting a second pointer corresponding to the array element in the second array, wherein the second pointer points to the array chain.

In one embodiment, the method further comprises: when detecting that a first entity is stored in the first array, pointing an array element corresponding to the first entity to a null pointer; the first entity is not in a subject entity set formed by the subject entities; and when detecting that the object entity does not exist between the subject entity stored in the first array and the entity relationship stored in the second array, pointing an array element corresponding to the entity relationship in the second array to a null pointer.

A graph storage device for streaming graph processing, the device comprising:

the data analysis module is used for acquiring triplet data corresponding to data to be stored in the data set; the triplet data includes: subject entities, entity relationships, and object entities;

the data storage module is used for storing the subject entity through a first array of the streaming chart, storing the entity relation through a second array of the streaming chart, and storing the object entity through an array chain of the streaming chart to obtain chart data corresponding to the data set;

A data insertion method for flow chart processing, comprising:

obtaining the data of the inserted triplet of the data to be inserted;

detecting whether the first array of the graph data comprises the subject entity inserted into the triplet data;

if yes, detecting whether the entity relation of the inserted triplet data is included in the second array of the graph data;

if yes, detecting whether the object entity of the inserted triplet data is included in the array chain of the graph data;

if not, inserting the object entity inserted with the triplet data into the array chain

A data query method for flow chart processing includes:

acquiring query triplet data of data to be queried;

detecting whether the first array of the graph data comprises a subject entity in the query triplet data;

if not, returning the query result to be empty; if yes, detecting whether the second array of the graph data comprises entity relations in the query triplet data or not;

if not, returning the query result to be empty; if yes, detecting whether an object entity in the query triplet data is included in an array chain of the graph data;

if not, returning the query result to be empty; if yes, the result is returned as present.

A data deleting method for flow chart processing comprises the following steps:

acquiring deletion triplet data of data to be deleted;

detecting whether the first array of the graph data comprises a subject entity in the deleted triplet data;

if yes, detecting whether the entity relationship in the deleted triplet data is included in the second array of the graph data;

if yes, detecting whether an object entity in the deleted triplet data is included in an array chain of the graph data;

if yes, deleting the object entity of the deleted triple data in the array chain.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

According to the graph storage method and device for processing the flow graph, the data to be stored are converted into the form of the triple data, so that each part of the triple data is stored through different arrays and array chains, and because fixed flow logic exists between the arrays in the flow graph and between the arrays and the array chains, when the flow graph is used for data operation, the flow of the data operation can be greatly simplified, the throughput of graph storage is improved, the storage cost is saved, and on the other hand, the flow logic is fixed, so that the accuracy of the data operation is ensured.

Drawings

FIG. 1 is an application scenario diagram of a graph storage method for flow graph processing in one embodiment;

FIG. 2 is a flow diagram of a graph storage method for flow graph processing in one embodiment;

FIG. 3 is a schematic diagram of a flow chart architecture in one embodiment;

FIG. 4 is a flow diagram of a method of data insertion for flow diagram processing in one embodiment;

FIG. 5 is a flow diagram of a data query method of flow chart processing in one embodiment;

FIG. 6 is a flow diagram of a method of data deletion for flow diagram processing, in one embodiment;

FIG. 7 is a block diagram of a graph store for flow graph processing in one embodiment;

fig. 8 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The graph storage method for the flow graph processing can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 104 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers.

Specifically, the terminal 102 acquires the triplet data from the server 104, so that the map data corresponding to the data set is established in the terminal 102. The terminal 102 may perform flow chart processing on the chart data.

In one embodiment, as shown in fig. 2, a graph storing method for streaming graph processing is provided, and the method is applied to the terminal in fig. 1 for illustration, and includes the following steps:

step 202, obtaining triple data corresponding to data to be stored in a data set.

The triplet data includes: subject entities, entity relationships, and object entities. An entity refers to something that exists objectively and can be distinguished from each other, for example: the long sand is a provincial city in Hunan province. Wherein, the Changsha and Hunan are entities, and the entity relationship is provincial city, so when the computer processes, the triplet data is: (Changsha, hunan province city), changsha is the subject entity and Hunan is the object entity. Each array needs to occupy a certain memory resource, so the fewer the arrays, the fewer elements in the arrays, the fewer memory resources are consumed.

For the data acquired from the server, the entity and the entity relationship contained in the text can be extracted by predefining the entity, or the entity and the entity relationship contained in the text can be extracted by carrying out semantic recognition on the text, so that the information contained in the text is stored in the form of the triplet data.

In step 204, the subject entity is stored through the first array of the flow chart, the object entity is stored through the second array of the flow chart, and the chart data corresponding to the data set is obtained through the relation of the array chain storage entities of the flow chart.

As shown in FIG. 3, the array elements in the first array point to the second array, and the array elements in the second array point to a chain of arrays, thereby forming the stream processing logic of the stream map. The array elements in the array can be added, queried and deleted, so that the data operation can be conveniently and quickly performed through the processing logic of the flow chart. Graph data refers to data stored using flow graph logic for data in a data set.

In the graph storage method for processing the flow graph, the data to be stored is converted into the form of the triple data, so that each part of the triple data is stored through different arrays and array chains, and because fixed flow logic exists between the arrays in the flow graph and between the arrays and the array chains, when the flow graph is used for data operation, the flow of the data operation can be greatly simplified, the throughput of graph storage is improved, the storage cost is saved, and on the other hand, the flow logic is fixed, so that the accuracy of the data operation is ensured.

In one embodiment, the length of each array in the array chain is fixed, thereby ensuring the efficiency of data query, so that when an object entity is stored as an array element in the array of the array chain, when the array of the array chain is full, a new array of the array chain is generated to store the object entity.

Specifically, the length of each array in the array chain is fixed to be 5, namely 5 elements can be stored, when the array is full of 5 elements, a new array is automatically generated, the original array and the new array form the array chain of the chain, and the elements in the array chain are not limited.

In this embodiment, since the array elements in the first array and the second array are fixed, taking the data query as an example, the query result includes: the two results exist or are empty, the number of the entity relation pointed by the subject entity is not needed to be considered, so that the number of data query times is greatly reduced, in addition, when the corresponding object entity is queried, the query times are equal to the length of an array chain for storing the object entity, and therefore, the length of an array can be flexibly configured by adopting the array chain, and the query efficiency is ensured.

In one embodiment, determining the flow chart processing logic by setting an array pointer, specifically setting a first pointer corresponding to an array element in a first array, wherein the first pointer points to the second array; setting a second pointer corresponding to the array element in the second array, wherein the second pointer points to the array chain. In this embodiment, each array element acts as a pointer, for example, one array element in the first array acts as a pointer to a second array. By means of the pointer, accuracy of data query, insertion and deletion can be guaranteed.

In another embodiment, when it is detected that the first entity is stored in the first array, an array element corresponding to the first entity is pointed to the null pointer, the first entity is not in the subject entity set formed by the subject entities, and when it is detected that the subject entity stored in the first array and the entity relationship stored in the second array do not have an object entity, an array element corresponding to the entity relationship in the second array is pointed to the null pointer. In this embodiment, if there is a subject entity and there is no corresponding entity relationship, in order to ensure correct logic of pointers in the first array, the pointers of the first entity in the first array point to null pointers. Similarly, when the subject entity and the relationship entity do not have corresponding object entities, the array element corresponding to the relationship entity is pointed to the null pointer.

In one embodiment, as shown in fig. 4, there is provided a schematic flow chart of a data insertion method of flow chart processing, including:

step 402, obtaining insertion triplet data of data to be inserted.

Step 404, it is detected whether the first array of graph data includes subject entities inserted into the triplet data.

If yes, it is detected whether the second array of the graph data includes an entity relationship for inserting the triplet data, step 406.

If yes, it is detected whether the array chain of the graph data includes an object entity into which the triplet data is inserted, step 408.

If not, an object entity with the triplet data inserted is inserted into the array chain, step 410.

In this embodiment, a data insertion method for processing a streaming graph is provided, and when data is inserted, specified data can be inserted into graph data only through simple logic judgment, so that instructions for data storage are greatly reduced, and the throughput of data insertion is improved.

For step 406, in one embodiment, the subject entity inserted into the triplet data is not included in the first array of the detection map data. Then the subject entity inserted into the triplet data needs to be inserted into the first array and pointed to by a pointer to the second array, and then step 404 is repeated.

For step 408, in one embodiment, if the second array of the detected graph data does not include the entity relationship of the inserted triplet data, a second array is created, the entity relationship of the inserted triplet data is stored in the second array, and the array element corresponding to the subject entity of the inserted triplet data is pointed to the second array.

For step 410, in one embodiment, detecting that an object entity of the inserted triplet data is included in the array chain of the graph data, then indicating that the inserted triplet data is already present in the graph data, no insertion is required, and a result of the insertion failure is returned.

In one embodiment, as shown in fig. 5, a schematic flowchart of a data query method of flow chart processing is provided, and specific steps are as follows:

step 502, query triplet data of data to be queried is obtained.

Step 504, it is detected whether the first array of graph data includes subject entities in query triplet data.

Step 506, if not, returning the query result to be null, if yes, detecting whether the second array of the graph data includes the entity relationship in the query triplet data.

Step 508, if not, returning the query result to be null; if yes, detecting whether an object entity in query triplet data is included in an array chain of the graph data.

Step 510, if not, returning the query result to be null; if yes, the result is returned as present.

In this embodiment, when data query is performed, only simple judgment logic is needed, so that the number of times of data query instructions can be reduced, and the throughput of data storage is improved.

In one embodiment, as shown in fig. 6, a schematic flowchart of a data deletion method for flow chart processing is provided, and specific steps are as follows:

step 602, obtaining deletion triplet data of data to be deleted.

Step 604 detects whether a subject entity in the deleted triplet data is included in the first array of graph data.

If yes, step 606 is performed to detect whether the second array of the graph data includes deleting entity relationships in the triplet data.

If yes, step 608 is performed to detect whether the array chain of the graph data includes an object entity in the deleted triplet data.

If yes, deleting object entities deleting triple data in the array chain, step 610.

In this embodiment, the deletion operation is performed by using simple judgment logic, so that the number of times of data deletion instructions can be reduced, thereby improving the throughput of data storage.

For step 606, in one embodiment, detecting that the subject entity in the deleted triplet data is not included in the first array of the graph data, indicates that the deleted triplet data is not in the graph data and therefore the deletion fails. And returning a deletion failure message.

For step 608, in one embodiment, detecting that the second array of graph data does not include deleting entity relationships in the triplet data, indicates that the delete triplet data is not in the graph data and therefore fails. And returning a deletion failure message.

For step 610, in one embodiment, detecting that the array chain of graph data does not include deleting an object entity in the triplet data, then it is indicated that the delete triplet data is not in the graph data and therefore the delete fails. And returning a deletion failure message.

Taking a specific embodiment as an example, S represents a first array, S (S) represents an S-th array element in the first array, the array element is a subject entity, P represents a second array, S (S) _p represents a second array with the subject entity being S, O represents an array chain, S (S) _p (P) _o represents an array chain with the subject entity being S, the entity relationship being P, and new P represents creating a predicate array. Taking data query, deletion and insertion as examples, compared with traditional CSR compression, the parameters D and D are set, wherein D represents the average length of an array chain, E represents the total edge number, and H represents the average number of the entity relation of each subject entity. As shown in table 1:

table 1 CSR compression and flow chart processing comparison

From table 1, it can be seen that the data query, insertion and deletion do not need to consider the influence of edges, so that the number of times of issuing instructions can be greatly reduced, the data processing speed is shortened, and the requirement of high throughput is met.

The method is applied to a notebook computer, 1000 ten thousand pieces of data are generated piece by piece, the inserting operation is executed, the time consumption is 14.09 seconds, 1000 ten thousand pieces of data are generated piece by piece, the query operation is executed, the total time consumption is 10.493 seconds, then 100 ten thousand pieces of data are generated, the inserting operation is executed, the total time consumption is 1.284 seconds, finally 1000 ten thousand pieces of data are generated piece by piece, the deleting operation is executed, the total time consumption is 10.631 seconds, and the total storage cost is 997M.

It should be understood that, although the steps in the flowcharts of fig. 2, 4-6 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps of fig. 2, 4-6 may include multiple sub-steps or phases that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or phases are performed necessarily occur in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or phases of other steps.

In one embodiment, as shown in fig. 7, there is provided a graph storage device for streaming graph processing, including: a data parsing module 702 and a data storage module 704, wherein:

the data parsing module 702 is configured to obtain triplet data corresponding to data to be stored in the dataset; the triplet data includes: subject entities, entity relationships, and object entities;

a data storage module 704, configured to store the subject entity through a first array of the streaming map, store the entity relationship through a second array of the streaming map, and store the object entity through an array chain of the streaming map, so as to obtain map data corresponding to the data set;

In one embodiment, the data storage module 704 is further configured to sequentially store object entities as array elements in the array of the array chain, and when the array of the array chain is full, generate a new array in the array chain to store the object entities.

In one embodiment, the method further comprises: the pointer setting module is used for setting a first pointer corresponding to the array elements in the first array, and the first pointer points to the second array; setting a second pointer corresponding to the array element in the second array, wherein the second pointer points to the array chain.

In one embodiment, the data storage module 704 is further configured to, when detecting that the first entity is stored in the first array, point an array element corresponding to the first entity to a null pointer; the first entity is not in a subject entity set formed by the subject entities; and when detecting that the object entity does not exist between the subject entity stored in the first array and the entity relationship stored in the second array, pointing an array element corresponding to the entity relationship in the second array to a null pointer.

For specific limitations on the graph storage device for the flow graph processing, reference may be made to the above limitation on the graph storage method for the flow graph processing, and the description thereof will not be repeated here. The various modules in the graph store for flow graph processing described above may be implemented in whole or in part in software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a graph storage method for streaming graph processing. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 8 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory storing a computer program and a processor that when executed performs the steps of the method embodiments described above.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A graph storage method for streaming graph processing, the method comprising:

2. The method of claim 1, wherein the storing the object entity through an array chain of flow graphs comprises:

and sequentially storing object entities as array elements in the arrays of the array chain, and generating a new array in the array chain when the arrays of the array chain are full so as to store the object entities.

3. The method according to claim 1, wherein the method further comprises:

setting a first pointer corresponding to an array element in the first array, wherein the first pointer points to the second array;

setting a second pointer corresponding to the array element in the second array, wherein the second pointer points to the array chain.

4. A method according to claim 3, characterized in that the method further comprises:

when detecting that a first entity is stored in the first array, pointing an array element corresponding to the first entity to a null pointer; the first entity is not in a subject entity set formed by the subject entities;

and when detecting that the object entity does not exist between the subject entity stored in the first array and the entity relationship stored in the second array, pointing an array element corresponding to the entity relationship in the second array to a null pointer.

5. A graph storage device for streaming graph processing, the device comprising:

6. A data insertion method for flow chart processing, comprising:

step 402, obtaining the data of the inserted triplet of the data to be inserted;

step 404, detecting whether the subject entity inserted into the triplet data is included in the first array of graph data as claimed in any one of claims 1 to 4; if yes, executing step 406, if not, inserting the subject entity inserted into the triplet data into the first array, pointing to the second array by the pointer, and then repeating step 404;

step 406, detecting whether the second array of the graph data includes the entity relationship of the inserted triplet data, if yes, executing step 408, if not, creating a second array, storing the entity relationship of the inserted triplet data into the second array, and directing the array element corresponding to the subject entity in the inserted triplet data to the second array, and executing step 408;

step 408, detecting whether the object entity of the inserted triplet data is included in the array chain of the graph data; if yes, no insertion is required, if not, step 410 is performed;

if not, step 410, inserting the object entity inserted with the triplet data into the array chain.

7. A data query method for processing a flow chart, comprising:

acquiring query triplet data of data to be queried;

detecting whether a subject entity in the query triplet data is included in a first array of graph data as claimed in any one of claims 1 to 4;

8. A method for deleting data in a stream map process, comprising:

step 602, obtaining deletion triplet data of data to be deleted;

step 604, detecting whether the subject entity in the deleted triplet data is included in the first array of graph data as claimed in any one of claims 1 to 4; if yes, go to step 606, if not, return to delete the message that fails;

step 606, detecting whether the entity relationship in the deleted triplet data is included in the second array of the graph data; if yes, go to step 608, if not, delete the failed message;

step 608, detecting whether an object entity in the deleted triplet data is included in an array chain of the graph data; if yes, go to step 610, if not, return the message of deleting failure;

step 610, deleting object entities of the deleted triplet data in the array chain.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 4 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.