CN111708923A

CN111708923A - Method and device for determining graph data storage structure

Info

Publication number: CN111708923A
Application number: CN202010586928.8A
Authority: CN
Inventors: 王宏志; 孙颖凯; 郑博; 梁栋; 齐志鑫
Original assignee: Beijing Sqh Tech Co ltd
Current assignee: Beijing Sqh Tech Co ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-09-25

Abstract

The invention relates to the technical field of databases, in particular to a method and a device for determining a graph data storage structure. One embodiment of the method comprises: obtaining prediction data for a graph data storage structure, the prediction data comprising: extracting a graph feature vector according to graph data, extracting an actual load feature vector according to an actual working load of the graph data under a current storage structure, and extracting a storage structure feature vector of each storage structure in the storage structure set according to the storage structures; taking the prediction data as input, and predicting the system load of each storage structure in the storage structure set by using a prediction network obtained by pre-training; and determining the optimal storage structure in the storage structure set according to the system load obtained by prediction, and replacing the current storage structure of the graph data. This embodiment enables intelligent selection of graph data storage structures.

Description

Method and device for determining graph data storage structure

Technical Field

The invention relates to the technical field of databases, in particular to a method and a device for determining a graph data storage structure.

Background

A graph is an abstract data structure for representing associative relationships between objects, and for data that can be abstracted into a graph representation, commonly referred to as graph data, it is more complex than a linked list of linear structures, arrays, and trees of non-linear structures. With the gradual development of graph data storage technology, a large graph data storage system often has multiple sets of adjustable settings for a storage structure of graph data, such as index column settings, graph partitioning strategies, cache parameters, and the like, and the complexity of the system exceeds the control capability of ordinary technicians, and often cannot perform optimal setting by a manual means.

To make intelligent selections of graph data storage structures, features that can be used for statistical learning are extracted from graph data. However, the graph data includes global topology information such as the total number of nodes, the total number of edges, connectivity, and diameters, local topology information such as the shortest path between nodes and the largest loop, and load information such as node access frequency, and the data formats, the acquisition manners, and the update frequencies of these pieces of information are different from each other, so that it is very difficult to extract features that can be used for statistical learning from the graph data. Therefore, no method for intelligently selecting the graph data storage structure exists at present.

Therefore, in view of the above disadvantages, it is desirable to provide a method and apparatus for determining a graph data storage structure, which can realize intelligent selection of the graph data storage structure.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method and an apparatus for determining a graph data storage structure, which can implement intelligent selection of the graph data storage structure, aiming at the defects in the prior art.

In order to solve the above technical problem, the present invention provides a method for determining a graph data storage structure, where the graph data is preconfigured with a storage structure set, where the graph data includes a current storage structure of the graph data and a plurality of preset alternative storage structures, and the method includes:

obtaining prediction data for a graph data storage structure, the prediction data comprising: extracting a graph feature vector according to graph data, extracting an actual load feature vector according to an actual working load of the graph data under a current storage structure, and extracting a storage structure feature vector of each storage structure in the storage structure set according to the storage structures;

taking the prediction data as input, and predicting the system load of each storage structure in the storage structure set by using a prediction network obtained by pre-training;

and determining the optimal storage structure in the storage structure set according to the system load obtained by prediction, and replacing the current storage structure of the graph data.

Further, the method for determining the graph data storage structure provided by the invention further comprises the following steps:

obtaining training data for a predictive network, the training data comprising: extracting a graph feature vector according to graph data, extracting an actual load feature vector according to an actual working load of the graph data under a current storage structure, and extracting a storage structure feature vector according to the current storage structure of the graph data;

and training by taking the training data as input to obtain the prediction network, wherein the loss function is the difference between the cost of the predicted system load and the cost of the actual system load.

Optionally, after the step of determining an optimal storage structure in the storage structure set according to the predicted system load and replacing the current storage structure of the graph data, the method further includes:

determining a preferred storage structure subset in the storage structure set according to the predicted system load;

and updating the storage structure set, and adding the preferred storage structure subset and the replaced current storage structure into the updated storage structure set.

splicing the quantity characteristics of the graph data to obtain a quantity characteristic vector of the graph data;

carrying out graph division operation on the graph data, and obtaining a summary graph of the graph data according to a graph division result;

taking the abstract graph of the graph data as input, and performing graph convolution operation to obtain a topological characteristic vector of the graph data;

and splicing the quantity characteristic vector and the topological characteristic vector of the graph data to obtain the graph characteristic vector of the graph data.

acquiring graph data storage options of the graph data in a graph data storage system;

determining the vector code of each graph data storage option of the graph data under the specified storage structure, and mapping the vector code of the graph data storage option to obtain the feature vector of the graph data storage option;

and splicing the feature vectors of each graph data storage option to obtain the storage structure feature vector of the graph data under the storage structure.

The invention also provides a device for determining the storage structure of graph data, wherein the graph data is pre-configured with a storage structure set, the storage structure set comprises a current storage structure of the graph data and a plurality of preset alternative storage structures, and the device comprises:

a prediction data acquisition module to acquire prediction data of a graph data storage structure, the prediction data comprising: extracting a graph feature vector according to graph data, extracting an actual load feature vector according to an actual working load of the graph data under a current storage structure, and extracting a storage structure feature vector of each storage structure in the storage structure set according to the storage structures;

the prediction module is used for taking the prediction data as input and predicting the system load of each storage structure in the storage structure set by utilizing a prediction network obtained by pre-training;

and the optimization module is used for determining the optimal storage structure in the storage structure set according to the system load obtained by prediction and replacing the current storage structure of the graph data.

Further, the apparatus for determining a graph data storage structure provided by the present invention further includes:

a training data obtaining module, configured to obtain training data of a prediction network, where the training data includes: extracting a graph feature vector according to graph data, extracting an actual load feature vector according to an actual working load of the graph data under a current storage structure, and extracting a storage structure feature vector according to the current storage structure of the graph data;

and the training module is used for training by taking the training data as input to obtain the prediction network, wherein the loss function is the difference between the cost of the predicted system load and the cost of the actual system load.

and the storage structure set updating module is used for determining a preferred storage structure subset in the storage structure set according to the system load obtained by prediction, updating the storage structure set and adding the preferred storage structure subset and the replaced current storage structure into the updated storage structure set.

the graph feature vector extraction module is used for splicing the quantity features of the graph data to obtain the quantity feature vectors of the graph data, performing graph division operation on the graph data, obtaining a summary graph of the graph data according to a graph division result, taking the summary graph of the graph data as input, performing graph convolution operation to obtain the topological feature vectors of the graph data, and splicing the quantity feature vectors and the topological feature vectors of the graph data to obtain the graph feature vectors of the graph data.

the storage structure feature vector extraction module is used for acquiring graph data storage options of the graph data in a graph data storage system, determining the vector code of each graph data storage option of the graph data in a specified storage structure, mapping the vector codes of the graph data storage options to obtain the feature vectors of the graph data storage options, and splicing the feature vectors of each graph data storage option to obtain the storage structure feature vectors of the graph data in the storage structure.

According to the method and the device for determining the graph data storage structure, in the process of extracting the graph data physical signs, the abstract graph is generated based on the graph division information, the existing functions of the graph data storage system are fully utilized, the calculated amount of model training is greatly reduced, and then the topological features of the graph are efficiently vectorized based on the graph convolution network feature extraction of the abstract graph. In the process of extracting the storage structure signs, the storage structure information is subjected to serialized representation based on orthogonal grouping, and the storage structure information of the complex graph is converted into a uniform characteristic vector, so that the network input can be conveniently predicted. In the preparation process of the prediction data, random adjustment is carried out based on the storage structure to obtain a next alternative storage structure, the complexity of manual adjustment of the storage scheme is avoided, the new scheme and the previous scheme are not changed too much, and the cost for replacing the storage scheme is reduced. And the next alternative is added to the candidate storage structure with the higher predicted score, so that compared with pure random adjustment, the probability that the new storage structure is superior to the existing structure is improved, and the convergence of the storage structure is accelerated.

According to the method and the device for determining the graph data storage structure, after the graph data and the load information are embedded into the numerical vector, downstream tasks can be processed through a deep learning methodology, such as graph data access structure performance evaluation, graph data storage structure automatic design, graph data index automatic construction and the like. Based on the design, the method has considerable flexibility, all the graph data information, the load information and the storage structure which can be vectorized can be optimally selected according to the framework of the method, and the vectorization representation method and the load prediction network have quite high replaceability and can be locally updated by using an updating technology and means at any time.

Drawings

FIG. 1 is a schematic main flow chart of a method for determining a graph data storage structure according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart illustrating a method for determining a graph data storage structure according to a second embodiment of the present invention;

fig. 3 is a schematic diagram of an apparatus for determining a graph data storage structure according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Example one

An embodiment of the present invention provides a method for determining a graph data storage structure, and as shown in fig. 1, the method includes steps S1 to S3.

In step S1, prediction data of the graph data storage structure is obtained, the present invention predicts the system load of the graph data under different storage structures through the graph neural network, and in this step, prediction data prepared in advance is obtained as an input of the graph neural network. The prediction data includes: the storage structure characteristic vector comprises a graph characteristic vector, an actual load characteristic vector, a storage structure characteristic vector of a current storage structure of graph data and storage structure characteristic vectors of a plurality of preset alternative storage structures.

And the graph characteristic vector is obtained by extracting the graph data, and the actual load characteristic vector is obtained by extracting the actual working load of the graph data under the current storage structure. The method comprises the steps that a storage structure set is pre-configured for graph data, wherein the storage structure set comprises a current storage structure of the graph data and a plurality of preset alternative storage structures, and feature extraction is carried out on each storage structure in the storage structure set to obtain a storage structure feature vector. The preset alternative storage structure can be generated by performing randomized adjustment based on the existing graph storage structure.

For the graph feature vectors in the prediction data, the invention provides an extraction method to carry out uniform vectorization characterization on the graph data. The extraction method comprises the following specific steps: dividing the graph data characteristics into quantity characteristics and topology characteristics, and vectorizing the quantity characteristics and the topology characteristics respectively through different means. And splicing the quantity features of the graph data to obtain a quantity feature vector of the graph data. For example: the number characteristics of point number, edge number, connectivity, diameter and radius are directly spliced to obtain a dense number characteristic vector.

And vectorizing the topological characteristics of the graph data by adopting a graph convolution network technology in a graph neural network. The graph convolution network takes the graph as a neuron, performs neuron transfer through a graph convolution operation, and can obtain vector representation through compression at the last layer. Because the graph convolution network has heavy calculation load, the graph convolution network is not suitable for being directly carried out on an original graph, and a summary is firstly generated and then graph convolution operation is carried out.

The distributed graph data storage system is usually configured with a graph partitioning operation function, so that by utilizing the existing functions of the system, graph partitioning operation is firstly carried out on graph data, according to the result of graph partitioning, the same data node is used as a super node, edges between two data nodes are collected to be used as a super edge, namely, a subgraph is replaced by the super node, and the existence and the attribute of the edges between the super nodes are artificially defined to be deduced from an original graph so as to obtain a summary graph of a data node level. The abstract graph is small in scale and contains topological information of communication scale among data nodes, and the topological information is used as input of the graph convolution network so as to reduce the calculation scale of the graph convolution network.

And then, taking the abstract graph of the graph data as input, and performing graph convolution operation to obtain the topological characteristic vector of the graph data.

Finally, the dense vectors of the quantity features of the graph data and the topological feature vectors output by the graph convolution network are spliced to obtain the graph feature vectors of the graph data, so that vectorization representation of the graph data is realized.

For the actual load feature vector in the prediction data, the method is obtained according to the actual workload extraction of the graph data under the current storage structure. In the system operation process, the access times and the system load of each data node in the graph are recorded in real time and stored as logs, and the recorded load logs are vectorized to obtain a working load sequence vector.

For the storage structure feature vector in the prediction data, the invention provides an extraction method to carry out uniform vectorization representation on the storage structure. The extraction method comprises the following specific steps: first, a drawing data storage option of drawing data in the drawing data storage system is acquired. In general, in a large-scale graph data storage system, there are often a plurality of graph data storage options orthogonal to each other, for example, "index those attributes" and "index which data structure is used" are a pair of orthogonal storage options, that is, the former choice does not restrict the latter's voluntary choice.

And then determining the vector code of each graph data storage option of the graph data under the specified storage structure, and mapping the vector codes of the graph data storage options to obtain the feature vector of the graph data storage options. Referring to the technical idea of a field decomposition machine in a recommendation system, each storage option corresponds to a vector code, and a vector obtained by each code is mapped to obtain a feature vector of the field. For example, the field "index which attributes" is selected as the "2 nd, 3 rd, and 5 th attributes" to encode the fixed length vector (01101), and then the feature vector corresponding to the storage selection is obtained by linear mapping (e.g., multiplying by the weight matrix of the field).

And finally, splicing the feature vectors of each graph data storage option together to obtain the storage structure feature vector of the graph data under the storage structure. I.e. the field feature vector is spliced with the feature vectors of other fields to obtain a vectorized representation of the set of storage options.

By using the method for extracting the characteristic vector of the storage structure, the current storage structure of the graph data and a plurality of preset alternative storage structures can be uniformly represented in a vectorization mode.

In step S2, the system load of each storage structure in the storage structure set is predicted using the prediction network obtained by the pre-training with the prediction data as input.

Further, in step S3, an optimal storage structure in the storage structure set, for example, the storage structure with the lowest system load, is determined according to the system load obtained by prediction, and the current storage structure of the graph data is replaced, so as to complete intelligent optimization of the graph data storage structure.

The embodiment of the invention also provides a training method of the prediction network, which comprises the following steps of firstly obtaining training data of the prediction network, wherein the training data comprises: the method comprises the steps of extracting a graph feature vector according to graph data, extracting an actual load feature vector according to actual working load of the graph data under a current storage structure, and extracting a storage structure feature vector according to the current storage structure of the graph data. The above-mentioned method for extracting training data is already described above, and is not repeated again. Then, training the training data as input to obtain a prediction network, namely using the current graph storage structure characteristic vector to predict a new working load vector together with the graph characteristic vector and the historical working load characteristic vector, and training the prediction network. The loss function of the prediction network is the difference between the cost of the prediction system working load and the cost of the actual system load.

After the training of the prediction network is completed, the prediction and optimization process of the graph data storage structure of steps S1 to S3 may be performed by using the prediction network. After the execution of steps S1 to S3 is completed, the existing storage structure is replaced with the best storage structure obtained by prediction, so that the training data of the prediction network is updated when the training process of the prediction network is executed next time. The current storage structure of the graph data is the storage structure after replacement, and the actual load characteristic vector and the storage structure characteristic vector in the training data are extracted by using the storage structure after replacement. And the new prediction network is trained by the updated training data.

After the steps S1 to S3 are completed, a preferred storage structure subset in the storage structure set is determined according to the predicted system load, the storage structure set is updated, the preferred storage structure subset and the replaced current storage structure are added to the updated storage structure set, that is, the best storage structure obtained by prediction is used to replace the existing storage structure, and the previous several storage structures are selected from the rest of the storage structures as part of the candidate storage schemes for the next prediction.

The prediction and optimization processes of the steps S1 to S3 can be performed again by using a new prediction network and storage structure set, and the storage structure of graph storage can be iteratively optimized continuously under the condition that the data query and the graph data are changed continuously by repeatedly performing the steps. The above-mentioned processes of step S1 to step S3 are executed again using new data, and may be triggered periodically, or respective trigger conditions are set, and triggered when the conditions are satisfied.

According to the method for determining the graph data storage structure, in the process of extracting the graph data physical signs, the abstract graph is generated based on the graph division information, the existing functions of the graph data storage system are fully utilized, the calculated amount of model training is greatly reduced, and then the topological features of the graph are efficiently vectorized based on the graph convolution network feature extraction of the abstract graph. In the process of extracting the storage structure signs, the storage structure information is subjected to serialized representation based on orthogonal grouping, and the storage structure information of the complex graph is converted into a uniform characteristic vector, so that the network input can be conveniently predicted. In the preparation process of the prediction data, random adjustment is carried out based on the storage structure to obtain a next alternative storage structure, the complexity of manual adjustment of the storage scheme is avoided, the new scheme and the previous scheme are not changed too much, and the cost for replacing the storage scheme is reduced. And the next alternative is added to the candidate storage structure with the higher predicted score, so that compared with pure random adjustment, the probability that the new storage structure is superior to the existing structure is improved, and the convergence of the storage structure is accelerated.

According to the method for determining the graph data storage structure, after graph data and load information are embedded into a numerical vector, downstream tasks can be processed through a deep learning methodology, such as graph data access structure performance evaluation, graph data storage structure automatic design, graph data index automatic construction and the like. Based on the design, the method has considerable flexibility, all the graph data information, the load information and the storage structure which can be vectorized can be optimally selected according to the framework of the method, and the vectorization representation method and the load prediction network have quite high replaceability and can be locally updated by using an updating technology and means at any time.

Example two

Compared with the first embodiment, the second embodiment combines a specific application scenario, and divides the method for providing the graph data storage structure into an initial stage, an operation stage, a training preparation stage, a training stage, a prediction stage, and a structure optimization stage, so as to describe the method provided by the present invention in more detail.

As shown in FIG. 2, the present invention, when actually run in a system, follows the following state flow: in the initial stage, the graph data is stored in the graph network in a default storage mode, and the operation is started.

In the operation stage, the number of times of accessing each data node in the graph and the system load are recorded in real time and stored as a log, and the training preparation stage is started regularly.

In the training preparation stage, graph data vectorization is performed to generate a summary graph, and then a graph convolution network obtains topological feature vectors which are used as feature vectors of graph data together with dense feature vectors. And vectorizing and expressing the accumulated workload characteristics to obtain a workload sequence vector. Based on the existing graph storage structure, a storage structure set is formed by performing randomized adjustment to generate a plurality of alternative graph storage schemes, and vectorization representation is performed to convert the alternative graph storage structure into a group of alternative graph storage structure feature vectors.

In the training phase, the current graph is used for storing the structural feature vector, a new working load vector is predicted together with the graph feature vector and the historical working load feature vector, and a prediction network is trained.

In the prediction stage, the graph feature vector and the complete workload feature vector are together with the feature vector of each candidate storage structure (including the actually used graph storage structure vector), and the system load of each candidate storage structure at the next moment is predicted by the prediction network.

And in the structure optimization stage, the optimal storage structure obtained in the prediction stage is selected to replace the current storage structure, the previous plurality of the candidates are selected to form the optimal storage structure subset to be used as part of the alternative storage scheme of the next prediction stage, and the operation stage is returned after the replacement is finished, so that a cycle is finished.

Example 3

The embodiment provides an apparatus for determining a graph data storage structure, where the graph data is preconfigured with a storage structure set, where the storage structure set includes a current storage structure of the graph data and a plurality of preset alternative storage structures, as shown in fig. 3, the apparatus 101 includes: the device comprises a prediction data acquisition module 1, a prediction module 2, an optimization module 3, a training data acquisition module 4, a training module 5, a storage structure set updating module 6, a graph feature vector extraction module 7 and a storage structure feature vector extraction module 8.

The prediction data obtaining module 1 is configured to obtain prediction data of a graph data storage structure, where the prediction data includes: the method comprises the steps of extracting a graph feature vector according to graph data, extracting an actual load feature vector according to actual working load of the graph data under a current storage structure, and extracting a storage structure feature vector of each storage structure in a storage structure set according to the storage structures.

The prediction module 2 is configured to predict a system load of each storage structure in the storage structure set by using prediction data as an input and using a prediction network obtained through pre-training.

The optimization module 3 is used for determining an optimal storage structure in the storage structure set according to the system load obtained by prediction and replacing the current storage structure of the graph data.

The training data obtaining module 4 is configured to obtain training data of the prediction network, where the training data includes: the method comprises the steps of extracting a graph feature vector according to graph data, extracting an actual load feature vector according to actual working load of the graph data under a current storage structure, and extracting a storage structure feature vector according to the current storage structure of the graph data.

The training module 5 is configured to train the training data as input to obtain a prediction network, where the loss function is a difference between a cost of the predicted system load and a cost of the actual system load.

The storage structure set updating module 6 is configured to determine a preferred storage structure subset in the storage structure set according to the predicted system load, update the storage structure set, and add the preferred storage structure subset and the replaced current storage structure to the updated storage structure set.

The graph feature vector extraction module 7 is configured to splice the quantity features of the graph data to obtain quantity feature vectors of the graph data, perform graph partitioning on the graph data, obtain a summary graph of the graph data according to a result of the graph partitioning, perform graph convolution using the summary graph of the graph data as an input to obtain topological feature vectors of the graph data, and splice the quantity feature vectors and the topological feature vectors of the graph data to obtain the graph feature vectors of the graph data.

The storage structure feature vector extraction module 8 is configured to obtain a graph data storage option of the graph data in the graph data storage system, determine a vector code of each graph data storage option of the graph data in a specified storage structure, map the vector codes of the graph data storage options to obtain a feature vector of the graph data storage option, and splice the feature vectors of each graph data storage option to obtain a storage structure feature vector of the graph data in the storage structure.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprises a prediction data acquisition module, a prediction module, an optimization module, a training data acquisition module, a training module, a storage structure set updating module, a graph feature vector extraction module and a storage structure feature vector extraction module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of determining a graph data storage structure, wherein the graph data is preconfigured with a set of storage structures including a current storage structure of the graph data and a plurality of preset alternative storage structures, the method comprising:

2. The method of claim 1, further comprising:

3. The method of claim 2, further comprising, after the step of determining an optimal storage structure in the set of storage structures based on the predicted system load and replacing a current storage structure of the graph data:

4. The method of any of claims 1 to 3, further comprising:

5. The method of any of claims 1 to 3, further comprising:

6. An apparatus for determining graph data storage structures, wherein the graph data is preconfigured with a set of storage structures including a current storage structure of the graph data and a plurality of preset alternative storage structures, the apparatus comprising:

7. The apparatus of claim 6, further comprising:

8. The apparatus of claim 7, further comprising:

9. The apparatus of any one of claims 6 to 8, further comprising:

10. The apparatus of any one of claims 6 to 8, further comprising: