CN112711659A

CN112711659A - Model calculation method and device based on mass graph data

Info

Publication number: CN112711659A
Application number: CN202011625560.8A
Authority: CN
Inventors: 顾凌云; 郭志攀; 王伟; 李海全
Original assignee: Nanjing Bingjian Information Technology Co ltd
Current assignee: Nanjing Bingjian Information Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-27
Anticipated expiration: 2040-12-31
Also published as: CN112711659B

Abstract

The model calculation method and device based on the massive graph data lead the graph data to be processed from the graph database JanusGraph into the hive database to obtain a data node list and a data relation list, determine each data node and a connected graph id of a corresponding data relation, aggregate the data of the same connected graph and push the data to hdfs storage based on the connected graph id, meanwhile keep the mapping between the operation parameters and the aggregation file in the aggregation process and lead the mapping into the hive database, and adjust the preset thread parameters to obtain the target thread parameters for data processing to obtain the data processing result. The method has the advantages that the data are split by adopting the connected graph in advance, the parallel preparation is made for the tasks, the data screening and the data conversion are performed in advance, the data volume during calculation is reduced, the data are installed and loaded in the memory through the data conversion, the model python code of the single computer is simply modified and converted into the spark code, the parallel calculation can be performed, and the parallelism can be dynamically adjusted according to the requirements of calculation resources and the tasks.

Description

Model calculation method and device based on mass graph data

Technical Field

The invention relates to the technical field of data processing, in particular to a model calculation method and device based on massive graph data.

Background

In a general knowledge graph project, a graph model is an important component of graph analysis and mining, and the graph model can perform deep analysis such as machine learning, data mining and the like by using graph data and can better discover the knowledge implied by the graph data. However, in practical application, because the graph data are associated more, the graph data are difficult to split during the calculation of the graph model, and when the graph data volume is small, the data can be loaded into a memory completely and then calculated, so that the influence is not great. However, under the condition of massive data, the data cannot be completely loaded into the memory, and meanwhile, the time consumed by the single machine operation cannot be accepted, so that a method which can consume smaller resources and can parallel the calculation under the condition of massive data is needed.

Disclosure of Invention

In order to solve the problems, the invention provides a model calculation method and a model calculation device based on massive graph data.

The embodiment of the invention provides a model calculation method based on massive graph data, which is applied to computer equipment and comprises the following steps:

importing graph data to be processed into a hive database from a graph database JanusGraph to obtain a data node list and a data relation list;

determining each data node and a connected graph id of a corresponding data relationship according to the data node list and the data relationship list;

obtaining a target file based on the connected graph id, and pushing the target file to an hdfs database;

performing data screening on the hdfs file directory corresponding to the target file in the hdfs database to obtain a mapping file, and importing the mapping file into the hive database;

adjusting the preset thread parameters to obtain target thread parameters;

and starting a data processing task according to the target thread parameter, and performing data processing on the mapping file in the hive database to obtain a data processing result.

Optionally, determining each data node and a connectivity graph id of a data relationship corresponding to the data node according to the data node list and the data relationship list includes:

reading the data node list and the data relation list through the acquired spark code;

and calculating a connected graph id of each data node and the corresponding data relation thereof based on the graph framework of spark.

Optionally, obtaining a target file based on the connectivity graph id, and pushing the target file to an hdfs database, including:

grouping the connected graph ids;

sequentially writing the data nodes and the data relations of each group into the initial file to obtain a target file;

and pushing the target file to an hdfs database.

Optionally, the data screening is performed on the hdfs file directory corresponding to the target file in the hdfs database to obtain a mapping file, and the importing the mapping file into the hive database includes:

defining a data filtering file;

reading the hdfs file directory according to the data filtering file;

converting each file to be processed in the hdfs file directory into an sqlite file and a mapping file of para and sqlite;

pushing the sqlite file and the mapping file to a specified directory of the hive database.

Optionally, adjusting the preset thread parameter to obtain the target thread parameter includes:

and modifying the preset standalone code into a distributed code.

Optionally, starting a data processing task according to the target thread parameter includes:

and starting a computing task based on the distributed code, and submitting the task by using a submit command of spark.

Optionally, the method further comprises:

and verifying the data processing result.

The embodiment of the invention provides a model calculation device based on massive graph data, which is applied to computer equipment and comprises the following functional modules:

the data import module is used for importing the graph data to be processed into the hive database from the graph database JanusGraph to obtain a data node list and a data relationship list;

the connected graph determining module is used for determining each data node and a connected graph id of a corresponding data relation according to the data node list and the data relation list;

the file pushing module is used for obtaining a target file based on the connected graph id and pushing the target file to the hdfs database;

the data screening module is used for screening data of the hdfs file directory corresponding to the target file in the hdfs database to obtain a mapping file and importing the mapping file into the hive database;

the parameter adjusting module is used for adjusting the preset thread parameters to obtain target thread parameters;

and the data processing module is used for starting a data processing task according to the target thread parameter and carrying out data processing on the mapping file in the hive database to obtain a data processing result.

Optionally, the connectivity map determining module is configured to:

Optionally, the file pushing module is configured to:

grouping the connected graph ids;

and pushing the target file to an hdfs database.

The model calculation method and device based on the massive graph data, provided by the invention, are characterized in that graph data to be processed are imported from a graph database JanusGraph into a hive database to obtain a data node list and a data relation list, a connected graph id of each data node and a corresponding data relation of each data node is determined, data of the same connected graph are aggregated and pushed into hdfs storage based on the connected graph id, meanwhile, mapping of operation parameters and aggregation files is kept in the aggregation process and imported into the hive database, and preset thread parameters are adjusted to obtain target thread parameters so as to start a data processing task according to the target thread parameters to perform data processing on the mapping files in the hive database to obtain a data processing result. By the design, the data splitting is performed by adopting the connected graph in advance, the preparation can be made for the parallel task, the data screening and the data conversion are performed in advance, the data amount during calculation is reduced, the data are installed and loaded in the memory through the data conversion, the simple modification of the model python code of the single computer is converted into the spark code, the parallel calculation can be performed, and the parallelism can be dynamically adjusted according to the requirements of calculation resources and the task.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a flowchart of a model calculation method based on mass map data according to an embodiment of the present invention.

Fig. 2 is a block diagram of a model computing apparatus based on mass map data according to an embodiment of the present invention.

Detailed Description

In order to better understand the technical solutions of the present invention, the following detailed descriptions of the technical solutions of the present invention are provided with the accompanying drawings and the specific embodiments, and it should be understood that the specific features in the embodiments and the examples of the present invention are the detailed descriptions of the technical solutions of the present invention, and are not limitations of the technical solutions of the present invention, and the technical features in the embodiments and the examples of the present invention may be combined with each other without conflict.

The inventor finds that the common memory map model calculation scheme based on a single machine is as follows: a. loading a small amount of data into a memory, b, compiling corresponding analysis codes, processing the data in the memory, c, starting a single thread or a plurality of threads to run the analysis codes, and d, outputting results.

However, the prior art has the following disadvantages: the prior art can only process a small amount of data and load all the data into a memory, has high resource consumption, can run in parallel in multiple threads, but can only run on a single machine. When the data volume is large, the computing resources cannot be linearly expanded by using a computing engine related to the large data.

To improve the above problems. The inventor innovatively provides a model calculation method and device based on massive graph data. Referring first to fig. 1, a model calculation method based on mass map data is shown, which can be applied to a computer device and further implemented as described in the following steps S11-S16.

And step S11, importing the graph data to be processed from the graph database JanusGraph into the hive database to obtain a data node list and a data relationship list.

The data export step is to export data from a graph database JanusGraph (a distributed graph database, which is commonly used in a small-lot OLTP query) to Hive (a large data distributed data warehouse), and the purpose of the export is to make subsequent full-scale analysis (OLAP) consistent with previous data. The results derived have two Hive tables: a list of data nodes (nodes table) and a list of data relationships (relationships table).

nodes table

relations table

And step S12, determining each data node and the connected graph id of the corresponding data relation according to the data node list and the data relation list.

And step S13, obtaining a target file based on the connected graph id, and pushing the target file to the hdfs database.

And step S14, performing data screening on the hdfs file directory corresponding to the target file in the hdfs database to obtain a mapping file, and importing the mapping file into the hive database.

And step S15, adjusting the preset thread parameters to obtain the target thread parameters.

And step S16, starting a data processing task according to the target thread parameter, and performing data processing on the mapping file in the hive database to obtain a data processing result.

The method includes the steps that data of a graph to be processed are imported from a graph database JanusGraph into a hive database to obtain a data node list and a data relation list, a connected graph id of each data node and a corresponding data relation of each data node is determined, data of the same connected graph are aggregated and pushed into hdfs storage based on the connected graph id, meanwhile, mapping of operation parameters and aggregation files is kept in the aggregation process and imported into the hive database, preset thread parameters are adjusted to obtain target thread parameters, and data processing is conducted on the mapping files in the hive database by starting a data processing task according to the target thread parameters to obtain data processing results. By the design, the data splitting is performed by adopting the connected graph in advance, the preparation can be made for the parallel task, the data screening and the data conversion are performed in advance, the data amount during calculation is reduced, the data are installed and loaded in the memory through the data conversion, the simple modification of the model python code of the single computer is converted into the spark code, the parallel calculation can be performed, and the parallelism can be dynamically adjusted according to the requirements of calculation resources and the task.

Further, the determining, according to the data node list and the data relationship list, a connectivity graph id of each data node and the corresponding data relationship thereof in step S12 includes: reading the data node list and the data relation list through the acquired spark code; and calculating a connected graph id of each data node and the corresponding data relation thereof based on the graph framework of spark. It can be understood that in the conventional method, the reason why the data cannot be paralleled is that the graph data are managed mutually, so that the data are not split, and the splitting in this step is to calculate the connected graph id of each node by using the attributes of the graph data connected graph.

Further, the obtaining a target file based on the connected graph id and pushing the target file to the hdfs database as described in step S13 includes: grouping the connected graph ids; sequentially writing the data nodes and the data relations of each group into the initial file to obtain a target file; and pushing the target file to an hdfs database.

It can be understood that the purpose of data aggregation is to write the nodes and relationships of the same connected graph to the same file to prepare for subsequent calculation, and meanwhile, in order to prevent the generation of a large number of small files caused by the undersize of some connected graphs, the minimum number of records of one file needs to be set.

For example, writing Spark code, reading the result of the step, grouping according to the connected graph id, writing the nodes of each group into the file, starting with the node | one line, one record, then writing the relationship, starting with the relationships | and caching flush to the disk every time 1w lines are written in order to prevent memory overflow. If the number of records written in the file does not reach the threshold value, the writing in the new file is not switched, and the writing in the original file is resumed. The final document formed is as follows:

when the file write is completed, the file is pushed to hdfs (a distributed file storage system).

For some possible examples, the data screening, performed in step S14, on the hdfs file directory corresponding to the target file in the hdfs database to obtain a mapping file, and importing the mapping file into the hive database includes: defining a data filtering file; reading the hdfs file directory according to the data filtering file; converting each file to be processed in the hdfs file directory into an sqlite file and a mapping file of para and sqlite; pushing the sqlite file and the mapping file to a specified directory of the hive database. The purpose of this step is to filter the data and convert the file of step S13 into data required for calculation. Because the general model only uses partial data of the nodes and the relations, the unused data are screened out in advance, and the data size of calculation is reduced.

For example, the data filter file is as follows.

Writing a code reading configuration file, reading the hdfs file directory obtained in the step S13 by using spark, and converting each file under the directory into a sqlite file, wherein each sqlite file has two tables nodes, and the data and indexes of the two tables of relations conform to the definition of the above file. The Sqlite file name is added with a suffix 'db' after the original file name, and during the conversion process, the file where the para is located also needs to be written into the file. In the conversion process, every 10000 pieces of data are converted, and a record is submitted. Every time a file is converted, two files will be generated. One is the converted sqlite file and the other is the mapping file of para and sqlite, and the two files are pushed to the directory specified by hdfs. Importing the mapping file of para and sqlite into hive, wherein the hive table has two columns: para, db.

Further, the adjusting the preset thread parameter described in step S15 to obtain the target thread parameter includes: and modifying the preset standalone code into a distributed code. The purpose of this step is to convert the analyst written standalone code into distributed code.

In some embodiments, the initiating a data processing task according to the target thread parameter as described in step S16 includes: and starting a computing task based on the distributed code, and submitting the task by using a submit command of spark.

On the basis of the above, the method may further include the steps of: and verifying the data processing result. For example, after the calculation is completed, the data of the selected part is compared and verified with that of the single machine.

Based on the same inventive concept as above, please refer to fig. 2, which shows a model calculation apparatus 200 based on mass graph data, applied to a computer device, the apparatus includes the following modules:

the data import module 210 is configured to import the graph data to be processed from the graph database JanusGraph into the hive database to obtain a data node list and a data relationship list;

a connected graph determining module 220, configured to determine, according to the data node list and the data relationship list, a connected graph id of each data node and a data relationship corresponding to the data node;

the file pushing module 230 is configured to obtain a target file based on the connected graph id, and push the target file to an hdfs database;

a data screening module 240, configured to perform data screening on an hdfs file directory corresponding to the target file in the hdfs database to obtain a mapping file, and import the mapping file into the hive database;

a parameter adjusting module 250, configured to adjust a preset thread parameter to obtain a target thread parameter;

and the data processing module 260 is configured to start a data processing task according to the target thread parameter, and perform data processing on the mapping file in the hive database to obtain a data processing result.

Optionally, the connectivity map determining module 220 is configured to: reading the data node list and the data relation list through the acquired spark code; and calculating a connected graph id of each data node and the corresponding data relation thereof based on the graph framework of spark.

Optionally, the file pushing module 230 is configured to: grouping the connected graph ids; sequentially writing the data nodes and the data relations of each group into the initial file to obtain a target file; and pushing the target file to an hdfs database.

To sum up, the model calculation method and device based on massive graph data of the invention import graph data to be processed from a graph database JanusGraph into a hive database to obtain a data node list and a data relationship list and determine a connected graph id of each data node and a corresponding data relationship, aggregate data of the same connected graph and push the aggregated data to hdfs storage based on the connected graph id, simultaneously keep the mapping of operation parameters and an aggregation file in the aggregation process and import the aggregated data into the hive database, and adjust preset thread parameters to obtain target thread parameters for data processing to obtain data processing results. Therefore, the data splitting is carried out by adopting the connected graph in advance, the parallel preparation is made for the tasks, the data screening and the data conversion are carried out in advance, the data volume during the calculation is reduced, the data are installed and loaded in the memory through the data conversion, the simple modification of the model python code of the single computer is converted into the spark code, the parallel calculation can be carried out, and meanwhile the parallelism can be dynamically adjusted according to the calculation resources and the requirements of the tasks.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A model calculation method based on massive graph data is applied to computer equipment, and the method comprises the following steps:

adjusting the preset thread parameters to obtain target thread parameters;

2. The data processing method according to claim 1, wherein determining a connectivity graph id of each data node and its corresponding data relationship according to the data node list and the data relationship list comprises:

3. The data processing method according to claim 1, wherein obtaining a target file based on the connectivity graph id and pushing the target file to an hdfs database comprises:

grouping the connected graph ids;

and pushing the target file to an hdfs database.

4. The data processing method of claim 1, wherein the data screening is performed on an hdfs file directory corresponding to the target file in the hdfs database to obtain a mapping file, and the importing the mapping file into the hive database comprises:

defining a data filtering file;

reading the hdfs file directory according to the data filtering file;

5. The data processing method of claim 1, wherein adjusting the preset thread parameter to obtain the target thread parameter comprises:

and modifying the preset standalone code into a distributed code.

6. The data processing method of claim 5, wherein initiating a data processing task according to the target thread parameter comprises:

7. The data processing method of claim 1, wherein the method further comprises:

and verifying the data processing result.

8. A model calculation device based on massive graph data is applied to computer equipment and comprises the following functional modules:

9. The data processing apparatus of claim 8, wherein the connectivity graph determination module is configured to:

10. The data processing apparatus of claim 8, wherein the file push module is configured to:

grouping the connected graph ids;

and pushing the target file to an hdfs database.