WO2020206952A1 - 一种图数据库的数据导入方法及装置 - Google Patents
一种图数据库的数据导入方法及装置 Download PDFInfo
- Publication number
- WO2020206952A1 WO2020206952A1 PCT/CN2019/109096 CN2019109096W WO2020206952A1 WO 2020206952 A1 WO2020206952 A1 WO 2020206952A1 CN 2019109096 W CN2019109096 W CN 2019109096W WO 2020206952 A1 WO2020206952 A1 WO 2020206952A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- spark
- graph database
- data
- importing
- database
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/51—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
Definitions
- the present invention relates to the technical field of data processing, in particular to a data import method and device for a graph database.
- Spark is a cluster and memory-based data processing technology. It can process large amounts of data through a combination of many machines, and it can also be integrated with the graph computing framework to perform data computing operations. Spark can not only integrate in different ways, but also can preprocess the data (including aggregation, filtering, transformation, etc.), and then import the preprocessed data into the graph database.
- the embodiments of the present invention provide a method and device for importing graph database data to overcome the need to import the data into csv format and import the data in real time when importing data into the graph database in the prior art.
- the speed is not adjustable and other issues.
- the technical solution adopted by the present invention is:
- a data import method for a graph database includes the following steps:
- the method further includes:
- the customized spark udf function with the graph database program it also includes:
- the method further includes:
- the query of the hive database by using the spark resource includes:
- a data import device for a graph database includes:
- connection module is used to register a custom spark udf function with the graph database program, so that the graph database establishes a connection with spark resources through the spark udf function;
- the query module is used to query the hive database by using the spark resource to obtain the queried data
- the partition module is used to repartition the queried data and register it as a temporary data table
- the import module is used to import the temporary data table into the graph database through the spark udf function and the node attribute index.
- the device further includes:
- the driver connection module is used to set the driver for connecting the graph database in the spark udf function to be written in a static method.
- the device further includes:
- the configuration module is used to define the input and output parameters of the spark udf function.
- the device further includes:
- Close module used to close the driver of the graph database and the spark resource
- the query of the hive database by using the spark resource includes:
- the method and device for importing graph database data provided by the embodiment of the present invention can realize real-time import of data by combining spark and graph database without exporting the data into csv format;
- the method and device for importing graph database data provided by the embodiments of the present invention utilize spark technology to facilitate spark performance tuning and adjust the speed of data import;
- the method and device for importing graph database data provided by the embodiments of the present invention utilize the concurrent feature of spark to speed up the import of data without losing data.
- Fig. 1 is a flow chart showing a method for importing data from a graph database according to an exemplary embodiment
- Fig. 2 is a schematic structural diagram of a data importing device for a graph database according to an exemplary embodiment.
- Fig. 1 is a flowchart showing a method for importing data from a graph database according to an exemplary embodiment. Referring to Fig. 1, the method includes the following steps:
- the graph database is combined with spark resources through a custom spark udf function (even if the graph database is connected to the spark resource), so that the data can be imported into the graph database in real time without the need to
- the data is exported to csv format.
- custom spark udf functions you can use Java language for development, or use other computer programming languages for development.
- the written custom spark udf function must be registered in the graph database program, because in the Java class method, the written custom udf must be registered before it can be used.
- the spark udf function can be used to facilitate spark performance tuning, such as how many zones are re-divided from the data queried from hive, how to set the parallelism of spark, and how to allocate how many calculations to spark tasks Node (executor), how much memory is allocated for each executor, how large is the number of cores for each executor, and how much memory is allocated for the driver, etc.
- Node executor
- Spark udf means that if some of the functions provided by spark itself cannot satisfy users, you can implement your own business logic through custom functions. for example:
- a node attribute index is created in the graph database, that is, for each node in the graph database. Property index. If the attribute index is not established, the speed of data insertion will be significantly slower.
- importing hive table data into a graph database is taken as an example.
- you can first use spark resources to query the hive database specifically, you can query the data in the hive database through the written spark sql statement) to obtain the queried data.
- the data in the hive database queried by spark sql is repartitioned.
- RDD Resilient Distributed Data Set
- each partition will start a task, so the number of RDD partitions determines the total number of tasks, so that it can coordinate with spark performance tuning .
- RDD Resilient Distributed Dataset
- RDD has the characteristics of a data flow model: automatic fault tolerance, location-aware scheduling, and scalability. RDD allows users to explicitly cache data in memory when executing multiple queries, and subsequent queries can reuse these data, which greatly improves the query speed. And because Spark has concurrent computing features, each task executes part of the data without causing data loss.
- the basis of partitioning is: if partitioning is not used, the number of partitions in the query hive table is the same as the number of partitions in the table itself, which cannot improve the degree of parallelism, that is, it cannot improve concurrent execution.
- the number of tasks After repartitioning, the number of concurrent execution tasks can be increased, so that the execution speed can be accelerated.
- the above-mentioned custom spark udf function is used in combination with the node attribute index created in the graph database to import the temporary data table into the graph database.
- the hive database is connected through Spark, data can be directly imported into the graph database without the need to export the data into a csv file format, and real-time insertion can be achieved.
- the method before registering the custom spark udf function with the graph database program, the method further includes:
- the driver to connect to the graph database (taking neo4j as an example) in the custom spark udf function must be written in a static method. This setting can reduce the number of times the spark udf function connects to the graph database, thereby Can reduce resource consumption.
- the method further includes:
- the input and output parameters of the spark udf function that is, it is necessary to define how many parameters are input and what are the parameter types, what is the output parameter type, and the main return value needs to be set.
- the value cannot be null.
- the method further includes:
- the driver of the graph database and spark resources need to be shut down to save resource consumption.
- the query of the hive database using the spark resource includes:
- the action operator of spark in order to trigger the execution of spark, the action operator of spark must be used, because only the action operator can perform calculations.
- the reduce operator in the action operator is selected to be used instead of operators such as collect and show. Because operators such as collect and show will affect performance, and operators like show cannot calculate all the data. That is to say, in the embodiment of the present invention, the use of a reduce operator to trigger the execution of spark can speed up the data import and ensure that no data is lost.
- Fig. 2 is a schematic structural diagram of an apparatus for importing graph database data according to an exemplary embodiment.
- the apparatus includes:
- connection module is used to register a custom spark udf function with the graph database program, so that the graph database establishes a connection with spark resources through the spark udf function;
- the query module is used to query the hive database by using the spark resource to obtain the queried data
- the partition module is used to repartition the queried data and register it as a temporary data table
- the import module is used to import the temporary data table into the graph database through the spark udf function and the node attribute index.
- the device further includes:
- the driver connection module is used to set the driver for connecting the graph database in the spark udf function to be written in a static method.
- the device further includes:
- the configuration module is used to define the input and output parameters of the spark udf function.
- the device further includes:
- the closing module is used to close the driver of the graph database and the resource.
- the query of the hive database using the spark resource includes:
- the method and device for importing graph database data provided by the embodiment of the present invention can realize real-time import of data by combining spark and graph database without exporting the data into csv format;
- the method and device for importing graph database data provided by the embodiments of the present invention utilize spark technology to facilitate spark performance tuning and adjust the speed of data import;
- the method and device for importing graph database data provided by the embodiments of the present invention utilize the concurrent feature of spark to speed up the import of data without losing data.
- the data import device of the graph database provided in the above embodiment only uses the division of the above functional modules as an example for data import services. In actual applications, the above functions can be assigned to different functions according to needs. Module completion means dividing the internal structure of the device into different functional modules to complete all or part of the functions described above.
- the data importing device of the graph database provided by the above embodiment belongs to the same concept as the data importing method of the graph database, that is, the device is based on the data importing method of the graph database, and the specific implementation process is described in the method embodiments. I won't repeat it here.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
Claims (10)
- 一种图数据库的数据导入方法,其特征在于,所述方法包括如下步骤:向图数据库程序注册自定义的spark udf函数,使得所述图数据库通过所述spark udf函数与spark资源建立连接;在所述图数据库中创建节点属性索引;利用所述spark资源查询hive数据库,获取查询到的数据;将所述查询到的数据重新分区后注册成临时数据表;通过所述spark udf函数以及所述节点属性索引,将所述临时数据表导入到所述图数据库中。
- 根据权利要求1所述的图数据库的数据导入方法,其特征在于,所述在向图数据库程序注册自定义的spark udf函数前还包括:设置所述spark udf函数中连接所述图数据库的驱动写在静态方法中。
- 根据权利要求1或2所述的图数据库的数据导入方法,其特征在于,所述向图数据库程序注册自定义的spark udf函数前还包括:定义所述spark udf函数的输入与输出的参数。
- 根据权利要求2所述的图数据库的数据导入方法,其特征在于,所述将所述临时数据表导入到所述图数据库中后还包括:将所述图数据库的驱动与所述spark资源关闭。
- 根据权利要求1或2所述的图数据库的数据导入方法,其特征在于,所述利用所述spark资源查询hive数据库包括:使用所述spark资源的reduce算子对所述查询到的数据进行相应计算。
- 一种图数据库的数据导入装置,其特征在于,所述装置包括:连接模块,用于向图数据库程序注册自定义的spark udf函数,使得所述图数据库通过所述spark udf函数与spark资源建立连接;创建模块,用于在所述图数据库中创建节点属性索引;查询模块,用于利用所述spark资源查询hive数据库,获取查询到的数据;划分模块,用于将所述查询到的数据重新分区后注册成临时数据表;导入模块,用于通过所述spark udf函数以及所述节点属性索引,将所述临时数据表导入到所述图数据库中。
- 根据权利要求6所述的图数据库的数据导入装置,其特征在于,所述装置还包括:驱动连接模块,用于设置所述spark udf函数中连接所述图数据库的驱动写在静态方法中。
- 根据权利要求6或7所述的图数据库的数据导入方法,其特征在于,所述装置还包括:配置模块,用于定义所述spark udf函数的输入与输出的参数。
- 根据权利要求7所述的图数据库的数据导入方法,其特征在于,所述装置还包括:关闭模块,用于将所述图数据库的驱动与所述资源关闭。
- 根据权利要求6或7所述的图数据库的数据导入方法,其特征在于,所述利用所述spark资源查询hive数据库包括:使用所述spark资源的reduce算子对所述查询到的数据进行相应计算。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA3176758A CA3176758A1 (en) | 2019-04-09 | 2019-09-29 | Method and apparatus for introducing data to a graph database |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910282923.3A CN110110108B (zh) | 2019-04-09 | 2019-04-09 | 一种图数据库的数据导入方法及装置 |
CN201910282923.3 | 2019-04-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020206952A1 true WO2020206952A1 (zh) | 2020-10-15 |
Family
ID=67485283
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/109096 WO2020206952A1 (zh) | 2019-04-09 | 2019-09-29 | 一种图数据库的数据导入方法及装置 |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN110110108B (zh) |
CA (1) | CA3176758A1 (zh) |
WO (1) | WO2020206952A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596561A (zh) * | 2018-03-29 | 2018-09-28 | 客如云科技(成都)有限责任公司 | 一种基于大数据架构的人效服务系统及方法 |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110110108B (zh) * | 2019-04-09 | 2021-03-30 | 苏宁易购集团股份有限公司 | 一种图数据库的数据导入方法及装置 |
CN112905854A (zh) * | 2021-03-05 | 2021-06-04 | 北京中经惠众科技有限公司 | 数据处理方法、装置、计算设备及存储介质 |
CN112925952A (zh) * | 2021-03-05 | 2021-06-08 | 北京中经惠众科技有限公司 | 数据查询方法、装置、计算设备及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104391957A (zh) * | 2014-12-01 | 2015-03-04 | 浪潮电子信息产业股份有限公司 | 一种针对混合型大数据处理系统的数据交互分析方法 |
CN105528367A (zh) * | 2014-09-30 | 2016-04-27 | 华东师范大学 | 基于开源大数据对时间敏感数据的存储和近实时查询方法 |
CN106528773A (zh) * | 2016-11-07 | 2017-03-22 | 山东首讯信息技术有限公司 | 一种基于Spark平台支持空间数据管理的图计算系统及方法 |
CN109460416A (zh) * | 2018-12-12 | 2019-03-12 | 成都四方伟业软件股份有限公司 | 一种数据处理方法、装置、电子设备及存储介质 |
CN110110108A (zh) * | 2019-04-09 | 2019-08-09 | 苏宁易购集团股份有限公司 | 一种图数据库的数据导入方法及装置 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105468671B (zh) * | 2015-11-12 | 2019-04-02 | 杭州中奥科技有限公司 | 实现人员关系建模的方法 |
US10409782B2 (en) * | 2016-06-15 | 2019-09-10 | Chen Zhang | Platform, system, process for distributed graph databases and computing |
CN106815353B (zh) * | 2017-01-20 | 2020-02-21 | 星环信息科技(上海)有限公司 | 一种数据查询的方法及设备 |
CN109344268A (zh) * | 2018-08-14 | 2019-02-15 | 北京奇虎科技有限公司 | 图形数据库写入的方法、电子设备及计算机可读存储介质 |
-
2019
- 2019-04-09 CN CN201910282923.3A patent/CN110110108B/zh active Active
- 2019-09-29 WO PCT/CN2019/109096 patent/WO2020206952A1/zh active Application Filing
- 2019-09-29 CA CA3176758A patent/CA3176758A1/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105528367A (zh) * | 2014-09-30 | 2016-04-27 | 华东师范大学 | 基于开源大数据对时间敏感数据的存储和近实时查询方法 |
CN104391957A (zh) * | 2014-12-01 | 2015-03-04 | 浪潮电子信息产业股份有限公司 | 一种针对混合型大数据处理系统的数据交互分析方法 |
CN106528773A (zh) * | 2016-11-07 | 2017-03-22 | 山东首讯信息技术有限公司 | 一种基于Spark平台支持空间数据管理的图计算系统及方法 |
CN109460416A (zh) * | 2018-12-12 | 2019-03-12 | 成都四方伟业软件股份有限公司 | 一种数据处理方法、装置、电子设备及存储介质 |
CN110110108A (zh) * | 2019-04-09 | 2019-08-09 | 苏宁易购集团股份有限公司 | 一种图数据库的数据导入方法及装置 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596561A (zh) * | 2018-03-29 | 2018-09-28 | 客如云科技(成都)有限责任公司 | 一种基于大数据架构的人效服务系统及方法 |
Also Published As
Publication number | Publication date |
---|---|
CN110110108A (zh) | 2019-08-09 |
CA3176758A1 (en) | 2020-10-15 |
CN110110108B (zh) | 2021-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020206952A1 (zh) | 一种图数据库的数据导入方法及装置 | |
US11468103B2 (en) | Relational modeler and renderer for non-relational data | |
US7984043B1 (en) | System and method for distributed query processing using configuration-independent query plans | |
Tao et al. | Minimal mapreduce algorithms | |
US8725707B2 (en) | Data continuous SQL process | |
US20170083573A1 (en) | Multi-query optimization | |
Ali et al. | The extensibility framework in Microsoft StreamInsight | |
CN103324765B (zh) | 一种基于列存储的多核并行数据查询优化方法 | |
CN102426582B (zh) | 数据操作管理装置和数据操作管理方法 | |
US11693912B2 (en) | Adapting database queries for data virtualization over combined database stores | |
US11615076B2 (en) | Monolith database to distributed database transformation | |
US11442934B2 (en) | Database calculation engine with dynamic top operator | |
US10339040B2 (en) | Core data services test double framework automation tool | |
CN107977446A (zh) | 一种基于数据分区的内存网格数据加载方法 | |
CN109902117A (zh) | 业务系统分析方法和装置 | |
Tatemura et al. | Partiqle: An elastic SQL engine over key-value stores | |
Ivanov et al. | On the inequality of the 3V's of Big Data Architectural Paradigms: A case for heterogeneity | |
Imran et al. | Fast datalog evaluation for batch and stream graph processing | |
CN108710640B (zh) | 一种提高Spark SQL的查询效率的方法 | |
CN112667598B (zh) | 基于业务需求变化的数据模型快速构建系统 | |
Ge et al. | LSShare: an efficient multiple query optimization system in the cloud | |
CN116431635A (zh) | 基于湖仓一体的配电物联网数据实时处理系统及方法 | |
Hüske | Specification and optimization of analytical data flows | |
CN114064655A (zh) | 一种可配置化数据查询及数据关系的自动发现方法 | |
Sax | Performance optimizations and operator semantics for streaming data flow programs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19924609 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19924609 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 22/04/2022) |
|
ENP | Entry into the national phase |
Ref document number: 3176758 Country of ref document: CA |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19924609 Country of ref document: EP Kind code of ref document: A1 |