WO2018045753A1 - 用于分布式图计算的方法与设备 - Google Patents
用于分布式图计算的方法与设备 Download PDFInfo
- Publication number
- WO2018045753A1 WO2018045753A1 PCT/CN2017/080845 CN2017080845W WO2018045753A1 WO 2018045753 A1 WO2018045753 A1 WO 2018045753A1 CN 2017080845 W CN2017080845 W CN 2017080845W WO 2018045753 A1 WO2018045753 A1 WO 2018045753A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- graph
- data
- computing
- graph algorithm
- distributed
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2237—Vectors, bitmaps or matrices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Definitions
- the present application relates to the field of computers, and in particular to a technique for distributed graph calculation.
- the single-machine, single-threaded graph processing algorithm is limited by system resources and computation time, and the algorithm cannot be guaranteed to run successfully and efficiently. Therefore, parallelizing and distributing the graph processing is the way to solve the problem.
- a method for distributed graph calculation comprising:
- the computing task corresponding to the graph algorithm is distributed to a plurality of computing nodes for execution, wherein the persistence operation is performed when the persistence condition is met during execution.
- an apparatus for distributed graph computing comprising:
- a first device configured to acquire original map data
- a second device configured to process the original map data according to a graph algorithm to obtain regular graph data corresponding to the graph algorithm
- a third device configured to distribute the computing task corresponding to the graph algorithm to a plurality of computing nodes for execution, wherein the persistence operation is performed when the persistence condition is met during execution.
- the present application first acquires original image data, and then processes the original image data according to a graph algorithm to obtain regular map data corresponding to the graph algorithm, so as to adapt different types of graph algorithms, and then The computing task corresponding to the graph algorithm is distributed to multiple computing nodes for execution.
- the persistence condition is met, the persistence operation is performed, the data dependency is cut off, and the weight is reduced.
- the present application performs a merge operation on the graph data before performing the aggregation operation and the connection operation, thereby improving the operation efficiency and reducing the network transmission pressure.
- the present application employs a method of data serialization and deserialization to facilitate the transfer of intermediate data generated during the calculation process between computing nodes.
- the present application implements the initiation of the graph algorithm by the SQL statement, and by improving the processing logic, the data entering the graph algorithm is complete graph data.
- FIG. 1 shows a flow chart of a method for distributed graph calculations in accordance with an aspect of the present application
- FIG. 2 is a schematic diagram of distributing a computing task corresponding to a graph algorithm to a plurality of computing nodes according to a preferred embodiment of the present application;
- FIG. 3 shows a flow chart of a method for distributed graph calculation in accordance with another preferred embodiment of the present application.
- FIG. 4 shows a schematic diagram of an apparatus for distributed graph computing in accordance with another aspect of the present application.
- FIG. 5 shows a schematic diagram of an apparatus for distributed graph calculation in accordance with yet another preferred embodiment of the present application.
- the terminal, the device of the service network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
- processors CPUs
- input/output interfaces network interfaces
- memory volatile and non-volatile memory
- the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
- RAM random access memory
- ROM read only memory
- Memory is an example of a computer readable medium.
- Computer readable media includes both permanent and non-permanent, removable and non-removable media
- Information storage is implemented by any method or technique.
- the information can be computer readable instructions, data structures, modules of programs, or other data.
- Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage,
- computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.
- step S11 shows a flow chart of a method for distributed graph calculation in accordance with an aspect of the present application, wherein the method includes step S11, step S12, and step S13.
- step S11 the device 1 acquires the original map data; in step S12, the device 1 processes the original map data according to the graph algorithm to obtain the regular map data corresponding to the graph algorithm; in step S13, the device 1
- the computing task corresponding to the graph algorithm is distributed to a plurality of computing nodes for execution, wherein the persistence operation is performed when the persistence condition is met during execution.
- the device 1 includes, but is not limited to, a user equipment, a network device, or a device formed by integrating a user equipment and a network device through a network.
- the user equipment includes, but is not limited to, any mobile electronic product that can interact with a user through a touchpad, such as a smart phone, a tablet computer, a notebook computer, etc., and the mobile electronic product can adopt any operating system, such as Android operating system, iOS operating system, etc.
- the network device includes an electronic device capable of automatically performing numerical calculation and information processing according to an instruction set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit (ASIC), and a programmable gate.
- ASIC application specific integrated circuit
- the network device includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a plurality of servers; wherein the cloud is composed of a large number of computers or network servers based on Cloud Computing Among them, cloud computing is a kind of distributed computing, a virtual supercomputer composed of a group of loosely coupled computers.
- the network includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, a wireless ad hoc network (Ad Hoc network), and the like.
- the device 1 may also be a script program running on the user equipment, the network device, or the user equipment and the network device, the network device, the touch terminal, or the network device and the touch terminal integrated through the network.
- a script program running on the user equipment, the network device, or the user equipment and the network device, the network device, the touch terminal, or the network device and the touch terminal integrated through the network.
- step S11 the device 1 acquires original map data.
- the original map data includes point data and side data of the graph; wherein the side data may include information of a starting point and an arrival point, and may also include information required by any graph algorithm; if it is a weight graph, the side data is further With weight data.
- step S12 the device 1 processes the original map data according to a graph algorithm to obtain regular map data corresponding to the graph algorithm.
- graph algorithms often require some parameters to control key information such as precision and number of operations.
- the parameters may be different for different graph algorithms.
- by processing the original map data corresponding regular map data is obtained to adapt different kinds of graph algorithms.
- the device 1 also stores the regular map data in a distributed file system.
- the distributed file system may include a Hadoop Distributed File System (HDFS); in order to increase the degree of parallelism of processing, in a preferred embodiment, the present application stores graph data in a Hadoop distributed file system.
- HDFS Hadoop Distributed File System
- Hadoop distributed file system is only an example, and other existing or future distributed file systems may be applicable to the present application, and should also be included in the scope of the present application. It is included here by reference.
- the Hadoop distributed file system is used for storage, and the Hive is used as an interaction tool.
- Hive is a data warehousing tool based on Hadoop.
- Hive can apply SQL language to big data scenarios, which is compatible with traditional data applications and shields complex distributed programming details.
- Hive supports a variety of computing engines, of which Spark has a rich computing model and operator as a computing engine, which can be used to implement graph calculation. law.
- the device 1 further performs type checking on the regular map data according to the graph algorithm.
- the regular map data may be first divided into fields, and then the column type check may be performed.
- the structure type checker StandardStructObjectInspector of the input data is retrieved from the Hive by the GraphOperator operator, which includes the element type checker ObjectInspector for each field.
- step S13 the device 1 distributes the computing task corresponding to the graph algorithm to a plurality of computing nodes for execution, wherein the persistence operation is performed when the persistence condition is satisfied during the execution.
- each computing node in the process of computing task distribution, in order to improve processing efficiency, is allocated as much as possible on the HDFS node in which the graph data is stored.
- saving the intermediate result by the persistence operation can cut off the data dependency and reduce the amount of double calculation.
- the device 1 creates a plurality of computing nodes through the resource management framework for executing computing tasks corresponding to the graph algorithm.
- the resource management framework may include Yarn.
- Yarn a plurality of computing nodes are created for the computing task corresponding to the graph algorithm by the resource management framework Yarn.
- the device 1 distributes the computing tasks corresponding to the graph algorithm to a plurality of computing nodes in the distributed computing framework for execution.
- the distributed computing framework may include Spark; referring to FIG. 2, the distributed computing framework Spark is used as a computing engine, and since the calculation process of the data is a lazy model, it is more advantageous to calculate a graph computing with high complexity.
- the persistence condition comprises at least one of the following: the computing time of the elastic distributed data set of the distributed computing framework reaches a corresponding duration threshold; the elastic distributed data set of the distributed computing framework The current dependency length reaches the corresponding length threshold.
- Spark's Resilient Distributed Datasets will be used in the calculation process.
- GraphRDD has a long calculation time or dependency (for example, you can set the duration threshold corresponding to the calculation time to 10 minutes, and when the calculation of GraphRDD takes 10 minutes)
- perform the persistence operation and the data and element type checker.
- the ObjectInspector is written to the local disk together and the corresponding BlockId is reported to the Spark Driver.
- persistence operations can also write data to the Hadoop Distributed File System (HDFS).
- HDFS Hadoop Distributed File System
- the persistence operation comprises at least one of: storing a current calculation result; clearing a current dependency.
- the persistence operation can save calculation results, clear dependencies, reduce the computational cost of some complex transformations that are used repeatedly, and provide fault tolerance.
- the device 1 also performs an aggregation operation and a connection operation on the regular map data having the same key value.
- a certain column or columns of data of the graph data is used as a key, a groupBy operation and a join operation are performed, and all the data of the same key (key) are processed by one computing node, so the computing node There will be a lot of data transfer between them.
- the specified fields of the data are selected by GraphRDD, the fields are serialized into keys, and the same data is combined by the aggregation operation and the connection operation, and different operations are applied according to the types of the graph algorithms.
- the data is first merged once on each computing node by the aggregation operation, and the merged result is transmitted to other computing nodes according to the key value.
- an optimized data structure and optimization strategy can be used.
- two large data GraphRDs When two large data GraphRDs are connected, they put a lot of pressure on the memory.
- the data structure used will store data in the disk when the memory resources are tight, thereby avoiding memory overflow problems.
- the connection optimization strategy that copies the smaller amount of GraphRDD to each compute node is used to speed up the connection and reduce the network pressure.
- the performing the aggregation operation and the connection operation on the regular map data having the same key value further comprises: performing a merge operation on the regular map data on each of the computing nodes before performing the aggregation operation.
- the data integration operation is performed at the current calculation node, which can reduce the transmission amount of the network data, improve the operation efficiency, and thereby reduce the network transmission pressure.
- step S13 when the computing node acquires intermediate data, the device 1 first deserializes the intermediate data, processes the deserialized intermediate data according to the graph algorithm, and processes the data according to the graph algorithm.
- the intermediate data is serialized.
- this embodiment adopts a data serialization and deserialization method based on type checking to parse the data type.
- the device is passed along with the data to the compute node.
- the GraphOperator combines the raw data and the element type checker ObjectInspector into a GraphRDD as input data for each graph algorithm operator.
- each data is serialized using ObjectInspector and deserialized at other compute nodes.
- the method further includes step S14' and step S15'; in step S14', the device 1 acquires an SQL statement to be executed; in step S15', the device 1 parses the SQL statement to call Corresponding graph algorithm.
- the distributed computing framework Spark is used as a computing engine, and a plurality of graph algorithms are integrated into the Hive in a custom function manner. Therefore, the graph algorithm can be organically combined with other SQL statements to reduce the processing difficulty.
- the device 1 registers a plurality of graph algorithms with a custom function, wherein each graph algorithm corresponds to one registration function.
- Hive's UDTF User Defined Table-Generating Function
- the user-defined table generation function registers the implementation class name of the Havi map algorithm to achieve the purpose of starting the graph algorithm through the SQL statement.
- UDTF is an interface designed by Hive to add custom functions to users. Users can obtain a line of input through UDTF's process method and convert it into one or more lines of output.
- the UDTF's "one-line input, multi-line output" model does not meet the needs of graph calculations.
- the data entering the graph algorithm is complete graph data by adding new processing logic on the basis of the UDTF.
- a function can be registered for each graph algorithm using the UDTF interface.
- This embodiment implements a UDTF-based Operator operator to solve the problem of graph calculation.
- a GraphOperator operator is first implemented as the base class for all graph algorithm operators. GraphOperator inherits the UDTF interface, so you can register different graph algorithms into Hive through the RegisterGenericUDTF method of FunctionRegistry.
- Hive's TableScanOperator operator and UDTFOperator operator are modified.
- the UDTFOperator operator takes the input data encapsulated as an RDD from the TableScanOperator operator and passes it to the GraphOperator operator.
- Each graph algorithm operator that inherits from GraphOperator can access the complete graph data.
- FIG. 4 shows an apparatus 1 for distributed graph calculation in accordance with another aspect of the present application, wherein the apparatus 1 includes a first device 11, a second device 12, and a third device 13.
- the first device 11 acquires original image data; the second device 12 processes the original image data according to a graph algorithm to obtain regular map data corresponding to the graph algorithm;
- the computing task corresponding to the graph algorithm is distributed to a plurality of computing node executions, wherein the persistence operation is performed when the persistence condition is satisfied during execution.
- the device 1 includes, but is not limited to, a user equipment, a network device, or a device formed by integrating a user equipment and a network device through a network.
- the user equipment includes, but is not limited to, any mobile electronic product that can interact with a user through a touchpad, such as a smart phone, a tablet computer, a notebook computer, etc., and the mobile electronic product can adopt any operating system, such as Android operating system, iOS operating system, etc.
- the network device includes an electronic device capable of automatically performing numerical calculation and information processing according to an instruction set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit (ASIC), and a programmable gate.
- ASIC application specific integrated circuit
- the network device includes but is not limited to calculation A cloud consisting of a machine, a network host, a single network server, multiple network server sets, or multiple servers; here, the cloud is composed of a large number of computers or network servers based on Cloud Computing, where cloud computing is distributed computing A virtual supercomputer consisting of a group of loosely coupled computers.
- the network includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, a wireless ad hoc network (Ad Hoc network), and the like.
- the device 1 may also be a script program running on the user equipment, the network device, or the user equipment and the network device, the network device, the touch terminal, or the network device and the touch terminal integrated through the network.
- the above-mentioned device 1 is only an example, and other existing or future devices 1 may be applicable to the present application, and are also included in the protection scope of the present application, and are hereby incorporated by reference. Included here.
- the first device 11 acquires original map data.
- the original map data includes point data and side data of the graph; wherein the side data may include information of a starting point and an arrival point, and may also include information required by any graph algorithm; if it is a weight graph, the side data is further With weight data.
- the second device 12 processes the original map data according to a graph algorithm to obtain regular graph data corresponding to the graph algorithm.
- graph algorithms often require some parameters to control key information such as precision and number of operations.
- the parameters may be different for different graph algorithms.
- by processing the original map data corresponding regular map data is obtained to adapt different kinds of graph algorithms.
- the second device 12 also stores the regular map data in a distributed file system.
- the distributed file system may include a Hadoop Distributed File System (HDFS); in order to increase the degree of parallelism of processing, in a preferred embodiment, the present application stores graph data in a Hadoop distributed file system.
- HDFS Hadoop Distributed File System
- Hadoop distributed file system is only an example, and other existing or future distributed file systems may be applicable to the present application, and should also be included in the scope of the present application. It is included here by reference.
- the Hadoop distributed file system is used for storage, and Hive is used as an interactive tool.
- Hive is a data warehousing tool based on Hadoop.
- Hive can apply SQL language to big data scenarios, which is compatible with traditional data applications and shields complex distributed programming details.
- Hive supports a variety of computing engines, of which Spark has a rich computational model and operator as a computational engine that can be used to implement graph algorithms.
- the second device 12 further performs type checking on the regular map data according to the graph algorithm.
- the regular map data may be first divided into fields, and then the column type check may be performed.
- the structure type checker StandardStructObjectInspector of the input data is retrieved from the Hive by the GraphOperator operator, which includes the element type checker ObjectInspector for each field.
- the third device 13 distributes the computing task corresponding to the graph algorithm to a plurality of computing nodes for execution, wherein the persistence operation is performed when the persistence condition is satisfied during the execution.
- each computing node in the process of computing task distribution, in order to improve processing efficiency, is allocated as much as possible on the HDFS node in which the graph data is stored.
- saving the intermediate result by the persistence operation can cut off the data dependency and reduce the amount of double calculation.
- the third device 13 creates a plurality of computing nodes through the resource management framework for executing computing tasks corresponding to the graph algorithm.
- the resource management framework may include Yarn.
- Yarn a plurality of computing nodes are created for the computing task corresponding to the graph algorithm by the resource management framework Yarn.
- the third device 13 distributes the computing tasks corresponding to the graph algorithm to a plurality of computing nodes in the distributed computing framework for execution.
- the distributed computing framework may include Spark; referring to FIG. 2, using a distributed meter
- the calculation framework Spark is used as the calculation engine. Since the calculation process of the data is a lazy model, it is more conducive to computational complexity calculation.
- the persistence condition comprises at least one of the following: the computing time of the elastic distributed data set of the distributed computing framework reaches a corresponding duration threshold; the elastic distributed data set of the distributed computing framework The current dependency length reaches the corresponding length threshold.
- Spark's Resilient Distributed Datasets will be used in the calculation process.
- GraphRDD has a long calculation time or dependency (for example, you can set the duration threshold corresponding to the calculation time to 10 minutes, and when the calculation of GraphRDD takes 10 minutes)
- perform the persistence operation and the data and element type checker.
- the ObjectInspector is written to the local disk together and the corresponding BlockId is reported to the Spark Driver.
- persistence operations can also write data to the Hadoop Distributed File System (HDFS).
- HDFS Hadoop Distributed File System
- the persistence operation comprises at least one of: storing a current calculation result; clearing a current dependency.
- the persistence operation can save calculation results, clear dependencies, reduce the computational cost of some complex transformations that are used repeatedly, and provide fault tolerance.
- the three devices 13 also perform an aggregation operation and a connection operation on the regular map data having the same key value.
- a certain column or columns of data of the graph data is used as a key, a groupBy operation and a join operation are performed, and all the data of the same key (key) are processed by one computing node, so the computing node There will be a lot of data transfer between them.
- the specified fields of the data are selected by GraphRDD, the fields are serialized into keys, and the same data is combined by the aggregation operation and the connection operation, and different operations are applied according to the types of the graph algorithms.
- the data is first merged once on each compute node by the aggregation operation, and the merged result is transmitted according to the key value. Delivered to other compute nodes.
- an optimized data structure and optimization strategy can be used.
- two large data GraphRDs When two large data GraphRDs are connected, they put a lot of pressure on the memory.
- the data structure used will store data in the disk when the memory resources are tight, thereby avoiding memory overflow problems.
- the connection optimization strategy of copying the smaller amount of GraphRDD to each compute node is adopted, which speeds up the connection and reduces the network pressure.
- the performing the aggregation operation and the connection operation on the regular map data having the same key value further comprises: performing a merge operation on the regular map data on each of the computing nodes before performing the aggregation operation.
- the data integration operation is performed at the current calculation node, which can reduce the transmission amount of the network data, improve the operation efficiency, and thereby reduce the network transmission pressure.
- the third device 13 first deserializes the intermediate data, processes the deserialized intermediate data according to the graph algorithm, and then processes the data according to the graph algorithm.
- the intermediate data is serialized.
- this embodiment adopts a data serialization and deserialization method based on type checking to parse the data type.
- the device is passed along with the data to the compute node.
- the GraphOperator combines the raw data and the element type checker ObjectInspector into a GraphRDD as input data for each graph algorithm operator.
- each data is serialized using ObjectInspector and deserialized at other compute nodes.
- the device 1 further includes a fourth device 14' and a fifth device 15'; the fourth device 14' acquires a SQL statement to be executed; the fifth device 15' parses the SQL The statement calls the corresponding graph algorithm.
- the distributed computing framework Spark is used as a calculation engine to customize The way of the function integrates many graph algorithms into Hive. Therefore, the graph algorithm can be organically combined with other SQL statements to reduce the processing difficulty.
- the fifth device 15' registers a plurality of graph algorithms with a custom function, wherein each graph algorithm corresponds to one registration function.
- Hive's UDTF User Defined Table-Generating Function
- Hive's UDTF User Defined Table-Generating Function
- the UDTF is an interface designed by Hive to add a custom function to the user. The user can obtain a line of input through the UDTF process method and convert it into one or more lines of output.
- the UDTF's "one-line input, multi-line output" model does not meet the needs of graph calculations.
- the data entering the graph algorithm is complete graph data by adding new processing logic on the basis of the UDTF.
- a function can be registered for each graph algorithm using the UDTF interface.
- This embodiment implements a UDTF-based Operator operator to solve the problem of graph calculation.
- a GraphOperator operator is first implemented as the base class for all graph algorithm operators. GraphOperator inherits the UDTF interface, so you can register different graph algorithms into Hive through the RegisterGenericUDTF method of FunctionRegistry.
- Hive's TableScanOperator operator and UDTFOperator operator are modified.
- the UDTFOperator operator takes the input data encapsulated as an RDD from the TableScanOperator operator and passes it to the GraphOperator operator.
- Each graph algorithm operator that inherits from GraphOperator can access the complete graph data.
- the present application first acquires original image data, and then processes the original image data according to a graph algorithm to obtain regular map data corresponding to the graph algorithm, so as to adapt different types of graph algorithms, and then The computing tasks corresponding to the graph algorithm are distributed to a plurality of computing nodes for execution.
- the persistence condition is met, the persistence operation is performed, the data dependency is cut off, the amount of repeated calculation is reduced, and the processing efficiency is improved.
- the present application performs a merge operation on the graph data before performing the aggregation operation and the connection operation, thereby improving the operation efficiency and reducing the network transmission pressure.
- the present application employs a method of data serialization and deserialization to facilitate the transfer of intermediate data generated during the calculation process between computing nodes. Further, the present application implements a graph initiation algorithm by using an SQL statement, and by improving processing logic, The data of the graphing algorithm is complete graph data.
- the present application can be implemented in software and/or a combination of software and hardware, for example, using an application specific integrated circuit (ASIC), a general purpose computer, or any other similar hardware device.
- the software program of the present application can be executed by a processor to implement the steps or functions described above.
- the software programs (including related data structures) of the present application can be stored in a computer readable recording medium such as a RAM memory, a magnetic or optical drive or a floppy disk and the like.
- some of the steps or functions of the present application may be implemented in hardware, for example, as a circuit that cooperates with a processor to perform various steps or functions.
- a portion of the present application can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide a method and/or technical solution in accordance with the present application.
- the program instructions for invoking the method of the present application may be stored in a fixed or removable recording medium, and/or transmitted by a data stream in a broadcast or other signal bearing medium, and/or stored in a The working memory of the computer device in which the program instructions are run.
- an embodiment in accordance with the present application includes a device including a memory for storing computer program instructions and a processor for executing program instructions, wherein when the computer program instructions are executed by the processor, triggering
- the apparatus operates based on the aforementioned methods and/or technical solutions in accordance with various embodiments of the present application.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种用于分布式图计算的方法与设备,所述方法先获取原始图数据(S11),然后根据图算法处理所述原始图数据以获得所述图算法对应的规整图数据(S12),以便于适配不同种类的图算法,接着将所述图算法对应的计算任务分发至多个计算节点执行,其中,在执行过程中当满足持久化条件,进行持久化操作(S13),切断数据依赖,减少重复计算量,提高处理效率。进一步地,上述方法在对图数据进行聚合操作及连接操作之前,先对其进行合并操作,从而提高运算效率,减轻网络传输压力。进一步地,上述方法采用一种数据序列化和反序列化的方法,以便于计算过程中的产生的中间数据在计算节点之间传递。
Description
本申请涉及计算机领域,尤其涉及一种用于分布式图计算的技术。
随着图规模的膨胀,单机、单线程的图处理算法受到系统资源和计算时间的限制,无法保证算法成功而有效地运行。因此将图处理过程并行化、分布式化是解决问题的途径。
发明内容
本申请的目的是提供一种用于分布式图计算的方法与设备。
根据本申请的一个方面,提供了一种用于分布式图计算的方法,其中,所述方法包括:
获取原始图数据;
根据图算法,处理所述原始图数据以获得所述图算法对应的规整图数据;
将所述图算法对应的计算任务分发至多个计算节点执行,其中,在执行过程中当满足持久化条件,进行持久化操作。
根据本申请的另一个方面,提供了一种用于分布式图计算的设备,其中,所述设备包括:
第一装置,用于获取原始图数据;
第二装置,用于根据图算法,处理所述原始图数据以获得所述图算法对应的规整图数据;
第三装置,用于将所述图算法对应的计算任务分发至多个计算节点执行,其中,在执行过程中当满足持久化条件,进行持久化操作。
与现有技术相比,本申请先获取原始图数据,然后根据图算法处理所述原始图数据以获得所述图算法对应的规整图数据,以便于适配不同种类的图算法,接着将所述图算法对应的计算任务分发至多个计算节点执行,其中,在执行过程中当满足持久化条件,进行持久化操作,切断数据依赖,减少重
复计算量,提高处理效率。进一步地,本申请在对图数据进行聚合操作及连接操作之前,先对其进行合并操作,从而提高运算效率,减轻网络传输压力。进一步地,本申请采用一种数据序列化和反序列化的方法,以便于计算过程中的产生的中间数据在计算节点之间传递。进一步地,本申请实现了通过SQL语句启动图算法,并且通过改进处理逻辑,使得进入图算法的数据是完整的图数据。
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本发明的其它特征、目的和优点将会变得更明显:
图1示出根据本申请一个方面的一种用于分布式图计算的方法流程图;
图2示出根据本申请一个优选实施例的一种将图算法对应的计算任务分发至多个计算节点的示意图;
图3示出根据本申请另一个优选实施例的一种用于分布式图计算的方法流程图;
图4示出根据本申请另一个方面的一种用于分布式图计算的设备示意图;
图5示出根据本申请又一个优选实施例的一种用于分布式图计算的设备示意图。
附图中相同或相似的附图标记代表相同或相似的部件。
下面结合附图对本发明作进一步详细描述。
在本申请一个典型的配置中,终端、服务网络的设备和可信方均包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以
由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
图1示出根据本申请一个方面的一种用于分布式图计算的方法流程图,其中,所述方法包括步骤S11、步骤S12和步骤S13。
具体地,在步骤S11中,设备1获取原始图数据;在步骤S12中,设备1根据图算法,处理所述原始图数据以获得所述图算法对应的规整图数据;在步骤S13中,设备1将所述图算法对应的计算任务分发至多个计算节点执行,其中,在执行过程中当满足持久化条件,进行持久化操作。
在此,所述设备1包括但不限于用户设备、网络设备、或用户设备与网络设备通过网络相集成所构成的设备。所述用户设备其包括但不限于任何一种可与用户通过触摸板进行人机交互的移动电子产品,例如智能手机、平板电脑、笔记本电脑等,所述移动电子产品可以采用任意操作系统,如android操作系统、iOS操作系统等。其中,所述网络设备包括一种能够按照事先设定或存储的指令,自动进行数值计算和信息处理的电子设备,其硬件包括但不限于微处理器、专用集成电路(ASIC)、可编程门阵列(FPGA)、数字处理器(DSP)、嵌入式设备等。所述网络设备其包括但不限于计算机、网络主机、单个网络服务器、多个网络服务器集或多个服务器构成的云;在此,云由基于云计算(Cloud Computing)的大量计算机或网络服务器构成,其中,云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个虚拟超级计算机。所述网络包括但不限于互联网、广域网、城域网、局域网、VPN网络、无线自组织网络(Ad Hoc网络)等。优选地,
设备1还可以是运行于所述用户设备、网络设备、或用户设备与网络设备、网络设备、触摸终端或网络设备与触摸终端通过网络相集成所构成的设备上的脚本程序。当然,本领域技术人员应能理解上述设备1仅为举例,其他现有的或今后可能出现的设备1如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
在步骤S11中,设备1获取原始图数据。
在此,所述原始图数据包括图的点数据和边数据;其中,边数据可以包括出发点和到达点的信息,还可以包括任何图算法所需的信息;如果是权重图,则边数据还带有权重数据。
在步骤S12中,设备1根据图算法,处理所述原始图数据以获得所述图算法对应的规整图数据。
例如,图算法往往需要一些参数来控制精度、运算次数等关键信息;所述图算法的种类可能有多种,对于不同的图算法,其参数也可能不同。在此,通过处理所述原始图数据,得到对应的规整图数据,以适配不同种类的图算法。
优选地,在步骤S12中,设备1还将所述规整图数据存储于分布式文件系统。
例如,所述分布式文件系统可以包括Hadoop分布式文件系统(Hadoop Distributed File System,HDFS);为了增加处理的并行度,在优选的实施例中,本申请将图数据存储于Hadoop分布式文件系统。
当然,本领域应能理解上述Hadoop分布式文件系统仅为举例,其他现有的或今后可能出现的分布式文件系统如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
在优选的实施例中,采用了Hadoop分布式文件系统进行存储,还采用Hive作为交互工具;在实际应用场景中,除了数据外,通常还需要将一些Hive的配置传递给运算节点。在此,Hive是基于Hadoop的数据仓库工具,通过Hive可以将SQL语言应用于大数据场景,一方面兼容传统数据应用,另一方面屏蔽复杂的分布式编程细节。Hive支持多种计算引擎,其中Spark作为计算引擎拥有丰富的计算模型和算子,可以用于实现图算
法。
优选地,在步骤S12中,设备1还根据所述图算法,对所述规整图数据进行类型检查。
例如,在数据进入图算法前,需要进行类型检查,避免错误数据导致算法出错。具体地,可以先对所述规整图数据进行字段分割,再进行列类型检查。在优选的实施例中,通过GraphOperator算子从Hive处获取输入数据的结构类型检查器StandardStructObjectInspector,该类型检查器囊括了每个字段的元素类型检查器ObjectInspector。
在步骤S13中,设备1将所述图算法对应的计算任务分发至多个计算节点执行,其中,在执行过程中当满足持久化条件,进行持久化操作。
在优选的实施例中,在计算任务分发的过程中,为了提高处理效率,尽量将每个计算节点分配在存有图数据的HDFS节点上。在计算执行过程复杂而费时的情况下,通过持久化操作保存中间结果可以切断数据依赖,减少重复计算量。
优选地,在步骤S13中,设备1通过资源管理框架创建多个计算节点用于执行所述图算法对应的计算任务。
例如,所述通过资源管理框架可以包括Yarn;参照图2,通过资源管理框架Yarn为所述图算法对应的计算任务创建多个计算节点。
当然,本领域应能理解上述资源管理框架Yarn仅为举例,其他现有的或今后可能出现的资源管理框架如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
优选地,在步骤S13中,设备1将所述图算法对应的计算任务分发至分布式计算框架中的多个计算节点执行。
例如,所述分布式计算框架可以包括Spark;参照图2,采用分布式计算框架Spark作为计算引擎,由于数据的计算过程是滞后(lazy)模型,更有利于计算复杂度高的图计算。
当然,本领域应能理解上述分布式计算框架Spark仅为举例,其他现有的或今后可能出现的分布式计算框架如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
优选地,所述持久化条件包括以下至少任一项:所述分布式计算框架的弹性分布式数据集的计算耗时达到对应的时长阈值;所述分布式计算框架的弹性分布式数据集的当前依赖关系长度达到对应的长度阈值。
例如,若采用分布式计算框架Spark,在计算过程中会用到Spark的弹性分布式数据集(Resilient Distributed Datasets,RDD)。当GraphRDD有较长计算时间或依赖关系时(比如可以将计算耗时对应的时长阈值设置为10分钟,当GraphRDD的计算耗时达到10分钟),进行持久化操作,将数据和元素类型检查器ObjectInspector一起写入到本地磁盘中,并将对应的BlockId汇报给Spark Driver。在处理需要多轮迭代运算的图算法时,为了避免计算节点故障造成的数据丢失,持久化操作亦可以将数据写入Hadoop分布式文件系统(HDFS)中。
优选地,所述持久化操作包括以下至少任一项:存储当前计算结果;清除当前依赖关系。
在此,通过持久化(persist)操作可以保存计算结果、清除依赖关系,降低一些被反复使用的复杂变换的计算成本,并提供容错性。
优选地,在步骤S13中,设备1还对键值相同的所述规整图数据进行聚合操作及连接操作。
例如,以图数据的某一列或几列数据作为键值(key),进行聚合(groupBy)操作及连接(join)操作,由一个计算节点处理相同键值(key)的所有数据,因此计算节点间会有大量的数据传输。具体地,通过GraphRDD选取数据的指定字段,将这些字段序列化成键值(key),通过聚合操作及连接操作合并键值(key)相同的数据,并根据图算法的种类不同,施加不同运算。为了减少网络传输压力,在此,通过聚合操作先将数据在每个计算节点上合并一次,再将合并完的结果根据键值(key)传递到其他计算节点上。
在优选的实施中,为了提高连接效率,可以使用一种优化的数据结构和优化策略。当两个数据量庞大的GraphRDD做连接操作时,会对内存产生极大压力。在本实施例中,所采用的数据结构在内存资源紧张时会把数据存入磁盘中,从而避免内存溢出问题。当数据量极小的GraphRDD与数
据量极大的GraphRDD做连接操作时,采用将较小数据量的GraphRDD拷贝至每个计算节点的连接优化策略,加快连接速度的同时也减轻了网络压力。
优选地,所述对键值相同的所述规整图数据进行聚合操作及连接操作还包括:进行所述聚合操作之前,在每个所述计算节点上对所述规整图数据进行合并操作。
在此,进行聚合操作及连接操作之前,在当前计算节点执行数据合并操作,可以减少网络数据的传输量,提高运算效率,从而减轻网络传输压力。
优选地,在步骤S13中,设备1当所述计算节点获取中间数据,先对中间数据进行反序列化,根据所述图算法处理反序列化后的中间数据,再对根据所述图算法处理后的中间数据进行序列化。
例如,图计算过程中会产生各种中间数据,为了能降低计算节点解析数据类型的CPU以及内存开销,本实施例采用一种基于类型检查的数据序列化和反序列化方法,将数据类型解析器随同数据一起传递给计算节点。具体地,GraphOperator将原始数据和元素类型检查器ObjectInspector组合成GraphRDD,作为每个图算法算子的输入数据。当涉及到Shuffle操作时,使用ObjectInspector序列化每个数据,并在其他计算节点反序列化。
优选地,参照图3,所述方法还包括步骤S14’和步骤S15’;在步骤S14’中,设备1获取待执行的SQL语句;在步骤S15’中,设备1解析所述SQL语句以调用对应的图算法。
在现有技术中,因为图算法有复杂的计算过程和大量迭代次数,无法通过SQL实现。
而在本实施例中,将分布式计算框架Spark作为计算引擎,以自定义函数的方式将众多图算法集成于Hive中。从而,可以将图算法与其他SQL语句有机组合,降低处理难度。
优选地,在步骤S15’中,设备1利用自定义函数注册多个图算法,其中,每个图算法对应一个注册函数。
例如,可以利用Hive的UDTF(User Defined Table-Generating Function,
用户自定义表生成函数)机制,向Hive注册图算法的实现类名,达到通过SQL语句启动图算法的目的。在此,UDTF是Hive为用户添加自定函数而设计的一种接口,用户可以通过UDTF的process方法,获取一行输入,并将其转换成一行或多行输出。
然而UDTF的“一行输入,多行输出”的模型不能满足图计算的需求。在本实施例中,通过在UDTF基础上添加新的处理逻辑,使得进入图算法的数据是完整的图数据。在此,可以利用UDTF接口,为每个图算法注册一个函数。本实施例实现了基于UDTF的Operator算子,从而解决图计算的需求。具体地,首先实现一个GraphOperator算子,作为所有图算法算子的基类。GraphOperator继承UDTF接口,因此可以通过FunctionRegistry的registerGenericUDTF方法,将不同图算法注册进Hive中。在本实施例改动了Hive的TableScanOperator算子和UDTFOperator算子。UDTFOperator算子从TableScanOperator算子处获取被封装为RDD的输入数据,并传递给GraphOperator算子。每个继承GraphOperator的图算法算子就都可以访问到完整的图数据。
图4示出根据本申请另一个方面的一种用于分布式图计算的设备1,其中,所述设备1包括第一装置11、第二装置12和第三装置13。
具体地,所述第一装置11获取原始图数据;所述第二装置12根据图算法,处理所述原始图数据以获得所述图算法对应的规整图数据;所述第三装置13将所述图算法对应的计算任务分发至多个计算节点执行,其中,在执行过程中当满足持久化条件,进行持久化操作。
在此,所述设备1包括但不限于用户设备、网络设备、或用户设备与网络设备通过网络相集成所构成的设备。所述用户设备其包括但不限于任何一种可与用户通过触摸板进行人机交互的移动电子产品,例如智能手机、平板电脑、笔记本电脑等,所述移动电子产品可以采用任意操作系统,如android操作系统、iOS操作系统等。其中,所述网络设备包括一种能够按照事先设定或存储的指令,自动进行数值计算和信息处理的电子设备,其硬件包括但不限于微处理器、专用集成电路(ASIC)、可编程门阵列(FPGA)、数字处理器(DSP)、嵌入式设备等。所述网络设备其包括但不限于计算
机、网络主机、单个网络服务器、多个网络服务器集或多个服务器构成的云;在此,云由基于云计算(Cloud Computing)的大量计算机或网络服务器构成,其中,云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个虚拟超级计算机。所述网络包括但不限于互联网、广域网、城域网、局域网、VPN网络、无线自组织网络(Ad Hoc网络)等。优选地,设备1还可以是运行于所述用户设备、网络设备、或用户设备与网络设备、网络设备、触摸终端或网络设备与触摸终端通过网络相集成所构成的设备上的脚本程序。当然,本领域技术人员应能理解上述设备1仅为举例,其他现有的或今后可能出现的设备1如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
所述第一装置11获取原始图数据。
在此,所述原始图数据包括图的点数据和边数据;其中,边数据可以包括出发点和到达点的信息,还可以包括任何图算法所需的信息;如果是权重图,则边数据还带有权重数据。
所述第二装置12根据图算法,处理所述原始图数据以获得所述图算法对应的规整图数据。
例如,图算法往往需要一些参数来控制精度、运算次数等关键信息;所述图算法的种类可能有多种,对于不同的图算法,其参数也可能不同。在此,通过处理所述原始图数据,得到对应的规整图数据,以适配不同种类的图算法。
优选地,所述第二装置12还将所述规整图数据存储于分布式文件系统。
例如,所述分布式文件系统可以包括Hadoop分布式文件系统(Hadoop Distributed File System,HDFS);为了增加处理的并行度,在优选的实施例中,本申请将图数据存储于Hadoop分布式文件系统。
当然,本领域应能理解上述Hadoop分布式文件系统仅为举例,其他现有的或今后可能出现的分布式文件系统如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
在优选的实施例中,采用了Hadoop分布式文件系统进行存储,还采
用Hive作为交互工具;在实际应用场景中,除了数据外,通常还需要将一些Hive的配置传递给运算节点。在此,Hive是基于Hadoop的数据仓库工具,通过Hive可以将SQL语言应用于大数据场景,一方面兼容传统数据应用,另一方面屏蔽复杂的分布式编程细节。Hive支持多种计算引擎,其中Spark作为计算引擎拥有丰富的计算模型和算子,可以用于实现图算法。
优选地,所述第二装置12还根据所述图算法,对所述规整图数据进行类型检查。
例如,在数据进入图算法前,需要进行类型检查,避免错误数据导致算法出错。具体地,可以先对所述规整图数据进行字段分割,再进行列类型检查。在优选的实施例中,通过GraphOperator算子从Hive处获取输入数据的结构类型检查器StandardStructObjectInspector,该类型检查器囊括了每个字段的元素类型检查器ObjectInspector。
所述第三装置13将所述图算法对应的计算任务分发至多个计算节点执行,其中,在执行过程中当满足持久化条件,进行持久化操作。
在优选的实施例中,在计算任务分发的过程中,为了提高处理效率,尽量将每个计算节点分配在存有图数据的HDFS节点上。在计算执行过程复杂而费时的情况下,通过持久化操作保存中间结果可以切断数据依赖,减少重复计算量。
优选地,所述第三装置13通过资源管理框架创建多个计算节点用于执行所述图算法对应的计算任务。
例如,所述通过资源管理框架可以包括Yarn;参照图2,通过资源管理框架Yarn为所述图算法对应的计算任务创建多个计算节点。
当然,本领域应能理解上述资源管理框架Yarn仅为举例,其他现有的或今后可能出现的资源管理框架如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
优选地,所述第三装置13将所述图算法对应的计算任务分发至分布式计算框架中的多个计算节点执行。
例如,所述分布式计算框架可以包括Spark;参照图2,采用分布式计
算框架Spark作为计算引擎,由于数据的计算过程是滞后(lazy)模型,更有利于计算复杂度高的图计算。
当然,本领域应能理解上述分布式计算框架Spark仅为举例,其他现有的或今后可能出现的分布式计算框架如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
优选地,所述持久化条件包括以下至少任一项:所述分布式计算框架的弹性分布式数据集的计算耗时达到对应的时长阈值;所述分布式计算框架的弹性分布式数据集的当前依赖关系长度达到对应的长度阈值。
例如,若采用分布式计算框架Spark,在计算过程中会用到Spark的弹性分布式数据集(Resilient Distributed Datasets,RDD)。当GraphRDD有较长计算时间或依赖关系时(比如可以将计算耗时对应的时长阈值设置为10分钟,当GraphRDD的计算耗时达到10分钟),进行持久化操作,将数据和元素类型检查器ObjectInspector一起写入到本地磁盘中,并将对应的BlockId汇报给Spark Driver。在处理需要多轮迭代运算的图算法时,为了避免计算节点故障造成的数据丢失,持久化操作亦可以将数据写入Hadoop分布式文件系统(HDFS)中。
优选地,所述持久化操作包括以下至少任一项:存储当前计算结果;清除当前依赖关系。
在此,通过持久化(persist)操作可以保存计算结果、清除依赖关系,降低一些被反复使用的复杂变换的计算成本,并提供容错性。
优选地,所述三装置13还对键值相同的所述规整图数据进行聚合操作及连接操作。
例如,以图数据的某一列或几列数据作为键值(key),进行聚合(groupBy)操作及连接(join)操作,由一个计算节点处理相同键值(key)的所有数据,因此计算节点间会有大量的数据传输。具体地,通过GraphRDD选取数据的指定字段,将这些字段序列化成键值(key),通过聚合操作及连接操作合并键值(key)相同的数据,并根据图算法的种类不同,施加不同运算。为了减少网络传输压力,在此,通过聚合操作先将数据在每个计算节点上合并一次,再将合并完的结果根据键值(key)传
递到其他计算节点上。
在优选的实施中,为了提高连接效率,可以使用一种优化的数据结构和优化策略。当两个数据量庞大的GraphRDD做连接操作时,会对内存产生极大压力。在本实施例中,所采用的数据结构在内存资源紧张时会把数据存入磁盘中,从而避免内存溢出问题。当数据量极小的GraphRDD与数据量极大的GraphRDD做连接操作时,采用将较小数据量的GraphRDD拷贝至每个计算节点的连接优化策略,加快连接速度的同时也减轻了网络压力。
优选地,所述对键值相同的所述规整图数据进行聚合操作及连接操作还包括:进行所述聚合操作之前,在每个所述计算节点上对所述规整图数据进行合并操作。
在此,进行聚合操作及连接操作之前,在当前计算节点执行数据合并操作,可以减少网络数据的传输量,提高运算效率,从而减轻网络传输压力。
优选地,所述第三装置13当所述计算节点获取中间数据,先对中间数据进行反序列化,根据所述图算法处理反序列化后的中间数据,再对根据所述图算法处理后的中间数据进行序列化。
例如,图计算过程中会产生各种中间数据,为了能降低计算节点解析数据类型的CPU以及内存开销,本实施例采用一种基于类型检查的数据序列化和反序列化方法,将数据类型解析器随同数据一起传递给计算节点。具体地,GraphOperator将原始数据和元素类型检查器ObjectInspector组合成GraphRDD,作为每个图算法算子的输入数据。当涉及到Shuffle操作时,使用ObjectInspector序列化每个数据,并在其他计算节点反序列化。
优选地,参照图5,所述设备1还包括第四装置14’和第五装置15’;所述第四装置14’获取待执行的SQL语句;所述第五装置15’解析所述SQL语句以调用对应的图算法。
在现有技术中,因为图算法有复杂的计算过程和大量迭代次数,无法通过SQL实现。
而在本实施例中,将分布式计算框架Spark作为计算引擎,以自定义
函数的方式将众多图算法集成于Hive中。从而,可以将图算法与其他SQL语句有机组合,降低处理难度。
优选地,所述第五装置15’利用自定义函数注册多个图算法,其中,每个图算法对应一个注册函数。
例如,可以利用Hive的UDTF(User Defined Table-Generating Function,用户自定义表生成函数)机制,向Hive注册图算法的实现类名,达到通过SQL语句启动图算法的目的。在此,UDTF是Hive为用户添加自定函数而设计的接口,用户可以通过UDTF的process方法,获取一行输入,并将其转换成一行或多行输出。
然而UDTF的“一行输入,多行输出”的模型不能满足图计算的需求。在本实施例中,通过在UDTF基础上添加新的处理逻辑,使得进入图算法的数据是完整的图数据。在此,可以利用UDTF接口,为每个图算法注册一个函数。本实施例实现了基于UDTF的Operator算子,从而解决图计算的需求。具体地,首先实现一个GraphOperator算子,作为所有图算法算子的基类。GraphOperator继承UDTF接口,因此可以通过FunctionRegistry的registerGenericUDTF方法,将不同图算法注册进Hive中。在本实施例改动了Hive的TableScanOperator算子和UDTFOperator算子。UDTFOperator算子从TableScanOperator算子处获取被封装为RDD的输入数据,并传递给GraphOperator算子。每个继承GraphOperator的图算法算子就都可以访问到完整的图数据。
与现有技术相比,本申请先获取原始图数据,然后根据图算法处理所述原始图数据以获得所述图算法对应的规整图数据,以便于适配不同种类的图算法,接着将所述图算法对应的计算任务分发至多个计算节点执行,其中,在执行过程中当满足持久化条件,进行持久化操作,切断数据依赖,减少重复计算量,提高处理效率。进一步地,本申请在对图数据进行聚合操作及连接操作之前,先对其进行合并操作,从而提高运算效率,减轻网络传输压力。进一步地,本申请采用一种数据序列化和反序列化的方法,以便于计算过程中的产生的中间数据在计算节点之间传递。进一步地,本申请实现了通过SQL语句启动图算法,并且通过改进处理逻辑,使得进
入图算法的数据是完整的图数据。
需要注意的是,本申请可在软件和/或软件与硬件的组合体中被实施,例如,可采用专用集成电路(ASIC)、通用目的计算机或任何其他类似硬件设备来实现。在一个实施例中,本申请的软件程序可以通过处理器执行以实现上文所述步骤或功能。同样地,本申请的软件程序(包括相关的数据结构)可以被存储到计算机可读记录介质中,例如,RAM存储器,磁或光驱动器或软磁盘及类似设备。另外,本申请的一些步骤或功能可采用硬件来实现,例如,作为与处理器配合从而执行各个步骤或功能的电路。
另外,本申请的一部分可被应用为计算机程序产品,例如计算机程序指令,当其被计算机执行时,通过该计算机的操作,可以调用或提供根据本申请的方法和/或技术方案。而调用本申请的方法的程序指令,可能被存储在固定的或可移动的记录介质中,和/或通过广播或其他信号承载媒体中的数据流而被传输,和/或被存储在根据所述程序指令运行的计算机设备的工作存储器中。在此,根据本申请的一个实施例包括一个装置,该装置包括用于存储计算机程序指令的存储器和用于执行程序指令的处理器,其中,当该计算机程序指令被该处理器执行时,触发该装置运行基于前述根据本申请的多个实施例的方法和/或技术方案。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。装置权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。
对于本领域技术人员而言,显然本发明不限于上述示范性实施例的细节,而且在不背离本发明的精神或基本特征的情况下,能够以其他的具体
形式实现本发明。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本发明的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本发明内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。装置权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。
Claims (24)
- 一种用于分布式图计算的方法,其中,所述方法包括:a获取原始图数据;b根据图算法,处理所述原始图数据以获得所述图算法对应的规整图数据;c将所述图算法对应的计算任务分发至多个计算节点执行,其中,在执行过程中当满足持久化条件,进行持久化操作。
- 根据权利要求1所述的方法,其中,所述方法还包括:获取待执行的SQL语句;解析所述SQL语句以调用对应的图算法。
- 根据权利要求2所述的方法,其中,所述解析所述SQL语句以调用对应的图算法包括:利用自定义函数注册多个图算法,其中,每个图算法对应一个注册函数。
- 根据权利要求1所述的方法,其中,所述步骤c还包括:对键值相同的所述规整图数据进行聚合操作及连接操作。
- 根据权利要求4所述的方法,其中,所述对键值相同的所述规整图数据进行聚合操作及连接操作还包括:进行所述聚合操作之前,在每个所述计算节点上对所述规整图数据进行合并操作。
- 根据权利要求1所述的方法,其中,所述步骤c包括:当所述计算节点获取中间数据,先对中间数据进行反序列化,根据所述图算法处理反序列化后的中间数据,再对根据所述图算法处理后的中间数据进行序列化。
- 根据权利要求1所述的方法,其中,所述步骤b还包括:将所述规整图数据存储于分布式文件系统。
- 根据权利要求1所述的方法,其中,所述步骤b还包括:根据所述图算法,对所述规整图数据进行类型检查。
- 根据权利要求1所述的方法,其中,所述步骤c包括:通过资源管理框架创建多个计算节点用于执行所述图算法对应的计算任务。
- 根据权利要求1所述的方法,其中,所述步骤c包括:将所述图算法对应的计算任务分发至分布式计算框架中的多个计算节点执行。
- 根据权利要求10所述的方法,其中,所述持久化条件包括以下至少任一项:所述分布式计算框架的弹性分布式数据集的计算耗时达到对应的时长阈值;所述分布式计算框架的弹性分布式数据集的当前依赖关系长度达到对应的长度阈值。
- 根据权利要求1至11中任一项所述的方法,其中,所述持久化操作包括以下至少任一项:存储当前计算结果;清除当前依赖关系。
- 一种用于分布式图计算的设备,其中,所述设备包括:第一装置,用于获取原始图数据;第二装置,用于根据图算法,处理所述原始图数据以获得所述图算法对应的规整图数据;第三装置,用于将所述图算法对应的计算任务分发至多个计算节点执行,其中,在执行过程中当满足持久化条件,进行持久化操作。
- 根据权利要求13所述的设备,其中,所述设备还包括:第四装置,用于获取待执行的SQL语句;第五装置,用于解析所述SQL语句以调用对应的图算法。
- 根据权利要求14所述的设备,其中,所述第五装置用于:利用自定义函数注册多个图算法,其中,每个图算法对应一个注册函数。
- 根据权利要求13所述的设备,其中,所述第三装置还用于:对键值相同的所述规整图数据进行聚合操作及连接操作。
- 根据权利要求16所述的设备,其中,所述对键值相同的所述规整图数据进行聚合操作及连接操作还包括:进行所述聚合操作之前,在每个所述计算节点上对所述规整图数据进行 合并操作。
- 根据权利要求13所述的设备,其中,所述第三装置用于:当所述计算节点获取中间数据,先对中间数据进行反序列化,根据所述图算法处理反序列化后的中间数据,再对根据所述图算法处理后的中间数据进行序列化。
- 根据权利要求13所述的设备,其中,所述第二装置还用于:将所述规整图数据存储于分布式文件系统。
- 根据权利要求13所述的设备,其中,所述第二装置还用于:根据所述图算法,对所述规整图数据进行类型检查。
- 根据权利要求13所述的设备,其中,所述第三装置用于:通过资源管理框架创建多个计算节点用于执行所述图算法对应的计算任务。
- 根据权利要求13所述的设备,其中,所述第三装置用于:将所述图算法对应的计算任务分发至分布式计算框架中的多个计算节点执行。
- 根据权利要求22所述的设备,其中,所述持久化条件包括以下至少任一项:所述分布式计算框架的弹性分布式数据集的计算耗时达到对应的时长阈值;所述分布式计算框架的弹性分布式数据集的当前依赖关系长度达到对应的长度阈值。
- 根据权利要求13至23中任一项所述的设备,其中,所述持久化操作包括以下至少任一项:存储当前计算结果;清除当前依赖关系。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610818819.8A CN106611037A (zh) | 2016-09-12 | 2016-09-12 | 用于分布式图计算的方法与设备 |
CN201610818819.8 | 2016-09-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018045753A1 true WO2018045753A1 (zh) | 2018-03-15 |
Family
ID=58614973
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2017/080845 WO2018045753A1 (zh) | 2016-09-12 | 2017-04-18 | 用于分布式图计算的方法与设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106611037A (zh) |
WO (1) | WO2018045753A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109918199A (zh) * | 2019-02-28 | 2019-06-21 | 中国科学技术大学苏州研究院 | 基于gpu的分布式图处理系统 |
CN111367936A (zh) * | 2020-02-28 | 2020-07-03 | 中国工商银行股份有限公司 | 一种结构化查询语言语法的离线校验方法及装置 |
CN114925123A (zh) * | 2022-04-24 | 2022-08-19 | 杭州悦数科技有限公司 | 一种分布式的图数据库与图计算系统间的数据传输方法 |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107729523A (zh) * | 2017-10-27 | 2018-02-23 | 平安科技(深圳)有限公司 | 数据服务方法、电子装置及存储介质 |
CN109189732A (zh) * | 2018-08-03 | 2019-01-11 | 成都四方伟业软件股份有限公司 | 一种中位数分析方法及装置 |
CN111211993B (zh) * | 2018-11-21 | 2023-08-11 | 百度在线网络技术(北京)有限公司 | 流式计算的增量持久化方法、装置及存储介质 |
CN110427359A (zh) * | 2019-06-27 | 2019-11-08 | 苏州浪潮智能科技有限公司 | 一种图数据处理方法和装置 |
CN110516117A (zh) * | 2019-07-22 | 2019-11-29 | 平安科技(深圳)有限公司 | 图计算的类别型变量存储方法、装置、设备及存储介质 |
CN110688610B (zh) * | 2019-09-27 | 2023-05-09 | 支付宝(杭州)信息技术有限公司 | 图数据的权重计算方法、装置和电子设备 |
CN113495679B (zh) * | 2020-04-01 | 2022-10-21 | 北京大学 | 基于非易失存储介质的大数据存储访问与处理的优化方法 |
CN111475684B (zh) * | 2020-06-29 | 2020-09-22 | 北京一流科技有限公司 | 数据处理网络系统及其计算图生成方法 |
CN111935026B (zh) * | 2020-08-07 | 2024-02-13 | 腾讯科技(深圳)有限公司 | 一种数据传输方法、装置、处理设备及介质 |
CN113626207B (zh) * | 2021-10-12 | 2022-03-08 | 苍穹数码技术股份有限公司 | 地图数据处理方法、装置、设备及存储介质 |
CN113806302B (zh) * | 2021-11-11 | 2022-02-22 | 支付宝(杭州)信息技术有限公司 | 图状态数据管理方法及装置 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102591709A (zh) * | 2011-12-20 | 2012-07-18 | 南京大学 | 基于OGR的shapefile文件主从式并行写方法 |
CN103793442A (zh) * | 2012-11-05 | 2014-05-14 | 北京超图软件股份有限公司 | 空间数据的处理方法及系统 |
CN103970604A (zh) * | 2013-01-31 | 2014-08-06 | 国际商业机器公司 | 基于MapReduce架构实现图处理的方法和装置 |
CN104978228A (zh) * | 2014-04-09 | 2015-10-14 | 腾讯科技(深圳)有限公司 | 一种分布式计算系统的调度方法和装置 |
CN105335135A (zh) * | 2014-07-14 | 2016-02-17 | 华为技术有限公司 | 数据处理方法和中心节点 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103336808B (zh) * | 2013-06-25 | 2017-12-15 | 中国科学院信息工程研究所 | 一种基于bsp模型的实时图数据处理系统及方法 |
-
2016
- 2016-09-12 CN CN201610818819.8A patent/CN106611037A/zh active Pending
-
2017
- 2017-04-18 WO PCT/CN2017/080845 patent/WO2018045753A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102591709A (zh) * | 2011-12-20 | 2012-07-18 | 南京大学 | 基于OGR的shapefile文件主从式并行写方法 |
CN103793442A (zh) * | 2012-11-05 | 2014-05-14 | 北京超图软件股份有限公司 | 空间数据的处理方法及系统 |
CN103970604A (zh) * | 2013-01-31 | 2014-08-06 | 国际商业机器公司 | 基于MapReduce架构实现图处理的方法和装置 |
CN104978228A (zh) * | 2014-04-09 | 2015-10-14 | 腾讯科技(深圳)有限公司 | 一种分布式计算系统的调度方法和装置 |
CN105335135A (zh) * | 2014-07-14 | 2016-02-17 | 华为技术有限公司 | 数据处理方法和中心节点 |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109918199A (zh) * | 2019-02-28 | 2019-06-21 | 中国科学技术大学苏州研究院 | 基于gpu的分布式图处理系统 |
CN109918199B (zh) * | 2019-02-28 | 2023-06-16 | 中国科学技术大学苏州研究院 | 基于gpu的分布式图处理系统 |
CN111367936A (zh) * | 2020-02-28 | 2020-07-03 | 中国工商银行股份有限公司 | 一种结构化查询语言语法的离线校验方法及装置 |
CN111367936B (zh) * | 2020-02-28 | 2023-08-22 | 中国工商银行股份有限公司 | 一种结构化查询语言语法的离线校验方法及装置 |
CN114925123A (zh) * | 2022-04-24 | 2022-08-19 | 杭州悦数科技有限公司 | 一种分布式的图数据库与图计算系统间的数据传输方法 |
CN114925123B (zh) * | 2022-04-24 | 2024-06-07 | 杭州悦数科技有限公司 | 一种分布式的图数据库与图计算系统间的数据传输方法 |
Also Published As
Publication number | Publication date |
---|---|
CN106611037A (zh) | 2017-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018045753A1 (zh) | 用于分布式图计算的方法与设备 | |
US11442942B2 (en) | Modified representational state transfer (REST) application programming interface (API) including a customized GraphQL framework | |
US20210406068A1 (en) | Method and system for stream computation based on directed acyclic graph (dag) interaction | |
WO2017041657A1 (zh) | 一种应用接口管理方法和装置 | |
WO2016095726A1 (zh) | 一种用于分布式执行关系型计算指令的方法与设备 | |
AU2017254506B2 (en) | Method, apparatus, computing device and storage medium for data analyzing and processing | |
US10983815B1 (en) | System and method for implementing a generic parser module | |
US8799861B2 (en) | Performance-testing a system with functional-test software and a transformation-accelerator | |
US11379499B2 (en) | Method and apparatus for executing distributed computing task | |
CN112860730A (zh) | Sql语句的处理方法、装置、电子设备及可读存储介质 | |
US11366704B2 (en) | Configurable analytics for microservices performance analysis | |
US20190213188A1 (en) | Distributed computing framework and distributed computing method | |
US20200278969A1 (en) | Unified metrics computation platform | |
WO2020015087A1 (zh) | 大规模图片处理方法、系统、计算机设备及计算机存储介质 | |
Miller et al. | Open source big data analytics frameworks written in scala | |
WO2016023372A1 (zh) | 数据存储处理方法及装置 | |
WO2016008317A1 (zh) | 数据处理方法和中心节点 | |
US20150067089A1 (en) | Metadata driven declarative client-side session management and differential server side data submission | |
CN111580938A (zh) | 一种工作单元的事务处理方法、装置、设备及介质 | |
Diez Dolinski et al. | Distributed simulation of P systems by means of map-reduce: first steps with Hadoop and P-Lingua | |
US11757959B2 (en) | Dynamic data stream processing for Apache Kafka using GraphQL | |
CN111753017B (zh) | 基于Kylin系统的维表处理方法、装置、电子设备及存储介质 | |
JP2022551454A (ja) | ストアドプロシージャの実行方法、装置、データベースシステム及び記憶媒体 | |
Jiang et al. | Architecture Analysis and Implementation of 3D Theatre Display System Based on Node. js | |
CN117435367B (zh) | 用户行为处理方法、装置、设备、存储介质和程序产品 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17847927 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17847927 Country of ref document: EP Kind code of ref document: A1 |