WO2023124543A1 - Data processing method and data processing apparatus for big data - Google Patents

Data processing method and data processing apparatus for big data Download PDF

Info

Publication number
WO2023124543A1
WO2023124543A1 PCT/CN2022/130286 CN2022130286W WO2023124543A1 WO 2023124543 A1 WO2023124543 A1 WO 2023124543A1 CN 2022130286 W CN2022130286 W CN 2022130286W WO 2023124543 A1 WO2023124543 A1 WO 2023124543A1
Authority
WO
WIPO (PCT)
Prior art keywords
program
data processing
virtual machine
operator
big data
Prior art date
Application number
PCT/CN2022/130286
Other languages
French (fr)
Chinese (zh)
Inventor
俞博文
冯冠宇
曹焕琦
郑纬民
陈文光
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2023124543A1 publication Critical patent/WO2023124543A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • optimizing the first part of the program includes: setting at least one operator in the first part of the program to batch process multiple data each time.
  • optimizing the first part of the program may include: for the data to be processed by the first part of the program, converting data of the same data type into a data arrangement continuously stored in the memory. This optimization operation may be referred to as a data arrangement operation.
  • the function pointer in the second part of the program is converted into a pointer address that can actually call the function in the predetermined virtual machine big data processing system, and these functions that are called have the function of the second operator in the second part of the program accomplish.
  • the virtual machine code calls the functional modules of the predetermined virtual machine big data processing system (for example, the processing module of the corresponding operator), avoiding the repeated programming of these functional modules.
  • predetermined here means pre-determined, that is, pre-selected target processing system for code conversion and invocation.
  • the code compiled in the native system includes the function implementation of the operator, so it can be run by the native system using the machine code execution mechanism of the virtual machine (eg, JNI, JNA). After this part of the program runs in the native system, the functions of the operators will be processed.
  • the code compiled in the native system does not contain the function implementation of the operator, but is converted into the virtual machine code supported by the predetermined virtual machine big data processing system, so it can be generated by residing in the virtual machine Execute the big data processing system on the scheduled virtual machine.
  • the engine of the predetermined virtual machine big data processing system can be reused, so as to realize functions such as distributed execution, elasticity, fall behind mitigation, and monitoring.
  • the engine of the native system big data processing system in the embodiment of the present disclosure may include a driver (driver) in the virtual machine system, which is a virtual machine big data processing system (such as Spark) commonly implemented in a virtual machine programming language (such as Java).
  • An application program for loading compiled program code ie, a "loadable module"
  • the driver can be submitted to a Spark-compatible cluster just like a normal Spark application. After a successful submission, the driver loads the loadable module from the native system, registers the engine implementation with the loadable module, and starts the main program contained in the loadable module.
  • the driver can also instruct the newly created executor to prepare the environment, such as downloading the above-mentioned loadable modules.

Abstract

A data processing method and data processing apparatus for big data. The data processing method comprises: acquiring a main program written in a native programming language (S101); compiling the main program in a native system and generating a loadable module, wherein the loadable module comprises a first program part and a second program part (S102); loading the loadable module by a virtual machine running in the native system, and converting the second program part into a virtual machine code supported by a predetermined virtual machine big data processing system (S103); and running, by the virtual machine, the main program contained in the loadable module, wherein the virtual mechanism constructs a directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, hands over the first program part to the native system for running, and hands over the converted second program part to the predetermined virtual machine big data processing system for running (S104). According to the data processing method, a high-performance big data processing framework can be constructed while integrating an existing virtual machine big data software ecosystem.

Description

用于大数据的数据处理方法和数据处理装置Data processing method and data processing device for big data
本申请要求于2021年12月27日递交的中国专利申请第202111618375.0号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。This application claims the priority of Chinese Patent Application No. 202111618375.0 submitted on December 27, 2021, and the content disclosed in the above Chinese Patent Application is cited in its entirety as a part of this application.
技术领域technical field
本公开涉及一种用于大数据的数据处理方法和数据处理装置。The present disclosure relates to a data processing method and a data processing device for big data.
背景技术Background technique
大数据是指异构格式的大量快速增长的数据集。Apache Hadoop是早期的开源大数据解决方案,其包括用于持久存储大数据的分布式文件系统(HDFS),以及基于MapReduce抽象的分析框架。Apache Spark是专为大规模数据处理而设计的快速通用的计算系统,Spark是一种与Hadoop相似的开源集群计算环境,但是Spark在某些工作负载方面表现得更加优越。最近,Apache Spark引入了一种称为弹性分布式数据集(RDD)的新抽象,以支持用于迭代工作负载的容错数据重用,可以实现比Hadoop MapReduce好一个数量级的性能。Spark提供了一个丰富的且易于使用的应用程序接口,可以构建用于图形计算、流处理、机器学习和SQL查询的支持库。如今,Spark被广泛地部署以服务于大数据分析。Big data refers to large and rapidly growing data sets in heterogeneous formats. Apache Hadoop is an early open source big data solution, which includes a distributed file system (HDFS) for persistent storage of big data, and an analysis framework based on the MapReduce abstraction. Apache Spark is a fast and general-purpose computing system designed for large-scale data processing. Spark is an open source cluster computing environment similar to Hadoop, but Spark is superior in certain workloads. Recently, Apache Spark introduced a new abstraction called Resilient Distributed Dataset (RDD) to support fault-tolerant data reuse for iterative workloads, which can achieve an order of magnitude better performance than Hadoop MapReduce. Spark provides a rich and easy-to-use API for building support libraries for graph computing, stream processing, machine learning, and SQL queries. Today, Spark is widely deployed to serve big data analysis.
发明内容Contents of the invention
本公开至少一实施例提供一种用于大数据的数据处理方法,包括:获取用原生编程语言编写的主程序;在原生系统中将主程序进行编译并生成可加载模块,其中,可加载模块包括第一部分程序和第二部分程序,第一部分程序包括第一运算符的功能实现,第二部分程序包括对第二运算符对应的函数进行调用的函数指针;由原生系统中运行的虚拟机加载可加载模块,并且将第二部分程序转换为预定虚拟机大数据处理体系支持的虚拟机代码,虚拟机代码调用预定虚拟机大数据处理体系的函数的功能实现;以及由虚拟机运行可加载模块包含的主程序,其中,在主程序的运行过程中,虚拟机构建可加载模块对应的有向 无环图程序,然后运行有向无环图程序,将第一部分程序交由原生系统运行,将转换后的第二部分程序交由预定虚拟机大数据处理体系运行。At least one embodiment of the present disclosure provides a data processing method for big data, including: obtaining a main program written in a native programming language; compiling the main program in the native system and generating a loadable module, wherein the loadable module Including the first part of the program and the second part of the program, the first part of the program includes the function realization of the first operator, and the second part of the program includes the function pointer for calling the function corresponding to the second operator; loaded by the virtual machine running in the native system The module can be loaded, and the second part of the program is converted into the virtual machine code supported by the predetermined virtual machine big data processing system, and the virtual machine code calls the function realization of the function of the predetermined virtual machine big data processing system; and the loadable module is run by the virtual machine The main program included, wherein, during the running of the main program, the virtual machine constructs the directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, handing over the first part of the program to the native system to run, and the The converted second part of the program is run by a predetermined virtual machine big data processing system.
例如,在本公开至少一实施例提供的数据处理方法中,数据处理方法采用分布式计算,第一运算符为本地类运算符,并且第二运算符为全局类运算符。For example, in the data processing method provided by at least one embodiment of the present disclosure, the data processing method adopts distributed computing, the first operator is a local type operator, and the second operator is a global type operator.
例如,在本公开至少一实施例提供的数据处理方法中,在原生系统中将主程序进行编译并生成可加载模块,包括:对第一部分程序进行优化操作以降低原生系统与虚拟机之间的交互开销。For example, in the data processing method provided by at least one embodiment of the present disclosure, compiling the main program in the native system and generating a loadable module includes: optimizing the first part of the program to reduce the gap between the native system and the virtual machine. Interaction overhead.
例如,在本公开至少一实施例提供的数据处理方法中,对第一部分程序进行优化操作,包括:将第一部分程序中的一连串运算步骤融合为一个运算步骤。For example, in the data processing method provided by at least one embodiment of the present disclosure, optimizing the first part of the program includes: fusing a series of operation steps in the first part of the program into one operation step.
例如,在本公开至少一实施例提供的数据处理方法中,对第一部分程序进行优化操作,包括:将第一部分程序中的至少一个运算符设置为每次批处理多个数据。For example, in the data processing method provided by at least one embodiment of the present disclosure, optimizing the first part of the program includes: setting at least one operator in the first part of the program to batch process multiple data each time.
例如,在本公开至少一实施例提供的数据处理方法中,对第一部分程序进行优化操作,包括:针对第一部分程序要处理的数据,将相同数据类型的数据转换成连续存放在内存中的数据排布。For example, in the data processing method provided by at least one embodiment of the present disclosure, optimizing the first part of the program includes: converting data of the same data type into data that is continuously stored in the memory for the data to be processed by the first part of the program arranged.
例如,在本公开至少一实施例提供的数据处理方法中,第一部分程序中的至少一个运算符的输入被设置为数据排布的起始地址指针。For example, in the data processing method provided by at least one embodiment of the present disclosure, the input of at least one operator in the first part of the program is set as the start address pointer of the data arrangement.
例如,在本公开至少一实施例提供的数据处理方法中,预定虚拟机大数据处理体系为Apache Spark。For example, in the data processing method provided in at least one embodiment of the present disclosure, the predetermined virtual machine big data processing system is Apache Spark.
本公开至少一实施例提供一种用于大数据的数据处理装置,包括:程序获取单元,配置为获取用原生编程语言编写的主程序;程序编译单元,配置为在原生系统中将主程序进行编译并生成可加载模块,其中,可加载模块包括第一部分程序和第二部分程序,第一部分程序包括第一运算符的功能实现,第二部分程序包括对第二运算符对应的函数进行调用的函数指针;加载和转换单元,配置为由原生系统中运行的虚拟机加载可加载模块,并且将第二部分程序转换为预定虚拟机大数据处理体系支持的虚拟机代码,虚拟机代码调用预定虚拟机大数据处理体系的函数的功能实现;以及运行单元,配置为由虚拟机运行可加载模块包含的主程序,其中,在主程序的运行过程中,虚拟机构建可加载模块对应的有向无环图程序,然后运行有向无环图程序,将第一部分程序交由原生系统运行,将转换后的第二部分程序交由预定虚拟机大数据处理体系运行。At least one embodiment of the present disclosure provides a data processing device for big data, including: a program acquisition unit configured to acquire a main program written in a native programming language; a program compilation unit configured to execute the main program in a native system Compile and generate a loadable module, wherein the loadable module includes a first part of the program and a second part of the program, the first part of the program includes the function realization of the first operator, and the second part of the program includes the function of calling the function corresponding to the second operator Function pointer; loading and conversion unit, configured to load the loadable module by the virtual machine running in the native system, and convert the second part of the program into the virtual machine code supported by the predetermined virtual machine big data processing system, and the virtual machine code calls the predetermined virtual machine The functional realization of the functions of the machine big data processing system; and the operating unit, which is configured to run the main program contained in the loadable module by the virtual machine, wherein, during the running process of the main program, the virtual machine builds the directed un The circular graph program, and then run the directed acyclic graph program, hand over the first part of the program to the native system to run, and hand over the converted second part of the program to the predetermined virtual machine big data processing system to run.
本公开至少一实施例提供的数据处理装置,还包括:优化单元,被配置为: 对第一部分程序进行优化操作以降低原生系统与虚拟机之间的交互开销。The data processing device provided by at least one embodiment of the present disclosure further includes: an optimization unit configured to: perform an optimization operation on the first part of the program to reduce the interaction overhead between the native system and the virtual machine.
例如,在本公开至少一实施例提供的数据处理装置中,对第一部分程序进行优化操作,包括:将第一部分程序中的一连串运算步骤融合为一个运算步骤。For example, in the data processing device provided by at least one embodiment of the present disclosure, optimizing the first part of the program includes: fusing a series of operation steps in the first part of the program into one operation step.
例如,在本公开至少一实施例提供的数据处理装置中,对第一部分程序进行优化操作,包括:将第一部分程序中的至少一个运算符设置为每次批处理多个数据。For example, in the data processing device provided by at least one embodiment of the present disclosure, optimizing the first part of the program includes: setting at least one operator in the first part of the program to batch process multiple data each time.
例如,在本公开至少一实施例提供的数据处理装置中,对第一部分程序进行优化操作,包括:针对第一部分程序要处理的数据,将相同数据类型的数据转换成连续存放在内存中的数据排布。For example, in the data processing device provided by at least one embodiment of the present disclosure, optimizing the first part of the program includes: converting data of the same data type into data that is continuously stored in the memory for the data to be processed by the first part of the program arranged.
例如,在本公开至少一实施例提供的数据处理装置中,第一部分程序中的至少一个运算符的输入被设置为数据排布的起始地址指针。For example, in the data processing device provided by at least one embodiment of the present disclosure, the input of at least one operator in the first part of the program is set as the start address pointer of the data arrangement.
附图说明Description of drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的附图作简单地介绍,显而易见地,下面描述中的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the accompanying drawings of the embodiments will be briefly introduced below. Obviously, the accompanying drawings in the following description only relate to some embodiments of the present disclosure, rather than limiting the present disclosure .
图1示出了本公开至少一实施例提供的一种用于大数据的数据处理方法的示意性流程图;Fig. 1 shows a schematic flowchart of a data processing method for big data provided by at least one embodiment of the present disclosure;
图2示出了本公开至少一实施例提供的对第一部分程序进行的优化操作的示例性示意图;Fig. 2 shows an exemplary schematic diagram of an optimization operation performed on the first part of the program provided by at least one embodiment of the present disclosure;
图3示出了本公开至少一实施例提供的有向无环图程序的示例性示意图;Fig. 3 shows an exemplary schematic diagram of a directed acyclic graph program provided by at least one embodiment of the present disclosure;
图4A示出了本公开至少一实施例提供的虚拟机大数据处理体系(Apache Spark)的示例性示意图;FIG. 4A shows an exemplary schematic diagram of a virtual machine big data processing system (Apache Spark) provided by at least one embodiment of the present disclosure;
图4B示出了本公开至少一实施例提供的原生系统大数据处理体系的示例性示意图;Fig. 4B shows an exemplary schematic diagram of a native system big data processing system provided by at least one embodiment of the present disclosure;
图5示出了利用本公开至少一实施例提供的数据处理方法的示例工作流程的示意图;Fig. 5 shows a schematic diagram of an example workflow of a data processing method provided by at least one embodiment of the present disclosure;
图6示出了本公开至少一实施例提供一种用于大数据的数据处理装置的示意框图。Fig. 6 shows a schematic block diagram of a data processing device for big data provided by at least one embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例的附图,对本公开实施例的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings of the embodiments of the present disclosure. Apparently, the described embodiments are some of the embodiments of the present disclosure, not all of them. Based on the described embodiments of the present disclosure, all other embodiments obtained by persons of ordinary skill in the art without creative effort fall within the protection scope of the present disclosure.
除非另外定义,本公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。同样,“一个”、“一”或者“该”等类似词语也不表示数量限制,而是表示存在至少一个。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。Unless otherwise defined, the technical terms or scientific terms used in the present disclosure shall have the usual meanings understood by those skilled in the art to which the present disclosure belongs. "First", "second" and similar words used in the present disclosure do not indicate any order, quantity or importance, but are only used to distinguish different components. Likewise, words like "a", "an" or "the" do not denote a limitation of quantity, but mean that there is at least one. "Comprising" or "comprising" and similar words mean that the elements or items appearing before the word include the elements or items listed after the word and their equivalents, without excluding other elements or items. Words such as "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "Down", "Left", "Right" and so on are only used to indicate the relative positional relationship. When the absolute position of the described object changes, the relative positional relationship may also change accordingly.
尽管Spark在内存分布数据集方面具有优势,除了能够提供交互式查询外,还可以优化迭代工作负载,但是最近的工作表明其性能仍然存在很大的提升空间。Spark程序是运行在JVM(java virtual machine,java虚拟机)的基础之上的,程序需要先被翻译成JVM可以间接执行的代码,而通过用C++编程语言构建的大数据分析框架与Spark相比具有很大的性能优势,这是因为C++编写的代码可以翻译成机器可以直接执行的代码。例如基于C++的大数据分析框架Thrill在典型的大数据工作负载上实现了3.26倍于Spark的平均加速。另外,对于Java矩阵乘法内核,将Java切换到C会产生4.4倍的加速,并且由C编译器提供的矢量化和AVX内在函数(AVX intrinsics)的性能被提高了9.45倍。然而,性能只是大数据处理的一个方面,基于C++的大数据分析框架虽然具有比Spark更好的性能,但是不具备Spark提供的用于大数据处理的许多关键的功能,例如基于世系的弹性:大数据分析通常采用分布式计算,即在基于商用硬件构建的多租户集群(multi-tenant commodity cluster)中执行,其中由于机器故障、网络抖动和抢占式调度而导致任务失败是非常常见的,这使得检查点处理这些频繁故障的效率低下。Spark的基于世系的容错机制允许仅重新计算部分数据而不是全部数据。Spark的弹性还支持其它功能,例如负载均衡、掉 队缓减(straggler mitigation)和自动缩放等,从而提高了集群的资源利用率。此外,Spark的生态系统,例如具有Web UI的性能分析器以及与各种资源管理器的集成,使得在各种私有云或公有云上部署、监控和分析应用程序变得容易。Thrill具有称为DIA的原生RDD(弹性分布式数据集,Resilient Distributed Datasets)式抽象,但是其将数据分布与物理机器紧密耦合使得弹性失效。Husky使用上游消息日志容错机制,即使没有故障也会产生不可忽略的开销。与Spark相比,这些基于C++的大数据分析框架缺少许多基本功能。Despite Spark's advantages in memory-distributed datasets, in addition to being able to provide interactive queries and optimize iterative workloads, recent work has shown that its performance still has a lot of room for improvement. The Spark program runs on the basis of the JVM (java virtual machine, java virtual machine). The program needs to be translated into code that can be executed indirectly by the JVM. Compared with Spark, the big data analysis framework built with the C++ programming language It has a great performance advantage, because the code written in C++ can be translated into code that the machine can directly execute. For example, Thrill, a C++-based big data analysis framework, achieves an average acceleration of 3.26 times that of Spark on typical big data workloads. Additionally, switching Java to C yields a 4.4x speedup for the Java matrix multiplication kernel, and a 9.45x performance boost for vectorization and AVX intrinsics provided by the C compiler. However, performance is only one aspect of big data processing. Although the C++-based big data analysis framework has better performance than Spark, it does not have many key functions provided by Spark for big data processing, such as lineage-based elasticity: Big data analysis usually uses distributed computing, that is, it is executed in a multi-tenant commodity cluster built on commodity hardware, where task failures due to machine failures, network jitter, and preemptive scheduling are very common, which is This makes checkpointing inefficient in handling these frequent failures. Spark's lineage-based fault-tolerance mechanism allows recomputing only part of the data rather than all of it. Spark's elasticity also supports other functions, such as load balancing, straggler mitigation, and auto-scaling, etc., thereby improving the resource utilization of the cluster. In addition, Spark's ecosystem, such as a performance analyzer with a Web UI and integration with various resource managers, makes it easy to deploy, monitor, and analyze applications on various private or public clouds. Thrill has a native RDD (Resilient Distributed Datasets)-like abstraction called DIA, but its tight coupling of data distribution with physical machines makes elasticity ineffective. Husky uses the upstream message log fault tolerance mechanism, even if there is no fault, it will generate non-negligible overhead. Compared with Spark, these C++-based big data analysis frameworks lack many basic functions.
显然,需要设计一个功能齐全的原生大数据框架。一个直接的解决方案是用原生编程语言(例如C++)将Spark提供的功能重新实现,该解决方案在理论上可行,但是可能过于昂贵且不必要:Spark 3.0.1的核心组件有74K行代码,其中与编程框架直接相关的只有9K行(包括RDD应用程序接口和运算符实施方式),并且其它代码都是服务于各种大数据功能的组件。因此,可以构建一个大数据框架,该大数据框架可以重用Spark成熟的大数据功能而不用重新实现这些功能。Clearly, a full-featured native big data framework needs to be designed. A straightforward solution is to reimplement the functionality provided by Spark in a native programming language (such as C++), which is theoretically possible, but may be too expensive and unnecessary: the core component of Spark 3.0.1 has 74K lines of code, Among them, there are only 9K lines directly related to the programming framework (including RDD application program interface and operator implementation), and other codes are components that serve various big data functions. Therefore, it is possible to build a big data framework that can reuse Spark's mature big data functions without reimplementing these functions.
然而,重用Spark的大数据功能的方法产生了很多挑战。第一,现有的原生大数据框架(例如,Thrill、Husky等)与Spark的执行模型不兼容,使得将原生大数据框架集成到Spark中是不可行的。例如,Thrill将每个数据集分区耦合到特定的机器,并且Husky依赖有状态的任务执行,这违反了Spark的动态调度和无状态假设。第二,无论是通过JNI(java native interface,java本机接口)还是JNA(java native access,java本机访问)的JVM和原生世界之间的细粒度交互都会产生高开销并有可能成为新的性能瓶颈。However, the approach of reusing Spark's big data capabilities creates many challenges. First, existing native big data frameworks (eg, Thrill, Husky, etc.) are not compatible with Spark's execution model, making it infeasible to integrate native big data frameworks into Spark. For example, Thrill couples each dataset partition to a specific machine, and Husky relies on stateful task execution, which violates Spark's dynamic scheduling and stateless assumptions. Second, fine-grained interaction between the JVM and the native world, whether through JNI (java native interface) or JNA (java native access, java native access), creates high overhead and potentially becomes a new performance bottleneck.
本公开至少一实施例提供一种用于大数据的数据处理方法,包括:获取用原生编程语言编写的主程序;在原生系统中将主程序进行编译并生成可加载模块,其中,可加载模块包括第一部分程序和第二部分程序,第一部分程序包括第一运算符的功能实现,第二部分程序包括对第二运算符对应的函数进行调用的函数指针;由原生系统中运行的虚拟机加载可加载模块,并且将第二部分程序转换为预定虚拟机大数据处理体系支持的虚拟机代码,虚拟机代码调用预定虚拟机大数据处理体系的函数的功能实现;以及由虚拟机运行可加载模块包含的主程序,其中,在主程序的运行过程中,虚拟机构建可加载模块对应的有向无环图程序,然后运行有向无环图程序,将第一部分程序交由原生系统运行,将转换后的第二部分程序交由预定虚拟机大数据处理体系运行。At least one embodiment of the present disclosure provides a data processing method for big data, including: obtaining a main program written in a native programming language; compiling the main program in the native system and generating a loadable module, wherein the loadable module Including the first part of the program and the second part of the program, the first part of the program includes the function realization of the first operator, and the second part of the program includes the function pointer for calling the function corresponding to the second operator; loaded by the virtual machine running in the native system The module can be loaded, and the second part of the program is converted into the virtual machine code supported by the predetermined virtual machine big data processing system, and the virtual machine code calls the function realization of the function of the predetermined virtual machine big data processing system; and the loadable module is run by the virtual machine The main program included, wherein, during the running of the main program, the virtual machine constructs the directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, handing over the first part of the program to the native system to run, and the The converted second part of the program is run by a predetermined virtual machine big data processing system.
该数据处理方法将用于处理大数据的程序分为第一部分程序和第二部分程序分别进行处理,既可以复用预定虚拟机大数据处理体系(例如,Apache Spark)的重要功能,又可以提升大数据处理的性能。第一部分程序包括第一运算符的功能实现,可以交由原生系统运行而不调用预定虚拟机大数据处理体系的功能,因而处理速度更快;而第二部分程序调用预定虚拟机大数据处理体系的功能,能够容易地实现预定虚拟机大数据处理体系已有的功能。因而,根据本公开实施例的数据处理方法既可以实现大数据的高速处理又可以充分利用已有大数据处理体系的功能而实现全功能的大数据处理。In this data processing method, the program for processing big data is divided into a first part program and a second part program for processing respectively, which can not only reuse important functions of a predetermined virtual machine big data processing system (for example, Apache Spark), but also improve Performance for big data processing. The first part of the program includes the function realization of the first operator, which can be run by the native system without calling the functions of the predetermined virtual machine big data processing system, so the processing speed is faster; while the second part of the program calls the predetermined virtual machine big data processing system It can easily realize the existing functions of the predetermined virtual machine big data processing system. Therefore, the data processing method according to the embodiment of the present disclosure can not only realize high-speed processing of big data, but also realize full-featured big data processing by making full use of the functions of the existing big data processing system.
原生(native)系统是相对于虚拟机的概念,其指计算机上固有的操作系统,例如计算机上本身安装的Linux系统、Window系统等。原生编程语言表示在原生系统上使用的编程语言,例如C++编程语言。例如,在本公开的一些实施例中,主程序由C++编程语言编写,主程序被编译形成目标代码(例如机器码)形式的可加载模块,然后可加载模块被加载到原生系统中运行的虚拟机中,例如加载到Linux系统上运行的JVM中。The native (native) system is a concept relative to the virtual machine, which refers to the inherent operating system on the computer, such as the Linux system and Windows system installed on the computer itself. Native programming language means a programming language used on a native system, such as the C++ programming language. For example, in some embodiments of the present disclosure, the main program is written in the C++ programming language, the main program is compiled to form a loadable module in the form of object code (such as machine code), and then the loadable module is loaded into the virtual system running in the native system. machine, such as loaded into a JVM running on a Linux system.
需要说明的是,虽然本公开的某些实施例以Apache Spark作为预定虚拟机大数据处理体系的示例进行说明,但本公开不限于以Apache Spark为基础。本领域的技术人员能够理解,其他虚拟机大数据处理体系也同样存在类似的技术问题,本公开实施例的技术方案同样适用于其他虚拟机大数据处理体系。在本公开中,虚拟机大数据处理体系(framework)表示基于虚拟机(例如JVM)运行的大数据处理系统,包括引擎和编程框架。大数据处理体系提供了大数据处理的各种功能,例如,如上文所述的Apache Spark的功能。It should be noted that although some embodiments of the present disclosure use Apache Spark as an example of a predetermined virtual machine big data processing system for illustration, the present disclosure is not limited to be based on Apache Spark. Those skilled in the art can understand that other virtual machine big data processing systems also have similar technical problems, and the technical solutions of the embodiments of the present disclosure are also applicable to other virtual machine big data processing systems. In this disclosure, the virtual machine big data processing system (framework) refers to a big data processing system running on a virtual machine (such as JVM), including an engine and a programming framework. The big data processing system provides various functions of big data processing, for example, the functions of Apache Spark as described above.
本公开至少一实施例还提供对应于上述数据处理方法的数据处理装置。At least one embodiment of the present disclosure further provides a data processing device corresponding to the above data processing method.
下面结合附图对本公开的实施例进行详细说明,但是本公开并不限于这些具体的实施例。Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.
图1示出了本公开至少一实施例提供的一种用于大数据的数据处理方法的示意性流程图。Fig. 1 shows a schematic flowchart of a data processing method for big data provided by at least one embodiment of the present disclosure.
如图1所示,该数据处理方法包括如下的步骤S101~S104。As shown in FIG. 1, the data processing method includes the following steps S101-S104.
步骤S101:获取用原生编程语言编写的主程序。Step S101: Obtain a main program written in a native programming language.
用户利用例如C++编程语言的原生编程语言编写主程序,处理装置获取用户编写的主程序。The user writes the main program in a native programming language such as C++ programming language, and the processing device obtains the main program written by the user.
步骤S102:在原生系统中将主程序进行编译并生成可加载模块,其中,可 加载模块包括第一部分程序和第二部分程序,第一部分程序包括第一运算符的功能实现,第二部分程序包括对第二运算符对应的函数进行调用的函数指针。Step S102: Compile the main program in the native system and generate a loadable module, wherein the loadable module includes a first part of the program and a second part of the program, the first part of the program includes the function realization of the first operator, and the second part of the program includes Function pointer to call the function corresponding to the second operator.
在步骤S102中,主程序被编译成目标代码形式(例如机器代码形式)的模块,其可以被加载到虚拟机中,因而称为可加载模块。主程序一般会包括对系统提供的功能函数的调用,主程序在编译时会结合系统提供的函数库或数据集进行编译,这些函数库或数据集会对调用的函数进行解析,以便进行相应的处理。根据本公开的实施例,编译后的可加载模块可以包括第一部分程序和第二部分程序,第一部分程序和第二部分程序在后续步骤中进行不同的处理。在本公开的实施例中,也将“第一部分程序”称为“编译时部分程序”,将“第二部分程序”称为“运行时部分程序”。第一部分程序和第二部分程序的区别在于该部分程序是否包括程序中运算符的具体的功能实现,如果包括运算符的具体的功能实现则为第一部分程序,对应的运算符称为第一运算符,否则为第二部分程序,对应的运算符称为第二运算符。第二部分程序不包括对运算符的具体的功能实现,而包括对运算符对应的函数进行调用的函数指针。这些运算符(第二运算符)的具体实现可以通过在后续的过程中将实现功能的函数的地址设置给函数指针(下文将具体说明)。第一运算符和第二运算符的划分可以根据运算符的类别进行,例如根据预定划分策略将某些类别的运算符划分为第一运算符,而将其他类别的运算符划分为第二运算符。划分策略可以根据具体应用确定,例如可以将功能实现相对简单的运算符划分为第一运算符,而将功能实现相对复杂的运算符划分为第二运算符,从而可以对功能实现复杂的第二运算符复用虚拟机大数据处理体系的功能。再例如,在采用分布式计算的数据处理方法中,可以将本地类运算符划分为第一运算符,而将全局类运算符的部分划分为第二运算符。换言之,在本公开的一些实施例中,数据处理方法采用分布式计算,第一运算符为本地类运算符,第二运算符为全局类运算符。In step S102, the main program is compiled into a module in the form of object code (for example, in the form of machine code), which can be loaded into a virtual machine, thus called a loadable module. The main program generally includes calls to the function functions provided by the system. When compiling, the main program will be compiled in conjunction with the function library or data set provided by the system. These function libraries or data sets will analyze the called function for corresponding processing. . According to an embodiment of the present disclosure, the compiled loadable module may include a first part of the program and a second part of the program, and the first part of the program and the second part of the program are processed differently in subsequent steps. In the embodiments of the present disclosure, the "first part program" is also called "compile-time part program", and the "second part program" is called "run-time part program". The difference between the first part of the program and the second part of the program is whether the part of the program includes the specific function realization of the operator in the program. If it includes the specific function realization of the operator, it is the first part of the program, and the corresponding operator is called the first operation. operator, otherwise it is the second part of the program, and the corresponding operator is called the second operator. The second part of the program does not include the specific function realization of the operator, but includes a function pointer for calling the function corresponding to the operator. These operators (second operators) can be realized by setting the address of the function that realizes the function to the function pointer in the subsequent process (details will be described below). The division of the first operator and the second operator can be performed according to the category of the operator, for example, according to a predetermined division strategy, certain categories of operators are classified as the first operator, while other categories of operators are classified as the second operator symbol. The division strategy can be determined according to the specific application. For example, operators with relatively simple functions can be divided into the first operator, and operators with relatively complex functions can be divided into the second operator, so that the second operator with complex functions can be realized. Operators reuse the functions of the virtual machine big data processing system. For another example, in a data processing method using distributed computing, local operators may be classified as first operators, and part of global operators may be divided as second operators. In other words, in some embodiments of the present disclosure, the data processing method adopts distributed computing, the first operator is a local operator, and the second operator is a global operator.
本地类运算符表示不需要考虑分布式运算中不同运算节点之间的协调的运算符,例如,映射值(mapValues)运算符、过滤(filter)运算符、变换(flatMap)运算符、哈希(hash)运算符(例如Hash Aggregate、Build Hash)等。Local class operators represent operators that do not need to consider the coordination between different operation nodes in distributed operations, for example, map value (mapValues) operator, filter (filter) operator, transformation (flatMap) operator, hash ( hash) operator (such as Hash Aggregate, Build Hash), etc.
全局类运算符表示需要考虑分布式运算中不同运算节点之间的协调的运算符,例如,分区剪枝(partition-pruning)运算符、具有多于一个依赖关系的运算符(例如,联合(union)运算符、zip运算符(例如Zip Partitions)、cartesian运算符)、洗牌(shuffle)运算符(例如Shuffle Write、Shuffle Read)、缓存(cache) 运算符以及数据源(data-source)运算符等。Global class operators represent operators that need to consider the coordination between different operation nodes in distributed operations, for example, partition-pruning operators, operators with more than one dependency (for example, union ) operator, zip operator (such as Zip Partitions), cartesian operator), shuffle (shuffle) operator (such as Shuffle Write, Shuffle Read), cache (cache) operator, and data-source (data-source) operator wait.
对于全局类运算符,由于需要考虑不同运算节点之间的协调,因而实现比较复杂,所以在本公开的实施例中将其划分为第二运算符以便复用已有虚拟机大数据处理体系(例如Spark)的全局类运算符功能。For the global operator, because it needs to consider the coordination between different operation nodes, the implementation is more complicated, so in the embodiment of the present disclosure, it is divided into the second operator in order to reuse the existing virtual machine big data processing system ( Such as the global class operator function of Spark).
例如,分区剪枝运算符和具有多个一个依赖关系的运算符可以向Spark提供世系相关的信息,并且可以重用Spark的基于世系的容错机制。For example, partition pruning operators and operators with multiple-one dependencies can provide lineage-related information to Spark and can reuse Spark's lineage-based fault tolerance mechanism.
例如,洗牌运算符可以重用Spark的容错数据洗牌机制,该机制不仅难以实现,而且部署起来很麻烦。For example, the shuffle operator can reuse Spark's fault-tolerant data shuffling mechanism, which is not only difficult to implement, but also cumbersome to deploy.
例如,缓存运算符可以重用Spark的中间数据管理机制。数据源运算符可以向Spark提供位置信息并且重用Spark的位置感知任务调度。For example, caching operators can reuse Spark's intermediate data management mechanisms. Data source operators can provide location information to Spark and reuse Spark's location-aware task scheduling.
将本地类运算符的部分划分为第一运算符,将全局类运算符划分为第二运算符,从而可以在后续步骤中将包含第二运算符的第二部分程序交给例如Spark的虚拟机大数据处理体系来处理而不用重新实现复杂的功能,可以避免不必要的人力消耗。Divide the part of the local class operator into the first operator, and divide the global class operator into the second operator, so that the second part of the program containing the second operator can be handed over to a virtual machine such as Spark in a subsequent step Big data processing system to process without re-implementing complex functions can avoid unnecessary human consumption.
在本公开的一些实施例中,在步骤S102中在原生系统中将主程序进行编译时可以包括:对第一部分程序进行优化操作以降低原生系统与虚拟机之间的交互开销。In some embodiments of the present disclosure, compiling the main program in the native system in step S102 may include: optimizing the first part of the program to reduce the interaction overhead between the native system and the virtual machine.
对于原生系统的程序集成到虚拟机大数据处理体系将导致原生系统与虚拟机之间的交互开销。根据本公开的一些实施例,通过对第一部分程序进行优化能够降低这些开销。The integration of native system programs into the virtual machine big data processing system will result in interaction overhead between the native system and the virtual machine. According to some embodiments of the present disclosure, these overheads can be reduced by optimizing the first part of the program.
图2中示出了对第一部分程序进行优化操作的示例性示意图。以“contribs.reduceByKey(_+_).mapValues(v=>0.85*v).mapValues(v=>0.15+v).join(links)”程序中的两个mapValues运算符为例进行说明。原生系统和虚拟机通过JNA Pointer(JNA指针)进行交互。FIG. 2 shows an exemplary schematic diagram of an optimization operation on the first part of the program. Take the two mapValues operators in the "contribs.reduceByKey(_+_).mapValues(v=>0.85*v).mapValues(v=>0.15+v).join(links)" program as an example for illustration. The native system and the virtual machine interact through JNA Pointer (JNA pointer).
图2的(a)是不进行优化的情形。在图2的(a)中,每个mapValues变换直接映射到虚拟机中的数据集(例如,Spark RDD),并且每个值被映射为一个虚拟机程序对象(例如Java对象)。这种方式简单,但带来很大的开销。处理诸如长双精度对这样一个简单的值都需要在虚拟机和原生系统分别进行一次序列化/反序列化,并且创建一个虚拟机程序对象,这产生了很大的交互开销,可能会抵消原生系统的性能优势。(a) of FIG. 2 is a case where optimization is not performed. In (a) of Figure 2, each mapValues transform is directly mapped to a dataset in the virtual machine (e.g., Spark RDD), and each value is mapped to a virtual machine program object (e.g., a Java object). This method is simple, but brings a lot of overhead. Processing a simple value such as a long double-precision pair requires serialization/deserialization in the virtual machine and the native system, and creates a virtual machine program object, which creates a large interaction overhead that may offset the native System performance advantages.
在本公开的一些实施例中,对第一部分程序进行优化操作可以包括:将第 一部分程序中的一连串运算步骤融合为一个运算步骤。此优化操作可以被称为运算符融合操作,如图2的(b)所示。In some embodiments of the present disclosure, optimizing the first part of the program may include: fusing a series of operation steps in the first part of the program into one operation step. This optimization operation may be called an operator fusion operation, as shown in (b) of FIG. 2 .
在图2的(b)中,两个连续的mapValues运算步骤(.mapValues(v=>0.85*v)和.mapValues(v=>0.15+v))被融合成一个mapValues运算步骤(.mapValues(v=>0.15+0.85*v)),这使得序列化/反序列化的次数减少,从而减小了原生系统和虚拟机之间的交互开销。In (b) of Figure 2, two consecutive mapValues operation steps (.mapValues(v=>0.85*v) and .mapValues(v=>0.15+v)) are fused into one mapValues operation step (.mapValues( v=>0.15+0.85*v)), which reduces the number of serialization/deserialization, thereby reducing the interaction overhead between the native system and the virtual machine.
在本公开的一些实施例中,对第一部分程序进行优化操作可以包括:将第一部分程序中的至少一个运算符设置为每次批处理多个数据。此优化操作可以被称为矢量化操作,如图2的(c)所示。In some embodiments of the present disclosure, performing an optimization operation on the first part of the program may include: setting at least one operator in the first part of the program to batch process multiple data each time. This optimization operation may be referred to as a vectorization operation, as shown in (c) of FIG. 2 .
在图2的(c)中,将第一部分程序中原先逐元素处理的运算符mapValues设置为每次批处理多个数据。即,原先每次处理一个长双精度对数据,现在每次处理一个长双精度对数组,该长精度对数组(用R表示)里包含多个长双精度对数据,每个长双精度对数据是键值对类型((k,v)类型,k表示键,v表示值)的数据。这样可以减少虚拟世界对原生世界中的数据的调用次数,减少序列化/反序列化的次数,从而减小了原生系统和虚拟机之间的交互开销。In (c) of Figure 2, the operator mapValues that was originally processed element-by-element in the first part of the program is set to batch process multiple data each time. That is to say, one long double-precision pair data was previously processed each time, and now one long double-precision pair array is processed each time. The long double-precision pair array (represented by R) contains multiple long double-precision pair data, and each long double-precision pair The data is data of the key-value pair type ((k,v) type, where k represents a key and v represents a value). This can reduce the number of times the virtual world calls the data in the native world, reduce the number of serialization/deserialization, thereby reducing the interaction overhead between the native system and the virtual machine.
在本公开的一些实施例中,对第一部分程序进行优化操作可以包括:针对第一部分程序要处理的数据,将相同数据类型的数据转换成连续存放在内存中的数据排布。此优化操作可以被称为数据排布操作。In some embodiments of the present disclosure, optimizing the first part of the program may include: for the data to be processed by the first part of the program, converting data of the same data type into a data arrangement continuously stored in the memory. This optimization operation may be referred to as a data arrangement operation.
通过内存(作为缓冲器)连续地存储批量数据,有助于避免由于分配小对象而导致的内存碎片,并消除了垃圾收集器运行时压缩的要求。此外,数据排布增加了数据局部性并提供了规则的内存访问模式,从而可以提高中央处理器的效率。例如,在本公开的一些实施例中,第一部分程序中的至少一个运算符的输入可以被设置为数据排布的起始地址指针。相同类型的数据往往会被某个或某些共同的运算符处理,因此,当这些运算符的数据被连续存放在内存中时,可以不需要提供每个数据或其地址指针,而只要提供起始数据的地址指针(即数据排布的起始地址指针),就可以获取这些连续存储的数据。在此情况下,虚拟机与原生系统之间不需要交互待处理的每个数据,而仅要交互数据排布的起始地址指针,大大降低了两者之间的交互开销。图2的(d)显示了虚拟机与原生系统之间的指针传递。在图2的(d)中,将这种数据排布称为“CompactArray”,其是由连续的内存存储的长双精度对数组。虚拟机与原生系统之间仅传递地址指针,而无需进行数据序列化/反序列化。Storing bulk data contiguously through memory (as a buffer) helps avoid memory fragmentation due to allocation of small objects and eliminates the requirement for runtime compaction by the garbage collector. In addition, data layout increases data locality and provides regular memory access patterns, which can improve CPU efficiency. For example, in some embodiments of the present disclosure, the input of at least one operator in the first part of the program may be set as the start address pointer of the data arrangement. The same type of data is often processed by one or some common operators. Therefore, when the data of these operators are continuously stored in memory, it is not necessary to provide each data or its address pointer, but only need to provide The address pointer of the initial data (that is, the initial address pointer of the data arrangement) can obtain these continuously stored data. In this case, the virtual machine and the native system do not need to exchange each piece of data to be processed, but only need to exchange the starting address pointer of the data arrangement, which greatly reduces the interaction overhead between the two. (d) of Figure 2 shows the pointer passing between the virtual machine and the native system. In (d) of FIG. 2 , this data arrangement is called "CompactArray", which is an array of long double-precision pairs stored in continuous memory. Only address pointers are passed between the virtual machine and the native system without data serialization/deserialization.
在本公开的一些实施例中,提供了多种模板来为各种数据类型生成数据排布。例如,平面数组(Flat Array)可以将固定长度的元素放置在缓冲器中。位图数组(Bitmap Array)可以将布尔元素放置在缓冲器中。数组数组(Array Array)和字符串数组(String Array)可以将数组元素放置在缓冲器中。可空数组(Nullable Array)可以将可空元素放置在缓冲器中。元组数组(tuple array)将元组元素放置在缓冲器中。In some embodiments of the present disclosure, various templates are provided to generate data arrangements for various data types. For example, a flat array (Flat Array) can place fixed-length elements in a buffer. Bitmap Array (Bitmap Array) can place Boolean elements in the buffer. Array Array (Array Array) and String Array (String Array) can place array elements in the buffer. Nullable Array (Nullable Array) can place nullable elements in the buffer. A tuple array places the tuple elements in the buffer.
上述对第一部分程序进行的优化操作减小了原生系统和虚拟机之间的交互开销,从而提升了大数据处理的性能。需要说明的是,本公开不限于上述优化操作,还适用其他能够降低原生系统和虚拟机之间的交互开销的优化操作。The above-mentioned optimization operation of the first part of the program reduces the interaction overhead between the native system and the virtual machine, thereby improving the performance of big data processing. It should be noted that the present disclosure is not limited to the above-mentioned optimization operations, and is also applicable to other optimization operations that can reduce the interaction overhead between the native system and the virtual machine.
步骤S103:由原生系统中运行的虚拟机加载可加载模块,并且将第二部分程序转换为预定虚拟机大数据处理体系支持的虚拟机代码,虚拟机代码调用预定虚拟机大数据处理体系的函数的功能实现。Step S103: Load the loadable module by the virtual machine running in the native system, and convert the second part of the program into the virtual machine code supported by the predetermined virtual machine big data processing system, and the virtual machine code calls the function of the predetermined virtual machine big data processing system function realization.
主程序被编译形成目标代码形式的可加载模块,该可加载模块被加载到原生系统中运行的虚拟机中,例如加载到Linux系统上运行的JVM中。该加载过程可以采用虚拟机与原生系统之间的接口进行,例如采用JNA进行。The main program is compiled to form a loadable module in the form of object code, and the loadable module is loaded into a virtual machine running on a native system, such as a JVM running on a Linux system. The loading process can be performed by using an interface between the virtual machine and the original system, for example, by using JNA.
在步骤S103的加载步骤之后,对于第二部分程序,还由虚拟机将其转换为预定虚拟机大数据处理体系支持的虚拟机代码。第二部分程序通常是实现比较复杂的程序,因而其调用虚拟机上的预定虚拟机大数据处理体系(例如Spark)的功能,而不重新编程实现。因此,第二部分程序在加载到虚拟机的过程中由虚拟机转换为预定虚拟机大数据处理体系(例如Spark)支持的代码。例如,将第二部分程序中的函数指针转换为可以对预定虚拟机大数据处理体系中的函数实际调用的指针地址,所调用的这些函数具有对第二部分程序中的第二运算符的功能实现。虚拟机代码调用预定虚拟机大数据处理体系的功能模块(例如,相应运算符的处理模块),避免对这些功能模块重复编程实现。需要说明的是,这里的“预定”表示预先确定,即预先选定所针对的处理体系以便进行代码转换和调用。After the loading step in step S103, for the second part of the program, the virtual machine also converts it into a virtual machine code supported by a predetermined virtual machine big data processing system. The second part of the program is usually a relatively complex program, so it calls the function of the predetermined virtual machine big data processing system (such as Spark) on the virtual machine without reprogramming. Therefore, the second part of the program is converted by the virtual machine into a code supported by a predetermined virtual machine big data processing system (such as Spark) during the process of being loaded into the virtual machine. For example, the function pointer in the second part of the program is converted into a pointer address that can actually call the function in the predetermined virtual machine big data processing system, and these functions that are called have the function of the second operator in the second part of the program accomplish. The virtual machine code calls the functional modules of the predetermined virtual machine big data processing system (for example, the processing module of the corresponding operator), avoiding the repeated programming of these functional modules. It should be noted that the "predetermined" here means pre-determined, that is, pre-selected target processing system for code conversion and invocation.
步骤S104:由虚拟机运行可加载模块包含的主程序,其中,在主程序的运行过程中,虚拟机构建可加载模块对应的有向无环图程序,然后运行有向无环图程序,将第一部分程序交由原生系统运行,将转换后的第二部分程序交由预定虚拟机大数据处理体系运行。Step S104: Run the main program contained in the loadable module by the virtual machine, wherein, during the running process of the main program, the virtual machine constructs a DAG program corresponding to the loadable module, and then runs the DAG program, which will The first part of the program is run by the original system, and the converted second part of the program is run by the predetermined virtual machine big data processing system.
有向无环图(DAG,Directed Acyclic Graph)是一种图论数据结构,如果 一个有向图无法从任意顶点出发经过若干条边回到该点,则这个图就是有向无环图。有向无环图程序为呈现有向无环图的程序。用于处理大数据处理的有向无环图程序包括对大数据进行处理的一系列运算符。图3示出了有向无环图程序的示例。图3中的第一行是从某个应用中截取的对数据进行处理的代码(即用户编写的主程序的一部分),该代码对应的有向无环图程序位于代码的下方,其包括多个顺序执行的运算符Hash Aggregate(哈希聚合)、Shuffle Write(洗牌写入)、Shuffle Read(洗牌读取)、Map Values(映射值)、Build Hash(构建哈希)、Zip Partitions(压缩分区)、Probe Hash(探测哈希)。这些运算符会对待处理数据进行相应的处理。如图3的有向无环图程序所示,顶点表示运算符,带箭头的边表示两个运算符之间的依赖关系。步骤S104中的有向无环图程序可以基于用户编写的代码转换构建。例如,图3中的代码“reduceByKey”可以构建为顺序执行的Hash Aggregate(哈希聚合)、Shuffle Write(洗牌写入)、Shuffle Read(洗牌读取)和Hash Aggregate(哈希聚合),代码“join(联合)”可以构建为顺序执行的Build Hash(构建哈希)、Zip Partitions(压缩分区)和Probe Hash(探测哈希)。这些运算符包括第一运算符和第二运算符,并被分别对应到第一部分程序和第二部分程序。Directed Acyclic Graph (DAG, Directed Acyclic Graph) is a graph theory data structure. If a directed graph cannot start from any vertex and return to the point through several edges, then this graph is a directed acyclic graph. A DAG program is a program that renders a DAG. Directed acyclic graph programs for handling big data processing include a series of operators for processing big data. Figure 3 shows an example of a directed acyclic graph program. The first line in Figure 3 is the code for processing data intercepted from an application (that is, a part of the main program written by the user). The DAG program corresponding to this code is located below the code, which includes multiple Operators executed in sequence Hash Aggregate (hash aggregation), Shuffle Write (shuffle write), Shuffle Read (shuffle read), Map Values (map value), Build Hash (build hash), Zip Partitions ( Compression partition), Probe Hash (probe hash). These operators process the data to be processed accordingly. As shown in the DAG program of Figure 3, vertices represent operators, and edges with arrows represent dependencies between two operators. The DAG program in step S104 can be constructed based on code conversion written by the user. For example, the code "reduceByKey" in Figure 3 can be constructed as Hash Aggregate (hash aggregation), Shuffle Write (shuffle write), Shuffle Read (shuffle read) and Hash Aggregate (hash aggregation) executed sequentially, The code "join" can be constructed as Build Hash (build hash), Zip Partitions (compressed partition) and Probe Hash (probe hash) executed sequentially. These operators include a first operator and a second operator, and are respectively corresponding to the first partial procedure and the second partial procedure.
步骤S104还说明了对第一部分程序和第二部分程序的不同处理方式。Step S104 also illustrates different processing methods for the first part of the program and the second part of the program.
对于第一部分程序,其在原生系统中编译后的代码包含对运算符的功能实现,因而可以使用虚拟机的机器代码执行机制(例如,JNI、JNA)将其由原生系统运行。这部分程序在原生系统中运行后会对其中的运算符的功能完成处理。而对于第二部分程序,其在原生系统中编译后的代码不包含对运算符的功能实现,相反被转换为了预定虚拟机大数据处理体系支持的虚拟机代码,因而可以由驻留在虚拟机上的预定虚拟机大数据处理体系执行。虚拟机执行步骤S104时可以复用预定虚拟机大数据处理体系的引擎,以便实现例如分布式执行、弹性、掉队缓减、监测等功能。For the first part of the program, the code compiled in the native system includes the function implementation of the operator, so it can be run by the native system using the machine code execution mechanism of the virtual machine (eg, JNI, JNA). After this part of the program runs in the native system, the functions of the operators will be processed. For the second part of the program, the code compiled in the native system does not contain the function implementation of the operator, but is converted into the virtual machine code supported by the predetermined virtual machine big data processing system, so it can be generated by residing in the virtual machine Execute the big data processing system on the scheduled virtual machine. When the virtual machine executes step S104, the engine of the predetermined virtual machine big data processing system can be reused, so as to realize functions such as distributed execution, elasticity, fall behind mitigation, and monitoring.
如前所述,预定虚拟机大数据处理体系为一种大数据处理系统,包括执行引擎和编程框架,例如,预定虚拟机大数据处理体系为Apache Spark。下面参照图4A以Apache Spark介绍预定虚拟机大数据处理体系。As mentioned above, the big data processing system of the scheduled virtual machine is a big data processing system, including an execution engine and a programming framework, for example, the big data processing system of the scheduled virtual machine is Apache Spark. Referring to Figure 4A, Apache Spark is used to introduce the scheduled virtual machine big data processing system.
图4A示出了本公开至少一实施例提供的虚拟机大数据处理体系(Apache Spark)的示例性示意图。Fig. 4A shows an exemplary schematic diagram of a virtual machine big data processing system (Apache Spark) provided by at least one embodiment of the present disclosure.
如图4A所示,预定虚拟机大数据处理体系包括编程框架(RDD)和Spark 引擎,编程框架包括Spark数据集表示(dataset representation),Spark数据集表示可以向用户提供Spark RDD应用程序接口(API)。As shown in Figure 4A, the predetermined virtual machine big data processing system includes a programming framework (RDD) and a Spark engine. The programming framework includes a Spark dataset representation (dataset representation), and the Spark dataset representation can provide users with a Spark RDD application programming interface (API). ).
Spark数据集表示包含分区集合、依赖项集合、计算定义以及关于数据集分布和数据放置的元数据。Spark数据集表示提供的Spark RDD应用程序接口可以用于构建DAG程序。A Spark dataset representation contains a collection of partitions, a collection of dependencies, a computation definition, and metadata about dataset distribution and data placement. The Spark dataset indicates that the Spark RDD API provided can be used to build a DAG program.
Spark引擎为Spark的多种功能提供支持和管理,这些功能例如包括分布式执行、弹性、掉队缓减、监测等。Spark引擎中包含资源管理器(例如YARN、AWS),资源管理器可以为上层应用提供统一的资源管理和调度。The Spark engine provides support and management for various functions of Spark, such as distributed execution, elasticity, straggler mitigation, monitoring, etc. The Spark engine includes a resource manager (such as YARN, AWS), which can provide unified resource management and scheduling for upper-layer applications.
驻留在虚拟机上的Apache Spark可以通过其编程框架和Spark引擎对第二部分程序经转换后的代码(即符合Spark RDD应用程序接口的代码)进行解析和执行。Apache Spark residing on the virtual machine can analyze and execute the converted code of the second part of the program (that is, the code conforming to the Spark RDD application program interface) through its programming framework and Spark engine.
根据本公开的实施例的数据处理方法,本公开的实施例提供了一种原生系统大数据处理体系,其结构类似于上述虚拟机大数据处理体系。According to the data processing method of the embodiment of the present disclosure, the embodiment of the present disclosure provides a native system big data processing system, the structure of which is similar to the above-mentioned virtual machine big data processing system.
图4B示出了本公开至少一实施例提供的原生系统大数据处理体系的示例性示意图。本公开实施例的大数据处理体系包括编程框架和引擎,编程框架包括数据集表示(dataset representation),数据集表示可以提供应用程序接口(API)。Fig. 4B shows an exemplary schematic diagram of a native system big data processing system provided by at least one embodiment of the present disclosure. The big data processing system in the embodiment of the present disclosure includes a programming framework and an engine. The programming framework includes a dataset representation, and the dataset representation can provide an application programming interface (API).
本公开实施例的原生系统大数据处理体系的应用程序接口可以被设计成类似于Spark RDD应用程序接口的RDD式应用程序接口,例如,其运算符的名称和语义与Spark的运算符的名称和语义相同,以便于Spark用户使用。当然,其也可以设计为与Spark RDD应用程序接口不同,只要能够被用户编程调用即可。The application program interface of the native system big data processing system of the disclosed embodiment can be designed as an RDD-style application program interface similar to the Spark RDD application program interface, for example, the name and semantics of its operator are the same as the name and semantics of the operator of Spark The semantics are the same for ease of use by Spark users. Of course, it can also be designed to be different from the Spark RDD API, as long as it can be called by user programming.
本公开实施例的原生系统大数据处理体系的数据集表示除了可以实现通常大数据处理体系的功能(例如上述Spark数据集表示的功能),还可以实现第一运算符的功能实现、对第二运算符对应的函数进行调用的函数指针、以及对第一部分程序的优化操作。该数据集表示与第二部分程序相关的部分可以复用Spark数据集表示,即包含对第二部分程序中的运算符对应的函数进行调用的函数指针。该数据集表示与第一部分程序相关的部分可以包含第一部分程序中的运算符的功能实现。该数据集表示中的操作符可以是代码模板的形式,从而可以通过例如C++模板元编程(meta-programming)实现对第一部分程序的优化。The data set representation of the native system big data processing system in the embodiment of the present disclosure can realize the functions of the first operator and the second The function pointer for calling the function corresponding to the operator, and the optimization operation for the first part of the program. The data set indicates that the part related to the second part of the program can reuse the Spark data set representation, that is, it contains function pointers for calling functions corresponding to the operators in the second part of the program. The data set representing the part associated with the first part of the program may contain functional implementations of the operators in the first part of the program. The operators in the data set representation can be in the form of code templates, so that the optimization of the first part of the program can be realized by, for example, C++ template meta-programming.
本公开实施例的原生系统大数据处理体系的引擎可以是对虚拟机大数据处理体系(例如Spark)引擎的封装以复用Spark的大数据功能,并同时包括用于将原生代码集成到虚拟机中的附加功能单元。引擎用于对原生系统大数据处理体系的多种功能提供支持和管理,由于原生系统大数据处理体系的引擎是对现有的虚拟机大数据处理体系引擎的封装,因此它无需重新配置现有的集群资源管理器或重新编译现有的Spark。本公开的原生系统大数据处理体系的引擎借用了Spark引擎的大部分功能,但是对Spark引擎进行了增强,以支持高效的Spark集成。例如,本公开的原生系统大数据处理体系的引擎在原生系统包含一个库,该库提供Spark的C++绑定(称为CppSpark),CppSpark是指虚拟机的功能用C++语言实现的接口,因此可以在原生系统中用CppSpark调用虚拟机的功能。在转换第二部分程序时会调用CppSpark提供的C++编程接口,CppSpark的作用是把调用操作从原生系统转发给虚拟机,从而在虚拟机中实现第二部分程序的转换,即将第二部分程序转换为预定虚拟机大数据处理体系支持的虚拟机代码。本公开实施例的原生系统大数据处理体系的引擎在虚拟机系统中可以包括一个驱动器(driver),其是用虚拟机编程语言(例如Java)实现的虚拟机大数据处理体系(例如Spark)普通应用程序,用于从原生系统加载经编译的程序代码(即“可加载模块”)、转换第二部分程序、以及执行可加载模块。例如,在Spark中,该驱动器可以如普通Spark应用程序一样被提交到Spark兼容的集群。成功提交后,驱动器从原生系统加载可加载模块、对可加载模块注册引擎实现、并启动可加载模块包含的主程序。此外,为了支持远程执行,驱动器还可以指示新创建的执行器准备环境,例如下载上述可加载模块。The engine of the native system big data processing system in the embodiment of the present disclosure may be an encapsulation of the engine of the virtual machine big data processing system (such as Spark) to reuse the big data functions of Spark, and at the same time include a method for integrating the native code into the virtual machine Additional functional units in . The engine is used to support and manage various functions of the native system big data processing system. Since the engine of the native system big data processing system is an encapsulation of the existing virtual machine big data processing system engine, it does not need to reconfigure the existing cluster resource manager or recompile an existing Spark. The engine of the disclosed native system big data processing system borrows most of the functions of the Spark engine, but enhances the Spark engine to support efficient Spark integration. For example, the engine of the big data processing system of the native system of the present disclosure includes a library in the native system, which provides the C++ binding of Spark (called CppSpark). Use CppSpark to call the function of the virtual machine in the native system. When converting the second part of the program, the C++ programming interface provided by CppSpark will be called. The function of CppSpark is to forward the calling operation from the native system to the virtual machine, so as to realize the conversion of the second part of the program in the virtual machine, that is, to convert the second part of the program The virtual machine code supported by the predetermined virtual machine big data processing system. The engine of the native system big data processing system in the embodiment of the present disclosure may include a driver (driver) in the virtual machine system, which is a virtual machine big data processing system (such as Spark) commonly implemented in a virtual machine programming language (such as Java). An application program for loading compiled program code (ie, a "loadable module") from the native system, transforming the second part of the program, and executing the loadable module. For example, in Spark, the driver can be submitted to a Spark-compatible cluster just like a normal Spark application. After a successful submission, the driver loads the loadable module from the native system, registers the engine implementation with the loadable module, and starts the main program contained in the loadable module. In addition, in order to support remote execution, the driver can also instruct the newly created executor to prepare the environment, such as downloading the above-mentioned loadable modules.
下面参照图4B的原生系统大数据处理体系对本公开至少一实施例提供的数据处理方法的示例工作流程进行简要说明,工作流程在图5中示出。The following briefly describes an example workflow of the data processing method provided by at least one embodiment of the present disclosure with reference to the native system big data processing system in FIG. 4B , and the workflow is shown in FIG. 5 .
如图5所示,在原生系统中,用户通过应用程序接口编写主程序。原生系统结合数据集表示将主程序进行编译并生成可加载模块,其中,可加载模块包括第一部分程序和第二部分程序,第一部分程序包括第一运算符的功能实现,第二部分程序包括对第二运算符对应的函数进行调用的函数指针。原生系统中的驱动器加载可加载模块,并且在加载过程中,结合提供Spark的C++绑定的库将第二部分程序转换为预定虚拟机大数据处理体系支持的虚拟机代码,以及对可加载模块注册引擎实现、并启动可加载模块包含的主程序。在主程序的运行过程中,驱动器构建可加载模块对应的有向无环图程序,然后运行有向无环 图程序,将第一部分程序交由原生系统运行。驱动器可以通过集群资源管理器(例如YARN)将转换后的第二部分程序分配到集群执行器(图中以Cloud(云)表示)上进行分布式计算。As shown in Figure 5, in the native system, the user writes the main program through the API. The native system combines the data set representation to compile the main program and generate a loadable module, wherein the loadable module includes the first part of the program and the second part of the program, the first part of the program includes the function realization of the first operator, and the second part of the program includes the The function pointer to the function corresponding to the second operator to call. The driver in the native system loads the loadable module, and during the loading process, combined with the C++ binding library that provides Spark, the second part of the program is converted into the virtual machine code supported by the predetermined virtual machine big data processing system, and the loadable module Registers the engine implementation and starts the main program contained in the loadable module. During the running of the main program, the driver builds the DAG program corresponding to the loadable module, then runs the DAG program, and hands the first part of the program to the native system to run. The driver can distribute the converted second part of the program to cluster executors (indicated by Cloud (cloud) in the figure) through a cluster resource manager (such as YARN) for distributed computing.
图6示出了本公开至少一实施例提供一种用于大数据的数据处理装置600的示意框图,该数据处理装置600可以用于执行图1所示的数据处理方法。FIG. 6 shows a schematic block diagram of a data processing apparatus 600 for big data provided by at least one embodiment of the present disclosure. The data processing apparatus 600 can be used to execute the data processing method shown in FIG. 1 .
如图6所示,数据处理装置600包括程序获取单元601、程序编译单元602、加载和转换单元603以及运行单元604。As shown in FIG. 6 , the data processing device 600 includes a program acquiring unit 601 , a program compiling unit 602 , a loading and converting unit 603 and an operating unit 604 .
程序获取单元601被配置为获取用原生编程语言编写的主程序。The program obtaining unit 601 is configured to obtain a main program written in a native programming language.
程序编译单元602被配置为在原生系统中将主程序进行编译并生成可加载模块,其中,可加载模块包括第一部分程序和第二部分程序,第一部分程序包括第一运算符的功能实现,第二部分程序包括对第二运算符对应的函数进行调用的函数指针。The program compiling unit 602 is configured to compile the main program in the native system and generate a loadable module, wherein the loadable module includes a first part of the program and a second part of the program, the first part of the program includes the function realization of the first operator, and the second part of the program includes The two-part program includes a function pointer that calls a function corresponding to the second operator.
加载和转换单元603被配置为由原生系统中运行的虚拟机加载可加载模块,并且将第二部分程序转换为预定虚拟机大数据处理体系支持的虚拟机代码,虚拟机代码调用预定虚拟机大数据处理体系的函数的功能实现。The loading and conversion unit 603 is configured to load the loadable module by the virtual machine running in the native system, and convert the second part of the program into the virtual machine code supported by the predetermined virtual machine big data processing system, and the virtual machine code calls the predetermined virtual machine big data processing system. The function realization of the function of the data processing system.
运行单元604被配置为由虚拟机运行可加载模块包含的主程序,其中,在主程序的运行过程中,虚拟机构建可加载模块对应的有向无环图程序,然后运行有向无环图程序,将第一部分程序交由原生系统运行,将转换后的第二部分程序交由预定虚拟机大数据处理体系运行。The running unit 604 is configured to run the main program contained in the loadable module by the virtual machine, wherein, during the running of the main program, the virtual machine builds a directed acyclic graph program corresponding to the loadable module, and then runs the directed acyclic graph program, the first part of the program is run by the native system, and the converted second part of the program is run by the predetermined virtual machine big data processing system.
例如,数据处理装置600的数据处理方法采用分布式计算,第一运算符为本地类运算符,并且第二运算符为全局类运算符。For example, the data processing method of the data processing apparatus 600 adopts distributed computing, the first operator is a local type operator, and the second operator is a global type operator.
例如,在至少一个实施例中,数据处理装置600还可以包括优化单元605。优化单元605被配置为对第一部分程序进行优化操作以降低原生系统与虚拟机之间的交互开销。For example, in at least one embodiment, the data processing apparatus 600 may further include an optimization unit 605 . The optimization unit 605 is configured to perform an optimization operation on the first part of the program to reduce the interaction overhead between the native system and the virtual machine.
例如,在至少一个实施例中,优化单元605还被配置为将第一部分程序中的一连串运算步骤融合为一个运算步骤。For example, in at least one embodiment, the optimization unit 605 is further configured to fuse a series of operation steps in the first partial program into one operation step.
例如,在至少一个实施例中,优化单元605还被配置为将第一部分程序中的至少一个运算符设置为每次批处理多个数据。For example, in at least one embodiment, the optimization unit 605 is further configured to set at least one operator in the first part of the program to batch process multiple data each time.
例如,在至少一个实施例中,优化单元605还被配置为针对第一部分程序要处理的数据,将相同数据类型的数据转换成连续存放在内存中的数据排布。For example, in at least one embodiment, the optimization unit 605 is further configured to convert data of the same data type into a data arrangement continuously stored in memory for the data to be processed by the first part of the program.
例如,在至少一个实施例中,优化单元605还被配置为将第一部分程序中 的至少一个运算符的输入设置为数据排布的起始地址指针。For example, in at least one embodiment, the optimization unit 605 is further configured to set the input of at least one operator in the first part of the program as the starting address pointer of the data arrangement.
例如,数据处理装置600可以采用硬件、软件、固件以及它们的任意可行的组合实现,本公开对此不作限制。For example, the data processing device 600 may be implemented by using hardware, software, firmware and any feasible combination thereof, which is not limited in the present disclosure.
上文关于数据处理方法的说明同样适用于数据处理装置600,在此不再赘述。The above descriptions about the data processing method are also applicable to the data processing device 600 , and will not be repeated here.
根据本公开的实施例还提供一种计算机程序产品,包括程序代码,程序代码在被处理器执行时执行根据本公开实施例的数据处理方法。Embodiments of the present disclosure further provide a computer program product, including program codes, and the program codes execute the data processing method according to the embodiments of the present disclosure when executed by a processor.
根据本公开的实施例还提供一种计算机可读介质,其上存储程序代码,程序代码在被处理器执行时执行根据本公开实施例的数据处理方法。Embodiments according to the present disclosure also provide a computer-readable medium on which program codes are stored, and the program codes execute the data processing method according to the embodiments of the present disclosure when executed by a processor.
对于本公开,还有以下几点需要说明:For this disclosure, the following points need to be explained:
(1)本公开实施例附图只涉及到与本公开实施例涉及到的结构,其他结构可参考通常设计。(1) The drawings of the embodiments of the present disclosure only relate to the structures involved in the embodiments of the present disclosure, and other structures may refer to general designs.
(2)在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合以得到新的实施例。(2) In the case of no conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other to obtain new embodiments.
以上所述仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,本公开的保护范围应以所述权利要求的保护范围为准。The above description is only a specific implementation manner of the present disclosure, but the protection scope of the present disclosure is not limited thereto, and the protection scope of the present disclosure should be based on the protection scope of the claims.

Claims (14)

  1. 一种用于大数据的数据处理方法,包括:A data processing method for big data, comprising:
    获取用原生编程语言编写的主程序;Get the main program written in the native programming language;
    在原生系统中将所述主程序进行编译并生成可加载模块,其中,所述可加载模块包括第一部分程序和第二部分程序,所述第一部分程序包括第一运算符的功能实现,所述第二部分程序包括对第二运算符对应的函数进行调用的函数指针;Compile the main program in the native system and generate a loadable module, wherein the loadable module includes a first part program and a second part program, the first part program includes the function realization of the first operator, and the The second part of the program includes a function pointer for calling a function corresponding to the second operator;
    由所述原生系统中运行的虚拟机加载所述可加载模块,并且将所述第二部分程序转换为预定虚拟机大数据处理体系支持的虚拟机代码,所述虚拟机代码调用所述预定虚拟机大数据处理体系的函数的功能实现;以及The loadable module is loaded by a virtual machine running in the native system, and the second part of the program is converted into a virtual machine code supported by a predetermined virtual machine big data processing system, and the virtual machine code calls the predetermined virtual machine The function realization of the functions of the computer big data processing system; and
    由所述虚拟机运行所述可加载模块包含的主程序,其中,在所述主程序的运行过程中,所述虚拟机构建所述可加载模块对应的有向无环图程序,然后运行所述有向无环图程序,将所述第一部分程序交由所述原生系统运行,将转换后的第二部分程序交由所述预定虚拟机大数据处理体系运行。The virtual machine runs the main program contained in the loadable module, wherein, during the running of the main program, the virtual machine builds a directed acyclic graph program corresponding to the loadable module, and then runs the The directed acyclic graph program, the first part of the program is handed over to the native system to run, and the converted second part of the program is handed over to the predetermined virtual machine big data processing system to run.
  2. 根据权利要求1所述的数据处理方法,其中,所述数据处理方法采用分布式计算,所述第一运算符为本地类运算符,并且所述第二运算符为全局类运算符。The data processing method according to claim 1, wherein the data processing method adopts distributed computing, the first operator is a local type operator, and the second operator is a global type operator.
  3. 根据权利要求1或2所述的数据处理方法,其中,在所述原生系统中将所述主程序进行编译并生成可加载模块,包括:The data processing method according to claim 1 or 2, wherein, in the native system, the main program is compiled and a loadable module is generated, comprising:
    对所述第一部分程序进行优化操作以降低所述原生系统与所述虚拟机之间的交互开销。An optimization operation is performed on the first part of the program to reduce the interaction overhead between the native system and the virtual machine.
  4. 根据权利要求3所述的数据处理方法,其中,对所述第一部分程序进行优化操作,包括:The data processing method according to claim 3, wherein, performing an optimization operation on the first part of the program comprises:
    将所述第一部分程序中的一连串运算步骤融合为一个运算步骤。A series of operation steps in the first part of the program are merged into one operation step.
  5. 根据权利要求3所述的数据处理方法,其中,对所述第一部分程序进行优化操作,包括:The data processing method according to claim 3, wherein, performing an optimization operation on the first part of the program comprises:
    将所述第一部分程序中的至少一个运算符设置为每次批处理多个数据。At least one operator in the first part of the program is set to batch process multiple data at a time.
  6. 根据权利要求3所述的数据处理方法,其中,对所述第一部分程序进行优化操作,包括:The data processing method according to claim 3, wherein, performing an optimization operation on the first part of the program comprises:
    针对所述第一部分程序要处理的数据,将相同数据类型的数据转换成连续 存放在内存中的数据排布。For the data to be processed by the first part of the program, the data of the same data type is converted into a data arrangement continuously stored in the memory.
  7. 根据权利要求6所述的数据处理方法,其中,The data processing method according to claim 6, wherein,
    所述第一部分程序中的至少一个运算符的输入被设置为所述数据排布的起始地址指针。The input of at least one operator in the first partial program is set as a start address pointer of the data arrangement.
  8. 根据权利要求1-7中的任一项所述的数据处理方法,其中,The data processing method according to any one of claims 1-7, wherein,
    所述预定虚拟机大数据处理体系为Apache Spark。The big data processing system of the predetermined virtual machine is Apache Spark.
  9. 一种用于大数据的数据处理装置,包括:A data processing device for big data, comprising:
    程序获取单元,配置为获取用原生编程语言编写的主程序;a program acquisition unit configured to acquire a main program written in a native programming language;
    程序编译单元,配置为在原生系统中将所述主程序进行编译并生成可加载模块,其中,所述可加载模块包括第一部分程序和第二部分程序,所述第一部分程序包括第一运算符的功能实现,所述第二部分程序包括对第二运算符对应的函数进行调用的函数指针;A program compiling unit configured to compile the main program in the native system and generate a loadable module, wherein the loadable module includes a first part program and a second part program, and the first part program includes a first operator The function implementation of the second part of the program includes a function pointer for calling a function corresponding to the second operator;
    加载和转换单元,配置为由所述原生系统中运行的虚拟机加载所述可加载模块,并且将所述第二部分程序转换为预定虚拟机大数据处理体系支持的虚拟机代码,所述虚拟机代码调用所述预定虚拟机大数据处理体系的函数的功能实现;以及The loading and conversion unit is configured to load the loadable module by a virtual machine running in the native system, and convert the second part of the program into a virtual machine code supported by a predetermined virtual machine big data processing system, the virtual machine The machine code calls the function realization of the function of the predetermined virtual machine big data processing system; and
    运行单元,配置为由所述虚拟机运行所述可加载模块包含的主程序,其中,在所述主程序的运行过程中,所述虚拟机构建所述可加载模块对应的有向无环图程序,然后运行所述有向无环图程序,将所述第一部分程序交由所述原生系统运行,将转换后的第二部分程序交由所述预定虚拟机大数据处理体系运行。The running unit is configured to run the main program contained in the loadable module by the virtual machine, wherein, during the running of the main program, the virtual machine constructs a directed acyclic graph corresponding to the loadable module program, and then run the directed acyclic graph program, hand over the first part of the program to the native system to run, and hand over the converted second part of the program to the predetermined virtual machine big data processing system to run.
  10. 根据权利要求9所述的数据处理装置,还包括:The data processing device according to claim 9, further comprising:
    优化单元,被配置为:对所述第一部分程序进行优化操作以降低所述原生系统与所述虚拟机之间的交互开销。The optimization unit is configured to: perform an optimization operation on the first part of the program to reduce the interaction overhead between the native system and the virtual machine.
  11. 根据权利要求10所述的数据处理装置,其中,对所述第一部分程序进行优化操作,包括:The data processing device according to claim 10, wherein performing an optimization operation on the first part of the program comprises:
    将所述第一部分程序中的一连串运算步骤融合为一个运算步骤。A series of operation steps in the first part of the program are merged into one operation step.
  12. 根据权利要求10所述的数据处理装置,其中,对所述第一部分程序进行优化操作,包括:The data processing device according to claim 10, wherein performing an optimization operation on the first part of the program comprises:
    将所述第一部分程序中的至少一个运算符设置为每次批处理多个数据。At least one operator in the first part of the program is set to batch process multiple data at a time.
  13. 根据权利要求10所述的数据处理装置,其中,对所述第一部分程序 进行优化操作,包括:The data processing device according to claim 10, wherein, performing an optimization operation on the first part of the program comprises:
    针对所述第一部分程序要处理的数据,将相同数据类型的数据转换成连续存放在内存中的数据排布。For the data to be processed by the first part of the program, the data of the same data type is converted into a data arrangement continuously stored in the memory.
  14. 根据权利要求13所述的数据处理装置,其中,The data processing apparatus according to claim 13, wherein,
    所述第一部分程序中的至少一个运算符的输入被设置为所述数据排布的起始地址指针。The input of at least one operator in the first partial program is set as a start address pointer of the data arrangement.
PCT/CN2022/130286 2021-12-27 2022-11-07 Data processing method and data processing apparatus for big data WO2023124543A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111618375.0 2021-12-27
CN202111618375.0A CN114327479A (en) 2021-12-27 2021-12-27 Data processing method and data processing device for big data

Publications (1)

Publication Number Publication Date
WO2023124543A1 true WO2023124543A1 (en) 2023-07-06

Family

ID=81014410

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/130286 WO2023124543A1 (en) 2021-12-27 2022-11-07 Data processing method and data processing apparatus for big data

Country Status (2)

Country Link
CN (1) CN114327479A (en)
WO (1) WO2023124543A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114327479A (en) * 2021-12-27 2022-04-12 清华大学 Data processing method and data processing device for big data
CN115378789B (en) * 2022-10-24 2023-01-10 中国地质大学(北京) Multi-level cooperative stream resource management method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103777997A (en) * 2013-12-25 2014-05-07 中软信息系统工程有限公司 JAVA virtual machine hardware independency platform based on MIPS and independency improvement method thereof
CN106648681A (en) * 2016-12-29 2017-05-10 南京科远自动化集团股份有限公司 System and method for compiling and loading programmable language
CN111309449A (en) * 2020-03-17 2020-06-19 上海蓝载信息科技有限公司 Programming language independent virtual machine oriented to meta-programming, interactive programming and blockchain interoperation
CN111767116A (en) * 2020-06-03 2020-10-13 江苏中科重德智能科技有限公司 Virtual machine for mechanical arm program development programming language and operation method for assembly file
US20210124600A1 (en) * 2019-10-29 2021-04-29 International Business Machines Corporation Rescheduling jit compilation based on jobs of parallel distributed computing framework
CN114327479A (en) * 2021-12-27 2022-04-12 清华大学 Data processing method and data processing device for big data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103777997A (en) * 2013-12-25 2014-05-07 中软信息系统工程有限公司 JAVA virtual machine hardware independency platform based on MIPS and independency improvement method thereof
CN106648681A (en) * 2016-12-29 2017-05-10 南京科远自动化集团股份有限公司 System and method for compiling and loading programmable language
US20210124600A1 (en) * 2019-10-29 2021-04-29 International Business Machines Corporation Rescheduling jit compilation based on jobs of parallel distributed computing framework
CN111309449A (en) * 2020-03-17 2020-06-19 上海蓝载信息科技有限公司 Programming language independent virtual machine oriented to meta-programming, interactive programming and blockchain interoperation
CN111767116A (en) * 2020-06-03 2020-10-13 江苏中科重德智能科技有限公司 Virtual machine for mechanical arm program development programming language and operation method for assembly file
CN114327479A (en) * 2021-12-27 2022-04-12 清华大学 Data processing method and data processing device for big data

Also Published As

Publication number Publication date
CN114327479A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
US10437573B2 (en) General purpose distributed data parallel computing using a high level language
KR102370568B1 (en) Containerized deployment of microservices based on monolithic legacy applications
Elser et al. An evaluation study of bigdata frameworks for graph processing
WO2023124543A1 (en) Data processing method and data processing apparatus for big data
Chen et al. Flinkcl: An opencl-based in-memory computing architecture on heterogeneous cpu-gpu clusters for big data
Murray et al. {CIEL}: A universal execution engine for distributed {Data-Flow} computing
US11556396B2 (en) Structure linked native query database management system and methods
US8572575B2 (en) Debugging a map reduce application on a cluster
Yuan et al. Spark-GPU: An accelerated in-memory data processing engine on clusters
Isard et al. Distributed data-parallel computing using a high-level programming language
Raychev et al. Parallelizing user-defined aggregations using symbolic execution
US8863096B1 (en) Parallel symbolic execution on cluster of commodity hardware
Yan et al. Incmr: Incremental data processing based on mapreduce
US10749984B2 (en) Processing requests for multi-versioned service
US11848980B2 (en) Distributed pipeline configuration in a distributed computing system
Miceli et al. Programming abstractions for data intensive computing on clouds and grids
de Carvalho Junior et al. Contextual abstraction in a type system for component-based high performance computing platforms
Asadi et al. Hybrid quantum programming with PennyLane Lightning on HPC platforms
Schneider et al. Language Runtime and Optimizations in IBM Streams.
Thor et al. Cloudfuice: A flexible cloud-based data integration system
Ren et al. Efficient shuffle management for DAG computing frameworks based on the FRQ model
Kukreti et al. CloneHadoop: Process Cloning to Reduce Hadoop's Long Tail
Tardieu et al. X10 for productivity and performance at scale
Ponce et al. Extension of a Task-based model to Functional programming
Perera Towards Scalable High Performance Data Engineering Systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913806

Country of ref document: EP

Kind code of ref document: A1