WO2023124543A1

WO2023124543A1 - Data processing method and data processing apparatus for big data

Info

Publication number: WO2023124543A1
Application number: PCT/CN2022/130286
Authority: WO
Inventors: 俞博文; 冯冠宇; 曹焕琦; 郑纬民; 陈文光
Original assignee: 清华大学
Priority date: 2021-12-27
Filing date: 2022-11-07
Publication date: 2023-07-06
Also published as: CN114327479A

Abstract

A data processing method and data processing apparatus for big data. The data processing method comprises: acquiring a main program written in a native programming language (S101); compiling the main program in a native system and generating a loadable module, wherein the loadable module comprises a first program part and a second program part (S102); loading the loadable module by a virtual machine running in the native system, and converting the second program part into a virtual machine code supported by a predetermined virtual machine big data processing system (S103); and running, by the virtual machine, the main program contained in the loadable module, wherein the virtual mechanism constructs a directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, hands over the first program part to the native system for running, and hands over the converted second program part to the predetermined virtual machine big data processing system for running (S104). According to the data processing method, a high-performance big data processing framework can be constructed while integrating an existing virtual machine big data software ecosystem.

Description

Data processing method and data processing device for big data

This application claims the priority of Chinese Patent Application No. 202111618375.0 submitted on December 27, 2021, and the content disclosed in the above Chinese Patent Application is cited in its entirety as a part of this application.

technical field

The present disclosure relates to a data processing method and a data processing device for big data.

Background technique

Big data refers to large and rapidly growing data sets in heterogeneous formats. Apache Hadoop is an early open source big data solution, which includes a distributed file system (HDFS) for persistent storage of big data, and an analysis framework based on the MapReduce abstraction. Apache Spark is a fast and general-purpose computing system designed for large-scale data processing. Spark is an open source cluster computing environment similar to Hadoop, but Spark is superior in certain workloads. Recently, Apache Spark introduced a new abstraction called Resilient Distributed Dataset (RDD) to support fault-tolerant data reuse for iterative workloads, which can achieve an order of magnitude better performance than Hadoop MapReduce. Spark provides a rich and easy-to-use API for building support libraries for graph computing, stream processing, machine learning, and SQL queries. Today, Spark is widely deployed to serve big data analysis.

Contents of the invention

At least one embodiment of the present disclosure provides a data processing method for big data, including: obtaining a main program written in a native programming language; compiling the main program in the native system and generating a loadable module, wherein the loadable module Including the first part of the program and the second part of the program, the first part of the program includes the function realization of the first operator, and the second part of the program includes the function pointer for calling the function corresponding to the second operator; loaded by the virtual machine running in the native system The module can be loaded, and the second part of the program is converted into the virtual machine code supported by the predetermined virtual machine big data processing system, and the virtual machine code calls the function realization of the function of the predetermined virtual machine big data processing system; and the loadable module is run by the virtual machine The main program included, wherein, during the running of the main program, the virtual machine constructs the directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, handing over the first part of the program to the native system to run, and the The converted second part of the program is run by a predetermined virtual machine big data processing system.

For example, in the data processing method provided by at least one embodiment of the present disclosure, the data processing method adopts distributed computing, the first operator is a local type operator, and the second operator is a global type operator.

For example, in the data processing method provided by at least one embodiment of the present disclosure, compiling the main program in the native system and generating a loadable module includes: optimizing the first part of the program to reduce the gap between the native system and the virtual machine. Interaction overhead.

For example, in the data processing method provided by at least one embodiment of the present disclosure, optimizing the first part of the program includes: fusing a series of operation steps in the first part of the program into one operation step.

For example, in the data processing method provided by at least one embodiment of the present disclosure, optimizing the first part of the program includes: setting at least one operator in the first part of the program to batch process multiple data each time.

For example, in the data processing method provided by at least one embodiment of the present disclosure, optimizing the first part of the program includes: converting data of the same data type into data that is continuously stored in the memory for the data to be processed by the first part of the program arranged.

For example, in the data processing method provided by at least one embodiment of the present disclosure, the input of at least one operator in the first part of the program is set as the start address pointer of the data arrangement.

For example, in the data processing method provided in at least one embodiment of the present disclosure, the predetermined virtual machine big data processing system is Apache Spark.

At least one embodiment of the present disclosure provides a data processing device for big data, including: a program acquisition unit configured to acquire a main program written in a native programming language; a program compilation unit configured to execute the main program in a native system Compile and generate a loadable module, wherein the loadable module includes a first part of the program and a second part of the program, the first part of the program includes the function realization of the first operator, and the second part of the program includes the function of calling the function corresponding to the second operator Function pointer; loading and conversion unit, configured to load the loadable module by the virtual machine running in the native system, and convert the second part of the program into the virtual machine code supported by the predetermined virtual machine big data processing system, and the virtual machine code calls the predetermined virtual machine The functional realization of the functions of the machine big data processing system; and the operating unit, which is configured to run the main program contained in the loadable module by the virtual machine, wherein, during the running process of the main program, the virtual machine builds the directed un The circular graph program, and then run the directed acyclic graph program, hand over the first part of the program to the native system to run, and hand over the converted second part of the program to the predetermined virtual machine big data processing system to run.

The data processing device provided by at least one embodiment of the present disclosure further includes: an optimization unit configured to: perform an optimization operation on the first part of the program to reduce the interaction overhead between the native system and the virtual machine.

For example, in the data processing device provided by at least one embodiment of the present disclosure, optimizing the first part of the program includes: fusing a series of operation steps in the first part of the program into one operation step.

For example, in the data processing device provided by at least one embodiment of the present disclosure, optimizing the first part of the program includes: setting at least one operator in the first part of the program to batch process multiple data each time.

For example, in the data processing device provided by at least one embodiment of the present disclosure, optimizing the first part of the program includes: converting data of the same data type into data that is continuously stored in the memory for the data to be processed by the first part of the program arranged.

For example, in the data processing device provided by at least one embodiment of the present disclosure, the input of at least one operator in the first part of the program is set as the start address pointer of the data arrangement.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the accompanying drawings of the embodiments will be briefly introduced below. Obviously, the accompanying drawings in the following description only relate to some embodiments of the present disclosure, rather than limiting the present disclosure .

Fig. 1 shows a schematic flowchart of a data processing method for big data provided by at least one embodiment of the present disclosure;

Fig. 2 shows an exemplary schematic diagram of an optimization operation performed on the first part of the program provided by at least one embodiment of the present disclosure;

Fig. 3 shows an exemplary schematic diagram of a directed acyclic graph program provided by at least one embodiment of the present disclosure;

FIG. 4A shows an exemplary schematic diagram of a virtual machine big data processing system (Apache Spark) provided by at least one embodiment of the present disclosure;

Fig. 4B shows an exemplary schematic diagram of a native system big data processing system provided by at least one embodiment of the present disclosure;

Fig. 5 shows a schematic diagram of an example workflow of a data processing method provided by at least one embodiment of the present disclosure;

Fig. 6 shows a schematic block diagram of a data processing device for big data provided by at least one embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings of the embodiments of the present disclosure. Apparently, the described embodiments are some of the embodiments of the present disclosure, not all of them. Based on the described embodiments of the present disclosure, all other embodiments obtained by persons of ordinary skill in the art without creative effort fall within the protection scope of the present disclosure.

Unless otherwise defined, the technical terms or scientific terms used in the present disclosure shall have the usual meanings understood by those skilled in the art to which the present disclosure belongs. "First", "second" and similar words used in the present disclosure do not indicate any order, quantity or importance, but are only used to distinguish different components. Likewise, words like "a", "an" or "the" do not denote a limitation of quantity, but mean that there is at least one. "Comprising" or "comprising" and similar words mean that the elements or items appearing before the word include the elements or items listed after the word and their equivalents, without excluding other elements or items. Words such as "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "Down", "Left", "Right" and so on are only used to indicate the relative positional relationship. When the absolute position of the described object changes, the relative positional relationship may also change accordingly.

Despite Spark's advantages in memory-distributed datasets, in addition to being able to provide interactive queries and optimize iterative workloads, recent work has shown that its performance still has a lot of room for improvement. The Spark program runs on the basis of the JVM (java virtual machine, java virtual machine). The program needs to be translated into code that can be executed indirectly by the JVM. Compared with Spark, the big data analysis framework built with the C++ programming language It has a great performance advantage, because the code written in C++ can be translated into code that the machine can directly execute. For example, Thrill, a C++-based big data analysis framework, achieves an average acceleration of 3.26 times that of Spark on typical big data workloads. Additionally, switching Java to C yields a 4.4x speedup for the Java matrix multiplication kernel, and a 9.45x performance boost for vectorization and AVX intrinsics provided by the C compiler. However, performance is only one aspect of big data processing. Although the C++-based big data analysis framework has better performance than Spark, it does not have many key functions provided by Spark for big data processing, such as lineage-based elasticity: Big data analysis usually uses distributed computing, that is, it is executed in a multi-tenant commodity cluster built on commodity hardware, where task failures due to machine failures, network jitter, and preemptive scheduling are very common, which is This makes checkpointing inefficient in handling these frequent failures. Spark's lineage-based fault-tolerance mechanism allows recomputing only part of the data rather than all of it. Spark's elasticity also supports other functions, such as load balancing, straggler mitigation, and auto-scaling, etc., thereby improving the resource utilization of the cluster. In addition, Spark's ecosystem, such as a performance analyzer with a Web UI and integration with various resource managers, makes it easy to deploy, monitor, and analyze applications on various private or public clouds. Thrill has a native RDD (Resilient Distributed Datasets)-like abstraction called DIA, but its tight coupling of data distribution with physical machines makes elasticity ineffective. Husky uses the upstream message log fault tolerance mechanism, even if there is no fault, it will generate non-negligible overhead. Compared with Spark, these C++-based big data analysis frameworks lack many basic functions.

Clearly, a full-featured native big data framework needs to be designed. A straightforward solution is to reimplement the functionality provided by Spark in a native programming language (such as C++), which is theoretically possible, but may be too expensive and unnecessary: the core component of Spark 3.0.1 has 74K lines of code, Among them, there are only 9K lines directly related to the programming framework (including RDD application program interface and operator implementation), and other codes are components that serve various big data functions. Therefore, it is possible to build a big data framework that can reuse Spark's mature big data functions without reimplementing these functions.

However, the approach of reusing Spark's big data capabilities creates many challenges. First, existing native big data frameworks (eg, Thrill, Husky, etc.) are not compatible with Spark's execution model, making it infeasible to integrate native big data frameworks into Spark. For example, Thrill couples each dataset partition to a specific machine, and Husky relies on stateful task execution, which violates Spark's dynamic scheduling and stateless assumptions. Second, fine-grained interaction between the JVM and the native world, whether through JNI (java native interface) or JNA (java native access, java native access), creates high overhead and potentially becomes a new performance bottleneck.

In this data processing method, the program for processing big data is divided into a first part program and a second part program for processing respectively, which can not only reuse important functions of a predetermined virtual machine big data processing system (for example, Apache Spark), but also improve Performance for big data processing. The first part of the program includes the function realization of the first operator, which can be run by the native system without calling the functions of the predetermined virtual machine big data processing system, so the processing speed is faster; while the second part of the program calls the predetermined virtual machine big data processing system It can easily realize the existing functions of the predetermined virtual machine big data processing system. Therefore, the data processing method according to the embodiment of the present disclosure can not only realize high-speed processing of big data, but also realize full-featured big data processing by making full use of the functions of the existing big data processing system.

The native (native) system is a concept relative to the virtual machine, which refers to the inherent operating system on the computer, such as the Linux system and Windows system installed on the computer itself. Native programming language means a programming language used on a native system, such as the C++ programming language. For example, in some embodiments of the present disclosure, the main program is written in the C++ programming language, the main program is compiled to form a loadable module in the form of object code (such as machine code), and then the loadable module is loaded into the virtual system running in the native system. machine, such as loaded into a JVM running on a Linux system.

It should be noted that although some embodiments of the present disclosure use Apache Spark as an example of a predetermined virtual machine big data processing system for illustration, the present disclosure is not limited to be based on Apache Spark. Those skilled in the art can understand that other virtual machine big data processing systems also have similar technical problems, and the technical solutions of the embodiments of the present disclosure are also applicable to other virtual machine big data processing systems. In this disclosure, the virtual machine big data processing system (framework) refers to a big data processing system running on a virtual machine (such as JVM), including an engine and a programming framework. The big data processing system provides various functions of big data processing, for example, the functions of Apache Spark as described above.

At least one embodiment of the present disclosure further provides a data processing device corresponding to the above data processing method.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.

Fig. 1 shows a schematic flowchart of a data processing method for big data provided by at least one embodiment of the present disclosure.

As shown in FIG. 1, the data processing method includes the following steps S101-S104.

Step S101: Obtain a main program written in a native programming language.

The user writes the main program in a native programming language such as C++ programming language, and the processing device obtains the main program written by the user.

Step S102: Compile the main program in the native system and generate a loadable module, wherein the loadable module includes a first part of the program and a second part of the program, the first part of the program includes the function realization of the first operator, and the second part of the program includes Function pointer to call the function corresponding to the second operator.

In step S102, the main program is compiled into a module in the form of object code (for example, in the form of machine code), which can be loaded into a virtual machine, thus called a loadable module. The main program generally includes calls to the function functions provided by the system. When compiling, the main program will be compiled in conjunction with the function library or data set provided by the system. These function libraries or data sets will analyze the called function for corresponding processing. . According to an embodiment of the present disclosure, the compiled loadable module may include a first part of the program and a second part of the program, and the first part of the program and the second part of the program are processed differently in subsequent steps. In the embodiments of the present disclosure, the "first part program" is also called "compile-time part program", and the "second part program" is called "run-time part program". The difference between the first part of the program and the second part of the program is whether the part of the program includes the specific function realization of the operator in the program. If it includes the specific function realization of the operator, it is the first part of the program, and the corresponding operator is called the first operation. operator, otherwise it is the second part of the program, and the corresponding operator is called the second operator. The second part of the program does not include the specific function realization of the operator, but includes a function pointer for calling the function corresponding to the operator. These operators (second operators) can be realized by setting the address of the function that realizes the function to the function pointer in the subsequent process (details will be described below). The division of the first operator and the second operator can be performed according to the category of the operator, for example, according to a predetermined division strategy, certain categories of operators are classified as the first operator, while other categories of operators are classified as the second operator symbol. The division strategy can be determined according to the specific application. For example, operators with relatively simple functions can be divided into the first operator, and operators with relatively complex functions can be divided into the second operator, so that the second operator with complex functions can be realized. Operators reuse the functions of the virtual machine big data processing system. For another example, in a data processing method using distributed computing, local operators may be classified as first operators, and part of global operators may be divided as second operators. In other words, in some embodiments of the present disclosure, the data processing method adopts distributed computing, the first operator is a local operator, and the second operator is a global operator.

Local class operators represent operators that do not need to consider the coordination between different operation nodes in distributed operations, for example, map value (mapValues) operator, filter (filter) operator, transformation (flatMap) operator, hash ( hash) operator (such as Hash Aggregate, Build Hash), etc.

Global class operators represent operators that need to consider the coordination between different operation nodes in distributed operations, for example, partition-pruning operators, operators with more than one dependency (for example, union ) operator, zip operator (such as Zip Partitions), cartesian operator), shuffle (shuffle) operator (such as Shuffle Write, Shuffle Read), cache (cache) operator, and data-source (data-source) operator wait.

For the global operator, because it needs to consider the coordination between different operation nodes, the implementation is more complicated, so in the embodiment of the present disclosure, it is divided into the second operator in order to reuse the existing virtual machine big data processing system ( Such as the global class operator function of Spark).

For example, partition pruning operators and operators with multiple-one dependencies can provide lineage-related information to Spark and can reuse Spark's lineage-based fault tolerance mechanism.

For example, the shuffle operator can reuse Spark's fault-tolerant data shuffling mechanism, which is not only difficult to implement, but also cumbersome to deploy.

For example, caching operators can reuse Spark's intermediate data management mechanisms. Data source operators can provide location information to Spark and reuse Spark's location-aware task scheduling.

Divide the part of the local class operator into the first operator, and divide the global class operator into the second operator, so that the second part of the program containing the second operator can be handed over to a virtual machine such as Spark in a subsequent step Big data processing system to process without re-implementing complex functions can avoid unnecessary human consumption.

In some embodiments of the present disclosure, compiling the main program in the native system in step S102 may include: optimizing the first part of the program to reduce the interaction overhead between the native system and the virtual machine.

The integration of native system programs into the virtual machine big data processing system will result in interaction overhead between the native system and the virtual machine. According to some embodiments of the present disclosure, these overheads can be reduced by optimizing the first part of the program.

FIG. 2 shows an exemplary schematic diagram of an optimization operation on the first part of the program. Take the two mapValues operators in the "contribs.reduceByKey(_+_).mapValues(v=>0.85*v).mapValues(v=>0.15+v).join(links)" program as an example for illustration. The native system and the virtual machine interact through JNA Pointer (JNA pointer).

(a) of FIG. 2 is a case where optimization is not performed. In (a) of Figure 2, each mapValues transform is directly mapped to a dataset in the virtual machine (e.g., Spark RDD), and each value is mapped to a virtual machine program object (e.g., a Java object). This method is simple, but brings a lot of overhead. Processing a simple value such as a long double-precision pair requires serialization/deserialization in the virtual machine and the native system, and creates a virtual machine program object, which creates a large interaction overhead that may offset the native System performance advantages.

In some embodiments of the present disclosure, optimizing the first part of the program may include: fusing a series of operation steps in the first part of the program into one operation step. This optimization operation may be called an operator fusion operation, as shown in (b) of FIG. 2 .

In (b) of Figure 2, two consecutive mapValues operation steps (.mapValues(v=>0.85*v) and .mapValues(v=>0.15+v)) are fused into one mapValues operation step (.mapValues( v=>0.15+0.85*v)), which reduces the number of serialization/deserialization, thereby reducing the interaction overhead between the native system and the virtual machine.

In some embodiments of the present disclosure, performing an optimization operation on the first part of the program may include: setting at least one operator in the first part of the program to batch process multiple data each time. This optimization operation may be referred to as a vectorization operation, as shown in (c) of FIG. 2 .

In (c) of Figure 2, the operator mapValues that was originally processed element-by-element in the first part of the program is set to batch process multiple data each time. That is to say, one long double-precision pair data was previously processed each time, and now one long double-precision pair array is processed each time. The long double-precision pair array (represented by R) contains multiple long double-precision pair data, and each long double-precision pair The data is data of the key-value pair type ((k,v) type, where k represents a key and v represents a value). This can reduce the number of times the virtual world calls the data in the native world, reduce the number of serialization/deserialization, thereby reducing the interaction overhead between the native system and the virtual machine.

In some embodiments of the present disclosure, optimizing the first part of the program may include: for the data to be processed by the first part of the program, converting data of the same data type into a data arrangement continuously stored in the memory. This optimization operation may be referred to as a data arrangement operation.

Storing bulk data contiguously through memory (as a buffer) helps avoid memory fragmentation due to allocation of small objects and eliminates the requirement for runtime compaction by the garbage collector. In addition, data layout increases data locality and provides regular memory access patterns, which can improve CPU efficiency. For example, in some embodiments of the present disclosure, the input of at least one operator in the first part of the program may be set as the start address pointer of the data arrangement. The same type of data is often processed by one or some common operators. Therefore, when the data of these operators are continuously stored in memory, it is not necessary to provide each data or its address pointer, but only need to provide The address pointer of the initial data (that is, the initial address pointer of the data arrangement) can obtain these continuously stored data. In this case, the virtual machine and the native system do not need to exchange each piece of data to be processed, but only need to exchange the starting address pointer of the data arrangement, which greatly reduces the interaction overhead between the two. (d) of Figure 2 shows the pointer passing between the virtual machine and the native system. In (d) of FIG. 2 , this data arrangement is called "CompactArray", which is an array of long double-precision pairs stored in continuous memory. Only address pointers are passed between the virtual machine and the native system without data serialization/deserialization.

In some embodiments of the present disclosure, various templates are provided to generate data arrangements for various data types. For example, a flat array (Flat Array) can place fixed-length elements in a buffer. Bitmap Array (Bitmap Array) can place Boolean elements in the buffer. Array Array (Array Array) and String Array (String Array) can place array elements in the buffer. Nullable Array (Nullable Array) can place nullable elements in the buffer. A tuple array places the tuple elements in the buffer.

The above-mentioned optimization operation of the first part of the program reduces the interaction overhead between the native system and the virtual machine, thereby improving the performance of big data processing. It should be noted that the present disclosure is not limited to the above-mentioned optimization operations, and is also applicable to other optimization operations that can reduce the interaction overhead between the native system and the virtual machine.

Step S103: Load the loadable module by the virtual machine running in the native system, and convert the second part of the program into the virtual machine code supported by the predetermined virtual machine big data processing system, and the virtual machine code calls the function of the predetermined virtual machine big data processing system function realization.

The main program is compiled to form a loadable module in the form of object code, and the loadable module is loaded into a virtual machine running on a native system, such as a JVM running on a Linux system. The loading process can be performed by using an interface between the virtual machine and the original system, for example, by using JNA.

After the loading step in step S103, for the second part of the program, the virtual machine also converts it into a virtual machine code supported by a predetermined virtual machine big data processing system. The second part of the program is usually a relatively complex program, so it calls the function of the predetermined virtual machine big data processing system (such as Spark) on the virtual machine without reprogramming. Therefore, the second part of the program is converted by the virtual machine into a code supported by a predetermined virtual machine big data processing system (such as Spark) during the process of being loaded into the virtual machine. For example, the function pointer in the second part of the program is converted into a pointer address that can actually call the function in the predetermined virtual machine big data processing system, and these functions that are called have the function of the second operator in the second part of the program accomplish. The virtual machine code calls the functional modules of the predetermined virtual machine big data processing system (for example, the processing module of the corresponding operator), avoiding the repeated programming of these functional modules. It should be noted that the "predetermined" here means pre-determined, that is, pre-selected target processing system for code conversion and invocation.

Step S104: Run the main program contained in the loadable module by the virtual machine, wherein, during the running process of the main program, the virtual machine constructs a DAG program corresponding to the loadable module, and then runs the DAG program, which will The first part of the program is run by the original system, and the converted second part of the program is run by the predetermined virtual machine big data processing system.

Directed Acyclic Graph (DAG, Directed Acyclic Graph) is a graph theory data structure. If a directed graph cannot start from any vertex and return to the point through several edges, then this graph is a directed acyclic graph. A DAG program is a program that renders a DAG. Directed acyclic graph programs for handling big data processing include a series of operators for processing big data. Figure 3 shows an example of a directed acyclic graph program. The first line in Figure 3 is the code for processing data intercepted from an application (that is, a part of the main program written by the user). The DAG program corresponding to this code is located below the code, which includes multiple Operators executed in sequence Hash Aggregate (hash aggregation), Shuffle Write (shuffle write), Shuffle Read (shuffle read), Map Values (map value), Build Hash (build hash), Zip Partitions ( Compression partition), Probe Hash (probe hash). These operators process the data to be processed accordingly. As shown in the DAG program of Figure 3, vertices represent operators, and edges with arrows represent dependencies between two operators. The DAG program in step S104 can be constructed based on code conversion written by the user. For example, the code "reduceByKey" in Figure 3 can be constructed as Hash Aggregate (hash aggregation), Shuffle Write (shuffle write), Shuffle Read (shuffle read) and Hash Aggregate (hash aggregation) executed sequentially, The code "join" can be constructed as Build Hash (build hash), Zip Partitions (compressed partition) and Probe Hash (probe hash) executed sequentially. These operators include a first operator and a second operator, and are respectively corresponding to the first partial procedure and the second partial procedure.

Step S104 also illustrates different processing methods for the first part of the program and the second part of the program.

For the first part of the program, the code compiled in the native system includes the function implementation of the operator, so it can be run by the native system using the machine code execution mechanism of the virtual machine (eg, JNI, JNA). After this part of the program runs in the native system, the functions of the operators will be processed. For the second part of the program, the code compiled in the native system does not contain the function implementation of the operator, but is converted into the virtual machine code supported by the predetermined virtual machine big data processing system, so it can be generated by residing in the virtual machine Execute the big data processing system on the scheduled virtual machine. When the virtual machine executes step S104, the engine of the predetermined virtual machine big data processing system can be reused, so as to realize functions such as distributed execution, elasticity, fall behind mitigation, and monitoring.

As mentioned above, the big data processing system of the scheduled virtual machine is a big data processing system, including an execution engine and a programming framework, for example, the big data processing system of the scheduled virtual machine is Apache Spark. Referring to Figure 4A, Apache Spark is used to introduce the scheduled virtual machine big data processing system.

Fig. 4A shows an exemplary schematic diagram of a virtual machine big data processing system (Apache Spark) provided by at least one embodiment of the present disclosure.

As shown in Figure 4A, the predetermined virtual machine big data processing system includes a programming framework (RDD) and a Spark engine. The programming framework includes a Spark dataset representation (dataset representation), and the Spark dataset representation can provide users with a Spark RDD application programming interface (API). ).

A Spark dataset representation contains a collection of partitions, a collection of dependencies, a computation definition, and metadata about dataset distribution and data placement. The Spark dataset indicates that the Spark RDD API provided can be used to build a DAG program.

The Spark engine provides support and management for various functions of Spark, such as distributed execution, elasticity, straggler mitigation, monitoring, etc. The Spark engine includes a resource manager (such as YARN, AWS), which can provide unified resource management and scheduling for upper-layer applications.

Apache Spark residing on the virtual machine can analyze and execute the converted code of the second part of the program (that is, the code conforming to the Spark RDD application program interface) through its programming framework and Spark engine.

According to the data processing method of the embodiment of the present disclosure, the embodiment of the present disclosure provides a native system big data processing system, the structure of which is similar to the above-mentioned virtual machine big data processing system.

Fig. 4B shows an exemplary schematic diagram of a native system big data processing system provided by at least one embodiment of the present disclosure. The big data processing system in the embodiment of the present disclosure includes a programming framework and an engine. The programming framework includes a dataset representation, and the dataset representation can provide an application programming interface (API).

The application program interface of the native system big data processing system of the disclosed embodiment can be designed as an RDD-style application program interface similar to the Spark RDD application program interface, for example, the name and semantics of its operator are the same as the name and semantics of the operator of Spark The semantics are the same for ease of use by Spark users. Of course, it can also be designed to be different from the Spark RDD API, as long as it can be called by user programming.

The data set representation of the native system big data processing system in the embodiment of the present disclosure can realize the functions of the first operator and the second The function pointer for calling the function corresponding to the operator, and the optimization operation for the first part of the program. The data set indicates that the part related to the second part of the program can reuse the Spark data set representation, that is, it contains function pointers for calling functions corresponding to the operators in the second part of the program. The data set representing the part associated with the first part of the program may contain functional implementations of the operators in the first part of the program. The operators in the data set representation can be in the form of code templates, so that the optimization of the first part of the program can be realized by, for example, C++ template meta-programming.

The engine of the native system big data processing system in the embodiment of the present disclosure may be an encapsulation of the engine of the virtual machine big data processing system (such as Spark) to reuse the big data functions of Spark, and at the same time include a method for integrating the native code into the virtual machine Additional functional units in . The engine is used to support and manage various functions of the native system big data processing system. Since the engine of the native system big data processing system is an encapsulation of the existing virtual machine big data processing system engine, it does not need to reconfigure the existing cluster resource manager or recompile an existing Spark. The engine of the disclosed native system big data processing system borrows most of the functions of the Spark engine, but enhances the Spark engine to support efficient Spark integration. For example, the engine of the big data processing system of the native system of the present disclosure includes a library in the native system, which provides the C++ binding of Spark (called CppSpark). Use CppSpark to call the function of the virtual machine in the native system. When converting the second part of the program, the C++ programming interface provided by CppSpark will be called. The function of CppSpark is to forward the calling operation from the native system to the virtual machine, so as to realize the conversion of the second part of the program in the virtual machine, that is, to convert the second part of the program The virtual machine code supported by the predetermined virtual machine big data processing system. The engine of the native system big data processing system in the embodiment of the present disclosure may include a driver (driver) in the virtual machine system, which is a virtual machine big data processing system (such as Spark) commonly implemented in a virtual machine programming language (such as Java). An application program for loading compiled program code (ie, a "loadable module") from the native system, transforming the second part of the program, and executing the loadable module. For example, in Spark, the driver can be submitted to a Spark-compatible cluster just like a normal Spark application. After a successful submission, the driver loads the loadable module from the native system, registers the engine implementation with the loadable module, and starts the main program contained in the loadable module. In addition, in order to support remote execution, the driver can also instruct the newly created executor to prepare the environment, such as downloading the above-mentioned loadable modules.

The following briefly describes an example workflow of the data processing method provided by at least one embodiment of the present disclosure with reference to the native system big data processing system in FIG. 4B , and the workflow is shown in FIG. 5 .

As shown in Figure 5, in the native system, the user writes the main program through the API. The native system combines the data set representation to compile the main program and generate a loadable module, wherein the loadable module includes the first part of the program and the second part of the program, the first part of the program includes the function realization of the first operator, and the second part of the program includes the The function pointer to the function corresponding to the second operator to call. The driver in the native system loads the loadable module, and during the loading process, combined with the C++ binding library that provides Spark, the second part of the program is converted into the virtual machine code supported by the predetermined virtual machine big data processing system, and the loadable module Registers the engine implementation and starts the main program contained in the loadable module. During the running of the main program, the driver builds the DAG program corresponding to the loadable module, then runs the DAG program, and hands the first part of the program to the native system to run. The driver can distribute the converted second part of the program to cluster executors (indicated by Cloud (cloud) in the figure) through a cluster resource manager (such as YARN) for distributed computing.

FIG. 6 shows a schematic block diagram of a data processing apparatus 600 for big data provided by at least one embodiment of the present disclosure. The data processing apparatus 600 can be used to execute the data processing method shown in FIG. 1 .

As shown in FIG. 6 , the data processing device 600 includes a program acquiring unit 601 , a program compiling unit 602 , a loading and converting unit 603 and an operating unit 604 .

The program obtaining unit 601 is configured to obtain a main program written in a native programming language.

The program compiling unit 602 is configured to compile the main program in the native system and generate a loadable module, wherein the loadable module includes a first part of the program and a second part of the program, the first part of the program includes the function realization of the first operator, and the second part of the program includes The two-part program includes a function pointer that calls a function corresponding to the second operator.

The loading and conversion unit 603 is configured to load the loadable module by the virtual machine running in the native system, and convert the second part of the program into the virtual machine code supported by the predetermined virtual machine big data processing system, and the virtual machine code calls the predetermined virtual machine big data processing system. The function realization of the function of the data processing system.

The running unit 604 is configured to run the main program contained in the loadable module by the virtual machine, wherein, during the running of the main program, the virtual machine builds a directed acyclic graph program corresponding to the loadable module, and then runs the directed acyclic graph program, the first part of the program is run by the native system, and the converted second part of the program is run by the predetermined virtual machine big data processing system.

For example, the data processing method of the data processing apparatus 600 adopts distributed computing, the first operator is a local type operator, and the second operator is a global type operator.

For example, in at least one embodiment, the data processing apparatus 600 may further include an optimization unit 605 . The optimization unit 605 is configured to perform an optimization operation on the first part of the program to reduce the interaction overhead between the native system and the virtual machine.

For example, in at least one embodiment, the optimization unit 605 is further configured to fuse a series of operation steps in the first partial program into one operation step.

For example, in at least one embodiment, the optimization unit 605 is further configured to set at least one operator in the first part of the program to batch process multiple data each time.

For example, in at least one embodiment, the optimization unit 605 is further configured to convert data of the same data type into a data arrangement continuously stored in memory for the data to be processed by the first part of the program.

For example, in at least one embodiment, the optimization unit 605 is further configured to set the input of at least one operator in the first part of the program as the starting address pointer of the data arrangement.

For example, the data processing device 600 may be implemented by using hardware, software, firmware and any feasible combination thereof, which is not limited in the present disclosure.

The above descriptions about the data processing method are also applicable to the data processing device 600 , and will not be repeated here.

Embodiments of the present disclosure further provide a computer program product, including program codes, and the program codes execute the data processing method according to the embodiments of the present disclosure when executed by a processor.

Embodiments according to the present disclosure also provide a computer-readable medium on which program codes are stored, and the program codes execute the data processing method according to the embodiments of the present disclosure when executed by a processor.

For this disclosure, the following points need to be explained:

(1) The drawings of the embodiments of the present disclosure only relate to the structures involved in the embodiments of the present disclosure, and other structures may refer to general designs.

(2) In the case of no conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other to obtain new embodiments.

The above description is only a specific implementation manner of the present disclosure, but the protection scope of the present disclosure is not limited thereto, and the protection scope of the present disclosure should be based on the protection scope of the claims.

Claims

A data processing method for big data, comprising:

Get the main program written in the native programming language;

Compile the main program in the native system and generate a loadable module, wherein the loadable module includes a first part program and a second part program, the first part program includes the function realization of the first operator, and the The second part of the program includes a function pointer for calling a function corresponding to the second operator;

The loadable module is loaded by a virtual machine running in the native system, and the second part of the program is converted into a virtual machine code supported by a predetermined virtual machine big data processing system, and the virtual machine code calls the predetermined virtual machine The function realization of the functions of the computer big data processing system; and

The virtual machine runs the main program contained in the loadable module, wherein, during the running of the main program, the virtual machine builds a directed acyclic graph program corresponding to the loadable module, and then runs the The directed acyclic graph program, the first part of the program is handed over to the native system to run, and the converted second part of the program is handed over to the predetermined virtual machine big data processing system to run.
The data processing method according to claim 1, wherein the data processing method adopts distributed computing, the first operator is a local type operator, and the second operator is a global type operator.
The data processing method according to claim 1 or 2, wherein, in the native system, the main program is compiled and a loadable module is generated, comprising:

An optimization operation is performed on the first part of the program to reduce the interaction overhead between the native system and the virtual machine.
The data processing method according to claim 3, wherein, performing an optimization operation on the first part of the program comprises:

A series of operation steps in the first part of the program are merged into one operation step.
The data processing method according to claim 3, wherein, performing an optimization operation on the first part of the program comprises:

At least one operator in the first part of the program is set to batch process multiple data at a time.
The data processing method according to claim 3, wherein, performing an optimization operation on the first part of the program comprises:

For the data to be processed by the first part of the program, the data of the same data type is converted into a data arrangement continuously stored in the memory.
The data processing method according to claim 6, wherein,

The input of at least one operator in the first partial program is set as a start address pointer of the data arrangement.
The data processing method according to any one of claims 1-7, wherein,

The big data processing system of the predetermined virtual machine is Apache Spark.
A data processing device for big data, comprising:

a program acquisition unit configured to acquire a main program written in a native programming language;

A program compiling unit configured to compile the main program in the native system and generate a loadable module, wherein the loadable module includes a first part program and a second part program, and the first part program includes a first operator The function implementation of the second part of the program includes a function pointer for calling a function corresponding to the second operator;

The loading and conversion unit is configured to load the loadable module by a virtual machine running in the native system, and convert the second part of the program into a virtual machine code supported by a predetermined virtual machine big data processing system, the virtual machine The machine code calls the function realization of the function of the predetermined virtual machine big data processing system; and

The running unit is configured to run the main program contained in the loadable module by the virtual machine, wherein, during the running of the main program, the virtual machine constructs a directed acyclic graph corresponding to the loadable module program, and then run the directed acyclic graph program, hand over the first part of the program to the native system to run, and hand over the converted second part of the program to the predetermined virtual machine big data processing system to run.
The data processing device according to claim 9, further comprising:

The optimization unit is configured to: perform an optimization operation on the first part of the program to reduce the interaction overhead between the native system and the virtual machine.
The data processing device according to claim 10, wherein performing an optimization operation on the first part of the program comprises:

A series of operation steps in the first part of the program are merged into one operation step.
The data processing device according to claim 10, wherein performing an optimization operation on the first part of the program comprises:

At least one operator in the first part of the program is set to batch process multiple data at a time.
The data processing device according to claim 10, wherein, performing an optimization operation on the first part of the program comprises:

For the data to be processed by the first part of the program, the data of the same data type is converted into a data arrangement continuously stored in the memory.
The data processing apparatus according to claim 13, wherein,

The input of at least one operator in the first partial program is set as a start address pointer of the data arrangement.