CN114327479A

CN114327479A - Data processing method and data processing device for big data

Info

Publication number: CN114327479A
Application number: CN202111618375.0A
Authority: CN
Inventors: 俞博文; 冯冠宇; 曹焕琦; 郑纬民; 陈文光
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-04-12
Also published as: WO2023124543A1

Abstract

A data processing method and a data processing device for big data. The data processing method comprises the following steps: acquiring a main program written by a native programming language; compiling the main program in the native system and generating a loadable module, wherein the loadable module comprises a first part program and a second part program; loading a loadable module by a virtual machine running in the native system and converting the second part of the program into virtual machine code supported by a predetermined virtual machine big data processing system; and running a main program contained in the loadable module by the virtual machine, wherein the virtual machine builds a directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, delivers the first part of program to a native system for running, and delivers the converted second part of program to a preset virtual machine big data processing system for running. The data processing method can construct a high-performance big data processing framework on the premise of being integrated into the ecology of the big data software of the existing virtual machine.

Description

Data processing method and data processing device for big data

Technical Field

The present disclosure relates to a data processing method and a data processing apparatus for big data.

Background

Big data refers to a large, rapidly growing set of data in heterogeneous formats. Apache Hadoop is an early open-source big data solution that includes a distributed file system (HDFS) for persisting big data, and an analysis framework based on MapReduce abstraction. Apache Spark, a fast general purpose computing system designed specifically for large-scale data processing, is an open source clustered computing environment similar to Hadoop, but performs much better in some workloads. More recently, Apache Spark introduced a new abstraction called the elastic distributed data set (RDD) to support fault-tolerant data reuse for iterative workloads, which can achieve an order of magnitude better performance than Hadoop MapReduce. Spark provides a rich and easy-to-use application program interface that can build supporting libraries for graphical computing, stream processing, machine learning, and SQL queries. Today, Spark is widely deployed to serve large data analytics.

Disclosure of Invention

At least one embodiment of the present disclosure provides a data processing method for big data, including: acquiring a main program written by a native programming language; compiling the main program in the native system and generating a loadable module, wherein the loadable module comprises a first part program and a second part program, the first part program comprises a function implementation of a first operator, and the second part program comprises a function pointer for calling a function corresponding to a second operator; loading a loadable module by a virtual machine running in the native system, converting the second part of program into a virtual machine code supported by a preset virtual machine big data processing system, and calling the function of the preset virtual machine big data processing system by the virtual machine code to realize the function; and running a main program contained in the loadable module by the virtual machine, wherein in the running process of the main program, the virtual machine builds a directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, delivers the first part of program to a native system for running, and delivers the converted second part of program to a preset virtual machine big data processing system for running.

For example, in a data processing method provided in at least one embodiment of the present disclosure, the data processing method employs distributed computation, a first operator is a local class operator, and a second operator is a global class operator.

For example, in a data processing method provided by at least one embodiment of the present disclosure, compiling a main program and generating a loadable module in a native system includes: and performing optimization operation on the first part of program to reduce the interaction overhead between the native system and the virtual machine.

For example, in a data processing method provided in at least one embodiment of the present disclosure, performing an optimization operation on a first partial program includes: and combining a series of operation steps in the first part of program into one operation step.

For example, in a data processing method provided in at least one embodiment of the present disclosure, performing an optimization operation on a first partial program includes: at least one operator in the first partial program is set to batch process a plurality of data at a time.

For example, in a data processing method provided in at least one embodiment of the present disclosure, performing an optimization operation on a first partial program includes: and aiming at the data to be processed by the first part of programs, converting the data of the same data type into data arrangement which is continuously stored in the memory.

For example, in a data processing method provided by at least one embodiment of the present disclosure, an input of at least one operator in the first partial program is set as a start address pointer of the data arrangement.

For example, in the data processing method provided in at least one embodiment of the present disclosure, the predetermined virtual machine big data processing system is Apache Spark.

At least one embodiment of the present disclosure provides a data processing apparatus for big data, including: a program acquisition unit configured to acquire a main program written in a native programming language; the program compiling unit is configured to compile the main program in the native system and generate a loadable module, wherein the loadable module comprises a first part of program and a second part of program, the first part of program comprises function implementation of a first operator, and the second part of program comprises a function pointer for calling a function corresponding to a second operator; the loading and converting unit is configured to load the loadable module by a virtual machine running in the native system and convert the second part of program into a virtual machine code supported by a predetermined virtual machine big data processing system, and the virtual machine code calls the function realization of a function of the predetermined virtual machine big data processing system; and the running unit is configured to run a main program contained in the loadable module by the virtual machine, wherein in the running process of the main program, the virtual machine builds a directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, delivers a first part of programs to a native system for running, and delivers a second part of programs after conversion to a preset virtual machine big data processing system for running.

The data processing apparatus provided in at least one embodiment of the present disclosure further includes: an optimization unit configured to: and performing optimization operation on the first part of program to reduce the interaction overhead between the native system and the virtual machine.

For example, in a data processing apparatus provided in at least one embodiment of the present disclosure, performing an optimization operation on a first partial program includes: and combining a series of operation steps in the first part of program into one operation step.

For example, in a data processing apparatus provided in at least one embodiment of the present disclosure, performing an optimization operation on a first partial program includes: at least one operator in the first partial program is set to batch process a plurality of data at a time.

For example, in a data processing apparatus provided in at least one embodiment of the present disclosure, performing an optimization operation on a first partial program includes: and aiming at the data to be processed by the first part of programs, converting the data of the same data type into data arrangement which is continuously stored in the memory.

For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, the input of at least one operator in the first partial program is set as a start address pointer of the data arrangement.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.

Fig. 1 illustrates a schematic flow chart of a data processing method for big data according to at least one embodiment of the present disclosure;

FIG. 2 illustrates an exemplary diagram of an optimization operation performed on a first partial program provided by at least one embodiment of the present disclosure;

FIG. 3 illustrates an exemplary diagram of a directed acyclic graph program provided by at least one embodiment of the present disclosure;

FIG. 4A illustrates an exemplary diagram of a virtual machine big data processing architecture (Apache Spark) in accordance with at least one embodiment of the present disclosure;

FIG. 4B illustrates an exemplary diagram of a native system big data processing architecture provided by at least one embodiment of the present disclosure;

FIG. 5 illustrates a schematic diagram of an example workflow of a data processing method provided by at least one embodiment of the present disclosure;

fig. 6 shows a schematic block diagram of a data processing apparatus for big data according to at least one embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Also, the use of the terms "a," "an," or "the" and similar referents do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

Although Spark has the advantage of in-memory distributed data sets that, in addition to being able to provide interactive queries, can also optimize the iterative workload, recent work has shown that there is still a large room for improvement in performance. The Spark program is operated on the basis of a JVM (java virtual machine), the program needs to be translated into codes which can be indirectly executed by the JVM, and compared with Spark, the large data analysis framework constructed by using the C + + programming language has great performance advantage because codes written in C + + can be translated into codes which can be directly executed by a machine. For example, the C + + based big data analysis framework Thrill achieves an average acceleration of 3.26 times Spark over a typical big data workload. In addition, for the Java matrix multiplication kernel, switching Java to C results in 4.4 times acceleration, and the performance of vectorization and AVX intrinsic functions (AVX intrinsics) provided by the C compiler is improved by 9.45 times. However, performance is only one aspect of big data processing, and the C + + based big data analysis framework, although having better performance than Spark, does not have many of the key functions for big data processing provided by Spark, such as lineage-based elasticity: big data analytics are typically performed in distributed computing, i.e., in multi-tenant clusters (multi-tenant clusters) built on commercial hardware, where task failures due to machine failures, network jitter, and preemptive scheduling are very common, making checkpoints inefficient at handling these frequent failures. Spark's lineage-based fault tolerance mechanism allows only partial data to be recalculated instead of the entire data. The elasticity of Spark also supports other functions, such as load balancing, queue deferral (straggler mitigation), and auto-scaling, thereby improving the resource utilization of the cluster. In addition, the Spark ecosystem, such as a performance analyzer with a Web UI and integration with various resource managers, makes it easy to deploy, monitor and analyze applications on various private or public clouds. Thrill has a native RDD (Resilient Distributed data sets) like abstraction called DIA, but it tightly couples the data distribution to the physical machine making resiliency ineffective. Husky uses an upstream message log fault tolerance mechanism that generates non-negligible overhead even without a fault. These C + + based big data analysis frameworks lack many basic functions compared to Spark.

Obviously, a fully functional native big data frame needs to be designed. A straightforward solution is to re-implement the functionality provided by Spark in a native programming language (e.g., C + +), which is theoretically possible, but may be prohibitively expensive and unnecessary: the kernel component of Spark 3.0.1 has 74K lines of code, of which only 9K lines directly related to the programming framework (including RDD application interface and operator implementations) and other code is the component that serves various big data functions. Thus, a big data framework can be constructed that can reuse the Spark-matured big data functions without re-implementing those functions.

However, the approach of reusing the big data function of Spark creates many challenges. First, existing native big data frameworks (e.g., kill, Husky, etc.) are not compatible with the Spark's execution model, making it infeasible to integrate native big data frameworks into Spark. For example, kill couples each dataset partition to a particular machine, and Husky relies on stateful task execution, which violates Spark's dynamic scheduling and stateless assumptions. Secondly, fine-grained interaction between the JVM and the native world, whether through JNI (java native interface) or JNA (java native access), can generate high overhead and possibly become a new performance bottleneck.

The data processing method divides the program for processing the big data into a first part program and a second part program for processing respectively, can not only multiplex the important functions of a preset virtual machine big data processing system (such as Apache Spark), but also can improve the performance of big data processing. The first part program comprises function realization of a first operator, can be run by a native system without calling the function of a preset virtual machine big data processing system, and has higher processing speed; and the second part program calls the function of the big data processing system of the preset virtual machine, so that the existing function of the big data processing system of the preset virtual machine can be easily realized. Therefore, the data processing method according to the embodiment of the disclosure can realize high-speed processing of big data and can realize full-function big data processing by fully utilizing the functions of the existing big data processing system.

Native (native) system is a concept with respect to a virtual machine, which refers to an operating system inherent on a computer, such as a Linux system, a Window system, etc., installed on the computer itself. Native programming languages represent programming languages used on native systems, such as the C + + programming language. For example, in some embodiments of the present disclosure, a main program is written in the C + + programming language, the main program is compiled to form a loadable module in the form of object code (e.g., machine code), and then the loadable module is loaded into a virtual machine running in a native system, such as a JVM running on a Linux system.

It should be noted that, although some embodiments of the present disclosure are described with Apache Spark as an example of a predetermined virtual machine big data processing architecture, the present disclosure is not limited to being based on Apache Spark. Those skilled in the art can understand that similar technical problems also exist in other virtual machine big data processing architectures, and the technical solution of the embodiment of the present disclosure is also applicable to other virtual machine big data processing architectures. In this disclosure, a virtual machine big data processing architecture (framework) represents a big data processing system that runs based on a virtual machine (e.g., JVM), including an engine and a programming framework. The big data processing architecture provides various functions for big data processing, for example, the function of Apache Spark as described above.

At least one embodiment of the present disclosure further provides a data processing apparatus corresponding to the data processing method.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.

Fig. 1 illustrates a schematic flow chart of a data processing method for big data according to at least one embodiment of the present disclosure.

As shown in fig. 1, the data processing method includes steps S101 to S104 as follows.

Step S101: a main program written in a native programming language is obtained.

A user writes a main program using a native programming language such as the C + + programming language, and the processing device obtains the user-written main program.

Step S102: and compiling the main program in the native system and generating a loadable module, wherein the loadable module comprises a first part program and a second part program, the first part program comprises a function implementation of a first operator, and the second part program comprises a function pointer for calling a function corresponding to a second operator.

In step S102, the main program is compiled into a module in object code form (e.g., machine code form), which can be loaded into a virtual machine, thus referred to as a loadable module. The main program generally includes a call to a function provided by the system, and the main program is compiled when being compiled, and the function library or the data set provided by the system analyzes the called function so as to perform corresponding processing. According to an embodiment of the present disclosure, the compiled loadable module may include a first partial program and a second partial program, which are processed differently in subsequent steps. In the embodiments of the present disclosure, the "first partial program" is also referred to as a "compile time partial program", and the "second partial program" is also referred to as a "runtime partial program". The difference between the first part program and the second part program is whether the part program comprises specific function implementation of operators in the program, if the specific function implementation of the operators is included, the part program is the first part program, the corresponding operators are called first operators, otherwise, the part program is the second part program, and the corresponding operators are called second operators. The second part program does not comprise specific function implementation of the operator, but comprises a function pointer for calling a function corresponding to the operator. The concrete implementation of these operators (second operators) can be achieved by setting the address of the function implementing the function to a function pointer in a subsequent process (to be described in detail later). The division of the first and second operators may be performed according to categories of operators, e.g. some categories of operators are divided into first operators and other categories of operators are divided into second operators according to a predetermined division policy. The division policy may be determined according to a specific application, for example, an operator with a relatively simple function implementation may be divided into a first operator, and an operator with a relatively complex function implementation may be divided into a second operator, so that the function of the virtual machine big data processing system may be multiplexed with the second operator with a complex function implementation. For another example, in a data processing method using distributed computing, a local operator class may be divided into a first operator, and a portion of a global operator class may be divided into a second operator. In other words, in some embodiments of the present disclosure, the data processing method employs distributed computing, the first operator being a local class operator and the second operator being a global class operator.

The native class operators represent operators that do not need to consider coordination between different operation nodes in the distributed operation, such as a map values (mapValues) operator, a filter operator, a transform (fltmap) operator, a Hash operator (e.g., Hash Aggregate, Build Hash), and so on.

The global class operator represents an operator that needs to consider coordination between different operation nodes in the distributed operation, such as a partition-pruning operator, an operator with more than one dependency relationship (e.g., a union operator, a Zip operator (e.g., Zip Partitions), a cartesian operator), a Shuffle operator (e.g., Shuffle Write, Shuffle Read), a cache operator, and a data-source operator, etc.

For the global class operator, since coordination among different operation nodes needs to be considered, and therefore implementation is relatively complex, in the embodiment of the present disclosure, the global class operator is divided into the second operator so as to reuse the global class operator function of the existing virtual machine big data processing system (e.g., Spark).

For example, partition pruning operators and operators with multiple one dependencies may provide lineage-related information to Spark and may reuse Spark's lineage-based fault-tolerance mechanism.

For example, the shuffle operator may reuse the Spark's fault tolerant data shuffling mechanism, which is not only difficult to implement, but is cumbersome to deploy.

For example, the cache operator may reuse the Spark's intermediate data management mechanism. The data source operator may provide location information to Spark and reuse Spark's location-aware task scheduling.

The method divides part of the local operator into a first operator and divides the global operator into a second operator, so that a second part of programs containing the second operator can be handed to a virtual machine big data processing system of Spark for processing in subsequent steps without realizing complex functions again, and unnecessary human consumption can be avoided.

In some embodiments of the present disclosure, compiling the main program in the native system in step S102 may include: and performing optimization operation on the first part of program to reduce the interaction overhead between the native system and the virtual machine.

The integration of programs for native systems into virtual machine big data handling architectures will result in interaction overhead between native systems and virtual machines. According to some embodiments of the present disclosure, these overheads can be reduced by optimizing the first partial program.

An exemplary schematic diagram of the optimization operation for the first part of the program is shown in fig. 2. Two mapValues operators in the program "contibs, reducebykey (___.). mapValues (v ═ 0.85 × v.). mapValues (v ═ 0.15+ v.). join (links)" will be described as an example. The native system and the virtual machine interact through a JNA Pointer (JNA Pointer).

Fig. 2 (a) shows a case where optimization is not performed. In fig. 2 (a), each mapValues transformation maps directly to a dataset in the virtual machine (e.g., Spark RDD), and each value is mapped to a virtual machine program object (e.g., Java object). This approach is simple but incurs significant overhead. Processing such as long double precision for such a simple value requires one serialization/deserialization at the virtual machine and native system, respectively, and creating a virtual machine program object, which creates a significant interaction overhead that may offset the performance advantages of the native system.

In some embodiments of the present disclosure, performing optimization operations on the first partial program may include: and combining a series of operation steps in the first part of program into one operation step. This optimization operation may be referred to as an operator fusion operation, as shown in fig. 2 (b).

In fig. 2 (b), two consecutive mapValues operation steps (. mapValues (v ═ 0.85 × v) and. mapValues (v ═ 0.15+ v)) are fused into one mapValues operation step (. mapValues (v ═ 0.15+0.85 × v)), which reduces the number of serialization/deserialization times, thereby reducing the interaction overhead between the native system and the virtual machine.

In some embodiments of the present disclosure, performing optimization operations on the first partial program may include: at least one operator in the first partial program is set to batch process a plurality of data at a time. This optimization operation may be referred to as a vectoring operation, as shown in (c) of fig. 2.

In fig. 2 (c), the operator mapValues originally processed element by element in the first part program is set to batch process a plurality of data at a time. That is, one long double precision pair data is originally processed at a time, and a long double precision pair array is now processed at a time, the long double precision pair array (denoted by R) including a plurality of long double precision pair data, each of which is data of a key-value pair type ((k, v) type, k denotes a key, v denotes a value). Therefore, the calling times of the virtual world to the data in the native world can be reduced, and the serialization/deserialization times can be reduced, so that the interaction overhead between the native system and the virtual machine is reduced.

In some embodiments of the present disclosure, performing optimization operations on the first partial program may include: and aiming at the data to be processed by the first part of programs, converting the data of the same data type into data arrangement which is continuously stored in the memory. This optimization operation may be referred to as a data arrangement operation.

Storing the batch data continuously through memory (as a buffer) helps to avoid memory fragmentation due to allocation of small objects and eliminates the need for garbage collector runtime compaction. In addition, the data arrangement increases the data locality and provides a regular memory access mode, so that the efficiency of the central processing unit can be improved. For example, in some embodiments of the present disclosure, the input of at least one operator in the first partial program may be set as a start address pointer for the data arrangement. Data of the same type is often processed by a common operator or some common operators, so that when the data of the operators is stored continuously in the memory, it is not necessary to provide each data or an address pointer thereof, and the continuously stored data can be obtained only by providing an address pointer of the starting data (i.e. a starting address pointer of the data arrangement). In this case, each data to be processed does not need to be interacted between the virtual machine and the native system, and only the start address pointer of the data arrangement needs to be interacted, so that the interaction overhead between the virtual machine and the native system is greatly reduced. Fig. 2 (d) shows pointer passing between the virtual machine and the native system. In fig. 2 (d), this data arrangement is referred to as a "compact array," which is a long double precision pair array stored by contiguous memory. Only the address pointer is transferred between the virtual machine and the native system, and data serialization/deserialization is not needed.

In some embodiments of the present disclosure, a variety of templates are provided to generate data arrangements for various data types. For example, a Flat Array (Flat Array) may place fixed length elements in a buffer. A Bitmap Array (Bitmap Array) may place boolean elements in a buffer. Array and String Array elements may place Array elements in a buffer. A Nullable Array (Nullable Array) may place Nullable elements in the buffer. The tuple array (tuple array) places the tuple elements in a buffer.

The optimization operation on the first part of programs reduces the interaction overhead between the native system and the virtual machine, so that the performance of big data processing is improved. It should be noted that the present disclosure is not limited to the above optimization operation, and is also applicable to other optimization operations capable of reducing the interaction overhead between the native system and the virtual machine.

Step S103: and loading the loadable module by a virtual machine running in the native system, and converting the second part of program into virtual machine code supported by the preset virtual machine big data processing system, wherein the virtual machine code calls the function realization of the function of the preset virtual machine big data processing system.

The main program is compiled to form a loadable module in the form of object code, which is loaded into a virtual machine running in a native system, for example into a JVM running on a Linux system. The loading process may be performed using an interface between the virtual machine and the native system, for example, using JNA.

After the loading step of step S103, for the second partial program, it is also converted by the virtual machine into a virtual machine code supported by a predetermined virtual machine big data processing architecture. The second part of the program is typically a relatively complex-to-implement program, and thus it calls the function of a predetermined virtual machine big data handling architecture (e.g., Spark) on the virtual machine, without reprogramming the implementation. Thus, the second part of the program is translated by the virtual machine into code supported by the predetermined virtual machine big data handling architecture (e.g., Spark) during loading into the virtual machine. For example, the function pointers in the second partial program are converted into pointer addresses that can be actually called for functions in the predetermined virtual machine big data processing architecture, and the called functions have functional implementation of the second operator in the second partial program. The virtual machine code calls functional modules (e.g., processing modules of corresponding operators) of a predetermined virtual machine big data processing architecture, and the implementation of repeated programming on the functional modules is avoided. It should be noted that "predetermined" here means to determine in advance, that is, to select in advance the processing system to be used for transcoding and calling.

Step S104: and running a main program contained in the loadable module by the virtual machine, wherein in the running process of the main program, the virtual machine builds a directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, delivers a first part of programs to a native system for running, and delivers a second part of programs after conversion to a preset virtual machine big data processing system for running.

Directed Acyclic Graph (DAG) is a Graph theory data structure, and a Directed Acyclic Graph is a Directed Acyclic Graph if it cannot go from any vertex back to the point through several edges. A directed acyclic graph program is a program that presents a directed acyclic graph. A directed acyclic graph program for processing big data processing includes a series of operators for processing big data. FIG. 3 illustrates an example of a directed acyclic graph program. The first row in fig. 3 is a code (i.e., a part of a main program written by a user) for processing data intercepted from an application, and the directed acyclic graph program corresponding to the code is located below the code, and includes a plurality of operators Hash Aggregate, Shuffle Write, Shuffle Read, Map Values, Build Hash, Zip Partitions, and Probe Hash executed in sequence. These operators will process the data to be processed accordingly. As shown in the directed acyclic graph program of FIG. 3, vertices represent operators and edges with arrows represent dependencies between two operators. The directed acyclic graph program in step S104 can be constructed based on user-written code conversion. For example, the code "reduce bykey" in fig. 3 may be constructed as a Hash Aggregate, a Shuffle Write, a Shuffle Read, and a Hash Aggregate that are sequentially executed, and the code "join" may be constructed as a Build Hash, a Zip Partitions, and a Probe Hash that are sequentially executed. The operators include a first operator and a second operator, and are respectively corresponding to the first partial program and the second partial program.

Step S104 also illustrates different processing manners for the first partial program and the second partial program.

For the first part of the program, its compiled code in the native system contains functional implementations of operators, and thus can be run by the native system using the machine code execution mechanisms (e.g., JNI, JNA) of the virtual machine. The part of the program can complete the processing of the function of the operator after running in the native system. And for the second part program, the compiled code of the second part program in the native system does not contain the functional implementation of the operator, but is converted into the virtual machine code supported by the predetermined virtual machine big data processing architecture, and thus can be executed by the predetermined virtual machine big data processing architecture residing on the virtual machine. The virtual machine executing step S104 may multiplex engines of a predetermined virtual machine big data processing architecture to implement functions such as distributed execution, elasticity, queue-dropping mitigation, monitoring, and the like.

As mentioned above, the predetermined virtual machine big data processing system is a big data processing system including an execution engine and a programming framework, for example, the predetermined virtual machine big data processing system is Apache Spark. The reservation virtual machine big data processing architecture is described below with reference to FIG. 4A as Apache Spark.

Fig. 4A illustrates an exemplary diagram of a virtual machine big data processing architecture (Apache Spark) according to at least one embodiment of the present disclosure.

As shown in fig. 4A, the predetermined virtual machine big data processing architecture includes a programming framework (RDD) including a Spark data set representation (dataset representation) that can provide a Spark RDD Application Program Interface (API) to a user, and a Spark engine.

The Spark dataset representation contains a set of partitions, a set of dependencies, a computation definition, and metadata about the dataset distribution and data placement. The Spark RDD application program interface provided by the Spark dataset representation may be used to build the DAG program.

The Spark engine provides support and management for Spark's various functions including, for example, distributed execution, resiliency, queue-dropping mitigation, monitoring, etc. A resource manager (e.g., YARN, AWS) is included in the Spark engine and can provide uniform resource management and scheduling for upper layer applications.

Apache Spark residing on the virtual machine can parse and execute the translated code of the second part program (i.e., the code conforming to Spark RDD application program interface) through its programming framework and Spark engine.

According to the data processing method of the embodiment of the disclosure, the embodiment of the disclosure provides a native system big data processing system, and the structure of the native system big data processing system is similar to the virtual machine big data processing system.

Fig. 4B illustrates an exemplary diagram of a native system big data processing architecture provided by at least one embodiment of the present disclosure. The big data processing architecture of embodiments of the present disclosure includes a programming framework that includes a dataset representation that can provide an Application Program Interface (API), and an engine.

The application program interface of the native system big data processing system of the embodiment of the present disclosure may be designed to be similar to the RDD type application program interface of the Spark RDD application program interface, for example, the name and semantic of the operator are the same as those of the Spark operator, so as to be convenient for the Spark user to use. Of course, it may be designed to be different from the Spark RDD application program interface as long as it can be called by the user programming.

The data set representation of the native system big data processing system according to the embodiment of the present disclosure may implement, in addition to the function of the normal big data processing system (for example, the function represented by the above Spark data set), the function implementation of the first operator, the function pointer for calling the function corresponding to the second operator, and the optimization operation on the first partial program. The dataset representation may be a multiplexed dataset representation of the portion associated with the second part program, i.e., containing function pointers to calls to functions corresponding to operators in the second part program. The data set indicates that the portion associated with the first partial program may contain a functional implementation of an operator in the first partial program. Operators in the dataset representation may be in the form of code templates, so that optimization of the first part program may be achieved by, for example, C + + template meta-programming.

The engine of the native system big data processing architecture of the embodiments of the present disclosure may be a wrapper to a virtual machine big data processing architecture (e.g., Spark) engine to multiplex Spark's big data functionality, and at the same time include additional functional units for integrating native code into the virtual machine. The engine is used to support and manage multiple functions of the native system big data processing architecture, and since the engine of the native system big data processing architecture is an encapsulation of the existing virtual machine big data processing architecture engine, it is not necessary to reconfigure the existing cluster resource manager or recompile the existing Spark. The engine of the native system big data processing architecture of the present disclosure borrows most of the functionality of the Spark engine, but enhances the Spark engine to support efficient Spark integration. For example, the engine of the big data processing system of the native system of the present disclosure includes a library in the native system, the library provides a C + + binding (referred to as CppSpark) of Spark, CppSpark refers to an interface implemented by C + + language for the function of the virtual machine, so that CppSpark can be used to call the function of the virtual machine in the native system. The C + + programming interface provided by CppSpark is called when the second part program is converted, and the CppSpark is used for forwarding the calling operation from the native system to the virtual machine, so that the conversion of the second part program is realized in the virtual machine, namely, the second part program is converted into the virtual machine code supported by the large data processing system of the predetermined virtual machine. The engine of the native system big data processing architecture of the embodiments of the present disclosure may include a driver (driver), which is a virtual machine big data processing architecture (e.g., Spark) common application implemented in a virtual machine programming language (e.g., Java), in the virtual machine system for loading compiled program code (i.e., "loadable module") from the native system, transforming the second partial program, and executing the loadable module. For example, in Spark, the driver may be committed to a Spark compatible cluster as in a normal Spark application. After successful submission, the driver loads the loadable module from the native system, registers the loadable module with the engine implementation, and starts the main program contained by the loadable module. In addition, to support remote execution, the driver may also instruct the newly created actuator preparation environment, such as to download the loadable module described above.

An example workflow of a data processing method provided by at least one embodiment of the present disclosure is briefly described below with reference to the native system big data processing architecture of fig. 4B, and the workflow is illustrated in fig. 5.

As shown in fig. 5, in a native system, a user writes a main program through an application program interface. The native system compiles the main program by combining the data set representation and generates a loadable module, wherein the loadable module comprises a first part of program and a second part of program, the first part of program comprises function realization of a first operator, and the second part of program comprises a function pointer for calling a function corresponding to a second operator. And loading the loadable module by a driver in the native system, and converting the second part of program into a virtual machine code supported by a preset virtual machine big data processing system by combining a C + + bound library for providing Spark during the loading process, and realizing the loadable module by a registration engine and starting a main program contained in the loadable module. In the running process of the main program, the driver builds a directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, and hands the first part of the program to the native system for running. The driver may distribute the converted second part program to a cluster executor (represented by Cloud in the figure) through a cluster resource manager (e.g., YARN) for distributed computing.

Fig. 6 shows a schematic block diagram of a data processing apparatus 600 for big data according to at least one embodiment of the present disclosure, where the data processing apparatus 600 may be used to execute the data processing method shown in fig. 1.

As shown in fig. 6, the data processing apparatus 600 includes a program acquisition unit 601, a program compiling unit 602, a loading and converting unit 603, and an execution unit 604.

The program acquisition unit 601 is configured to acquire a main program written in a native programming language.

The program compiling unit 602 is configured to compile a main program in a native system and generate a loadable module, where the loadable module includes a first part of program and a second part of program, the first part of program includes a function implementation of a first operator, and the second part of program includes a function pointer for calling a function corresponding to a second operator.

The loading and conversion unit 603 is configured to load a loadable module by a virtual machine running in the native system and convert the second partial program into virtual machine code supported by the predetermined virtual machine big data processing architecture, the virtual machine code calling a functional implementation of a function of the predetermined virtual machine big data processing architecture.

The running unit 604 is configured to run a main program included in the loadable module by the virtual machine, wherein during the running of the main program, the virtual machine constructs a directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, hands over a first part of the program to the native system for running, and hands over a second part of the program after conversion to a predetermined virtual machine big data processing system for running.

For example, the data processing method of the data processing apparatus 600 employs distributed computing, the first operator being a local class operator, and the second operator being a global class operator.

For example, in at least one embodiment, data processing apparatus 600 may also include optimization unit 605. Optimization unit 605 is configured to perform optimization operations on the first part of the program to reduce interaction overhead between the native system and the virtual machine.

For example, in at least one embodiment, the optimization unit 605 is further configured to fuse a series of arithmetic steps in the first part of the program into one arithmetic step.

For example, in at least one embodiment, the optimization unit 605 is further configured to set at least one operator in the first part program to batch process multiple data at a time.

For example, in at least one embodiment, the optimization unit 605 is further configured to convert data of the same data type into a data arrangement continuously stored in the memory for data to be processed by the first partial program.

For example, in at least one embodiment, the optimization unit 605 is further configured to set the input of the at least one operator in the first partial program as a start address pointer for the data arrangement.

For example, data processing apparatus 600 may be implemented in hardware, software, firmware, or any feasible combination thereof, as not limited by the present disclosure.

The above description of the data processing method is also applicable to the data processing apparatus 600, and is not repeated herein.

There is also provided, in accordance with an embodiment of the present disclosure, a computer program product comprising program code which, when executed by a processor, performs a data processing method in accordance with an embodiment of the present disclosure.

There is also provided, in accordance with an embodiment of the present disclosure, a computer readable medium having stored thereon program code, which, when executed by a processor, performs a data processing method in accordance with an embodiment of the present disclosure.

For the present disclosure, there are also the following points to be explained:

(1) the drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to the common design.

(2) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.

The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.

Claims

1. A data processing method for big data, comprising:

acquiring a main program written by a native programming language;

compiling the main program in a native system and generating a loadable module, wherein the loadable module comprises a first part program and a second part program, the first part program comprises a function implementation of a first operator, and the second part program comprises a function pointer for calling a function corresponding to a second operator;

loading, by a virtual machine running in the native system, the loadable module and converting the second partial program into virtual machine code supported by a predetermined virtual machine big data processing architecture, the virtual machine code calling a functional implementation of a function of the predetermined virtual machine big data processing architecture; and

and running a main program contained in the loadable module by the virtual machine, wherein in the running process of the main program, the virtual mechanism builds a directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, delivers the first part of programs to the native system for running, and delivers the converted second part of programs to the preset virtual machine big data processing system for running.

2. The data processing method of claim 1, wherein the data processing method employs distributed computing, the first operator is a local class operator, and the second operator is a global class operator.

3. The data processing method of claim 1 or 2, wherein compiling the main program and generating a loadable module in the native system comprises:

performing an optimization operation on the first portion of programs to reduce interaction overhead between the native system and the virtual machine.

4. The data processing method of claim 3, wherein performing optimization operations on the first partial program comprises:

and combining a series of operation steps in the first part of program into one operation step.

5. The data processing method of claim 3, wherein performing optimization operations on the first partial program comprises:

setting at least one operator in the first partial program to batch process a plurality of data at a time.

6. The data processing method of claim 3, wherein performing optimization operations on the first partial program comprises:

and aiming at the data to be processed by the first part of programs, converting the data of the same data type into data arrangement which is continuously stored in the memory.

7. The data processing method of claim 6,

the input of at least one operator in the first partial program is set as a start address pointer for the data arrangement.

8. The data processing method according to claim 1,

the preset virtual machine big data processing system is Apache Spark.

9. A data processing apparatus for big data, comprising:

a program acquisition unit configured to acquire a main program written in a native programming language;

the program compiling unit is configured to compile the main program in a native system and generate a loadable module, wherein the loadable module comprises a first part program and a second part program, the first part program comprises a function implementation of a first operator, and the second part program comprises a function pointer for calling a function corresponding to a second operator;

a loading and conversion unit configured to load the loadable module by a virtual machine running in the native system and convert the second partial program into virtual machine code supported by a predetermined virtual machine big data processing architecture, the virtual machine code calling a functional implementation of a function of the predetermined virtual machine big data processing architecture; and

and the operation unit is configured to run a main program contained in the loadable module by the virtual machine, wherein in the running process of the main program, the virtual machine builds a directed acyclic graph program corresponding to the loadable module, then runs the directed acyclic graph program, delivers the first part of programs to the native system for running, and delivers the converted second part of programs to the preset virtual machine big data processing system for running.

10. The data processing apparatus of claim 9, further comprising:

an optimization unit configured to: performing an optimization operation on the first portion of programs to reduce interaction overhead between the native system and the virtual machine.

11. The data processing apparatus of claim 10, wherein performing optimization operations on the first partial program comprises:

12. The data processing apparatus of claim 10, wherein performing optimization operations on the first partial program comprises:

13. The data processing apparatus of claim 10, wherein performing optimization operations on the first partial program comprises:

14. The data processing apparatus of claim 13,