WO2020233584A1

WO2020233584A1 - Mixed language task execution method and device, and cluster

Info

Publication number: WO2020233584A1
Application number: PCT/CN2020/091189
Authority: WO
Inventors: 刘铖
Original assignee: 星环信息科技（上海）有限公司
Priority date: 2019-05-21
Filing date: 2020-05-20
Publication date: 2020-11-26
Also published as: CN110109748A; CN110109748B

Abstract

Embodiments of the present application provide a mixed language task execution method and device, and a cluster. The method is applied to the cluster and comprises: obtaining a task to be executed, and dividing the task to be executed into at least two sub-tasks, wherein different sub-tasks are compiled by adopting codes of different programming languages; determining the language types of the codes in the at least two sub-tasks; respectively executing the at least two sub-tasks in a manner corresponding to the language types of the codes in the at least two sub-tasks; and storing an execution result in a Java virtual machine, so that the execution result is read from the Java virtual machine during subsequent calculation.

Description

Mixed language task execution method, device and cluster

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with application number 201910425952.0 on May 21, 2019. The entire content of this application is incorporated into this application by reference.

Technical field

This application relates to the field of distributed technology, such as a method, device and cluster for executing mixed language tasks.

Background technique

A cluster is a parallel system or a distributed system composed of a number of computers connected to each other. A single computer in a cluster is usually called a node, which is usually connected through a local area network, but there are other possible connection methods. Cluster computers are generally used to improve the calculation speed and/or reliability of a single computer.

Distributed computing framework is the running and programming framework of distributed systems for processing big data. For example, Storm is a distributed real-time computing system for processing high-speed and large data streams, adding reliable real-time data processing functions to Hadoop; Spark uses memory computing, starting from multi-iteration batch processing, allowing data to be loaded into memory for repeated queries, and it also integrates multiple computing paradigms such as data warehouse, stream processing, and graph computing. Spark is built on the Hadoop Distributed File System (HDFS) and can be well integrated with Hadoop. Among them, Spark has multiple versions, such as Spark1, Spark2, etc.

The distributed computing framework can support multiple languages. For example, Spark2 can support the Python language and the R language, which means that Spark2 can support the distributed execution of tasks in the Python language and the distributed execution of tasks in the R language. However, in related technologies, because each language has certain limitations in realizing business, it requires a mixture of multiple languages to achieve tasks to meet the needs of users, but the distributed computing framework can only support distributed execution of single-language tasks, and does not support Distributed execution of mixed-language tasks leads to certain limitations in the realization of services for distributed execution of single-language tasks, and cannot realize some specific functions, thus failing to meet the needs of users.

Summary of the invention

The embodiments of the present application provide a method, device, and cluster for executing a mixed language task, which can overcome the limitation of executing a single language task to implement a business, and can implement more business functions.

The embodiment of the present application provides a method for executing mixed language tasks. The method is applied to a cluster, and the method includes:

Obtain the task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different programming languages;

Determine the language type of the code in the at least two subtasks;

Respectively execute the at least two subtasks in a manner corresponding to the language type of the codes in the at least two subtasks;

The execution result is stored in the Java virtual machine, so that the execution result is read from the Java virtual machine in the case of subsequent calculations.

An embodiment of the present application also provides a mixed language task execution device, the device is applied to a cluster, and the device includes:

The obtaining module is configured to obtain the task to be executed, and divide the task to be executed into at least two subtasks, wherein different subtasks are written in codes in different programming languages;

A judging module, configured to judge the language type of the codes in the at least two subtasks;

The execution module is configured to execute the at least two subtasks in a manner corresponding to the language type of the codes in the at least two subtasks;

The storage module is configured to store the execution result in the Java virtual machine, so that the execution result is read from the Java virtual machine in the case of subsequent calculations.

An embodiment of the present application also provides a cluster, including a mixed-language task execution device provided by the embodiment of the present application.

Description of the drawings

Fig. 1 is a flowchart of a method for executing a mixed language task provided by an embodiment of the present application;

2 is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application;

Fig. 3a is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application;

Figure 3b is a flowchart of a method for executing subtasks other than Python and R language provided by an embodiment of the present application;

Figure 3c is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application;

4 is a structural block diagram of a mixed language task execution device provided by an embodiment of the present application;

Fig. 5 is a schematic structural diagram of a cluster provided by an embodiment of the present application.

Detailed ways

The application will be described below with reference to the drawings and embodiments. The drawings only show a part but not all of the structure related to this application.

FIG. 1 is a flowchart of a method for executing a mixed language task provided by an embodiment of the present application. The method may be executed by a mixed language task executing device, and the device may be implemented by software and/or hardware. Devices can be configured in clusters. The method can be applied to a scenario where big data is calculated, and the method can be applied to a scenario where big data is calculated in multiple ways.

The embodiment of the present application takes Spark2 configured in a cluster as an example for description. Optionally, the cluster may include at least three servers, and the cluster in the embodiment of the present application may be a hadoop cluster.

In one embodiment, the Spark2 distributed computing framework makes full use of the advantages of clusters and breaks through the limitations of a single server on memory, central processing unit (CPU), and storage compared to stand-alone computing. Using distributed computing methods can accelerate the calculation of big data. The stand-alone computing method, when the business and data volume increase, stand-alone resources are prone to bottlenecks, accompanied by frequent garbage collection (garbage collection, gc) and high input/output (I/O) written back to the disk. The correctness of the functions and results of the business. The distributed computing framework Spark2 can expand computing resources horizontally, freely allocate resources required for tasks, and assign tasks to corresponding nodes for calculation according to the selected resources. Combining spark2's distributed algorithm and feature engineering, the data Perform distributed processing.

Spark2 supports Scala, Python and R languages, and can provide related interfaces to call distributed algorithms implemented by Spark2. Among them, the core algorithm of Spark2 is implemented in Scala language. Python and R can provide functions that call the implementation of Scala and provide distributed functions for basic algorithms. At the same time, Spark2 includes functions such as map and mappartion, and supports custom implementation of distributed functions.

In related technologies, Spark2 supports user-defined implementation of distributed functions, provides python, R and scala interfaces, supports distributed execution of single-language tasks, but does not support distributed execution of mixed-language tasks, resulting in a certain degree of business realization The limitations of the system cannot achieve some specific functions. The embodiment of the application provides that by dividing the task into subtasks written in different language codes, and executing the subtasks in a manner corresponding to the language type of the code in the subtasks, a task to be executed can be written in multiple languages, which can overcome Run single language tasks to realize the limitations of the business and realize more business functions.

As shown in FIG. 1, the technical solution provided by the embodiment of the present application includes: S110 to S140.

S110: Obtain a task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different languages.

In the embodiment of this application, the user equipment can communicate with the cluster. Optionally, a system for communicating with the cluster may be configured on the user equipment, and the user may use code writing tasks in the system. When users write tasks, they can write tasks in different languages. In the embodiment of the present application, the cluster can obtain the task to be executed through the interface, and divide the task to be executed into at least two subtasks. In an embodiment, the task may be divided into subtasks according to the identification of the code in the task to be executed. For example, subtasks in different programming languages have different code identifiers, and the task to be executed can be divided into at least two subtasks according to the identifiers in the code.

S120: Determine the language type of the code in the at least two subtasks.

In the embodiment of the present application, optionally, the codes of different subtasks are stored in different functional modules. The judging the language type of the code in the at least two subtasks includes: judging the language type of the code in the at least two subtasks based on the identifier of the functional module where the code in the subtask is located.

In an embodiment, the code of each subtask may be stored in different functional modules, and the stored functional modules of subtasks written in different languages are not the same. Each functional module that stores the subtask has an identification that is different from other functional modules, and the language type of the code in the subtask in the functional module can be determined through the identification of the functional module. Wherein, the method of determining the language type of the code in the subtask is not limited to the above method, and may also be other methods.

S130: Execute the at least two subtasks in a manner corresponding to the language types of the codes in the at least two subtasks.

In the embodiment of the present application, if the language types of the codes in the at least two subtasks are not the same, the manners of executing the at least two subtasks may also be different. Among them, the language type of the subtask code can be Python language, R language or other languages. For the manner of executing the subtask, refer to the introduction of the following embodiment.

In the embodiment of the present application, the manner of executing at least two subtasks may be executed serially, or may also be executed in parallel. Wherein, when multiple subtasks have a dependency relationship, the manner in which the cluster executes the subtasks may be serial execution, that is, the multiple subtasks are executed in the order in the dependency relationship. When multiple subtasks do not have a dependency relationship, the way the cluster executes the subtasks can be executed in parallel, which can save time and improve efficiency.

S140: Store the execution result in the Java virtual machine, so that the execution result is read from the Java virtual machine during subsequent calculations.

In the embodiment of the present application, the execution result may be stored in the form of Data Frame in the Java virtual machine.

In related technologies, Spark2 supports user-defined implementation of distributed functions, provides Python, R and Scala interfaces, supports distributed execution of single-language tasks, but does not support distributed execution of mixed-language tasks, resulting in a certain degree of business realization The limitations of the system cannot realize some specific functions, and thus cannot meet the needs of users. In order to meet the needs of users, it is often necessary to write multiple tasks in different languages. Multiple tasks can be executed by the cluster configured with Spark2 to achieve the goal. After each task is executed, the execution result of each task often needs to be intermediate. The files are interacted with each other, causing the accuracy and stability of the processing results to be uncontrollable. In the embodiment of this application, by dividing the task into subtasks written in different language codes, and executing the subtasks in a manner corresponding to the language type of the code in the subtasks, a task to be executed can be written in multiple languages, that is, In the Spark resource life cycle, multiple languages can be mixed to use, which can overcome the limitations of running a single language task to achieve business, and can achieve more business functions. In the embodiment of the present application, tasks are divided into subtasks, and execution results are stored through a Java virtual machine, thereby avoiding the interaction of calculation results through intermediate files, and ensuring the accuracy and controllability of processing results.

The method provided in the embodiments of the present application is not limited to being applied to clusters configured with the Spark2 distributed computing framework, and can also be applied to clusters configured with other distributed computing frameworks.

The mixed language task execution method provided by the embodiments of the present application divides the task into subtasks, wherein the subtasks are written by using codes in different languages, and the subtasks are executed in a manner corresponding to the language types of the codes in the subtasks. , That is, by dividing the task into subtasks written in different language codes, and executing the subtasks in a manner corresponding to the language type of the code in the subtasks, a task to be executed can be written in multiple languages, which can overcome the running single language The functional limitations of the task can achieve more functions. The embodiment of the present application feeds back the execution result of each subtask to the virtual machine for storage, so that the calculation result is read from the virtual machine during subsequent calculations, thereby avoiding calculations. The result of the interaction through the intermediate file ensures the accuracy and controllability of the processing result.

FIG. 2 is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application. In an embodiment of the present application, "the subtask is executed in a manner corresponding to the language type of the code in the subtask). Tasks are described. As shown in FIG. 2, the technical solutions provided by the embodiments of the present application include: S210 to S290.

S210: Obtain a task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different programming languages.

S220: Determine the language type of the code in the at least two subtasks.

S230: If it is determined that the language type of the code of the first subtask of the at least two subtasks is the target language, establish a connection between the target language interface and the Java virtual machine, where the target language includes Python language or R language, and the target language The interface includes the Pyspark interface corresponding to the Python language or the SparkR interface corresponding to the R language.

In the embodiment of the present application, optionally, Spark2 is configured in the cluster, where the target language interface may be an interface for performing a set function.

In this embodiment of the application, in the cluster, if the language type of the code in the subtask is Python, a connection between the Pyspark interface and the Java virtual machine is established. If the language type of the code in the subtask is R language, establish a connection between the SparkR interface and the Java virtual machine. Among them, the Java virtual machine can be created on the master node in the cluster, and the Java virtual machine can be used to store data or variables. Optionally, the Pyspark interface and the Java virtual machine are connected through a gateway. The SparkR interface and the Java virtual machine are connected through callJStatic. Before the SparkR interface and the Java virtual machine are connected, the SparkR interface and the Java virtual machine can interact with Socket parameters first, thereby realizing the SparkR interface and the Java virtual machine. Machine connection.

S240: Read data or variables stored in the virtual machine through the target language interface, and perform distributed calculation on the read data, or perform distributed calculation on the data corresponding to the variables, to obtain a calculation result.

In an embodiment, the code based on the first subtask reads data or variables from the Java virtual machine through the target language interface, and performs distributed calculations on the variables or the data to obtain the calculation results ..

In the embodiment of the present application, if the language of the code in the subtask is the Python language, a connection between the Pyspark interface and the Java virtual machine is established, and the stored data or variables are read from the virtual machine based on the code in the subtask through the Pyspark interface. In one embodiment, the code in the subtask contains variables, and the variables can be read from the virtual machine based on the variables contained in the code in the subtask through the Pyspark interface, and the data corresponding to the variables is distributedly calculated by the cluster. Among them, when the variable is read from the Java virtual machine through the Pyspark interface, the data corresponding to the variable can be stored in the slave node in the cluster. The method for distributed calculation of variables can be calculated according to the method followed by the Spark2 distributed computing framework.

In one embodiment, the corresponding data can be read from the virtual machine based on the variables contained in the code in the subtask through the Pyspark interface, and the cluster performs distributed calculation on the read data. For example, if the variable included in the subtask code is Table 1, the data in Table 1 can be read from the virtual machine through the Pyspark interface. Among them, the method of performing distributed computing on data can be calculated in accordance with the method followed by the Spark2 distributed computing framework.

S250: Feed back the calculation result to the Java virtual machine, so that the calculation result is read from the Java virtual machine during subsequent calculations.

In the embodiment of the present application, the master node in the cluster assigns computing tasks to the slave nodes, and the slave nodes perform calculations to obtain the calculation results, and the calculation results are fed back to the master node in the cluster. The master node in the cluster calculates The result is fed back to the Java virtual machine, so that the calculation result can be read from the Java virtual machine during subsequent calculations. Optionally, the calculation result may be stored in the form of Data Frame in the Java virtual machine.

S260: If it is determined by the master node in the cluster that the language type of the code of the second subtask in the at least two subtasks is a language other than the target language, the master node based on the code slave of the second subtask Variables or data are read from the Java virtual machine.

In this embodiment of the application, optionally, if the master node in the cluster determines that the language type of the code of the second subtask is a language other than Python language and R language, the master node is based on the second subtask The code reads variables or data from the Java virtual machine. In an embodiment, the master node in the cluster can read variables from the Java virtual machine based on the variables contained in the code in the subtask. Among them, when the variable is read from the Java virtual machine through the master node, the data corresponding to the variable included in the subtask code may not exist in the Java virtual machine, and the data corresponding to the variable may be stored in the slave node. Alternatively, the master node can read data corresponding to the variable from the Java virtual machine based on the variable included in the subtask code, where the data corresponding to the variable can be stored in the Java virtual machine.

S270: Modify the variable through the master node, send the modified variable and the code of the second subtask to the slave node corresponding to the modified variable, or divide the read data , And send the code and division data of the second subtask to the slave node corresponding to the division data.

In an implementation of the embodiment of the present application, optionally, the modified variable and the code in the second subtask are sent to the slave node corresponding to the modified variable, or the read data Performing division, and sending the code and division data of the second subtask to the slave node corresponding to the division data, including: sending the modified variable and the code of the second subtask to the distributed file system , Or send the divided data and the code of the second subtask to the distributed file system; download the code in the second subtask and the slave from the distributed file system through the slave node The modified variable corresponding to the node, or download the code in the second subtask and the partition data corresponding to the slave node.

In an embodiment, different slave nodes correspond to different modified variables, or different slave nodes correspond to different divided data. In an embodiment, when the cluster executes a calculation task on data, the master node in the cluster may allocate the task to multiple slave nodes for execution, and each slave node may execute a calculation task on part of the data. Therefore, when the master node reads variables or data from the virtual machine based on the code in the subtask, the master node needs to modify the read variables or divide the read data to make Multiple slave nodes perform calculations on different data, thereby completing the calculation of the read data, or completing the calculation of the data corresponding to the read variable.

In the embodiment of the present application, the distributed file system may be a Hadoop distributed file system, and each slave node downloads the code and modified variables (or partitioned data) in the subtasks from the distributed file system. The method for each slave node to download the modified variable or divide the data from the distributed file system can be: establish a correspondence between the modified variable and the slave node number, and the slave node can download the corresponding modified variable according to the relationship; Or a corresponding relationship between the division data and the number of the slave node is established, and the slave node can download the corresponding division data according to the relationship. The method for each slave node to download modified variables or partition data from the distributed file system may also be other methods.

S280: Run the code of the second subtask through the slave node, calculate the data corresponding to the modified variable, or calculate the divided data to obtain the calculation result, and feed the calculation result back to the all The main node.

In the embodiment of the present application, the slave node runs the code in the subtask, and queries the corresponding data in the storage location of the slave node according to the modified variable, and calculates the data. Or calculate the downloaded divided data from the node to obtain the calculation result. Among them, the calculation results of each slave node are fed back to the master node.

S290: Receive the calculation result fed back by the slave node through the master node, and feed the calculation result back to the Java virtual machine.

The technical solutions of S260-S290 are illustrated by an example. If the subtask is to calculate the sum of each row of data in Table 1, where the variable included in the subcode is Table 1. The variable read from the Java virtual machine based on the subtask code through the master node in the cluster is Table 1, and the read variable is modified. If the modified variable can be: Lines 1 to 100 of Table 1 (Lines 1-100 of Table 1), Lines 101 to 200 of Table 1 (Lines 101-200 of Table 1). The master node in the cluster sends the modified variables and subtask codes to the distributed file system, and the slave node 1 can download the subtask codes and modified variables from the distributed system (Lines 1 to 100 of Table 1) , So that when slave node 1 runs the subtask code, query the data in rows 1-100 in Table 1, and sum each row of data in rows 1-100 in Table 1. Through slave node 2, the distributed Download the subtask code and modified variables (Lines 101 to 200 of Table 1) in the system, so that when the subtask code is run from node 2, the data in rows 101-200 in Table 1 will be looked up based on the downloaded modified variables , And calculate the sum of each row of data in rows 101-200 in Table 1, and summarize the calculation results of each node, so that the calculation results can be obtained in the Java Virtual Machine (JVM) of the master node. Subsequent task calculation.

In related technologies, when implementing distributed computing in a cluster, the language of distributed tasks has certain limitations. For example, it is only limited to Scala, R, and Python. For tasks in other languages, distribution is not supported. In addition, the language that users can use is relatively single. In the method provided by the embodiments of this application, when the language type of the code in the subtask is a language other than the target language, the code and data in the subtask can be distributed to the slave node through the master node, so that the slave node can run the subtask In order to calculate the distributed data to achieve distributed computing, it can overcome the limitations of the language for distributed execution tasks. For tasks in multiple languages, it can achieve distributed execution tasks, avoiding users’ constant language Dependence on enriching the languages that users can use.

In related technologies, the execution of some tasks is done by a single machine. If you need to execute tasks in a distributed manner, you need to re-modify the program in the single machine. It is often necessary to delete the program in the single machine and rewrite the program, resulting in the modification to distributed execution. The difficulty is greater. The method provided by the embodiment of this application, by judging the code language type in the task, and by adopting the corresponding method to execute the task in a distributed manner, there is no need to modify the program of the stand-alone computer, and only add the provided by the embodiment of this application to the original program. The code program of the method can realize the purpose of distributed execution of tasks, save time and reduce the difficulty of modifying single-computer computing into distributed computing.

The embodiment of the present application exemplarily combines S210-S290 into one embodiment to perform a task execution method, but this embodiment is only an example. In other embodiments of the present application, S210-S250 can be combined into one embodiment to perform a task execution method. A task execution method, or S210, S220, and S260-S290 form an embodiment to execute a task execution method.

Fig. 3a is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application. As shown in Fig. 3a, the technical solution provided by an embodiment of the present application includes:

S310: Initialize Spark resources.

In this embodiment of the present application, initializing Spark resources may include applying for Spark resources. In the embodiment of the present application, the required Spark resources can be preset, and the Spark resources can be applied for in advance.

S320: JVM stores data and variables.

In the embodiment of the present application, the JVM may store the requested Spark resource list, may store data and variables, may also store intermediate data and variables generated in the distributed computing process, or may also store some other data.

S330: Obtain the task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different programming languages.

S340: Determine the language type of the code in the subtask.

S350: If it is judged that the type of code in the subtask is Python language, connect to JVM based on gateway through Pyspark interface, and read data from JVM; and call Pyspark interface for distributed calculation, and the calculation results are transferred to JVM so that Python resources are released.

In the embodiments of the present application, Pyspark implements Spark's Application Programming Interface (API) for Python, through which Pyspark users can write Python programs running on Spark, thereby taking advantage of the distributed computing features of Spark.

In related technologies, the native Pyspark uses java_gateway to call the Spark interface to provide Python with methods such as calling sparkcontext implemented in the Scala language to initialize resources and call distributed algorithms to achieve distributed computing. The above methods in related technologies are used to initialize resources. Each task needs to be initialized, and the processing mechanism is cumbersome and time-consuming. The embodiment of the application initializes resources in advance, and when the code language in the executed subtask is Python, a gatewayServer instance can be created, allowing the Python program in the cluster to communicate with the JVM, and serializing the data and Spark objects in the JVM , To read data or Spark objects from the JVM through the Pyspark interface, and then perform distributed calculations based on the read data, etc. The embodiment of the application provides the method of initializing resources and then performing distributed calculations, which can avoid processing each If all tasks need to initialize resources, efficiency can be improved.

S360: If the type of code in the subtask is judged to be R language, Socket parameter interaction is performed between the SparkR interface and the JVM, so that the SparkR interface is connected to the JVM through callJStatic, and data is read from the JVM; the SparkR calculation is performed and the calculation result Transfer and release resources.

In the embodiments of this application, SparkR is an R language package, which provides a lightweight way to use Apache Spark in R language. SparkR implements a distributed data frame and supports similar query, filtering, and aggregation operations .

In related technologies, the native SparkR uses the callJStatic method to call the method of initializing resources defined in the scala language to complete the initialization of the resources, and then perform distributed calculations by calling the initialized resources, and interact the calculation results through the socket. The above-mentioned methods in the related art need to initialize resources every time an R language task is executed, and the processing mechanism is cumbersome and wastes time. The embodiment of this application initializes resources in advance and connects to the JVM through the SparkR interface, so that it can be connected to the resources that have been applied for, and data interaction with the JVM can be realized. After the cluster obtains the data, distributed computing is implemented, and the calculation results are fed back to the JVM. It can avoid the need to initialize resources for processing each R language task, which can improve efficiency.

S370: If it is determined that the language type of the code in the subtask is a language other than Python language and R language, the subtask code and data are distributed through the master node, so that the slave node runs the subtask code on a stand-alone machine, and the stand-alone execution on the data The calculation results are fed back to the master node, and the calculation results are integrated through the master node and fed back to the JVM.

In the embodiment of this application, as shown in Figure 3b, the master node can use the mapPartition method to transfer the subtask code and data through HDFS, and distribute to each slave node included in the applied resource, so that the slave node can use The supported instructions in the operating environment continue to calculate the data (python2, python3, bash, rscript, etc.). After the calculation is completed, the calculation results are collected through the master node and converted into spark dataFrame for subsequent data processing, storage and distribution Calculation etc.

The mixed language task execution method provided by the embodiment of the present application can also refer to the flow shown in Fig. 3c.

4 is a structural block diagram of a mixed language task execution device provided by an embodiment of the present application. The device is configured in a cluster. The device includes: an acquisition module 410, a judgment module 420, an execution module 430, and a storage module 440.

The obtaining module 410 is configured to obtain a task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written in codes in different programming languages;

The judgment module 420 is configured to judge the language type of the codes in the at least two subtasks;

The execution module 430 is configured to execute the at least two subtasks in a manner corresponding to the language types of the codes in the at least two subtasks;

The storage module 440 is configured to store the execution result in the Java virtual machine, so that the execution result is read from the Java virtual machine during subsequent calculations.

Optionally, the execution module 430 is configured to: if it is determined that the language type of the code of the first subtask in the at least two subtasks is the target language, establish a connection between the target language interface and the Java virtual machine, wherein the target language includes Python Language or R language, the target language interface includes a Pyspark interface corresponding to the Python language or a SparkR interface corresponding to the R language;

The code based on the first subtask reads data or variables from the Java virtual machine through the target language interface, performs distributed calculations on the variables or data, and obtains a calculation result.

Optionally, the Pyspark interface and the Java virtual machine are connected through a gateway, and the SparkR interface and the Java virtual machine are connected through a callJStatic.

Optionally, the execution module 430 is set to:

If the master node in the cluster determines that the language type of the code of the second subtask in at least two subtasks is a language other than the target language, the master node uses the code of the second subtask from the Read variables or data in the Java virtual machine;

Modify the variable through the master node, send the modified variable and the code of the second subtask to the slave node corresponding to the modified variable, or divide the read data, and Sending the code and division data of the second subtask to the slave node corresponding to the division data;

Run the code of the second subtask through the slave node to calculate the data corresponding to the modified variable, or calculate the divided data to obtain the calculation result, and feed the calculation result back to the master node;

Correspondingly, the storage module 440 is configured to receive the calculation result fed back by the slave node through the master node, and feed the calculation result back to the Java virtual machine.

Optionally, the variable is modified through the master node, the modified variable and the code of the second subtask are sent to the corresponding slave node, or the read data is divided, and the The code and division data of the second subtask are sent to the slave node, including:

Sending the modified variable and the code of the second subtask to the distributed file system, or sending the division data and the code of the second subtask to the distributed file system;

Download the code in the second subtask and the modified variable corresponding to the slave node from the distributed file system through the slave node, or download the code in the second subtask and the partition corresponding to the slave node Data; among them, different slave nodes correspond to different modified variables, or different slave nodes correspond to different divided data.

Optionally, the codes of different subtasks are stored in different functional modules, and the judgment module 420 is configured to judge the language types of the codes in the at least two subtasks based on the identifier of the functional module where the codes in the subtask are located.

Optionally, the cluster is a Hadoop cluster.

Optionally, Spark2 is configured in the cluster.

The above-mentioned device can execute the method provided in any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.

FIG. 5 is a structural block diagram of a cluster provided by an embodiment of the present application. As shown in FIG. 5, a cluster 500 provided by an embodiment of the present application includes a mixed language task execution device 501 provided by an embodiment of the present application.

Claims

A mixed language task execution method, the method is applied to a cluster, and the method includes:

Obtain the task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different programming languages;

Determine the language type of the code in the at least two subtasks;

Respectively execute the at least two subtasks in a manner corresponding to the language types of the codes in the at least two subtasks;

The execution result is stored in the Java virtual machine, so that the execution result is read from the Java virtual machine in the case of subsequent calculations.
The method according to claim 1, wherein said respectively executing said at least two subtasks in a manner corresponding to the language types of codes in said at least two subtasks comprises:

In the case where it is determined that the language type of the code of the first subtask in the at least two subtasks is the target language, a connection between the target language interface and the Java virtual machine is established, wherein the target language includes Python language or R language , The target language interface includes a Pyspark interface corresponding to the Python language or a SparkR interface corresponding to the R language;

The code based on the first subtask reads data or variables from the Java virtual machine through the target language interface, performs distributed calculation on the variables or the data, and obtains a calculation result.
The method according to claim 2, wherein the Pyspark interface and the Java virtual machine are connected through a gateway, and the SparkR interface and the Java virtual machine are connected through a callJStatic.
The method according to claim 1, wherein said respectively executing said at least two subtasks in a manner corresponding to the language types of codes in said at least two subtasks comprises:

In the case where it is determined by the master node in the cluster that the language type of the code of the second subtask in the at least two subtasks is a language other than the target language, the master node is based on the second The code of the subtask reads variables or data from the Java virtual machine;

Modify the variable through the master node, send the modified variable and the code of the second subtask to the slave node corresponding to the modified variable, or divide the read data, and Sending the code and division data of the second subtask to the slave node corresponding to the division data;

Run the code of the second subtask through the slave node to calculate the data corresponding to the modified variable, or calculate the divided data to obtain the calculation result, and feed the calculation result back to all The main node;

The storing the execution result in the Java virtual machine so as to read the execution result from the Java virtual machine in the case of subsequent calculation includes: receiving the calculation result fed back by the slave node through the master node, And feedback the calculation result to the Java virtual machine.
The method according to claim 4, wherein the variable is modified by the master node, and the modified variable and the code of the second subtask are sent to the corresponding to the modified variable The slave node, or dividing the read data, and sending the code of the second subtask and the divided data to the slave node corresponding to the divided data includes:

Sending the modified variable and the code of the second subtask to the distributed file system, or sending the division data and the code of the second subtask to the distributed file system;

Download the code of the second subtask and the modified variable corresponding to the slave node from the distributed file system through the slave node, or download the code of the second subtask and the slave node correspondence The data is divided, where different slave nodes correspond to different modified variables, or different slave nodes correspond to different divided data.
The method according to claim 1, wherein the codes of different subtasks are stored in different functional modules;

The determining the language type of the code in the at least two subtasks includes:

The language type of the code in the at least two subtasks is determined based on the identifier of the functional module where the code in the subtask is located.
The method according to any one of claims 1-6, wherein the cluster is a Hadoop cluster.
The method according to claim 7, wherein Spark2 is configured in the cluster.
A mixed language task execution device, the device is applied to a cluster, and the device includes:

The obtaining module is configured to obtain the task to be executed, and divide the task to be executed into at least two subtasks, wherein different subtasks are written in codes in different programming languages;

A judging module, configured to judge the language type of the codes in the at least two subtasks;

The execution module is configured to execute the at least two subtasks in a manner corresponding to the language type of the codes in the at least two subtasks;

The storage module is configured to store the execution result in the Java virtual machine, so that the execution result is read from the Java virtual machine in the case of subsequent calculations.
A cluster comprising: a mixed language task execution device as claimed in claim 9.