WO2020233584A1 - Mixed language task execution method and device, and cluster - Google Patents

Mixed language task execution method and device, and cluster Download PDF

Info

Publication number
WO2020233584A1
WO2020233584A1 PCT/CN2020/091189 CN2020091189W WO2020233584A1 WO 2020233584 A1 WO2020233584 A1 WO 2020233584A1 CN 2020091189 W CN2020091189 W CN 2020091189W WO 2020233584 A1 WO2020233584 A1 WO 2020233584A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
subtasks
code
subtask
data
Prior art date
Application number
PCT/CN2020/091189
Other languages
French (fr)
Chinese (zh)
Inventor
刘铖
Original Assignee
星环信息科技(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 星环信息科技(上海)有限公司 filed Critical 星环信息科技(上海)有限公司
Publication of WO2020233584A1 publication Critical patent/WO2020233584A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Definitions

  • This application relates to the field of distributed technology, such as a method, device and cluster for executing mixed language tasks.
  • a cluster is a parallel system or a distributed system composed of a number of computers connected to each other.
  • a single computer in a cluster is usually called a node, which is usually connected through a local area network, but there are other possible connection methods.
  • Cluster computers are generally used to improve the calculation speed and/or reliability of a single computer.
  • Distributed computing framework is the running and programming framework of distributed systems for processing big data.
  • Storm is a distributed real-time computing system for processing high-speed and large data streams, adding reliable real-time data processing functions to Hadoop; Spark uses memory computing, starting from multi-iteration batch processing, allowing data to be loaded into memory for repeated queries, and it also integrates multiple computing paradigms such as data warehouse, stream processing, and graph computing.
  • Spark is built on the Hadoop Distributed File System (HDFS) and can be well integrated with Hadoop. Among them, Spark has multiple versions, such as Spark1, Spark2, etc.
  • HDFS Hadoop Distributed File System
  • the distributed computing framework can support multiple languages.
  • Spark2 can support the Python language and the R language, which means that Spark2 can support the distributed execution of tasks in the Python language and the distributed execution of tasks in the R language.
  • each language has certain limitations in realizing business, it requires a mixture of multiple languages to achieve tasks to meet the needs of users, but the distributed computing framework can only support distributed execution of single-language tasks, and does not support Distributed execution of mixed-language tasks leads to certain limitations in the realization of services for distributed execution of single-language tasks, and cannot realize some specific functions, thus failing to meet the needs of users.
  • the embodiments of the present application provide a method, device, and cluster for executing a mixed language task, which can overcome the limitation of executing a single language task to implement a business, and can implement more business functions.
  • the embodiment of the present application provides a method for executing mixed language tasks.
  • the method is applied to a cluster, and the method includes:
  • the execution result is stored in the Java virtual machine, so that the execution result is read from the Java virtual machine in the case of subsequent calculations.
  • An embodiment of the present application also provides a mixed language task execution device, the device is applied to a cluster, and the device includes:
  • the obtaining module is configured to obtain the task to be executed, and divide the task to be executed into at least two subtasks, wherein different subtasks are written in codes in different programming languages;
  • a judging module configured to judge the language type of the codes in the at least two subtasks
  • the execution module is configured to execute the at least two subtasks in a manner corresponding to the language type of the codes in the at least two subtasks;
  • the storage module is configured to store the execution result in the Java virtual machine, so that the execution result is read from the Java virtual machine in the case of subsequent calculations.
  • An embodiment of the present application also provides a cluster, including a mixed-language task execution device provided by the embodiment of the present application.
  • Fig. 1 is a flowchart of a method for executing a mixed language task provided by an embodiment of the present application
  • FIG. 2 is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application
  • Fig. 3a is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application
  • Figure 3b is a flowchart of a method for executing subtasks other than Python and R language provided by an embodiment of the present application;
  • Figure 3c is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application.
  • FIG. 4 is a structural block diagram of a mixed language task execution device provided by an embodiment of the present application.
  • Fig. 5 is a schematic structural diagram of a cluster provided by an embodiment of the present application.
  • FIG. 1 is a flowchart of a method for executing a mixed language task provided by an embodiment of the present application.
  • the method may be executed by a mixed language task executing device, and the device may be implemented by software and/or hardware. Devices can be configured in clusters.
  • the method can be applied to a scenario where big data is calculated, and the method can be applied to a scenario where big data is calculated in multiple ways.
  • the embodiment of the present application takes Spark2 configured in a cluster as an example for description.
  • the cluster may include at least three servers, and the cluster in the embodiment of the present application may be a hadoop cluster.
  • the Spark2 distributed computing framework makes full use of the advantages of clusters and breaks through the limitations of a single server on memory, central processing unit (CPU), and storage compared to stand-alone computing.
  • Using distributed computing methods can accelerate the calculation of big data.
  • the stand-alone computing method when the business and data volume increase, stand-alone resources are prone to bottlenecks, accompanied by frequent garbage collection (garbage collection, gc) and high input/output (I/O) written back to the disk.
  • the distributed computing framework Spark2 can expand computing resources horizontally, freely allocate resources required for tasks, and assign tasks to corresponding nodes for calculation according to the selected resources. Combining spark2's distributed algorithm and feature engineering, the data Perform distributed processing.
  • Spark2 supports Scala, Python and R languages, and can provide related interfaces to call distributed algorithms implemented by Spark2.
  • the core algorithm of Spark2 is implemented in Scala language.
  • Python and R can provide functions that call the implementation of Scala and provide distributed functions for basic algorithms.
  • Spark2 includes functions such as map and mappartion, and supports custom implementation of distributed functions.
  • Spark2 supports user-defined implementation of distributed functions, provides python, R and scala interfaces, supports distributed execution of single-language tasks, but does not support distributed execution of mixed-language tasks, resulting in a certain degree of business realization
  • the limitations of the system cannot achieve some specific functions.
  • the embodiment of the application provides that by dividing the task into subtasks written in different language codes, and executing the subtasks in a manner corresponding to the language type of the code in the subtasks, a task to be executed can be written in multiple languages, which can overcome Run single language tasks to realize the limitations of the business and realize more business functions.
  • the technical solution provided by the embodiment of the present application includes: S110 to S140.
  • S110 Obtain a task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different languages.
  • the user equipment can communicate with the cluster.
  • a system for communicating with the cluster may be configured on the user equipment, and the user may use code writing tasks in the system.
  • users write tasks they can write tasks in different languages.
  • the cluster can obtain the task to be executed through the interface, and divide the task to be executed into at least two subtasks.
  • the task may be divided into subtasks according to the identification of the code in the task to be executed. For example, subtasks in different programming languages have different code identifiers, and the task to be executed can be divided into at least two subtasks according to the identifiers in the code.
  • S120 Determine the language type of the code in the at least two subtasks.
  • the codes of different subtasks are stored in different functional modules.
  • the judging the language type of the code in the at least two subtasks includes: judging the language type of the code in the at least two subtasks based on the identifier of the functional module where the code in the subtask is located.
  • the code of each subtask may be stored in different functional modules, and the stored functional modules of subtasks written in different languages are not the same.
  • Each functional module that stores the subtask has an identification that is different from other functional modules, and the language type of the code in the subtask in the functional module can be determined through the identification of the functional module.
  • the method of determining the language type of the code in the subtask is not limited to the above method, and may also be other methods.
  • S130 Execute the at least two subtasks in a manner corresponding to the language types of the codes in the at least two subtasks.
  • the manners of executing the at least two subtasks may also be different.
  • the language type of the subtask code can be Python language, R language or other languages.
  • the manner of executing at least two subtasks may be executed serially, or may also be executed in parallel.
  • the manner in which the cluster executes the subtasks may be serial execution, that is, the multiple subtasks are executed in the order in the dependency relationship.
  • the way the cluster executes the subtasks can be executed in parallel, which can save time and improve efficiency.
  • S140 Store the execution result in the Java virtual machine, so that the execution result is read from the Java virtual machine during subsequent calculations.
  • the execution result may be stored in the form of Data Frame in the Java virtual machine.
  • Spark2 supports user-defined implementation of distributed functions, provides Python, R and Scala interfaces, supports distributed execution of single-language tasks, but does not support distributed execution of mixed-language tasks, resulting in a certain degree of business realization
  • the limitations of the system cannot realize some specific functions, and thus cannot meet the needs of users.
  • a task to be executed can be written in multiple languages, that is, In the Spark resource life cycle, multiple languages can be mixed to use, which can overcome the limitations of running a single language task to achieve business, and can achieve more business functions.
  • tasks are divided into subtasks, and execution results are stored through a Java virtual machine, thereby avoiding the interaction of calculation results through intermediate files, and ensuring the accuracy and controllability of processing results.
  • the method provided in the embodiments of the present application is not limited to being applied to clusters configured with the Spark2 distributed computing framework, and can also be applied to clusters configured with other distributed computing frameworks.
  • the mixed language task execution method divides the task into subtasks, wherein the subtasks are written by using codes in different languages, and the subtasks are executed in a manner corresponding to the language types of the codes in the subtasks. , That is, by dividing the task into subtasks written in different language codes, and executing the subtasks in a manner corresponding to the language type of the code in the subtasks, a task to be executed can be written in multiple languages, which can overcome the running single language
  • the functional limitations of the task can achieve more functions.
  • the embodiment of the present application feeds back the execution result of each subtask to the virtual machine for storage, so that the calculation result is read from the virtual machine during subsequent calculations, thereby avoiding calculations.
  • the result of the interaction through the intermediate file ensures the accuracy and controllability of the processing result.
  • FIG. 2 is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application.
  • the subtask is executed in a manner corresponding to the language type of the code in the subtask). Tasks are described.
  • the technical solutions provided by the embodiments of the present application include: S210 to S290.
  • S210 Obtain a task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different programming languages.
  • S220 Determine the language type of the code in the at least two subtasks.
  • S230 If it is determined that the language type of the code of the first subtask of the at least two subtasks is the target language, establish a connection between the target language interface and the Java virtual machine, where the target language includes Python language or R language, and the target language
  • the interface includes the Pyspark interface corresponding to the Python language or the SparkR interface corresponding to the R language.
  • Spark2 is configured in the cluster, where the target language interface may be an interface for performing a set function.
  • the language type of the code in the subtask is Python
  • a connection between the Pyspark interface and the Java virtual machine is established.
  • the language type of the code in the subtask is R language
  • the Java virtual machine can be created on the master node in the cluster, and the Java virtual machine can be used to store data or variables.
  • the Pyspark interface and the Java virtual machine are connected through a gateway.
  • the SparkR interface and the Java virtual machine are connected through callJStatic. Before the SparkR interface and the Java virtual machine are connected, the SparkR interface and the Java virtual machine can interact with Socket parameters first, thereby realizing the SparkR interface and the Java virtual machine. Machine connection.
  • S240 Read data or variables stored in the virtual machine through the target language interface, and perform distributed calculation on the read data, or perform distributed calculation on the data corresponding to the variables, to obtain a calculation result.
  • the code based on the first subtask reads data or variables from the Java virtual machine through the target language interface, and performs distributed calculations on the variables or the data to obtain the calculation results ..
  • the language of the code in the subtask is the Python language
  • a connection between the Pyspark interface and the Java virtual machine is established, and the stored data or variables are read from the virtual machine based on the code in the subtask through the Pyspark interface.
  • the code in the subtask contains variables, and the variables can be read from the virtual machine based on the variables contained in the code in the subtask through the Pyspark interface, and the data corresponding to the variables is distributedly calculated by the cluster.
  • the variable is read from the Java virtual machine through the Pyspark interface
  • the data corresponding to the variable can be stored in the slave node in the cluster.
  • the method for distributed calculation of variables can be calculated according to the method followed by the Spark2 distributed computing framework.
  • the corresponding data can be read from the virtual machine based on the variables contained in the code in the subtask through the Pyspark interface, and the cluster performs distributed calculation on the read data.
  • the variable included in the subtask code is Table 1
  • the data in Table 1 can be read from the virtual machine through the Pyspark interface.
  • the method of performing distributed computing on data can be calculated in accordance with the method followed by the Spark2 distributed computing framework.
  • S250 Feed back the calculation result to the Java virtual machine, so that the calculation result is read from the Java virtual machine during subsequent calculations.
  • the master node in the cluster assigns computing tasks to the slave nodes, and the slave nodes perform calculations to obtain the calculation results, and the calculation results are fed back to the master node in the cluster.
  • the master node in the cluster calculates The result is fed back to the Java virtual machine, so that the calculation result can be read from the Java virtual machine during subsequent calculations.
  • the calculation result may be stored in the form of Data Frame in the Java virtual machine.
  • the master node in the cluster determines that the language type of the code of the second subtask is a language other than Python language and R language, the master node is based on the second subtask
  • the code reads variables or data from the Java virtual machine.
  • the master node in the cluster can read variables from the Java virtual machine based on the variables contained in the code in the subtask. Among them, when the variable is read from the Java virtual machine through the master node, the data corresponding to the variable included in the subtask code may not exist in the Java virtual machine, and the data corresponding to the variable may be stored in the slave node.
  • the master node can read data corresponding to the variable from the Java virtual machine based on the variable included in the subtask code, where the data corresponding to the variable can be stored in the Java virtual machine.
  • S270 Modify the variable through the master node, send the modified variable and the code of the second subtask to the slave node corresponding to the modified variable, or divide the read data , And send the code and division data of the second subtask to the slave node corresponding to the division data.
  • the modified variable and the code in the second subtask are sent to the slave node corresponding to the modified variable, or the read data Performing division, and sending the code and division data of the second subtask to the slave node corresponding to the division data, including: sending the modified variable and the code of the second subtask to the distributed file system , Or send the divided data and the code of the second subtask to the distributed file system; download the code in the second subtask and the slave from the distributed file system through the slave node
  • different slave nodes correspond to different modified variables, or different slave nodes correspond to different divided data.
  • the master node in the cluster may allocate the task to multiple slave nodes for execution, and each slave node may execute a calculation task on part of the data. Therefore, when the master node reads variables or data from the virtual machine based on the code in the subtask, the master node needs to modify the read variables or divide the read data to make Multiple slave nodes perform calculations on different data, thereby completing the calculation of the read data, or completing the calculation of the data corresponding to the read variable.
  • the distributed file system may be a Hadoop distributed file system, and each slave node downloads the code and modified variables (or partitioned data) in the subtasks from the distributed file system.
  • the method for each slave node to download the modified variable or divide the data from the distributed file system can be: establish a correspondence between the modified variable and the slave node number, and the slave node can download the corresponding modified variable according to the relationship; Or a corresponding relationship between the division data and the number of the slave node is established, and the slave node can download the corresponding division data according to the relationship.
  • the method for each slave node to download modified variables or partition data from the distributed file system may also be other methods.
  • S280 Run the code of the second subtask through the slave node, calculate the data corresponding to the modified variable, or calculate the divided data to obtain the calculation result, and feed the calculation result back to the all The main node.
  • the slave node runs the code in the subtask, and queries the corresponding data in the storage location of the slave node according to the modified variable, and calculates the data. Or calculate the downloaded divided data from the node to obtain the calculation result. Among them, the calculation results of each slave node are fed back to the master node.
  • S290 Receive the calculation result fed back by the slave node through the master node, and feed the calculation result back to the Java virtual machine.
  • S260-S290 The technical solutions of S260-S290 are illustrated by an example. If the subtask is to calculate the sum of each row of data in Table 1, where the variable included in the subcode is Table 1.
  • the variable read from the Java virtual machine based on the subtask code through the master node in the cluster is Table 1, and the read variable is modified. If the modified variable can be: Lines 1 to 100 of Table 1 (Lines 1-100 of Table 1), Lines 101 to 200 of Table 1 (Lines 101-200 of Table 1).
  • the master node in the cluster sends the modified variables and subtask codes to the distributed file system, and the slave node 1 can download the subtask codes and modified variables from the distributed system (Lines 1 to 100 of Table 1) , So that when slave node 1 runs the subtask code, query the data in rows 1-100 in Table 1, and sum each row of data in rows 1-100 in Table 1.
  • the distributed Download the subtask code and modified variables (Lines 101 to 200 of Table 1) in the system, so that when the subtask code is run from node 2, the data in rows 101-200 in Table 1 will be looked up based on the downloaded modified variables , And calculate the sum of each row of data in rows 101-200 in Table 1, and summarize the calculation results of each node, so that the calculation results can be obtained in the Java Virtual Machine (JVM) of the master node. Subsequent task calculation.
  • JVM Java Virtual Machine
  • the language of distributed tasks has certain limitations. For example, it is only limited to Scala, R, and Python. For tasks in other languages, distribution is not supported. In addition, the language that users can use is relatively single.
  • the language type of the code in the subtask is a language other than the target language
  • the code and data in the subtask can be distributed to the slave node through the master node, so that the slave node can run the subtask
  • it can overcome the limitations of the language for distributed execution tasks. For tasks in multiple languages, it can achieve distributed execution tasks, avoiding users’ constant language Dependence on enriching the languages that users can use.
  • the execution of some tasks is done by a single machine. If you need to execute tasks in a distributed manner, you need to re-modify the program in the single machine. It is often necessary to delete the program in the single machine and rewrite the program, resulting in the modification to distributed execution. The difficulty is greater.
  • the method provided by the embodiment of this application by judging the code language type in the task, and by adopting the corresponding method to execute the task in a distributed manner, there is no need to modify the program of the stand-alone computer, and only add the provided by the embodiment of this application to the original program.
  • the code program of the method can realize the purpose of distributed execution of tasks, save time and reduce the difficulty of modifying single-computer computing into distributed computing.
  • the embodiment of the present application exemplarily combines S210-S290 into one embodiment to perform a task execution method, but this embodiment is only an example. In other embodiments of the present application, S210-S250 can be combined into one embodiment to perform a task execution method.
  • a task execution method, or S210, S220, and S260-S290 form an embodiment to execute a task execution method.
  • Fig. 3a is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application. As shown in Fig. 3a, the technical solution provided by an embodiment of the present application includes:
  • initializing Spark resources may include applying for Spark resources.
  • the required Spark resources can be preset, and the Spark resources can be applied for in advance.
  • the JVM may store the requested Spark resource list, may store data and variables, may also store intermediate data and variables generated in the distributed computing process, or may also store some other data.
  • S330 Obtain the task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different programming languages.
  • Pyspark implements Spark's Application Programming Interface (API) for Python, through which Pyspark users can write Python programs running on Spark, thereby taking advantage of the distributed computing features of Spark.
  • API Application Programming Interface
  • the native Pyspark uses java_gateway to call the Spark interface to provide Python with methods such as calling sparkcontext implemented in the Scala language to initialize resources and call distributed algorithms to achieve distributed computing.
  • the above methods in related technologies are used to initialize resources. Each task needs to be initialized, and the processing mechanism is cumbersome and time-consuming.
  • the embodiment of the application initializes resources in advance, and when the code language in the executed subtask is Python, a gatewayServer instance can be created, allowing the Python program in the cluster to communicate with the JVM, and serializing the data and Spark objects in the JVM , To read data or Spark objects from the JVM through the Pyspark interface, and then perform distributed calculations based on the read data, etc.
  • the embodiment of the application provides the method of initializing resources and then performing distributed calculations, which can avoid processing each If all tasks need to initialize resources, efficiency can be improved.
  • S360 If the type of code in the subtask is judged to be R language, Socket parameter interaction is performed between the SparkR interface and the JVM, so that the SparkR interface is connected to the JVM through callJStatic, and data is read from the JVM; the SparkR calculation is performed and the calculation result Transfer and release resources.
  • SparkR is an R language package, which provides a lightweight way to use Apache Spark in R language. SparkR implements a distributed data frame and supports similar query, filtering, and aggregation operations .
  • the native SparkR uses the callJStatic method to call the method of initializing resources defined in the scala language to complete the initialization of the resources, and then perform distributed calculations by calling the initialized resources, and interact the calculation results through the socket.
  • the above-mentioned methods in the related art need to initialize resources every time an R language task is executed, and the processing mechanism is cumbersome and wastes time.
  • the embodiment of this application initializes resources in advance and connects to the JVM through the SparkR interface, so that it can be connected to the resources that have been applied for, and data interaction with the JVM can be realized. After the cluster obtains the data, distributed computing is implemented, and the calculation results are fed back to the JVM. It can avoid the need to initialize resources for processing each R language task, which can improve efficiency.
  • the master node can use the mapPartition method to transfer the subtask code and data through HDFS, and distribute to each slave node included in the applied resource, so that the slave node can use
  • the supported instructions in the operating environment continue to calculate the data (python2, python3, bash, rscript, etc.). After the calculation is completed, the calculation results are collected through the master node and converted into spark dataFrame for subsequent data processing, storage and distribution Calculation etc.
  • the mixed language task execution method provided by the embodiment of the present application can also refer to the flow shown in Fig. 3c.
  • the 4 is a structural block diagram of a mixed language task execution device provided by an embodiment of the present application.
  • the device is configured in a cluster.
  • the device includes: an acquisition module 410, a judgment module 420, an execution module 430, and a storage module 440.
  • the obtaining module 410 is configured to obtain a task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written in codes in different programming languages;
  • the judgment module 420 is configured to judge the language type of the codes in the at least two subtasks;
  • the execution module 430 is configured to execute the at least two subtasks in a manner corresponding to the language types of the codes in the at least two subtasks;
  • the storage module 440 is configured to store the execution result in the Java virtual machine, so that the execution result is read from the Java virtual machine during subsequent calculations.
  • the execution module 430 is configured to: if it is determined that the language type of the code of the first subtask in the at least two subtasks is the target language, establish a connection between the target language interface and the Java virtual machine, wherein the target language includes Python Language or R language, the target language interface includes a Pyspark interface corresponding to the Python language or a SparkR interface corresponding to the R language;
  • the code based on the first subtask reads data or variables from the Java virtual machine through the target language interface, performs distributed calculations on the variables or data, and obtains a calculation result.
  • the Pyspark interface and the Java virtual machine are connected through a gateway, and the SparkR interface and the Java virtual machine are connected through a callJStatic.
  • the execution module 430 is set to:
  • the master node in the cluster determines that the language type of the code of the second subtask in at least two subtasks is a language other than the target language, the master node uses the code of the second subtask from the Read variables or data in the Java virtual machine;
  • Modify the variable through the master node send the modified variable and the code of the second subtask to the slave node corresponding to the modified variable, or divide the read data, and Sending the code and division data of the second subtask to the slave node corresponding to the division data;
  • the storage module 440 is configured to receive the calculation result fed back by the slave node through the master node, and feed the calculation result back to the Java virtual machine.
  • variable is modified through the master node, the modified variable and the code of the second subtask are sent to the corresponding slave node, or the read data is divided, and the The code and division data of the second subtask are sent to the slave node, including:
  • the codes of different subtasks are stored in different functional modules, and the judgment module 420 is configured to judge the language types of the codes in the at least two subtasks based on the identifier of the functional module where the codes in the subtask are located.
  • the cluster is a Hadoop cluster.
  • Spark2 is configured in the cluster.
  • the above-mentioned device can execute the method provided in any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.
  • FIG. 5 is a structural block diagram of a cluster provided by an embodiment of the present application.
  • a cluster 500 provided by an embodiment of the present application includes a mixed language task execution device 501 provided by an embodiment of the present application.

Abstract

Embodiments of the present application provide a mixed language task execution method and device, and a cluster. The method is applied to the cluster and comprises: obtaining a task to be executed, and dividing the task to be executed into at least two sub-tasks, wherein different sub-tasks are compiled by adopting codes of different programming languages; determining the language types of the codes in the at least two sub-tasks; respectively executing the at least two sub-tasks in a manner corresponding to the language types of the codes in the at least two sub-tasks; and storing an execution result in a Java virtual machine, so that the execution result is read from the Java virtual machine during subsequent calculation.

Description

混合语言任务执行方法、装置及集群Mixed language task execution method, device and cluster
本申请要求在2019年05月21日提交中国专利局、申请号为201910425952.0的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with application number 201910425952.0 on May 21, 2019. The entire content of this application is incorporated into this application by reference.
技术领域Technical field
本申请涉及分布式技术领域,例如涉及一种混合语言任务执行方法、装置及集群。This application relates to the field of distributed technology, such as a method, device and cluster for executing mixed language tasks.
背景技术Background technique
集群是由一些互相连接在一起的计算机构成的一个并行系统或分布式系统,集群中的单个计算机通常称为节点,通常通过局域网连接,但也有其它的可能连接方式。集群计算机通常用来改进单个计算机的计算速度和/或可靠性。A cluster is a parallel system or a distributed system composed of a number of computers connected to each other. A single computer in a cluster is usually called a node, which is usually connected through a local area network, but there are other possible connection methods. Cluster computers are generally used to improve the calculation speed and/or reliability of a single computer.
分布式计算框架是用于处理大数据的分布式系统的运行和编程框架,例如,Storm是用于处理高速、大型数据流的分布式实时计算系统,为Hadoop添加了可靠的实时数据处理功能;Spark采用了内存计算,从多迭代批处理出发,允许将数据载入内存作反复查询,此外还融合数据仓库,流处理和图形计算等多种计算范式。Spark构建在Hadoop分布式文件系统(Hadoop Distributed File System,HDFS)上,能与Hadoop很好的结合,其中,Spark有多个版本,例如有Spark1、Spark2等。Distributed computing framework is the running and programming framework of distributed systems for processing big data. For example, Storm is a distributed real-time computing system for processing high-speed and large data streams, adding reliable real-time data processing functions to Hadoop; Spark uses memory computing, starting from multi-iteration batch processing, allowing data to be loaded into memory for repeated queries, and it also integrates multiple computing paradigms such as data warehouse, stream processing, and graph computing. Spark is built on the Hadoop Distributed File System (HDFS) and can be well integrated with Hadoop. Among them, Spark has multiple versions, such as Spark1, Spark2, etc.
分布式计算框架可以支持多种语言,例如,Spark2可以支持Python语言,也可以支持R语言,也就是说Spark2可以支持分布式执行Python语言的任务,也可以支持分布式执行R语言的任务。但是相关技术中,由于每种语言在实现业务上具有一定的局限性,需要多语言混合实现任务才能满足用户的需求,但是分布式计算框架仅仅可以支持分布式执行单语言的任务,并不支持分布式执行混合语言的任务,从而导致分布式执行单语言的任务在实现业务上具有一定的局限性,无法实现一些特定的功能,从而不能满足用户的需求。The distributed computing framework can support multiple languages. For example, Spark2 can support the Python language and the R language, which means that Spark2 can support the distributed execution of tasks in the Python language and the distributed execution of tasks in the R language. However, in related technologies, because each language has certain limitations in realizing business, it requires a mixture of multiple languages to achieve tasks to meet the needs of users, but the distributed computing framework can only support distributed execution of single-language tasks, and does not support Distributed execution of mixed-language tasks leads to certain limitations in the realization of services for distributed execution of single-language tasks, and cannot realize some specific functions, thus failing to meet the needs of users.
发明内容Summary of the invention
本申请实施例提供一种混合语言任务执行方法、装置及集群,可以克服执行单语言任务实现业务的局限性,可以实现更多的业务功能。The embodiments of the present application provide a method, device, and cluster for executing a mixed language task, which can overcome the limitation of executing a single language task to implement a business, and can implement more business functions.
本申请实施例提供了一种混合语言任务执行方法,所述方法应用于集群,所述方法包括:The embodiment of the present application provides a method for executing mixed language tasks. The method is applied to a cluster, and the method includes:
获取待执行的任务,将所述待执行的任务划分为至少两个子任务,其中,不同的子任务采用不同编程语言的代码进行编写;Obtain the task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different programming languages;
判断所述至少两个子任务中代码的语言类型;Determine the language type of the code in the at least two subtasks;
采用与所述至少两个子任务中代码的语言类型对应的方式分别执行所述至少两个子任务;Respectively execute the at least two subtasks in a manner corresponding to the language type of the codes in the at least two subtasks;
将执行结果存储到Java虚拟机,以使在后续计算的情况下从所述Java虚拟机中读取所述执行结果。The execution result is stored in the Java virtual machine, so that the execution result is read from the Java virtual machine in the case of subsequent calculations.
本申请实施例还提供了一种混合语言任务执行装置,所述装置应用于集群,所述装置包括:An embodiment of the present application also provides a mixed language task execution device, the device is applied to a cluster, and the device includes:
获取模块,设置为获取待执行的任务,将所述待执行的任务划分为至少两个子任务,其中,不同的子任务采用不同编程语言的代码进行编写;The obtaining module is configured to obtain the task to be executed, and divide the task to be executed into at least two subtasks, wherein different subtasks are written in codes in different programming languages;
判断模块,设置为判断所述至少两个子任务中代码的语言类型;A judging module, configured to judge the language type of the codes in the at least two subtasks;
执行模块,设置为采用与所述至少两个子任务中代码的语言类型对应的方式分别执行所述至少两个子任务;The execution module is configured to execute the at least two subtasks in a manner corresponding to the language type of the codes in the at least two subtasks;
存储模块,设置为将执行结果存储到Java虚拟机,以使在后续计算的情况下从所述Java虚拟机中读取所述执行结果。The storage module is configured to store the execution result in the Java virtual machine, so that the execution result is read from the Java virtual machine in the case of subsequent calculations.
本申请实施例还提供了一种集群,包括本申请实施例提供的一种混合语言任务执行装置。An embodiment of the present application also provides a cluster, including a mixed-language task execution device provided by the embodiment of the present application.
附图说明Description of the drawings
图1是本申请实施例提供的一种混合语言任务执行方法的流程图;Fig. 1 is a flowchart of a method for executing a mixed language task provided by an embodiment of the present application;
图2是本申请实施例提供的另一种混合语言任务执行方法的流程图;2 is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application;
图3a是本申请实施例提供的另一种混合语言任务执行方法的流程图;Fig. 3a is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application;
图3b是本申请实施例提供的一种执行除Python和R语言之外的子任务的方法的流程图;Figure 3b is a flowchart of a method for executing subtasks other than Python and R language provided by an embodiment of the present application;
图3c是本申请实施例提供的另一种混合语言任务执行方法的流程图;Figure 3c is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application;
图4是本申请实施例提供的一种混合语言任务执行装置的结构框图;4 is a structural block diagram of a mixed language task execution device provided by an embodiment of the present application;
图5是本申请实施例提供的一种集群的结构示意图。Fig. 5 is a schematic structural diagram of a cluster provided by an embodiment of the present application.
具体实施方式Detailed ways
下面结合附图和实施例对本申请进行说明。附图中仅示出了与本申请相关的部分而非全部结构。The application will be described below with reference to the drawings and embodiments. The drawings only show a part but not all of the structure related to this application.
图1是本申请实施例提供的一种混合语言任务执行方法的流程图,所述方法可以由一种混合语言任务执行装置来执行,所述装置可以由软件和/或硬件来实现,所述装置可以配置在集群中。所述方法可以应用于对大数据进行计算的场景中,所述方法可以应用于对大数据进行多种方式计算的场景中。FIG. 1 is a flowchart of a method for executing a mixed language task provided by an embodiment of the present application. The method may be executed by a mixed language task executing device, and the device may be implemented by software and/or hardware. Devices can be configured in clusters. The method can be applied to a scenario where big data is calculated, and the method can be applied to a scenario where big data is calculated in multiple ways.
本申请实施例以集群中配置有Spark2为例进行说明。可选的,集群中可以至少包括三台服务器,本申请实施例中的集群可以是hadoop集群。The embodiment of the present application takes Spark2 configured in a cluster as an example for description. Optionally, the cluster may include at least three servers, and the cluster in the embodiment of the present application may be a hadoop cluster.
在一实施例中,Spark2分布式计算框架充分利用集群的优势,相比单机计算,突破单机服务器对内存,中央处理器(Central Processing Unit,CPU)和存储的限制。采用分布式的计算方法,可以加速大数据的计算。单机计算的方式,当业务和数据量提升时,单机资源容易出现瓶颈,伴随着频繁的垃圾回收(garbage collection,gc)和写回磁盘的高输入输出(Input/Output,I/O),影响业务的功能和结果的正确性。而分布式的计算框架Spark2,可以横向扩展计算资源,可以自由的分配任务所需要的资源,根据所选资源,将任务分配到对应节点进行计算,结合spark2的分布式算法和特征工程,对数据进行分布式处理。In one embodiment, the Spark2 distributed computing framework makes full use of the advantages of clusters and breaks through the limitations of a single server on memory, central processing unit (CPU), and storage compared to stand-alone computing. Using distributed computing methods can accelerate the calculation of big data. The stand-alone computing method, when the business and data volume increase, stand-alone resources are prone to bottlenecks, accompanied by frequent garbage collection (garbage collection, gc) and high input/output (I/O) written back to the disk. The correctness of the functions and results of the business. The distributed computing framework Spark2 can expand computing resources horizontally, freely allocate resources required for tasks, and assign tasks to corresponding nodes for calculation according to the selected resources. Combining spark2's distributed algorithm and feature engineering, the data Perform distributed processing.
Spark2支持Scala、Python和R语言,可以提供相关接口调用Spark2实现的分布式算法等。其中,Spark2的核心算法使用Scala语言实现,Python和R可以提供调用Scala实现的函数,对基础算法提供分布式的功能。同时,Spark2包含map,mappartion等函数,支持自定义实现分布式的功能。Spark2 supports Scala, Python and R languages, and can provide related interfaces to call distributed algorithms implemented by Spark2. Among them, the core algorithm of Spark2 is implemented in Scala language. Python and R can provide functions that call the implementation of Scala and provide distributed functions for basic algorithms. At the same time, Spark2 includes functions such as map and mappartion, and supports custom implementation of distributed functions.
在相关技术中,Spark2支持用户自定义实现分布式的功能,提供python,R和scala的接口,支持分布式执行单语言任务,但是不支持分布式执行混合语言任务,导致在实现业务上具有一定的局限性,无法实现一些特定的功能。本申请实施例提供通过将任务划分成不同语言代码编写的子任务,采用与子任务中代码的语言类型对应的方式执行子任务,可以将一个待执行的任务采用多种语言进行编写,可以克服运行单语言任务实现业务的局限性,可以实现更多的业务功能。In related technologies, Spark2 supports user-defined implementation of distributed functions, provides python, R and scala interfaces, supports distributed execution of single-language tasks, but does not support distributed execution of mixed-language tasks, resulting in a certain degree of business realization The limitations of the system cannot achieve some specific functions. The embodiment of the application provides that by dividing the task into subtasks written in different language codes, and executing the subtasks in a manner corresponding to the language type of the code in the subtasks, a task to be executed can be written in multiple languages, which can overcome Run single language tasks to realize the limitations of the business and realize more business functions.
如图1所示,本申请实施例提供的技术方案包括:S110至S140。As shown in FIG. 1, the technical solution provided by the embodiment of the present application includes: S110 to S140.
S110:获取待执行的任务,将所述待执行的任务划分为至少两个子任务,其中,不同的子任务采用不同语言的代码进行编写。S110: Obtain a task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different languages.
在本申请实施例中,用户设备可以与集群进行通信。可选的,用户设备上可以配置有与集群进行通信的系统,用户可以在该系统中采用代码编写任务。当用户编写任务时,可以采用不同的语言编写任务。在本申请实施例中,集群可以通过接口获取待执行的任务,将待执行的任务划分成至少两个子任务。在一实施例中,可以根据待执行任务中的代码的标识将任务划分成子任务。例如,不同编程语言的子任务中代码的标识不同,可以根据代码中的标识将待执行的任务划分成至少两个子任务。In the embodiment of this application, the user equipment can communicate with the cluster. Optionally, a system for communicating with the cluster may be configured on the user equipment, and the user may use code writing tasks in the system. When users write tasks, they can write tasks in different languages. In the embodiment of the present application, the cluster can obtain the task to be executed through the interface, and divide the task to be executed into at least two subtasks. In an embodiment, the task may be divided into subtasks according to the identification of the code in the task to be executed. For example, subtasks in different programming languages have different code identifiers, and the task to be executed can be divided into at least two subtasks according to the identifiers in the code.
S120:判断所述至少两个子任务中代码的语言类型。S120: Determine the language type of the code in the at least two subtasks.
在本申请实施例中,可选的,不同子任务的代码存储于不同的功能模块中。所述判断所述至少两个子任务中代码的语言类型,包括:基于子任务中代码所在功能模块的标识判断所述至少两个子任务中代码的语言类型。In the embodiment of the present application, optionally, the codes of different subtasks are stored in different functional modules. The judging the language type of the code in the at least two subtasks includes: judging the language type of the code in the at least two subtasks based on the identifier of the functional module where the code in the subtask is located.
在一实施例中,每个子任务的代码可以存储于不同的功能模块中,不同语言的编写的子任务存储的功能模块并不相同。每一个存储子任务的功能模块具有区别于其他功能模块的标识,通过功能模块的标识可以判断该功能模块中子任务中代码的语言类型。其中,判断子任务中代码的语言类型的方式并不局限于上述的方式,还可以是其他方式。In an embodiment, the code of each subtask may be stored in different functional modules, and the stored functional modules of subtasks written in different languages are not the same. Each functional module that stores the subtask has an identification that is different from other functional modules, and the language type of the code in the subtask in the functional module can be determined through the identification of the functional module. Wherein, the method of determining the language type of the code in the subtask is not limited to the above method, and may also be other methods.
S130:采用与所述至少两个子任务中代码的语言类型对应的方式分别执行所述至少两个子任务。S130: Execute the at least two subtasks in a manner corresponding to the language types of the codes in the at least two subtasks.
在本申请实施例中,若至少两个子任务中代码的语言类型不相同,则执行至少两个子任务的方式也可以不相同。其中,子任务代码的语言类型可以是Python语言、R语言或者其他语言。执行子任务的方式可以参见下述实施例的介绍。In the embodiment of the present application, if the language types of the codes in the at least two subtasks are not the same, the manners of executing the at least two subtasks may also be different. Among them, the language type of the subtask code can be Python language, R language or other languages. For the manner of executing the subtask, refer to the introduction of the following embodiment.
在本申请实施例中,执行至少两个子任务的方式可以串行执行,或者也可以并行执行。其中,当多个子任务之间具有依赖关系时,集群执行子任务的方式可以是串行执行,即按照依赖关系中的顺序执行多个子任务。当多个子任务之间不具有依赖关系时,集群执行子任务的方式可以并行执行,从而可以节省时间,提高效率。In the embodiment of the present application, the manner of executing at least two subtasks may be executed serially, or may also be executed in parallel. Wherein, when multiple subtasks have a dependency relationship, the manner in which the cluster executes the subtasks may be serial execution, that is, the multiple subtasks are executed in the order in the dependency relationship. When multiple subtasks do not have a dependency relationship, the way the cluster executes the subtasks can be executed in parallel, which can save time and improve efficiency.
S140:将执行结果存储到Java虚拟机,以使后续计算时从所述Java虚拟机中读取所述执行结果。S140: Store the execution result in the Java virtual machine, so that the execution result is read from the Java virtual machine during subsequent calculations.
在本申请实施例中,执行结果在Java虚拟机中可以是以Data Frame的形式进行存储。In the embodiment of the present application, the execution result may be stored in the form of Data Frame in the Java virtual machine.
在相关技术中,Spark2支持用户自定义实现分布式的功能,提供Python,R和Scala的接口,支持分布式执行单语言任务,但是不支持分布式执行混合语言 任务,导致在实现业务上具有一定的局限性,无法实现一些特定的功能,从而不能满足用户的需求。为了满足用户的需求,往往需要采用不同的语言编写多个任务,由配置有Spark2的集群执行多个任务才能达到目的,当每个任务执行完毕后,每个任务的执行结果之间往往需要中间文件进行交互,导致处理结果的准确性和稳定性并不可控。本申请实施例中通过将任务划分成不同语言代码编写的子任务,采用与子任务中代码的语言类型对应的方式执行子任务,可以将一个待执行的任务采用多种语言进行编写,即在Spark资源生存周期,多种语言可以混合使用,可以克服运行单语言任务实现业务的局限性,可以实现更多的业务功能。本申请实施例通过将任务划分成子任务,通过Java虚拟机存储执行结果从而避免了计算结果通过中间文件进行交互的情况,保证了处理结果的准确性和可控性。In related technologies, Spark2 supports user-defined implementation of distributed functions, provides Python, R and Scala interfaces, supports distributed execution of single-language tasks, but does not support distributed execution of mixed-language tasks, resulting in a certain degree of business realization The limitations of the system cannot realize some specific functions, and thus cannot meet the needs of users. In order to meet the needs of users, it is often necessary to write multiple tasks in different languages. Multiple tasks can be executed by the cluster configured with Spark2 to achieve the goal. After each task is executed, the execution result of each task often needs to be intermediate. The files are interacted with each other, causing the accuracy and stability of the processing results to be uncontrollable. In the embodiment of this application, by dividing the task into subtasks written in different language codes, and executing the subtasks in a manner corresponding to the language type of the code in the subtasks, a task to be executed can be written in multiple languages, that is, In the Spark resource life cycle, multiple languages can be mixed to use, which can overcome the limitations of running a single language task to achieve business, and can achieve more business functions. In the embodiment of the present application, tasks are divided into subtasks, and execution results are stored through a Java virtual machine, thereby avoiding the interaction of calculation results through intermediate files, and ensuring the accuracy and controllability of processing results.
本申请实施例提供的方法并不局限于应用于配置有Spark2分布式计算框架的集群,还可以应用于配置有其他分布式计算框架的集群中。The method provided in the embodiments of the present application is not limited to being applied to clusters configured with the Spark2 distributed computing framework, and can also be applied to clusters configured with other distributed computing frameworks.
本申请实施例提供的一种混合语言任务执行方法,通过将任务划分成子任务,其中,子任务通过采用不同语言的代码进行编写,通过采用与子任务中代码的语言类型对应的方式执行子任务,即通过将任务划分成不同语言代码编写的子任务,采用与子任务中代码的语言类型对应的方式执行子任务,可以将一个待执行的任务采用多种语言进行编写,可以克服运行单语言任务的功能局限性,可以实现更多的功能,本申请实施例通过将每个子任务的执行结果反馈给虚拟机进行存储,以使后续计算时从虚拟机中读取计算结果,从而避免了计算结果通过中间文件进行交互的情况,保证了处理结果的准确性和可控性。The mixed language task execution method provided by the embodiments of the present application divides the task into subtasks, wherein the subtasks are written by using codes in different languages, and the subtasks are executed in a manner corresponding to the language types of the codes in the subtasks. , That is, by dividing the task into subtasks written in different language codes, and executing the subtasks in a manner corresponding to the language type of the code in the subtasks, a task to be executed can be written in multiple languages, which can overcome the running single language The functional limitations of the task can achieve more functions. The embodiment of the present application feeds back the execution result of each subtask to the virtual machine for storage, so that the calculation result is read from the virtual machine during subsequent calculations, thereby avoiding calculations. The result of the interaction through the intermediate file ensures the accuracy and controllability of the processing result.
图2是本申请实施例提供的另一种混合语言任务执行方法的流程图,在本申请实施例中,对“所述采用与所述子任务中代码的语言类型对应的方式执行所述子任务”进行说明,如图2所示,本申请实施例提供的技术方案,包括:S210至S290。FIG. 2 is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application. In an embodiment of the present application, "the subtask is executed in a manner corresponding to the language type of the code in the subtask). Tasks are described. As shown in FIG. 2, the technical solutions provided by the embodiments of the present application include: S210 to S290.
S210:获取待执行的任务,将所述待执行的任务划分为至少两个子任务,其中,不同的子任务采用不同编程语言的代码进行编写。S210: Obtain a task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different programming languages.
S220:判断所述至少两个子任务中代码的语言类型。S220: Determine the language type of the code in the at least two subtasks.
S230:若判断至少两个子任务中第一子任务的代码的语言类型为目标语言,建立目标语言接口与Java虚拟机的连接,其中,所述目标语言包括Python语言或者R语言,所述目标语言接口包括与Python语言对应的Pyspark接口或者与R语言对应的SparkR接口。S230: If it is determined that the language type of the code of the first subtask of the at least two subtasks is the target language, establish a connection between the target language interface and the Java virtual machine, where the target language includes Python language or R language, and the target language The interface includes the Pyspark interface corresponding to the Python language or the SparkR interface corresponding to the R language.
在本申请实施例中,可选的,集群中配置有Spark2,其中,目标语言接口可以是执行设定功能的接口。In the embodiment of the present application, optionally, Spark2 is configured in the cluster, where the target language interface may be an interface for performing a set function.
在本申请实施例中,在集群中,若子任务中代码的语言类型为Python语言,建立Pyspark接口与Java虚拟机的连接。若子任务中代码的语言类型为R语言,建立SparkR接口与Java虚拟机的连接。其中,Java虚拟机可以创建于集群中的主节点上,Java虚拟机可以用于存储数据或者变量。可选的,Pyspark接口与Java虚拟机通过gateway的方式进行连接。SparkR接口与所述Java虚拟机通过callJStatic的方式进行连接,其中,在SparkR接口与Java虚拟机建立连接之前,SparkR接口和Java虚拟机之间可以先进行Socket参数交互,从而实现SparkR接口与Java虚拟机的连接。In this embodiment of the application, in the cluster, if the language type of the code in the subtask is Python, a connection between the Pyspark interface and the Java virtual machine is established. If the language type of the code in the subtask is R language, establish a connection between the SparkR interface and the Java virtual machine. Among them, the Java virtual machine can be created on the master node in the cluster, and the Java virtual machine can be used to store data or variables. Optionally, the Pyspark interface and the Java virtual machine are connected through a gateway. The SparkR interface and the Java virtual machine are connected through callJStatic. Before the SparkR interface and the Java virtual machine are connected, the SparkR interface and the Java virtual machine can interact with Socket parameters first, thereby realizing the SparkR interface and the Java virtual machine. Machine connection.
S240:通过所述目标语言接口读取虚拟机中存储的数据或者变量,并对读取的所述数据进行分布式计算,或者对所述变量对应的数据进行分布式计算,得到计算结果。S240: Read data or variables stored in the virtual machine through the target language interface, and perform distributed calculation on the read data, or perform distributed calculation on the data corresponding to the variables, to obtain a calculation result.
在一实施例中,基于所述第一子任务的代码通过所述目标语言接口从所述Java虚拟机中读取数据或者变量,对所述变量或所述数据进行分布式计算,得到计算结果.。In an embodiment, the code based on the first subtask reads data or variables from the Java virtual machine through the target language interface, and performs distributed calculations on the variables or the data to obtain the calculation results ..
在本申请实施例中,若子任务中代码的语言是Python语言,建立Pyspark接口与Java虚拟机的连接,通过Pyspark接口基于子任务中的代码从虚拟机中读取存储的数据或者变量。在一实施例中,子任务中的代码中包含变量,可以通过Pyspark接口基于子任务中代码中包含的变量从虚拟机中读取变量,由集群对变量对应的数据进行分布式计算。其中,当通过Pyspark接口从Java虚拟机中读取变量时,变量对应的数据可以存储在集群中的从节点中。对变量进行分布式计算的方法可以按照Spark2分布式计算框架所遵循的方法进行计算。In the embodiment of the present application, if the language of the code in the subtask is the Python language, a connection between the Pyspark interface and the Java virtual machine is established, and the stored data or variables are read from the virtual machine based on the code in the subtask through the Pyspark interface. In one embodiment, the code in the subtask contains variables, and the variables can be read from the virtual machine based on the variables contained in the code in the subtask through the Pyspark interface, and the data corresponding to the variables is distributedly calculated by the cluster. Among them, when the variable is read from the Java virtual machine through the Pyspark interface, the data corresponding to the variable can be stored in the slave node in the cluster. The method for distributed calculation of variables can be calculated according to the method followed by the Spark2 distributed computing framework.
在一实施例中,还可以通过Pyspark接口基于子任务中代码包含的变量从虚拟机中读取对应的数据,由集群对读取的数据进行分布式计算。例如,子任务代码中包含的变量为Table 1,则通过Pyspark接口可以从虚拟机中读取Table 1中的数据。其中,对数据进行分布式计算的方法可以按照Spark2分布式计算框架所遵循的方法进行计算。In one embodiment, the corresponding data can be read from the virtual machine based on the variables contained in the code in the subtask through the Pyspark interface, and the cluster performs distributed calculation on the read data. For example, if the variable included in the subtask code is Table 1, the data in Table 1 can be read from the virtual machine through the Pyspark interface. Among them, the method of performing distributed computing on data can be calculated in accordance with the method followed by the Spark2 distributed computing framework.
S250:将所述计算结果反馈给所述Java虚拟机,以使后续计算时从所述Java虚拟机中读取所述计算结果。S250: Feed back the calculation result to the Java virtual machine, so that the calculation result is read from the Java virtual machine during subsequent calculations.
在本申请实施例中,通过集群中的主节点对从节点分配计算任务,由从节点进行计算,得到计算结果,并将计算结果反馈给集群中的主节点,通过集群 中的主节点将计算结果反馈给Java虚拟机,以使后续计算时从Java虚拟机中读取计算结果。可选的,计算结果在Java虚拟机中可以是以Data Frame的形式进行存储。In the embodiment of the present application, the master node in the cluster assigns computing tasks to the slave nodes, and the slave nodes perform calculations to obtain the calculation results, and the calculation results are fed back to the master node in the cluster. The master node in the cluster calculates The result is fed back to the Java virtual machine, so that the calculation result can be read from the Java virtual machine during subsequent calculations. Optionally, the calculation result may be stored in the form of Data Frame in the Java virtual machine.
S260:若通过集群中的主节点判断至少两个子任务中第二子任务的代码的语言类型为除所述目标语言之外的语言,通过所述主节点基于所述第二子任务的代码从所述Java虚拟机中读取变量或者数据。S260: If it is determined by the master node in the cluster that the language type of the code of the second subtask in the at least two subtasks is a language other than the target language, the master node based on the code slave of the second subtask Variables or data are read from the Java virtual machine.
在本申请实施例中,可选的,若通过集群中的主节点判断第二子任务的代码的语言类型为除Python语言和R语言之外的语言,通过主节点基于第二子任务中的代码从Java虚拟机中读取变量或者数据。在一实施例中,可以通过集群中的主节点基于子任务中代码包含的变量,从Java虚拟机中读取变量。其中,当通过主节点从Java虚拟机中读取变量时,Java虚拟机中可能并不存在子任务代码包含的变量对应的数据,变量对应的数据可以存储在从节点中。或者,可以通过主节点基于子任务代码中包含的变量,从Java虚拟机中读取该变量对应的数据,其中,变量对应的数据可以存储到Java虚拟机中。In this embodiment of the application, optionally, if the master node in the cluster determines that the language type of the code of the second subtask is a language other than Python language and R language, the master node is based on the second subtask The code reads variables or data from the Java virtual machine. In an embodiment, the master node in the cluster can read variables from the Java virtual machine based on the variables contained in the code in the subtask. Among them, when the variable is read from the Java virtual machine through the master node, the data corresponding to the variable included in the subtask code may not exist in the Java virtual machine, and the data corresponding to the variable may be stored in the slave node. Alternatively, the master node can read data corresponding to the variable from the Java virtual machine based on the variable included in the subtask code, where the data corresponding to the variable can be stored in the Java virtual machine.
S270:通过所述主节点对所述变量进行修改,将修改后的变量以及所述第二子任务的代码发送至与所述修改后的变量对应的从节点,或者对读取的数据进行划分,并将所述第二子任务的代码以及划分数据发送至与所述划分数据对应的从节点。S270: Modify the variable through the master node, send the modified variable and the code of the second subtask to the slave node corresponding to the modified variable, or divide the read data , And send the code and division data of the second subtask to the slave node corresponding to the division data.
在本申请实施例的一个实施方式中,可选的,将修改后的变量以及所述第二子任务中的代码发送至与所述修改后的变量对应的从节点,或者对读取的数据进行划分,并将所述第二子任务的代码以及划分数据发送至与所述划分数据对应的从节点,包括:将修改后的变量以及所述第二子任务的代码发送至分布式文件系统,或者将所述划分数据以及所述第二子任务的代码发送至分布式文件系统;通过所述从节点从所述分布式文件系统中下载所述第二子任务中的代码以及所述从节点对应的修改后的变量,或者下载所述第二子任务中的代码以及所述从节点对应的划分数据。In an implementation of the embodiment of the present application, optionally, the modified variable and the code in the second subtask are sent to the slave node corresponding to the modified variable, or the read data Performing division, and sending the code and division data of the second subtask to the slave node corresponding to the division data, including: sending the modified variable and the code of the second subtask to the distributed file system , Or send the divided data and the code of the second subtask to the distributed file system; download the code in the second subtask and the slave from the distributed file system through the slave node The modified variable corresponding to the node, or download the code in the second subtask and the partition data corresponding to the slave node.
在一实施例中,不同的从节点对应不同的修改后的变量,或者不同的从节点对应不同的划分数据。在一实施例中,当集群执行对数据的计算任务时,集群中的主节点可以将任务分配给多个从节点进行执行,每个从节点可以执行对部分数据的计算任务。因此,当通过主节点基于所述子任务中的代码从所述虚拟机中读取变量或者数据后,需要通过主节点对读取的变量进行修改,或者对读取的数据进行划分,以使多个从节点对不同的数据进行计算,从而完成对读取的数据的计算,或者完成对读取的变量对应的数据的计算。In an embodiment, different slave nodes correspond to different modified variables, or different slave nodes correspond to different divided data. In an embodiment, when the cluster executes a calculation task on data, the master node in the cluster may allocate the task to multiple slave nodes for execution, and each slave node may execute a calculation task on part of the data. Therefore, when the master node reads variables or data from the virtual machine based on the code in the subtask, the master node needs to modify the read variables or divide the read data to make Multiple slave nodes perform calculations on different data, thereby completing the calculation of the read data, or completing the calculation of the data corresponding to the read variable.
在本申请实施例中,分布式文件系统可以是Hadoop分布式文件系统,每个 从节点从分布式文件系统下载子任务中的代码以及修改后的变量(或者划分数据)。每个从节点从分布式文件系统中下载修改后的变量或者划分数据的方法可以是:建立修改后的变量和从节点编号的对应关系,从节点可以根据该关系下载对应的修改后的变量;或者建立划分数据和从节点编号的对应关系,从节点可以根据该关系下载对应的划分数据。每个从节点从分布式文件系统中下载修改后的变量或者划分数据的方法还可以是其他方式。In the embodiment of the present application, the distributed file system may be a Hadoop distributed file system, and each slave node downloads the code and modified variables (or partitioned data) in the subtasks from the distributed file system. The method for each slave node to download the modified variable or divide the data from the distributed file system can be: establish a correspondence between the modified variable and the slave node number, and the slave node can download the corresponding modified variable according to the relationship; Or a corresponding relationship between the division data and the number of the slave node is established, and the slave node can download the corresponding division data according to the relationship. The method for each slave node to download modified variables or partition data from the distributed file system may also be other methods.
S280:通过所述从节点运行所述第二子任务的代码,对所述修改后的变量对应的数据进行计算,或者对划分数据进行计算,得到计算结果,并将所述计算结果反馈给所述主节点。S280: Run the code of the second subtask through the slave node, calculate the data corresponding to the modified variable, or calculate the divided data to obtain the calculation result, and feed the calculation result back to the all The main node.
在本申请实施例中,通过从节点运行子任务中的代码,根据修改后的变量在从节点的存储位置查询对应的数据,并对数据进行计算。或者通过从节点对下载的划分数据进行计算,得到计算结果。其中,每个从节点的计算结果均反馈给主节点。In the embodiment of the present application, the slave node runs the code in the subtask, and queries the corresponding data in the storage location of the slave node according to the modified variable, and calculates the data. Or calculate the downloaded divided data from the node to obtain the calculation result. Among them, the calculation results of each slave node are fed back to the master node.
S290:通过主节点接收所述从节点反馈的计算结果,并将所述计算结果反馈给Java虚拟机。S290: Receive the calculation result fed back by the slave node through the master node, and feed the calculation result back to the Java virtual machine.
对S260-S290的技术方案进行举例说明,若子任务是:对Table 1中的每行数据进行求和计算,其中,子代码中包含的变量是Table 1。通过集群中的主节点从Java虚拟机中基于子任务代码读取的变量是Table 1,并将读取的变量进行修改。若修改后的变量可以是:Lines 1 to 100 of Table 1(表1的第1-100行),Lines 101 to 200 of Table 1(表1的第101-200行)。通过集群中的主节点将修改后的变量和子任务代码发送至分布式文件系统中,通过从节点1可以从分布式系统中下载子任务代码和修改后的变量(Lines 1 to 100 of Table 1),以使从节点1运行子任务代码时,查询表1中第1-100行的数据,并对表1中第1-100行的每行数据进行求和;通过从节点2可以从分布式系统中下载子任务代码和修改后的变量(Lines 101 to 200 of Table 1),以使从节点2运行子任务代码时,基于下载的修改后的变量查询表1中第101-200行的数据,并对表1中第101-200行的每行数据进行求和计算,将每个节点的计算结果汇总,使得在主节点的Java虚拟机(Java Virtual Machine,JVM)中能够得到计算结果供后续任务计算。The technical solutions of S260-S290 are illustrated by an example. If the subtask is to calculate the sum of each row of data in Table 1, where the variable included in the subcode is Table 1. The variable read from the Java virtual machine based on the subtask code through the master node in the cluster is Table 1, and the read variable is modified. If the modified variable can be: Lines 1 to 100 of Table 1 (Lines 1-100 of Table 1), Lines 101 to 200 of Table 1 (Lines 101-200 of Table 1). The master node in the cluster sends the modified variables and subtask codes to the distributed file system, and the slave node 1 can download the subtask codes and modified variables from the distributed system (Lines 1 to 100 of Table 1) , So that when slave node 1 runs the subtask code, query the data in rows 1-100 in Table 1, and sum each row of data in rows 1-100 in Table 1. Through slave node 2, the distributed Download the subtask code and modified variables (Lines 101 to 200 of Table 1) in the system, so that when the subtask code is run from node 2, the data in rows 101-200 in Table 1 will be looked up based on the downloaded modified variables , And calculate the sum of each row of data in rows 101-200 in Table 1, and summarize the calculation results of each node, so that the calculation results can be obtained in the Java Virtual Machine (JVM) of the master node. Subsequent task calculation.
在相关技术中,集群中实现分布式计算时,分布式执行的任务的语言具有一定的局限性,例如,仅仅局限于Scala语言、R语言和Python语言,对于其他语言的任务,并不支持分布式执行,同时用户可以使用的语言也比较单一。本申请实施例提供的方法,当子任务中的代码的语言类型为除了目标语言之外的语言,可以通过主节点将子任务中的代码和数据分发给从节点,以使从节点运行子任务中的代码,以对分发的数据进行计算,从而实现分布式计算,可以克 服分布式执行任务的语言的局限性,对于多种语言的任务均可以实现分布式执行任务,避免了用户对固定语言的依赖性,丰富了用户可使用的语言。In related technologies, when implementing distributed computing in a cluster, the language of distributed tasks has certain limitations. For example, it is only limited to Scala, R, and Python. For tasks in other languages, distribution is not supported. In addition, the language that users can use is relatively single. In the method provided by the embodiments of this application, when the language type of the code in the subtask is a language other than the target language, the code and data in the subtask can be distributed to the slave node through the master node, so that the slave node can run the subtask In order to calculate the distributed data to achieve distributed computing, it can overcome the limitations of the language for distributed execution tasks. For tasks in multiple languages, it can achieve distributed execution tasks, avoiding users’ constant language Dependence on enriching the languages that users can use.
在相关技术中,有些任务的执行是由单机来完成,若需要分布式执行任务,需要重新修改单机中的程序,往往是需要将单机中的程序删除,重新编写程序,导致修改成分布式执行难度较大。本申请实施例提供的方法,通过对任务中代码语言类型的判断,通过采用对应的方式进行分布式执行任务,可以无需修改单机的程序,仅仅在原有程序的基础上添加本申请实施例提供的方法的代码程序,从而实现分布式执行任务的目的,可以节省时间,降低将单机计算修改成分布式计算的难度。In related technologies, the execution of some tasks is done by a single machine. If you need to execute tasks in a distributed manner, you need to re-modify the program in the single machine. It is often necessary to delete the program in the single machine and rewrite the program, resulting in the modification to distributed execution. The difficulty is greater. The method provided by the embodiment of this application, by judging the code language type in the task, and by adopting the corresponding method to execute the task in a distributed manner, there is no need to modify the program of the stand-alone computer, and only add the provided by the embodiment of this application to the original program. The code program of the method can realize the purpose of distributed execution of tasks, save time and reduce the difficulty of modifying single-computer computing into distributed computing.
本申请实施例示例性的将S210-S290组成一个实施例执行一种任务执行方法,但是本实施例仅仅是一种示例,在本申请其他实施例中,S210-S250可以组成一个实施例执行一种任务执行方法,或者S210、S220和S260-S290组成一个实施例执行一种任务执行方法。The embodiment of the present application exemplarily combines S210-S290 into one embodiment to perform a task execution method, but this embodiment is only an example. In other embodiments of the present application, S210-S250 can be combined into one embodiment to perform a task execution method. A task execution method, or S210, S220, and S260-S290 form an embodiment to execute a task execution method.
图3a是本申请实施例提供的另一种混合语言任务执行方法的流程图,如图3a所示,本申请实施例提供的技术方案包括:Fig. 3a is a flowchart of another method for executing a mixed language task provided by an embodiment of the present application. As shown in Fig. 3a, the technical solution provided by an embodiment of the present application includes:
S310:初始化Spark资源。S310: Initialize Spark resources.
在本申请实施例中,初始化Spark资源可以包括申请Spark资源。本申请实施例中,可以预先设置需要的Spark资源,预先申请Spark资源。In this embodiment of the present application, initializing Spark resources may include applying for Spark resources. In the embodiment of the present application, the required Spark resources can be preset, and the Spark resources can be applied for in advance.
S320:JVM存储数据和变量。S320: JVM stores data and variables.
在本申请实施例中,JVM可以存储申请的Spark资源列表,可以存储数据和变量,也可以存储分布式计算过程中产生的中间数据和变量,或者也可以存储其他的一些数据。In the embodiment of the present application, the JVM may store the requested Spark resource list, may store data and variables, may also store intermediate data and variables generated in the distributed computing process, or may also store some other data.
S330:获取待执行的任务,将待执行的任务划分成至少两个子任务,其中,不同的子任务采用不同编程语言的代码进行编写。S330: Obtain the task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different programming languages.
S340:判断所述子任务中代码的语言类型。S340: Determine the language type of the code in the subtask.
S350:若判断子任务中代码的类型为Python语言,通过Pyspark接口基于gateway方式与JVM连接,从JVM中读取数据;并调用Pyspark的接口进行分布式计算,计算结果转存至JVM,以使Python资源释放。S350: If it is judged that the type of code in the subtask is Python language, connect to JVM based on gateway through Pyspark interface, and read data from JVM; and call Pyspark interface for distributed calculation, and the calculation results are transferred to JVM so that Python resources are released.
在本申请实施例中,Pyspark实现了Spark对于Python的应用程序接口(Application Programming Interface,API),通过Pyspark用户可以编写运行在Spark之上的Python程序,从而利用到Spark分布式计算的特点。In the embodiments of the present application, Pyspark implements Spark's Application Programming Interface (API) for Python, through which Pyspark users can write Python programs running on Spark, thereby taking advantage of the distributed computing features of Spark.
相关技术中,通过原生的Pyspark采用java_gateway的方式调用Spark接口,为Python提供调用Scala语言实现的sparkcontext等方法初始化资源以及调用分布式算法实现分布式计算,相关技术中的上述方法在初始化资源时,每执行一个任务均需要初始化资源,处理机制繁琐,浪费时间。本申请实施例通过预先初始化资源,当执行的子任务中的代码语言是Python语言时,可以创建gatewayServer实例,允许集群中的Python程序与JVM通信,并将JVM中的数据、Spark对象进行序列化,以通过Pyspark接口从JVM读取数据或者Spark对象等,然后基于读取的数据等进行分布式计算,本申请实施例提供的先进行初始化资源,再进行分布式计算的方法,可以避免处理每个任务均需要进行初始化资源的情况,可以提高效率。In related technologies, the native Pyspark uses java_gateway to call the Spark interface to provide Python with methods such as calling sparkcontext implemented in the Scala language to initialize resources and call distributed algorithms to achieve distributed computing. The above methods in related technologies are used to initialize resources. Each task needs to be initialized, and the processing mechanism is cumbersome and time-consuming. The embodiment of the application initializes resources in advance, and when the code language in the executed subtask is Python, a gatewayServer instance can be created, allowing the Python program in the cluster to communicate with the JVM, and serializing the data and Spark objects in the JVM , To read data or Spark objects from the JVM through the Pyspark interface, and then perform distributed calculations based on the read data, etc. The embodiment of the application provides the method of initializing resources and then performing distributed calculations, which can avoid processing each If all tasks need to initialize resources, efficiency can be improved.
S360:若判断子任务中代码的类型为R语言,通过SparkR接口与JVM之间进行Socket参数交互,以使SparkR接口通过callJStatic与JVM连接,从JVM中读取数据;进行SparkR计算,将计算结果转存,并释放资源。S360: If the type of code in the subtask is judged to be R language, Socket parameter interaction is performed between the SparkR interface and the JVM, so that the SparkR interface is connected to the JVM through callJStatic, and data is read from the JVM; the SparkR calculation is performed and the calculation result Transfer and release resources.
在本申请实施例中,SparkR是一个R语言包,它提供了轻量级的方式使得可以在R语言中使用Apache Spark,SparkR实现了分布式的data frame,支持类似查询、过滤以及聚合的操作。In the embodiments of this application, SparkR is an R language package, which provides a lightweight way to use Apache Spark in R language. SparkR implements a distributed data frame and supports similar query, filtering, and aggregation operations .
相关技术中,原生的SparkR采用callJStatic的方法调用scala语言定义的初始化资源的方法,完成对资源的初始化,然后通过调用初始化的资源进行分布式计算,将计算结果通过socket进行交互。相关技术中的上述方法每执行一个R语言的任务均需要初始化资源,处理机制繁琐,浪费时间。本申请实施例预先初始化资源,通过SparkR接口连接JVM,从而可以连接至已申请的资源,可以实现与JVM的数据交互,当集群获取数据之后,实现分布式计算,并将计算结果反馈给JVM,可以避免处理每个R语言任务均需要进行初始化资源的情况,可以提高效率。In related technologies, the native SparkR uses the callJStatic method to call the method of initializing resources defined in the scala language to complete the initialization of the resources, and then perform distributed calculations by calling the initialized resources, and interact the calculation results through the socket. The above-mentioned methods in the related art need to initialize resources every time an R language task is executed, and the processing mechanism is cumbersome and wastes time. The embodiment of this application initializes resources in advance and connects to the JVM through the SparkR interface, so that it can be connected to the resources that have been applied for, and data interaction with the JVM can be realized. After the cluster obtains the data, distributed computing is implemented, and the calculation results are fed back to the JVM. It can avoid the need to initialize resources for processing each R language task, which can improve efficiency.
S370:若判断子任务中的代码的语言类型为除Python语言和R语言之外的语言,通过主节点将子任务代码和数据进行分发,以使从节点单机运行子任务代码,单机执行对数据的计算,并反馈计算结果给主节点,以及通过主节点将计算结果进行整合,反馈给JVM。S370: If it is determined that the language type of the code in the subtask is a language other than Python language and R language, the subtask code and data are distributed through the master node, so that the slave node runs the subtask code on a stand-alone machine, and the stand-alone execution on the data The calculation results are fed back to the master node, and the calculation results are integrated through the master node and fed back to the JVM.
在本申请实施例中,如图3b所示,通过主节点可以利用mapPartition的方法,将子任务代码和数据通过HDFS中转,分发至已申请资源中包括的每个从节点,以使从节点利用运行环境中的所支持的指令对数据继续计算(python2,python3,bash,rscript等),计算完毕后,通过主节点收集计算结果,并转换成spark dataFrame的形式供后续数据处理,存储和分布式计算等。In the embodiment of this application, as shown in Figure 3b, the master node can use the mapPartition method to transfer the subtask code and data through HDFS, and distribute to each slave node included in the applied resource, so that the slave node can use The supported instructions in the operating environment continue to calculate the data (python2, python3, bash, rscript, etc.). After the calculation is completed, the calculation results are collected through the master node and converted into spark dataFrame for subsequent data processing, storage and distribution Calculation etc.
本申请实施例提供的混合语言任务执行方法也可以参考如图3c所示的流 程。The mixed language task execution method provided by the embodiment of the present application can also refer to the flow shown in Fig. 3c.
图4是本申请实施例提供的一种混合语言任务执行装置的结构框图,所述装置配置在集群中,所述装置包括:获取模块410、判断模块420、执行模块430和存储模块440。4 is a structural block diagram of a mixed language task execution device provided by an embodiment of the present application. The device is configured in a cluster. The device includes: an acquisition module 410, a judgment module 420, an execution module 430, and a storage module 440.
获取模块410,设置为获取待执行的任务,将所述待执行的任务划分为至少两个子任务,其中,不同的子任务采用不同编程语言的代码进行编写;The obtaining module 410 is configured to obtain a task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written in codes in different programming languages;
判断模块420,设置为判断所述至少两个子任务中代码的语言类型;The judgment module 420 is configured to judge the language type of the codes in the at least two subtasks;
执行模块430,设置为采用与所述至少两个子任务中代码的语言类型对应的方式分别执行所述至少两个子任务;The execution module 430 is configured to execute the at least two subtasks in a manner corresponding to the language types of the codes in the at least two subtasks;
存储模块440,设置为将执行结果存储到Java虚拟机,以使后续计算时从所述Java虚拟机中读取所述执行结果。The storage module 440 is configured to store the execution result in the Java virtual machine, so that the execution result is read from the Java virtual machine during subsequent calculations.
可选的,执行模块430,设置为:若判断至少两个子任务中第一子任务的代码的语言类型为目标语言,建立目标语言接口与Java虚拟机的连接,其中,所述目标语言包括Python语言或者R语言,所述目标语言接口包括与Python语言对应的Pyspark接口或者与R语言对应的SparkR接口;Optionally, the execution module 430 is configured to: if it is determined that the language type of the code of the first subtask in the at least two subtasks is the target language, establish a connection between the target language interface and the Java virtual machine, wherein the target language includes Python Language or R language, the target language interface includes a Pyspark interface corresponding to the Python language or a SparkR interface corresponding to the R language;
基于所述第一子任务的代码通过所述目标语言接口从所述Java虚拟机中读取数据或者变量,对所述变量或数据进行分布式计算,得到计算结果。The code based on the first subtask reads data or variables from the Java virtual machine through the target language interface, performs distributed calculations on the variables or data, and obtains a calculation result.
可选的,所述Pyspark接口与Java虚拟机通过gateway的方式进行连接,SparkR接口与所述Java虚拟机通过callJStatic的方式进行连接。Optionally, the Pyspark interface and the Java virtual machine are connected through a gateway, and the SparkR interface and the Java virtual machine are connected through a callJStatic.
可选的,执行模块430,是设置为:Optionally, the execution module 430 is set to:
若通过集群中的主节点判断至少两个子任务中第二子任务的代码的语言类型为除所述目标语言之外的语言,通过所述主节点基于所述第二子任务的代码从所述Java虚拟机中读取变量或者数据;If the master node in the cluster determines that the language type of the code of the second subtask in at least two subtasks is a language other than the target language, the master node uses the code of the second subtask from the Read variables or data in the Java virtual machine;
通过所述主节点对所述变量进行修改,将修改后的变量以及所述第二子任务的代码发送至与所述修改后的变量对应的从节点,或者对读取的数据进行划分,并将所述第二子任务的代码以及划分数据发送至与所述划分数据对应的从节点;Modify the variable through the master node, send the modified variable and the code of the second subtask to the slave node corresponding to the modified variable, or divide the read data, and Sending the code and division data of the second subtask to the slave node corresponding to the division data;
通过所述从节点运行所述第二子任务的代码,对所述修改后的变量对应的数据进行计算,或者对划分数据进行计算,得到计算结果,并将所述计算结果反馈给所述主节点;Run the code of the second subtask through the slave node to calculate the data corresponding to the modified variable, or calculate the divided data to obtain the calculation result, and feed the calculation result back to the master node;
相应的,存储模块440,设置为通过主节点接收所述从节点反馈的计算结果, 并将所述计算结果反馈给Java虚拟机。Correspondingly, the storage module 440 is configured to receive the calculation result fed back by the slave node through the master node, and feed the calculation result back to the Java virtual machine.
可选的,通过所述主节点对所述变量进行修改,将修改后的变量以及所述第二子任务的代码发送至对应的从节点,或者对读取的数据进行划分,并将所述第二子任务的代码以及划分数据发送至从节点,包括:Optionally, the variable is modified through the master node, the modified variable and the code of the second subtask are sent to the corresponding slave node, or the read data is divided, and the The code and division data of the second subtask are sent to the slave node, including:
将修改后的变量以及所述第二子任务的代码发送至分布式文件系统,或者将所述划分数据以及所述第二子任务的代码发送至分布式文件系统;Sending the modified variable and the code of the second subtask to the distributed file system, or sending the division data and the code of the second subtask to the distributed file system;
通过所述从节点从所述分布式文件系统中下载所述第二子任务中的代码以及从节点对应的修改后的变量,或者下载所述第二子任务中的代码以及从节点对应的划分数据;其中,不同的从节点对应不同的修改后的变量,或者不同的从节点对应不同的划分数据。Download the code in the second subtask and the modified variable corresponding to the slave node from the distributed file system through the slave node, or download the code in the second subtask and the partition corresponding to the slave node Data; among them, different slave nodes correspond to different modified variables, or different slave nodes correspond to different divided data.
可选的,不同子任务的代码存储于不同的功能模块中,判断模块420,设置为基于子任务中代码所在功能模块的标识判断所述至少两个子任务中代码的语言类型。Optionally, the codes of different subtasks are stored in different functional modules, and the judgment module 420 is configured to judge the language types of the codes in the at least two subtasks based on the identifier of the functional module where the codes in the subtask are located.
可选的,所述集群为所述集群为Hadoop集群。Optionally, the cluster is a Hadoop cluster.
可选的,所述集群中配置有Spark2。Optionally, Spark2 is configured in the cluster.
上述装置可执行本申请任意实施例所提供的方法,具备执行方法相应的功能模块和有益效果。The above-mentioned device can execute the method provided in any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.
图5是本申请实施例提供的一种集群的结构框图,如图5所示,本申请实施例提供的集群500包括本申请实施例提供的一种混合语言任务执行装置501。FIG. 5 is a structural block diagram of a cluster provided by an embodiment of the present application. As shown in FIG. 5, a cluster 500 provided by an embodiment of the present application includes a mixed language task execution device 501 provided by an embodiment of the present application.

Claims (10)

  1. 一种混合语言任务执行方法,所述方法应用于集群,所述方法包括:A mixed language task execution method, the method is applied to a cluster, and the method includes:
    获取待执行的任务,将所述待执行的任务划分为至少两个子任务,其中,不同的子任务采用不同编程语言的代码进行编写;Obtain the task to be executed, and divide the task to be executed into at least two subtasks, where different subtasks are written using codes in different programming languages;
    判断所述至少两个子任务中代码的语言类型;Determine the language type of the code in the at least two subtasks;
    采用与所述至少两个子任务中代码的语言类型对应的方式分别执行所述至少两个子任务;Respectively execute the at least two subtasks in a manner corresponding to the language types of the codes in the at least two subtasks;
    将执行结果存储到Java虚拟机,以使在后续计算的情况下从所述Java虚拟机中读取所述执行结果。The execution result is stored in the Java virtual machine, so that the execution result is read from the Java virtual machine in the case of subsequent calculations.
  2. 根据权利要求1所述的方法,其中,所述采用与所述至少两个子任务中代码的语言类型对应的方式分别执行所述至少两个子任务,包括:The method according to claim 1, wherein said respectively executing said at least two subtasks in a manner corresponding to the language types of codes in said at least two subtasks comprises:
    在判断所述至少两个子任务中第一子任务的代码的语言类型为目标语言的情况下,建立目标语言接口与所述Java虚拟机的连接,其中,所述目标语言包括Python语言或者R语言,所述目标语言接口包括与Python语言对应的Pyspark接口或者与R语言对应的SparkR接口;In the case where it is determined that the language type of the code of the first subtask in the at least two subtasks is the target language, a connection between the target language interface and the Java virtual machine is established, wherein the target language includes Python language or R language , The target language interface includes a Pyspark interface corresponding to the Python language or a SparkR interface corresponding to the R language;
    基于所述第一子任务的代码通过所述目标语言接口从所述Java虚拟机中读取数据或者变量,对所述变量或所述数据进行分布式计算,得到计算结果。The code based on the first subtask reads data or variables from the Java virtual machine through the target language interface, performs distributed calculation on the variables or the data, and obtains a calculation result.
  3. 根据权利要求2所述的方法,其中,所述Pyspark接口与所述Java虚拟机通过gateway的方式进行连接,所述SparkR接口与所述Java虚拟机通过callJStatic的方式进行连接。The method according to claim 2, wherein the Pyspark interface and the Java virtual machine are connected through a gateway, and the SparkR interface and the Java virtual machine are connected through a callJStatic.
  4. 根据权利要求1所述的方法,其中,所述采用与所述至少两个子任务中代码的语言类型对应的方式分别执行所述至少两个子任务,包括:The method according to claim 1, wherein said respectively executing said at least two subtasks in a manner corresponding to the language types of codes in said at least two subtasks comprises:
    在通过所述集群中的主节点判断所述至少两个子任务中第二子任务的代码的语言类型为除所述目标语言之外的语言的情况下,通过所述主节点基于所述第二子任务的代码从所述Java虚拟机中读取变量或者数据;In the case where it is determined by the master node in the cluster that the language type of the code of the second subtask in the at least two subtasks is a language other than the target language, the master node is based on the second The code of the subtask reads variables or data from the Java virtual machine;
    通过所述主节点对所述变量进行修改,将修改后的变量以及所述第二子任务的代码发送至与所述修改后的变量对应的从节点,或者对读取的数据进行划分,并将所述第二子任务的代码以及划分数据发送至与所述划分数据对应的从节点;Modify the variable through the master node, send the modified variable and the code of the second subtask to the slave node corresponding to the modified variable, or divide the read data, and Sending the code and division data of the second subtask to the slave node corresponding to the division data;
    通过所述从节点运行所述第二子任务的代码,对所述修改后的变量对应的数据进行计算,或者对所述划分数据进行计算,得到计算结果,并将所述计算结果反馈给所述主节点;Run the code of the second subtask through the slave node to calculate the data corresponding to the modified variable, or calculate the divided data to obtain the calculation result, and feed the calculation result back to all The main node;
    所述将执行结果存储到Java虚拟机,以使在后续计算的情况下从所述Java 虚拟机中读取所述执行结果,包括:通过所述主节点接收所述从节点反馈的计算结果,并将所述计算结果反馈给所述Java虚拟机。The storing the execution result in the Java virtual machine so as to read the execution result from the Java virtual machine in the case of subsequent calculation includes: receiving the calculation result fed back by the slave node through the master node, And feedback the calculation result to the Java virtual machine.
  5. 根据权利要求4所述的方法,其中,所述通过所述主节点对所述变量进行修改,将修改后的变量以及所述第二子任务的代码发送至与所述修改后的变量对应的从节点,或者对读取的数据进行划分,并将所述第二子任务的代码以及划分数据发送至与所述划分数据对应的从节点,包括:The method according to claim 4, wherein the variable is modified by the master node, and the modified variable and the code of the second subtask are sent to the corresponding to the modified variable The slave node, or dividing the read data, and sending the code of the second subtask and the divided data to the slave node corresponding to the divided data includes:
    将所述修改后的变量以及所述第二子任务的代码发送至分布式文件系统,或者将所述划分数据以及所述第二子任务的代码发送至分布式文件系统;Sending the modified variable and the code of the second subtask to the distributed file system, or sending the division data and the code of the second subtask to the distributed file system;
    通过所述从节点从所述分布式文件系统中下载所述第二子任务的代码以及所述从节点对应的修改后的变量,或者下载所述第二子任务的代码以及所述从节点对应的划分数据,其中,不同的从节点对应不同的修改后的变量,或者不同的从节点对应不同的划分数据。Download the code of the second subtask and the modified variable corresponding to the slave node from the distributed file system through the slave node, or download the code of the second subtask and the slave node correspondence The data is divided, where different slave nodes correspond to different modified variables, or different slave nodes correspond to different divided data.
  6. 根据权利要求1所述的方法,其中,不同子任务的代码存储于不同的功能模块中;The method according to claim 1, wherein the codes of different subtasks are stored in different functional modules;
    所述判断所述至少两个子任务中代码的语言类型,包括:The determining the language type of the code in the at least two subtasks includes:
    基于子任务中代码所在功能模块的标识判断所述至少两个子任务中代码的语言类型。The language type of the code in the at least two subtasks is determined based on the identifier of the functional module where the code in the subtask is located.
  7. 根据权利要求1-6中任一项所述的方法,其中,所述集群为Hadoop集群。The method according to any one of claims 1-6, wherein the cluster is a Hadoop cluster.
  8. 根据权利要求7所述的方法,其中,所述集群中配置有Spark2。The method according to claim 7, wherein Spark2 is configured in the cluster.
  9. 一种混合语言任务执行装置,所述装置应用于集群,所述装置包括:A mixed language task execution device, the device is applied to a cluster, and the device includes:
    获取模块,设置为获取待执行的任务,将所述待执行的任务划分为至少两个子任务,其中,不同的子任务采用不同编程语言的代码进行编写;The obtaining module is configured to obtain the task to be executed, and divide the task to be executed into at least two subtasks, wherein different subtasks are written in codes in different programming languages;
    判断模块,设置为判断所述至少两个子任务中代码的语言类型;A judging module, configured to judge the language type of the codes in the at least two subtasks;
    执行模块,设置为采用与所述至少两个子任务中代码的语言类型对应的方式分别执行所述至少两个子任务;The execution module is configured to execute the at least two subtasks in a manner corresponding to the language type of the codes in the at least two subtasks;
    存储模块,设置为将执行结果存储到Java虚拟机,以使在后续计算的情况下从所述Java虚拟机中读取所述执行结果。The storage module is configured to store the execution result in the Java virtual machine, so that the execution result is read from the Java virtual machine in the case of subsequent calculations.
  10. 一种集群,包括:如权利要求9所述的一种混合语言任务执行装置。A cluster comprising: a mixed language task execution device as claimed in claim 9.
PCT/CN2020/091189 2019-05-21 2020-05-20 Mixed language task execution method and device, and cluster WO2020233584A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910425952.0 2019-05-21
CN201910425952.0A CN110109748B (en) 2019-05-21 2019-05-21 Mixed language task execution method, device and cluster

Publications (1)

Publication Number Publication Date
WO2020233584A1 true WO2020233584A1 (en) 2020-11-26

Family

ID=67491483

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/091189 WO2020233584A1 (en) 2019-05-21 2020-05-20 Mixed language task execution method and device, and cluster

Country Status (2)

Country Link
CN (1) CN110109748B (en)
WO (1) WO2020233584A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685150A (en) * 2020-12-21 2021-04-20 联想(北京)有限公司 Multi-language program execution method, device and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110109748B (en) * 2019-05-21 2020-03-17 星环信息科技(上海)有限公司 Mixed language task execution method, device and cluster
CN113918211B (en) * 2021-12-13 2022-06-07 昆仑智汇数据科技(北京)有限公司 Method, device and equipment for executing industrial equipment object data model
CN114579261B (en) * 2022-04-29 2022-09-20 支付宝(杭州)信息技术有限公司 Processing method and device for multi-language mixed stream

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090172657A1 (en) * 2007-12-28 2009-07-02 Nokia, Inc. System, Method, Apparatus, Mobile Terminal and Computer Program Product for Providing Secure Mixed-Language Components to a System Dynamically
US20090313319A1 (en) * 2008-06-16 2009-12-17 International Business Machines Corporation System and Method for Dynamic Partitioning of Applications in Client-Server Environments
CN104834561A (en) * 2015-04-29 2015-08-12 华为技术有限公司 Data processing method and device
CN106415495A (en) * 2014-05-30 2017-02-15 苹果公司 Programming system and language for application development
CN110109748A (en) * 2019-05-21 2019-08-09 星环信息科技(上海)有限公司 A kind of hybrid language task executing method, device and cluster

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6282697B1 (en) * 1998-09-18 2001-08-28 Wylci Fables Computer processing and programming method using autonomous data handlers
US8881158B2 (en) * 2008-11-14 2014-11-04 Nec Corporation Schedule decision device, parallel execution device, schedule decision method, and program
US9959142B2 (en) * 2014-06-17 2018-05-01 Mediatek Inc. Dynamic task scheduling method for dispatching sub-tasks to computing devices of heterogeneous computing system and related computer readable medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090172657A1 (en) * 2007-12-28 2009-07-02 Nokia, Inc. System, Method, Apparatus, Mobile Terminal and Computer Program Product for Providing Secure Mixed-Language Components to a System Dynamically
US20090313319A1 (en) * 2008-06-16 2009-12-17 International Business Machines Corporation System and Method for Dynamic Partitioning of Applications in Client-Server Environments
CN106415495A (en) * 2014-05-30 2017-02-15 苹果公司 Programming system and language for application development
CN104834561A (en) * 2015-04-29 2015-08-12 华为技术有限公司 Data processing method and device
CN110109748A (en) * 2019-05-21 2019-08-09 星环信息科技(上海)有限公司 A kind of hybrid language task executing method, device and cluster

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685150A (en) * 2020-12-21 2021-04-20 联想(北京)有限公司 Multi-language program execution method, device and storage medium

Also Published As

Publication number Publication date
CN110109748A (en) 2019-08-09
CN110109748B (en) 2020-03-17

Similar Documents

Publication Publication Date Title
WO2020233584A1 (en) Mixed language task execution method and device, and cluster
JP7210713B2 (en) Efficient State Maintenance of Execution Environments in On-Demand Code Execution Systems
US11243953B2 (en) Mapreduce implementation in an on-demand network code execution system and stream data processing system
CN108139935B (en) The extension of the resource constraint of service definition container
US20190377604A1 (en) Scalable function as a service platform
JP7083901B2 (en) Dark Roch Realization Method, Equipment, Computation Node and System
EP4160405A1 (en) Task execution method and storage device
Wawrzoniak et al. Boxer: Data analytics on network-enabled serverless platforms
WO2022009006A1 (en) Distributed pipeline configuration in distributed computing system
WO2023124543A1 (en) Data processing method and data processing apparatus for big data
WO2019153880A1 (en) Method for downloading mirror file in cluster, node, and query server
CN112364897A (en) Distributed training method and device, storage medium and electronic equipment
Nguyen et al. On the role of message broker middleware for many-task computing on a big-data platform
Palyvos-Giannas et al. Lachesis: a middleware for customizing OS scheduling of stream processing queries
CN117234697B (en) Conservative time synchronization parallel event scheduling computing architecture and method
CN111274018A (en) Distributed training method based on DL framework
KR101620896B1 (en) Executing performance enhancement method, executing performance enhancement apparatus and executing performance enhancement system for map-reduce programming model considering different processing type
Meiklejohn et al. Towards a solution to the red wedding problem
Sanches Distributed computing in a cloud of mobile phones
Böhringer FMI: The FaaS Message Interface
US20230315541A1 (en) Tightly coupled parallel applications on a serverless computing system
US20230315543A1 (en) Tightly coupled parallel applications on a serverless computing system
CN117519911B (en) Automatic injection system, method, device, cluster and medium
US20230419160A1 (en) 3-tier quantum computing execution model
US20240127034A1 (en) Apparatus and method for distributed processing of neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20810452

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20810452

Country of ref document: EP

Kind code of ref document: A1