CN111767092B

CN111767092B - Job execution method, apparatus, system and computer readable storage medium

Info

Publication number: CN111767092B
Application number: CN202010624055.5A
Authority: CN
Inventors: 刘有; 尹强; 王和平; 黄山; 杨峙岳; 冯朝阁; 杨永坤; 邸帅; 卢道和
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2023-05-12
Anticipated expiration: 2040-06-30
Also published as: WO2022001209A1; CN111767092A

Abstract

The invention relates to the technical field of financial science and technology, and discloses a job execution method, a job execution device, a job execution system and a computer readable storage medium. The job execution method includes: when an execution request of Spark job is received, acquiring a version number, dynamic configuration parameters and Spark job codes of a target Spark engine according to the execution request; determining deployment catalog information and version loading rules of a target Spark engine according to the version number; acquiring static configuration parameters according to the deployment catalog information, and initializing a target Spark engine by using the dynamic configuration parameters and the static configuration parameters according to the version loading rules so as to start the target Spark engine; the Spark job code is submitted to the target Spark engine to execute the job. The invention can realize that Spark operation of multiple versions is supported in one Linkis service cluster at the same time, and reduces operation and maintenance cost.

Description

Job execution method, apparatus, system and computer readable storage medium

Technical Field

The present invention relates to the field of financial technology (Fintech), and in particular, to a job execution method, apparatus, system, and computer readable storage medium.

Background

With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changed to the financial technology (Fintech), so that the Spark technology is not exceptional, but due to the requirements of safety and instantaneity of the financial industry, the Spark technology is also put forward higher requirements.

In the existing big data clustering application, big data processing environments in financial institutions (such as banks) are concentrated, data are very concentrated, and data volume is very large. Because of the centralized processing of data, many big data application components are also deployed centrally, and the version updates of these components are also frequent, with several versions being updated each year, such as the use of Apache Spark (a fast general purpose computing engine designed for large-scale data processing).

Each version update of Spark brings many new characteristics, but sometimes the old version operation cannot be well compatible, a large number of users often exist in one environment, some users need to use the new version Spark, some users need the old version Spark, but the current Linkis (a data middleware with multiple computing and storage engines on) service clusters can only support one version of Spark operation, and multiple versions of Spark operation cannot be fused at the same time. Therefore, to meet the multi-version requirements of these users, multiple sets of Linkis clusters are usually required to be deployed, so that a large number of machines are required to deploy Spark drivers with different versions, meanwhile, the operation of the users also needs to be switched between different environments, and each set of Linkis is added, so that the operation and maintenance cost and difficulty are increased, and the use of each set of environment is difficult for the users to maintain.

Disclosure of Invention

The invention mainly aims to provide a job execution method, a job execution device, a job execution system and a computer readable storage medium, and aims to realize that Spark job operation of multiple versions is supported simultaneously in a Linkis service cluster, so that operation and maintenance cost is reduced.

In order to achieve the above object, the present invention provides a job execution method including:

when an execution request of Spark job is received, acquiring a version number, dynamic configuration parameters and Spark job codes of a target Spark engine according to the execution request;

determining deployment catalog information and version loading rules of the target Spark engine according to the version number;

acquiring static configuration parameters according to the deployment catalog information, and initializing the target Spark engine by using the dynamic configuration parameters and the static configuration parameters according to the version loading rules so as to start the target Spark engine;

and submitting the Spark job code to the target Spark engine to execute a job.

Optionally, before the step of determining the deployment catalog information and version loading rule of the target Spark engine according to the version number, the method further includes:

acquiring a user identifier corresponding to the execution request, and detecting whether an idle Spark engine corresponding to the user identifier and the version number exists;

If not, executing the steps of: determining deployment catalog information and version loading rules of the target Spark engine according to the version number;

if so, submitting the Spark job code to the idle Spark engine to execute the job.

acquiring a user identifier corresponding to the execution request, and judging whether a user is in a preset gray list or not according to the user identifier;

if the user is not in the preset gray list, executing the steps of: determining deployment catalog information and version loading rules of the target Spark engine according to the version number;

if the user is in the preset gray list, a gray Spark engine is created, and the Spark job code is submitted to the gray Spark engine to execute the job.

Optionally, the job execution method further includes:

in the initialization process, determining a target calling method according to the version number and a preset abstract layer interface;

and loading a file package which is dependent on the target Spark engine under the corresponding directory of the deployment directory information according to the target calling method.

Optionally, before the step of submitting the Spark job code to the target Spark engine to execute a job, the method further includes:

modifying the Spark job code according to the version number;

the step of submitting the Spark job code to the target Spark engine to execute a job includes:

and submitting the modified Spark job code to the target Spark engine to execute the job.

Optionally, the step of submitting the Spark job code to the target Spark engine to execute a job includes:

submitting the Spark job code to a driver node of the target Spark engine;

converting the Spark job code through the driver node to obtain a Spark task;

and distributing the Spark task to an executor node deployed on the Yarn cluster to execute the job.

Optionally, before the step of converting, by the driver node, the Spark job code to obtain a Spark task, the method further includes:

in the initialization process, when a Scala interpreter is created in a driver node of the target Spark engine, a class loader of a main thread is injected into the Scala interpreter, so that the class loader of the main thread becomes a parent level of the Scala interpreter class loader, and the Scala interpreter creates a corresponding class loader according to the class loader of the parent level;

The step of converting the Spark job code through the driver node to obtain a Spark task includes:

converting the Spark job code through a class loader of a Scala interpreter created in the driver node to obtain a Spark task;

after the step of distributing the Spark task to the executor nodes deployed on the Yarn cluster to execute the job, the method further includes:

and when receiving a serialization execution result returned by the executor node based on the Spark task, modifying a class loader of the current thread of the target Spark engine into a class loader of the Scala interpreter so as to deserialize the serialization execution result through the class loader of the Scala interpreter.

In addition, in order to achieve the above object, the present invention also provides a job execution apparatus including:

the first acquisition module is used for acquiring the version number, the dynamic configuration parameters and the Spark job code of the target Spark engine according to the execution request when the execution request of the Spark job is received;

the first determining module is used for determining deployment catalog information and version loading rules of the target Spark engine according to the version number;

The engine initialization module is used for acquiring static configuration parameters according to the deployment catalog information, and initializing the target Spark engine by using the dynamic configuration parameters and the static configuration parameters according to the version loading rules so as to start the target Spark engine;

and the job execution module is used for submitting the Spark job code to the target Spark engine so as to execute the job.

In addition, in order to achieve the above object, the present invention also provides a job execution system including: the job execution program is executed by the processor and realizes the steps of the job execution method as described above.

In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a job execution program which, when executed by a processor, implements the steps of the job execution method as described above.

The invention provides a job execution method, a device, a system and a computer readable storage medium, wherein when an execution request of Spark job is received, a version number, a dynamic configuration parameter and a Spark job code of a target Spark engine are acquired according to the execution request; then, determining deployment catalog information and version loading rules of the target Spark engine according to the version number; acquiring static configuration parameters according to the deployment catalog information, and initializing a target Spark engine by using the dynamic configuration parameters and the static configuration parameters according to the version loading rules so as to start the target Spark engine; and then submits the Spark job code to the target Spark engine to execute the job. According to the method, the device and the system, the jab packages which are easy to conflict in Spark of different versions are installed and deployed in advance, when an execution request of Spark operation is received, version numbers in the request are acquired to determine corresponding deployment catalog information, corresponding starting parameters (including dynamic configuration parameters and static configuration parameters) are acquired, and further a Spark engine of a target version is initialized according to the starting parameters to start the Spark engine of the target version, so that dynamic loading of the Spark dependent jar packages of different versions can be achieved, the problem of multi-version jar conflict is avoided, and multi-version Spark can be executed in parallel under the same Linkis cluster. Therefore, only one set of Linkis service cluster is required to be deployed, and Spark engines of different versions can be created in the Linkis service cluster so as to support the operation of multi-version Spark operation.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart of a first embodiment of a job execution method according to the present invention;

FIG. 3 is a system architecture diagram of a job execution system according to the present invention;

fig. 4 is a schematic functional block diagram of a first embodiment of the job execution apparatus of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic device structure diagram of a hardware running environment according to an embodiment of the present invention.

The job execution device in the embodiment of the present invention may be a server, or may be a terminal device such as a PC (Personal Computer ), a tablet computer, a portable computer, or the like.

As shown in fig. 1, the job execution apparatus may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., wi-Fi interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the configuration of the job execution apparatus shown in fig. 1 is not limiting of the job execution apparatus, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, and a job execution program may be included in the memory 1005 as one type of computer storage medium.

In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server and performing data communication with the background server; the user interface 1003 is mainly used for connecting a client and communicating data with the client; and the processor 1001 may be used to call a job execution program stored in the memory 1005 and execute the respective steps of the following job execution method.

Based on the above hardware structure, various embodiments of the job execution method of the present invention are presented.

The invention provides a job execution method.

Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of a job execution method according to the present invention.

In this embodiment, the job execution method includes:

step S10, when an execution request of Spark job is received, acquiring a version number, dynamic configuration parameters and Spark job codes of a target Spark engine according to the execution request;

The Job execution method of the present embodiment is applied to a Job execution system, as shown in fig. 3, where the Job execution system includes a Linkis cluster service and a Yarn cluster, where the Linkis cluster service includes an external service portal and an engine management service, and the external service portal includes a Job management module and an engine manager. Specifically, compared with the existing job execution system, the method improves engine management service, increases a dynamic loading scheme of multi-version spark dependent jar and dynamic adjustment of a class loader, and solves the problem of parallel execution of spark jobs of different versions. In addition, version identification is added in the execution request parameters of Spark job, correspondingly, an engine manager is added for managing Spark engine services of different versions and different users, and engines are created or selected according to version numbers.

The Linkis cluster service is a data middleware which opens up a plurality of computing storage engines, provides a unified Restful (a design style and development mode of a network application program) interface to the outside, submits and executes scripts such as SQL (Structured Query Language ), pyspark (Spark is an API provided by a Python developer), hiveQL (Hive Query Language, a query language of a data warehouse tool), scala (a programming language of a one-door multiple-format). The Job management module in the external service portal is used for receiving the execution request of the Spark Job and also is used for automatically adjusting the error code of the Spark Job. And the engine manager is used for managing Spark engine services of different versions and different users and creating or selecting an engine according to the version numbers. The engine management service is used for managing process creation, state tracking, process destruction and the like of Spark Context, and comprises one or more Spark engines, wherein the Spark engines can be executed in parallel and communicate with an external service portal through RPC (Remote Procedure Call ). Yarn clusters are frameworks that provide job scheduling and cluster resource management in large data platforms.

In this embodiment, when receiving an execution request of a Spark job, the external service portal obtains a version number, a dynamic configuration parameter, and a Spark job code of a target Spark engine according to the execution request. The dynamic configuration parameters include one or more of the execution number of the Executor nodes (Executor), the number of the CPUs (Central Processing Unit, central processing units) and the memory size, and the number of the CPUs of the Driver nodes (Driver) and the memory size.

Step S20, determining deployment catalog information and version loading rules of the target Spark engine according to the version number;

then, the deployment catalog information and the version loading rules of the target Spark engine are determined according to the version numbers. The deployment catalog information comprises root catalog deployment information and configuration catalog deployment information, and the version loading rules comprise loading rules of different Spark version dependent libraries.

In implementation, a root directory of Spark service needs to be established in advance in a machine where an engine management service is located, then a spark_home directory and a spark_conf_dir directory are established under the root directory according to different Spark version numbers, deployment is performed according to the path rule when Spark is installed and deployed, specifically, subdirectories of each version are set under the spark_home directory and the spark_conf_dir directory, and Spark is installed under the corresponding subdirectories correspondingly. At the time of installation, the jab packages of the dependency which is easy to conflict in the Spark of different versions need to be deployed separately. The spark_home directory is a root directory, is an installation directory of Spark, and includes static configuration parameters of different versions of Spark under the spark_conf_dir directory, such as application attribute parameters, running environment parameters, running behavior parameters, network parameters, and the like.

In addition, before installing and deploying Spark of different versions, in order to realize support of multiple versions of the Spark engine, the API interfaces of the Spark packages of different versions may be compared first to obtain a differential API, and the Spark packages of different versions are compiled according to the differential API, and then the compiled Spark packages are deployed under corresponding directories. That is, the Spark packet is compiled first, the part of the API (Application Programming Interface, application program interface) with the change is separated, and then the compiled Spark packet is installed and deployed. For example, for Vector-related APIs, we can strip two projects, each project adapting to a corresponding version, then release jar packages, and introduce different adaptation packages according to Spark versions in Maven (Maven, project object model) according to Profile (user Profile) mechanism.

Step S30, acquiring static configuration parameters according to the deployment catalog information, and initializing the target Spark engine by using the dynamic configuration parameters and the static configuration parameters according to the version loading rule so as to start the target Spark engine;

then, static configuration parameters are obtained according to the deployment catalog information, specifically, static configuration parameters under the spark_conf_dir catalog, such as application attribute parameters, running environment parameters, running behavior parameters, network parameters and the like, can be obtained. Further, the target Spark engine is initialized according to the version loading rules using the start parameters (including the dynamic configuration parameters and the static configuration parameters) to start the target Spark engine. That is, the dynamic configuration parameters and the static configuration parameters are correspondingly filled into the configuration parameters of the configuration file of the target Spark engine for initialization.

In addition, it should be noted that, in the initialization process, the values of the parameters used by pySpark (API provided by Spark for Python developer) also need to be uploaded to the corresponding hdfs (Hadoop Distributed File System, distributed file system) directory according to different versions, so that the Executor node of the Spark engine can conveniently download the files of these third parties from the correct location, so as to solve the Jar problem that is dependent when the Executor starts.

Further, the job execution method further includes:

step a1, in the initialization process, determining a target calling method according to the version number and a preset abstract layer interface;

and a step a2 of loading a file package which is dependent on the target Spark engine under the corresponding directory of the deployment directory information according to the target calling method.

Further, since Spark engine service of Linkis has some dependencies on Spark at runtime, these dependencies are all put under a lib directory of the root directory in jar packets. When different Spark versions exist, the jar conflicts, so that the dependent jar needs to be extracted in advance, not all the jar is loaded when an application is started, but an abstract interface definition is added in Spark engine service according to different engine versions to realize a multi-version dependent abstract layer, a bottom layer packet which solves the jar conflicts is made into a version module (namely a dependent file packet below), and only a designated version module is loaded when the engine is created, so that the problem of the jar conflicts of multiple versions is avoided.

Specifically, in the initialization process, a target calling method is determined according to the version number and a preset abstract layer interface, and then a file package relied on by a target Spark engine under a directory corresponding to deployment directory information is loaded according to the target calling method so as to load a conflict-free dependency library according to version differences.

Step S40, submitting the Spark job code to the target Spark engine to execute the job.

Finally, the Spark job code is submitted to the target Spark engine to execute the job. Specifically, submitting the Spark job code to a Driver node (Driver) of a target Spark engine, and then converting the Spark job code by the Driver to obtain a Spark task (task), wherein the specific conversion process can refer to the prior art; finally, spark tasks are assigned to the Executor nodes (executors) deployed on the Yarn cluster to execute the job.

The embodiment of the invention provides a job execution method, which comprises the steps of acquiring a version number, dynamic configuration parameters and a Spark job code of a target Spark engine according to an execution request when the execution request of the Spark job is received; then, determining deployment catalog information and version loading rules of the target Spark engine according to the version number; acquiring static configuration parameters according to the deployment catalog information, and initializing a target Spark engine by using the dynamic configuration parameters and the static configuration parameters according to the version loading rules so as to start the target Spark engine; and then submits the Spark job code to the target Spark engine to execute the job. In the embodiment of the invention, through installing and deploying the depending jab packages which are easy to conflict in Spark with different versions in advance, when an execution request of Spark operation is received, version numbers in the request are acquired to determine corresponding deployment catalog information, corresponding starting parameters (including dynamic configuration parameters and static configuration parameters) are acquired, and further a Spark engine of a target version is initialized according to the starting parameters to start the Spark engine of the target version, so that dynamic loading of the Spark dependent jar packages of different versions can be realized, the problem of multi-version jar conflict is avoided, and multi-version Spark can be executed in parallel under the same Linkis cluster. Therefore, only one set of Linkis service cluster is required to be deployed, and Spark engines of different versions can be created in the Linkis service cluster so as to support the operation of multi-version Spark operation.

Further, since it takes a long time to start a Spark engine, and a part of cluster computing resources are locked once the start is successful, when the operation of a user is completed, the engine already in the operation state is not immediately ended, and the next operation of the user is immediately executed, so as to improve the user experience. Engine multiplexing saves time and computational resources to a great extent, however under the Session (Session control) mode management of the existing Linkis user, when multiplexing the existing Spark engine, the engine randomly submits the job to the idle Spark engine. When there are Spark engines of different versions in one environment, a part of the job of the user may be submitted to the Spark engine of the old version, and a part of the job may be submitted to the Spark engine of the new version, resulting in a case where execution of the job fails.

In this regard, based on the above-described first embodiment, a second embodiment of the job execution method of the present invention is proposed.

In this embodiment, before the step S20, the job execution method further includes:

step A, obtaining a user identifier corresponding to the execution request, and detecting whether an idle Spark engine corresponding to the user identifier and the version number exists or not;

In this embodiment, when the external service portal obtains the execution request of the Spark job, the external service portal may obtain the version number, the dynamic configuration parameter and the Spark job code of the target Spark engine, and also obtain the user identifier corresponding to the execution request, where the user identifier may be a user name or a user number. Then, it is detected whether there is an idle Spark engine corresponding to the user identification and version number. During detection, the version number of the currently idle Spark engine and the user to which the Spark engine belongs can be obtained from the engine manager, and then the currently idle Spark engine is matched with the obtained user identification and version number.

If not, executing step S20: determining deployment catalog information and version loading rules of the target Spark engine according to the version number;

if there is no idle Spark engine corresponding to the user identifier and the version number, a new Spark engine needs to be created, and at this time, deployment catalog information and version loading rules of the target Spark engine are determined according to the version number, so as to execute subsequent steps, and the specific execution process can refer to the first embodiment, which is not described in detail.

If yes, executing step B: and submitting the Spark job code to the idle Spark engine to execute a job.

If the idle Spark engine corresponding to the user identification and the version number exists, the Spark job code is directly submitted to the idle Spark engine corresponding to the user identification and the version number so as to execute the job.

In this embodiment, a version tag (i.e., version number) is added to an execution request of a Spark job, and meanwhile, idle Spark engines corresponding to different users and version numbers are managed by an engine manager, after an external service portal of Linkis receives the execution request, by detecting whether an idle Spark engine corresponding to a user identifier and a version number exists, and then, the idle Spark engine is automatically submitted to a corresponding version according to the version tag of the job, so that the situation that the Spark job is randomly submitted to Spark engines of different versions to cause job execution failure can be avoided while multiplexing of the engines is realized.

Further, when the Spark version needs to be updated, linkis needs to kill all the running Spark engines, stop the existing Spark engine management service, update the configuration files of all Spark engine management servers, and update the dependent Jar (a software package file format) package used by all Spark engine management services, so that the influence on the service is very large, and the Spark operation of the user cannot be executed during the updating period. In addition, after updating, there is a risk that some Spark jobs cannot be executed correctly under the new version Spark engine.

In this regard, based on the above-described first embodiment, a third embodiment of the job execution method of the present invention is proposed.

step C, obtaining a user identifier corresponding to the execution request, and judging whether the user is in a preset gray list or not according to the user identifier;

in this embodiment, when the external service portal obtains the execution request of the Spark job, the external service portal may obtain the version number, the dynamic configuration parameter and the Spark job code of the target Spark engine, and also obtain the user identifier corresponding to the execution request, where the user identifier may be a user name or a user number. And then judging whether the user is in a preset gray list or not according to the user identification. The preset gray list is preset and is used for designating part of users to perform gray, namely submitting the roles of the designated part of users to a Spark engine of a new version under the newly deployed Linkis service cluster for execution. In the implementation, the number of the gray users designated in the preset gray list can be correspondingly adjusted according to the success rate of executing the jobs of the designated users on the new-version Spark engine, so that the Spark jobs of the users are gradually migrated to the new-version Spark engine under the newly deployed Linkis service cluster for execution.

If the user is not in the preset gray list, step S20 is executed: determining deployment catalog information and version loading rules of the target Spark engine according to the version number;

if the user is in the preset gray list, executing step D: creating a gray Spark engine, and submitting the Spark job code to the gray Spark engine to execute a job.

If the user is not in the preset gray list, the user does not belong to the designated gray user, at this time, the user only needs to submit the job to the Spark engine under the original Linkis service cluster for execution, specifically, the deployment catalog information and version loading rule of the target Spark engine are determined according to the version number, further, the subsequent steps are executed, and the specific execution process can refer to the first embodiment and is not repeated here.

If the user is in the preset gray list, a gray Spark engine is created, and a Spark job code is submitted to the gray Spark engine to execute the job. The gray Spark engine is a new version of Spark engine under the newly deployed Linkis service cluster, and the creation mode of the new version of Spark engine is the same as that of the target Spark engine in the first embodiment, which is not described in detail herein. In the implementation, the gray scale Spark engine may be created in advance, and in this case, the Spark job code is directly submitted to the pre-created gray scale Spark engine to execute the job.

It should be noted that, because the Spark engine version to be submitted is changed, the corresponding version number and configuration parameter are all preset, and when the gray Spark engine is created, the gray Spark engine is constructed based on the preset version number and configuration parameter. In addition, because the definition of the current Spark version parameters is complex, in the specific implementation, the front end and the rear end can be configured into the database by adopting a unified version code mechanism, for example, v1 represents Spark1.6.0 and the like, so that unified numbering is carried out, and the situation that multiple versions are uncontrollable is avoided. The user can freely designate loading, and only the version number is the version number with the uniform number can be submitted correctly, otherwise, the job is failed to be submitted.

In this embodiment, by establishing a user-based version gray scale mechanism, gray scales are performed on a designated part of users, and an action request of the gray scale user is submitted to a new version gray scale Spark engine for execution. By the method, when the Spark version is updated under the Linkis, the existing Spark engine management service does not need to be stopped, and simultaneously Spark operation of a user can still be executed during updating, so that adverse effects on the service can be avoided.

Further, there may be some differences in Spark job code due to the different versions of Spark engines. When the user sends the Spark job, the code of the new Spark version to be submitted may not be adopted, but the code of the old Spark version is still used to send the execution request of the Spark job, and at this time, the problem of job execution failure exists.

In this regard, based on the above-described first to third embodiments, a fourth embodiment of the job execution method of the present invention is proposed.

In this embodiment, before the step S40, the job execution method further includes:

e, modifying the Spark job code according to the version number;

in this embodiment, after the version number of the target Spark engine and the Spark job code are obtained according to the execution request of the Spark job, the Spark job code may be modified according to the version number, so that the modified Spark job code is compatible and matched with the version number thereof.

Specifically, a code parser of different versions is defined in the external service portal, and the changes between the different versions and the corresponding modification strategies are predefined through the code parser so as to predefine grammar replacement operations of the different versions based on the known changes, such as the change problems introduced by packages of the different versions or the change problems of function call interfaces of an API (Application Programming Interface, application program interface). When the code is modified, the code parser matches the Spark job code with the predefined code differences of different versions and the modification strategies thereof according to the version numbers, and then corresponding modification is carried out according to the matching results.

At this time, step S40 includes: and submitting the modified Spark job code to the target Spark engine to execute the job.

The modified Spark job code is then submitted to the target Spark engine to execute the job.

In this embodiment, by automatically modifying Spark job codes according to code differences of different versions and modification strategies thereof in the code submitting stage, the user can be compatible with multi-version operation without modifying the existing codes.

Further, based on the above-described first to third embodiments, a fifth embodiment of the job execution method of the present invention is proposed.

In the present embodiment, step S40 includes:

step a41, submitting the Spark job code to a driver node of the target Spark engine;

in the present embodiment, the job execution process is as follows:

the Spark job code is first submitted to the Driver node (Driver) of the target Spark engine. The Driver is used for converting Spark job codes of users into a plurality of units for physical execution, and the units are also called tasks (tasks), and are also used for tracking the running state of the Executor and distributing all tasks to the proper Executor based on the position of the data according to the current Executor node set.

Step a42, converting the Spark job code through the driver node to obtain a Spark task;

and then, converting the Spark job code through the driver node to obtain a Spark task. Specific transformation modes can be referred to the prior art.

And a step a43, distributing the Spark task to an executor node deployed on the Yarn cluster to execute the job.

Finally, spark tasks are assigned to the Executor nodes (executors) deployed on the Yarn cluster to execute the job. The Yarn cluster is a framework for providing job scheduling and cluster resource management in a big data platform, and can realize the management and scheduling of the executor nodes. The executor node is used for running Spark tasks and returning execution results to the driver node.

Further, since Spark is distributed, after the Spark task dynamically generated by the Driver is serialized, the Spark task is distributed to each Executor to perform distributed execution, and the Driver provides a remote class loading service at the same time, when the Executor deserializes the serialized code to perform dynamic loading, a problem may occur, and in some cases, the serialized result received by the Driver after the Executor is executed cannot be correctly deserialized. Since the Linkis starts the Spark engine, a class loader exists, the class loader is a default class loader of the Spark Driver, the Spark job code of the user is submitted to the Driver for interpretation and execution, so the Driver also starts a scaler, that is, when the Spark Driver is initialized, a Spark ILoop is also created, and since the Spark ILoop multiplexes the code interpreters of the scaler, a class loader is also set in the scaler. When a user defines a code (i.e. Spark job code) to be submitted to a Spark engine, the code is firstly interpreted and executed by a Scala interpreter in a Spark Driver, so that some newly defined classes are completely in a class loader of the Scala interpreter, when a serialized Spark task is submitted to an Executor to be executed, if operation references of instance objects of the class defined by the user need to be continuously returned, a serialized execution result is submitted to the Spark Driver, but at the moment, the class loader of the Spark Driver is a default class loader, and the class loader lacks class information defined by user dynamic codes in the Scala interpreter, and fails to run, namely only partial code running can be supported, and cannot be correctly executed when the new class object defined by the user needs to be returned.

In view of the above, a sixth embodiment of the job execution method of the present invention is proposed based on the above fifth embodiment.

In this embodiment, before step a42, the job execution method further includes:

step F, in the initialization process, when a Scala interpreter is created in a driver node of the target Spark engine, a class loader of a main thread is injected into the Scala interpreter, so that the class loader of the main thread becomes a parent level of the Scala interpreter class loader, and the Scala interpreter creates a corresponding class loader according to the class loader of the parent level;

in the process of initializing the target Spark engine, a Scala interpreter is created in a driver node of the target Spark engine, and at the moment, a class loader of a main thread is injected into the Scala interpreter, so that the class loader of the main thread becomes a parent level of the Scala interpreter class loader, and the Scala interpreter creates a corresponding class loader according to the class loader of the parent level;

at this time, step a42 includes:

then, the Spark job code is converted by a class loader of a Scala interpreter created in the Driver node Driver to obtain a Spark task, and then the Spark task is distributed to an Executor node Executor deployed on the Yarn cluster to execute the job.

Further, after step a43, the job execution method further includes:

and G, when receiving a serialization execution result returned by the executor node based on the Spark task, modifying a class loader of the current thread of the target Spark engine into a class loader of the Scala interpreter so as to deserialize the serialization execution result through the class loader of the Scala interpreter.

And when receiving a serialization execution result returned by the Executor node Executor based on the Spark task, modifying the class loader of the current thread of the target Spark engine into the class loader of the Scala interpreter so as to deserialize the serialization execution result through the class loader of the Scala interpreter.

In this embodiment, the class loader of the Spark engine is dynamically modified to ensure consistency between class loading of the Spark engine and class loader of the Scala interpreter in the Driver, so that the execution result after the Executor deserialization can be correctly analyzed by the Driver.

Furthermore, it should be noted that, due to the difference of Spark versions, some codes are difficult to be compatible, and the Spark cannot be switched to be compiled, and at this time, a method of combining dynamic compiling and reflection can be adopted. Normally, two Spark job codes corresponding to different Spark versions may be prepared, and then it is decided at runtime to compile which code. However, this approach has a problem that if the values returned by the dynamic compilation are to be serialized and then sent to Executor, problems can occur in de-serialization because some of the anonymized classes generated therein do not exist in Executor.

For this, a layer of self-packaging implementation can be done for classes with changes, since in Spark, a class can be defined: org.apache.spark.spark_version (for Spark VERSION) to obtain the Spark VERSION. Specifically, different version parameters are obtained from the code according to the self-defined class, the class of the corresponding version is dynamically loaded, and then the method is called through reflection, so that the error in compiling can be avoided. The problem of call interface difference of different versions is shielded by the self-encapsulated class, however, the codes such as the original UDF (Universal Disc Format, unified disc format) cannot be used by reflection. This is because udf function requires that it be possible to deduce what specific type of input and return values are, and if reflected, we cannot determine the return value (possibly org.apache.spark.ml. Linear.vector, and possibly org.apache.spark.mllb. Linear.vector), which cannot be compiled at this time. For this case, the source code version of Spark needs to be modified, and according to the current version parameters, the corresponding type can be dynamically loaded as a return value.

The invention also provides a job execution device.

Referring to fig. 4, fig. 4 is a schematic functional block diagram of a first embodiment of a job execution apparatus according to the present invention.

As shown in fig. 4, the job execution apparatus includes:

the first obtaining module 10 is configured to obtain, when receiving an execution request of a Spark job, a version number, a dynamic configuration parameter and a Spark job code of a target Spark engine according to the execution request;

a first determining module 20, configured to determine deployment catalog information and version loading rules of the target Spark engine according to the version number;

the engine initialization module 30 is configured to obtain static configuration parameters according to the deployment catalog information, initialize the target Spark engine according to the version loading rule using the dynamic configuration parameters and the static configuration parameters, so as to start the target Spark engine;

and the job execution module 40 is configured to submit the Spark job code to the target Spark engine to execute a job.

Further, the job execution apparatus further includes:

the detection module is used for acquiring the user identification corresponding to the execution request and detecting whether an idle Spark engine corresponding to the user identification and the version number exists or not;

the first determining module 20 is further configured to execute, if there is no idle Spark engine corresponding to the user identifier and the version number, the steps of: determining deployment catalog information and version loading rules of the target Spark engine according to the version number;

And the first submitting module is used for submitting the Spark job code to the idle Spark engine if the idle Spark engine corresponding to the version number exists in the user identification so as to execute the job.

Further, the job execution apparatus further includes:

the judging module is used for acquiring the user identification corresponding to the execution request and judging whether the user is in a preset gray list or not according to the user identification;

the first determining module 20 is further configured to execute the steps of: determining deployment catalog information and version loading rules of the target Spark engine according to the version number;

and the second submitting module is used for creating a gray Spark engine if the user is in the preset gray list and submitting the Spark job code to the gray Spark engine so as to execute the job.

Further, the job execution apparatus further includes:

the second determining module is used for determining a target calling method according to the version number and a preset abstract layer interface in the initializing process;

and the file package loading module is used for loading the file package which is dependent on the target Spark engine under the corresponding directory of the deployment directory information according to the target calling method.

Further, the job execution apparatus further includes:

the first modification module is used for modifying the Spark job code according to the version number;

the job execution module 40 is further configured to:

Further, the job execution module 40 includes:

the code submitting unit is used for submitting the Spark job code to a driver node of the target Spark engine;

the task generating unit is used for converting the Spark job code through the driver node to obtain a Spark task;

and the task allocation unit is used for allocating the Spark task to the executor nodes deployed on the Yarn cluster so as to execute the job.

Further, the job execution apparatus further includes:

the second modification module is used for injecting the class loader of the main thread into the Scala interpreter when the Scala interpreter is created in the driver node of the target Spark engine in the initialization process, so that the class loader of the main thread becomes a parent level of the class loader of the Scala interpreter, and the Scala interpreter creates a corresponding class loader according to the class loader of the parent level;

The task generating unit is further configured to: converting the Spark job code through a class loader of a Scala interpreter created in the driver node to obtain a Spark task;

and the third modification module is used for modifying the class loader of the current thread of the target Spark engine into the class loader of the Scala interpreter when receiving the serialization execution result returned by the executor node based on the Spark task, so as to deserialize the serialization execution result through the class loader of the Scala interpreter.

The function implementation of each module in the job execution device corresponds to each step in the embodiment of the job execution method, and the function and implementation process thereof are not described in detail herein.

The present invention also provides a computer-readable storage medium having stored thereon a job execution program which, when executed by a processor, implements the steps of the job execution method according to any one of the above embodiments.

The specific embodiments of the computer readable storage medium of the present invention are substantially the same as the embodiments of the job execution method described above, and will not be described herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A job execution method, the job execution method comprising:

submitting the Spark job code to the target Spark engine to execute a job;

before the step of determining the deployment catalog information and the version loading rule of the target Spark engine according to the version number, the method further comprises the following steps:

2. The job execution method as set forth in claim 1, wherein before the step of determining the deployment catalog information and version loading rules of the target Spark engine according to the version number, further comprising:

3. The job execution method as set forth in claim 1, wherein the job execution method further comprises:

4. A method of executing a job as claimed in any one of claims 1 to 3, wherein before the step of submitting the Spark job code to the target Spark engine to execute the job, further comprising:

modifying the Spark job code according to the version number;

5. A job execution method as claimed in any one of claims 1 to 3, wherein said step of submitting said Spark job code to said target Spark engine to execute a job comprises:

submitting the Spark job code to a driver node of the target Spark engine;

converting the Spark job code through the driver node to obtain a Spark task;

6. The job execution method as set forth in claim 5, wherein before the step of converting the Spark job code by the driver node to obtain a Spark task, further comprising:

7. A job execution device, the job execution device comprising:

the job execution module is used for submitting the Spark job code to the target Spark engine so as to execute a job;

wherein the job execution apparatus further includes:

the first determining module is further configured to execute the steps if the user is not in the preset gray list: determining deployment catalog information and version loading rules of the target Spark engine according to the version number;

8. A job execution system, the job execution system comprising: a memory, a processor, and a job execution program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the job execution method according to any one of claims 1 to 6.

9. A computer-readable storage medium, wherein a job execution program is stored thereon, which when executed by a processor, implements the steps of the job execution method according to any one of claims 1 to 6.