WO2020248708A1 - 一种Spark作业的提交方法及装置 - Google Patents

一种Spark作业的提交方法及装置 Download PDF

Info

Publication number
WO2020248708A1
WO2020248708A1 PCT/CN2020/085217 CN2020085217W WO2020248708A1 WO 2020248708 A1 WO2020248708 A1 WO 2020248708A1 CN 2020085217 W CN2020085217 W CN 2020085217W WO 2020248708 A1 WO2020248708 A1 WO 2020248708A1
Authority
WO
WIPO (PCT)
Prior art keywords
spark
job
execution
node
spark job
Prior art date
Application number
PCT/CN2020/085217
Other languages
English (en)
French (fr)
Inventor
刘有
尹强
王和平
黄山
杨峙岳
邸帅
卢道和
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2020248708A1 publication Critical patent/WO2020248708A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Definitions

  • the embodiment of the present invention relates to the field of big data of financial technology (Fintech), and in particular to a method and device for submitting a Spark job.
  • Fetech big data of financial technology
  • Spark technology is no exception.
  • Spark technology puts forward higher requirements.
  • Spark technology is a fast and general-purpose computing engine designed for large-scale data processing. Spark uses memory computing technology, which can analyze and compute in memory when the data has not been written to the hard disk.
  • memory computing technology can analyze and compute in memory when the data has not been written to the hard disk.
  • it is necessary to wait for the task to be assigned to the failed node and perform multiple failures before it can be determined that the node is a failed node, and the failed node is reported.
  • the failed node cannot be known, which affects the progress of the job submission.
  • the embodiment of the present invention provides a method and device for submitting a Spark job to obtain a faulty node in a Yarn cluster and avoid submitting the Spark job to the faulty node, so as to achieve efficient execution of the Spark job.
  • An embodiment of the present invention provides a method for submitting a Spark job, including:
  • the node blacklist is a list generated by a monitoring alarm platform that records execution nodes that are not available in the Yarn cluster;
  • a Spark engine is created, and the Spark job is sent to the Yarn cluster through the Spark engine.
  • a machine blacklist mechanism is established, combined with the monitoring alarm platform, to obtain the node blacklist of the monitoring alarm platform.
  • the parameters of the execution nodes in the node blacklist are taken into consideration to prevent Spark jobs from being sent to
  • the execution node in the blacklist effectively avoids the job failure problem of the Spark job due to the execution node failure, thereby realizing the efficient execution of the Spark job.
  • the method further includes:
  • the creating a Spark engine according to the job parameters of the Spark job in the execution request and the node blacklist of the Yarn cluster, and sending the Spark job to the Yarn cluster through the Spark engine includes:
  • the first resource queue for executing the Spark job is determined from the target queue of the Spark job, and the Spark engine is created in combination with the job parameters of the Spark job in the execution request, and the Spark engine The Spark job is sent to the first resource queue.
  • the resource usage of the target queue of the Spark job is acquired, and the current resource sufficient queue is determined according to the resource usage to submit the Spark job, that is, the first resource queue for executing the Spark job is determined, so that The Spark job is efficiently scheduled and executed.
  • Optional also includes:
  • the Spark job is regenerated and sent to the first resource queue.
  • the job parameters of the Spark job include:
  • the execution status of the Spark job is tracked by obtaining the execution information of the Spark job in the first resource queue.
  • the error code can be determined according to the execution information, and according to the definition of the error code, Adjust the job parameters and resubmit the Spark job after adjusting the job parameters until the Spark job is successfully executed. There is no need for staff to locate the cause of the failure, and manually adjust the parameters and resubmit the job, which improves the efficiency of Spark job execution and reduces labor costs .
  • an embodiment of the present invention also provides a device for submitting a Spark job, including:
  • the transceiver unit is used to receive the execution request of the Spark job
  • the processing unit is configured to obtain a blacklist of nodes of the Yarn cluster according to the execution request; the blacklist of nodes is a list generated by a monitoring alarm platform that records execution nodes that are not available in the Yarn cluster; according to the execution request.
  • the job parameters of the Spark job and the node blacklist of the Yarn cluster are created in the Spark engine, and the Spark job is sent to the Yarn cluster through the Spark engine.
  • processing unit is further configured to:
  • the processing unit is specifically used for:
  • the first resource queue for executing the Spark job is determined from the target queue of the Spark job, and the Spark engine is created in combination with the job parameters of the Spark job in the execution request, and the Spark engine The Spark job is sent to the first resource queue.
  • processing unit is further configured to:
  • the Spark job is regenerated and sent to the first resource queue.
  • the job parameters of the Spark job include:
  • an embodiment of the present invention also provides a computing device, including:
  • Memory used to store program instructions
  • the processor is configured to call the program instructions stored in the memory, and execute the above-mentioned method for submitting the Spark job according to the obtained program.
  • an embodiment of the present invention also provides a computer-readable non-volatile storage medium, including computer-readable instructions, when the computer reads and executes the computer-readable instructions, the computer is caused to execute the submission of the above-mentioned Spark job method.
  • an embodiment of the present invention also provides a computer program product, the computer program product includes a calculation program stored on a computer-readable non-volatile storage medium, the computer program includes program instructions, when the program When the instruction is executed by the computer, the computer is caused to execute the above-mentioned method for submitting the Spark job.
  • FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a method for submitting a Spark job according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of another method for submitting a Spark job according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a device for submitting a Spark job according to an embodiment of the present invention
  • FIG. 5 is a schematic structural diagram of a computing device provided by this application.
  • the driver node runs in AM (Application Master), which is responsible for applying for resources from Yarn and supervising the running status of jobs. After the user submits the job, the client can be turned off, and the job will continue to run on Yarn. Therefore, the Yarn-Cluster mode is not suitable for running interactive jobs.
  • AM Application Master
  • AM requests an execution node (Executor) from Yarn, and the Client communicates with the requested container (Container) to schedule the execution of the node.
  • the Client mode is a more controllable job operation mode, which can easily obtain jobs. The operation progress, operation log, operation result and other information.
  • Yarn-Cluster mode and Yarn-Client mode generally use the Spark-Submit command officially provided by Spark. You need to set the parameters required for operation, such as the name of the Yarn queue, the number of execution nodes, and the CPU (Central Processing Unit). ) Number, memory size, and the number of CPUs and memory size of the drive node.
  • the Spark-Submit command can be as follows:
  • the submitted Spark job may not be executed correctly every time during the running process of the Yarn cluster.
  • the application fails due to a disk failure of a node, and sometimes The application fails due to OOM (Out Of Memory) errors, and sometimes the queue cannot be scheduled for execution due to the shortage of queue resources, which results in jobs that cannot be executed correctly unattended.
  • OOM Out Of Memory
  • the existing method is likely to repeatedly distribute tasks to the above-mentioned faulty nodes, and the task failure reaches a certain number of times, which will eventually cause the Spark job to fail. If the setting of computing resources is unreasonable, the node will have an OOM error, the Spark job will eventually run incorrectly, and it will not run correctly after repeated submissions. For example, when there are too many jobs running in a queue at the same time, some high-priority jobs are constantly waiting in line and cannot be scheduled for execution in time. Obviously, the existing methods still have the above-mentioned defects.
  • FIG. 1 exemplarily shows a system architecture applicable to a method for submitting a Spark job provided by an embodiment of the present invention.
  • the system architecture may include an engine management server 100, an external service portal 200, a Yarn cluster 300, and a monitoring alarm platform 400.
  • the engine management server 100 is responsible for managing the process creation, state tracking, process destruction, etc. of the Spark Context (Spark context content).
  • the engine management server 100 includes one or more Spark Yarn Client engines, and multiple Spark Yarn Client engines can be executed in parallel, and the Spark Yarn Client engine communicates with the external service portal 200 through RPC (Remote Procedure Call).
  • the Spark Yarn Client engine may also be called the Spark engine or the Spark Yarn client engine in some embodiments.
  • the external service portal 200 includes a job management module and an execution queue. It is responsible for receiving the submission of external jobs and tracking the execution status of the job. When an exception occurs during the operation of the job, the external service portal 200 can automatically adjust the job parameter settings according to the error code of the job , And automatically retry the submission.
  • the external service interface can receive HTTP (HyperText Transfer Protocol) requests or Socket (socket) requests and other types of Spark job execution requests.
  • Yarn cluster 300 provides a framework for job scheduling and cluster resource management in a big data platform.
  • the monitoring alarm platform 400 is used to monitor the running status of the nodes in the job running process in the Yarn cluster 300, and obtain the faulty nodes in time, such as unreadable disks, full disks, and other faulty nodes, and establish a node blacklist.
  • the node blacklist is monitoring
  • the record generated by the alarm platform 400 contains a list of unavailable execution nodes in the Yarn cluster 300.
  • the node blacklist in the monitoring and alarm platform 400 can be obtained by the engine management server 100, which is used by the engine management server 100 to filter out the execution nodes in the node blacklist when starting a Spark job, so as to prevent the job from being scheduled to these failures in time Node.
  • FIG. 2 exemplarily shows a flow of a method for submitting a Spark job according to an embodiment of the present invention, and the flow may be executed by a device for submitting a Spark job.
  • the process specifically includes:
  • Step 201 Receive an execution request of the Spark job.
  • Step 202 Obtain a node blacklist of the Yarn cluster according to the execution request.
  • the node blacklist is a list of unavailable execution nodes in the Yarn cluster generated by the monitoring alarm platform.
  • Step 203 Create a Spark engine according to the job parameters of the Spark job in the execution request and the node blacklist of the Yarn cluster, and send the Spark job to the Yarn cluster through the Spark engine.
  • the external service entrance receives the execution request of the Spark job sent by the external system and puts the Spark job in the execution queue.
  • the external service entrance is used to track the execution status of the Spark job in the execution queue.
  • the external service portal sends the Spark jobs in the execution queue to the engine management server, and the engine management server determines the Spark jobs to be executed through a unified external interface, and obtains the node blacklist corresponding to the Yarn cluster from the monitoring alarm platform.
  • the engine management server creates a Spark engine based on the obtained node blacklist of the Yarn cluster and job parameters of the Spark job, and submits the Spark job to the Yarn cluster through the Spark engine.
  • the engine management server when the submitted Spark job cannot obtain computing resources in the Yarn cluster, it will always be in the Accept state, and the actual scheduled execution cannot be obtained. The task needs to be resubmitted manually, which will affect the progress of the job submission. Therefore, in order to further improve the efficiency of job submission in this embodiment, when the engine management server receives a Spark job, it determines a queue with sufficient current resources to submit the Spark job according to the resource usage of the current resource queue, that is, through the Yarn interface , Actively obtain the resource usage of the target queue of the Spark job in the Yarn cluster, determine the first resource queue used to execute the Spark job from the target queue, and send the Spark job to the first resource queue through the Spark engine, thereby realizing Spark Efficient scheduling and execution of jobs.
  • the method further includes:
  • the creating a Spark engine according to the job parameters of the Spark job in the execution request and the node blacklist of the Yarn cluster, and sending the Spark job to the Yarn cluster through the Spark engine includes:
  • the first resource queue for executing the Spark job is determined from the target queue of the Spark job, and the Spark engine is created in combination with the job parameters of the Spark job in the execution request, and the Spark engine The Spark job is sent to the first resource queue.
  • the Yarn cluster contains multiple resource queues, and each resource queue corresponds to a resource pool.
  • Each resource pool is composed of multiple nodes.
  • the nodes include but are not limited to Spark driver nodes, AM nodes, and execution Nodes and other nodes. Among them, multiple nodes in the resource pool where each resource queue is located form a Spark application, so that the Spark application executes the Spark job. Therefore, after obtaining the computing resources and node blacklist of the Yarn cluster, first remove the unavailable execution nodes according to the node blacklist, then obtain the resource usage of the target queue of the Spark job in the Yarn cluster, and find out the current idle resource queue.
  • the selected nodes will form a Spark application, and then the Spark engine The Spark job is sent to the Spark application in the first resource queue, thereby realizing the execution of the Spark job.
  • the Spark engine submits the Spark job to the Spark application of the Yarn cluster, specifically sending it to the RM (Resource Manager) node in the Yarn cluster, the RM node starts the AM node, and the AM node requests the execution node from the RM node. To start the execution node to run the submitted Spark job.
  • RM Resource Manager
  • the engine management server wraps the creation of the Spark Context for the creation and destruction of the Spark engine, and regularly maintains heartbeat information.
  • the resource usage of each resource queue can be actively obtained from the Yarn cluster, and a relatively idle resource queue can be determined as the first resource queue for sending Spark jobs to the first resource queue.
  • Spark jobs that have been submitted but have not been scheduled by Yarn after a preset period of time can be actively deleted, so as to prevent Spark jobs from being submitted, due to the resource queue problem of the Yarn cluster, and long waiting time affects the final batch timing scheduling.
  • the external service portal can also track the execution status of the Spark job.
  • the external service portal obtains the execution information of the Spark job in the first resource queue, and after monitoring the Spark job execution failure, it can determine according to the execution information
  • the error code of the Spark job, and the job parameters of the Spark job are adjusted according to the error code of the Spark job, and the Spark job is regenerated according to the adjusted job parameters and sent to the first resource queue.
  • the external service portal can set specific job parameter adjustment methods for different Spark job error codes.
  • Common errors in Spark jobs include the occurrence of OOM on the node, resulting in a large number of shuffle operations, the loss of execution nodes due to GC (Garbage Collection, garbage collection) problems, network and other reasons, combined with the common errors of Spark jobs, the job parameters that can be automatically adjusted include but It is not limited to the number of execution nodes during the execution of the Spark job, the memory of each execution node, network delay parameters, and the number of failed retries for each task (Task). For example, when OOM occurs, the memory size of each execution node can be adjusted. For example, the original 2G memory is doubled, that is, it is set to 2 ⁇ 2G memory and resubmitted.
  • Step 301 The external system sends an execution request of the Spark job.
  • the execution request can be an HTTP request, a Socket request, etc.
  • Step 302 The external service portal receives the execution request of the Spark job, puts the Spark job in the execution queue, and tracks the execution status of the Spark job.
  • Step 303 The external service portal sends a Spark engine creation request to the engine management server through the RPC interface of the engine management server.
  • Step 304 The engine management server queries the RM in the Yarn cluster for computing resources.
  • Step 305 The engine management server obtains the node blacklist from the monitoring alarm platform.
  • Step 306 The engine management server creates a Spark engine.
  • Step 307 The engine management server submits the Spark job to the RM of the Yarn cluster through the driver node in the Spark engine.
  • Step 308 After receiving the request of the Spark job, the RM in the Yarn cluster starts the AM in the Yarn cluster. Among them, AM is running in Container.
  • step 309 the AM in the Yarn cluster requests an execution node from the RM.
  • Step 310 the AM in the Yarn cluster starts the requested execution node.
  • Step 311 The Spark engine created by the engine management server receives the Spark job sent by the external service portal, such as SQL (Structured Query Language) and Scala code.
  • SQL Structured Query Language
  • Scala code Scala code
  • Step 312 the Spark driver node in the Spark engine sends the task to the execution node in the Yarn cluster.
  • step 313 the execution node in the Yarn cluster executes the task, and the node monitoring thread monitors the running status of the node and sends it to the monitoring alarm platform.
  • Step 314 The external service portal obtains the execution information of the Spark job, and generates information such as the execution result, the execution log, and the execution status.
  • the embodiments of the present invention can be applied to the field of financial technology (Fintech).
  • the field of financial technology refers to a new innovative technology brought to the financial field after information technology is integrated into the financial field.
  • Financial operations are assisted by the use of advanced information technology.
  • Transaction execution and financial system improvement can improve the processing efficiency and business scale of the financial system, and can reduce costs and financial risks.
  • Spark can be used in the bank to do whitelist analysis and blacklist analysis of users, and ETL (Extract-transform-load, data extraction, cleaning, conversion, and loading) operations can be executed based on Spark in the bank.
  • ETL Extract-transform-load, data extraction, cleaning, conversion, and loading
  • the efficient execution of Spark jobs can be realized by rationally scheduling the computing resources in the Yarn cluster.
  • This technical solution adopts a solution that can manage and control the Client, track the progress of Spark operations, and formulate parameter adjustments based on common errors, which can automatically optimize the retry and submit the Spark job.
  • the beneficial effects of this technical solution are as follows:
  • the execution status of the Spark job can be tracked.
  • the error code can be determined according to the execution information, and the job can be adjusted according to the definition of the error code Parameters, and resubmit the Spark job after adjusting the job parameters until the Spark job is successfully executed. There is no need for staff to locate the cause of the failure, and manually adjust the parameters and resubmit the job. This improves the efficiency of Spark job execution and reduces labor costs.
  • FIG. 4 exemplarily shows the structure of an apparatus for submitting a Spark job according to an embodiment of the present invention, and the apparatus can execute the process of the method for submitting a Spark job.
  • the transceiver unit 401 is configured to receive the execution request of the Spark job
  • the processing unit 402 is configured to obtain a blacklist of nodes of the Yarn cluster according to the execution request; the blacklist of nodes is a list of execution nodes that are not available in the Yarn cluster generated by the monitoring and alarm platform; In the request, the job parameters of the Spark job and the node blacklist of the Yarn cluster are created, a Spark engine is created, and the Spark job is sent to the Yarn cluster through the Spark engine.
  • processing unit 402 is further configured to:
  • the processing unit 402 is specifically configured to:
  • the first resource queue for executing the Spark job is determined from the target queue of the Spark job, and the Spark engine is created in combination with the job parameters of the Spark job in the execution request, and the Spark engine The Spark job is sent to the first resource queue.
  • processing unit 402 is further configured to:
  • the Spark job is regenerated and sent to the first resource queue.
  • the job parameters of the Spark job include:
  • an embodiment of the present invention also provides a computing device, including:
  • the present application also provides a computing device.
  • the computing device includes:
  • the processor 501 is configured to read a program in the memory 502 and execute the above-mentioned method for submitting a Spark job;
  • the processor 501 may be a central processing unit (central processing unit, CPU for short), a network processor (NP for short), or a combination of CPU and NP. It can also be a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (ASIC for short), a programmable logic device (PLD for short), or a combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (generic array logic, GAL), or any of them combination.
  • the memory 502 is configured to store one or more executable programs, and can store data used by the processor 501 when performing operations.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory 502 may include a volatile memory (volatile memory), such as random-access memory (RAM for short); the memory 502 may also include a non-volatile memory (non-volatile memory), such as flash memory ( flash memory), hard disk drive (HDD for short) or solid-state drive (SSD for short); the memory 502 may also include a combination of the foregoing types of memories.
  • volatile memory volatile memory
  • RAM random-access memory
  • non-volatile memory non-volatile memory
  • flash memory flash memory
  • HDD hard disk drive
  • SSD solid-state drive
  • the memory 502 stores the following elements, executable modules or data structures, or their subsets, or their extended sets:
  • Operating instructions including various operating instructions, used to implement various operations.
  • Operating system including various system programs, used to implement various basic services and process hardware-based tasks.
  • the bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in FIG. 5 to represent, but it does not mean that there is only one bus or one type of bus.
  • the bus interface 504 may be a wired communication access port, a wireless bus interface or a combination thereof, where the wired bus interface may be, for example, an Ethernet interface.
  • the Ethernet interface can be an optical interface, an electrical interface or a combination thereof.
  • the wireless bus interface may be a WLAN interface.
  • an embodiment of the present invention also provides a computer-readable non-volatile storage medium, including computer-readable instructions, when the computer reads and executes the computer-readable instructions, the computer is caused to execute the above Spark job Method of submission.
  • the embodiments of the present invention also provide a computer program product.
  • the computer program product includes a calculation program stored on a computer-readable non-volatile storage medium.
  • the computer program includes program instructions. When the program instructions are executed by the computer, the computer executes the above-mentioned method for submitting the Spark job.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)

Abstract

本发明涉及金融科技领域,并公开了一种Spark作业的提交方法及装置,其中,方法包括:接收Spark作业的执行请求,根据执行请求,获取Yarn集群的节点黑名单,根据执行请求中Spark作业的作业参数、Yarn集群的节点黑名单,创建Spark引擎,通过Spark引擎将Spark作业发送至Yarn集群。该技术方案通过获取Yarn集群中故障节点,避免将Spark作业提交至故障节点,以实现Spark作业的高效执行。

Description

一种Spark作业的提交方法及装置
相关申请的交叉引用
本申请要求在2019年06月12日提交中国专利局、申请号为201910504561.8、申请名称为“一种Spark作业的提交方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及金融科技(Fintech)的大数据领域,尤其涉及一种Spark作业的提交方法及装置。
背景技术
随着计算机技术的发展,越来越多的技术应用在金融领域,传统金融业正在逐步向金融科技转变,Spark技术也不例外,但由于金融、支付行业的安全性、实时性要求,也对Spark技术提出的更高的要求。
Spark技术是专为大规模数据处理而设计的快速通用的计算引擎,Spark使用了内存运算技术,能在数据尚未写入硬盘时在内存分析运算。现有技术中,若Yarn集群中存在某个计算节点故障,则需要等待任务分配到该故障节点,并执行多次失败后才能确定出该节点为故障节点,且上报故障节点。Spark作业提交时,无法获知故障节点,导致作业提交的进度受到影响。
发明内容
本发明实施例提供一种Spark作业的提交方法及装置,用以获取Yarn集群中故障节点,并避免将Spark作业提交至故障节点,以实现Spark作业的高效执行。
本发明实施例提供的一种Spark作业的提交方法,包括:
接收Spark作业的执行请求;
根据所述执行请求,获取Yarn集群的节点黑名单;所述节点黑名单是监控告警平台生成的记录有所述Yarn集群中不可用的执行节点的名单;
根据所述执行请求中所述Spark作业的作业参数、所述Yarn集群的节点黑名单,创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述Yarn集群。
上述技术方案中,建立机器黑名单机制,结合监控告警平台,获取监控告警平台的节点黑名单,在进行引擎初始化的时候,将节点黑名单中执行节点的参数考虑进去,防止将Spark作业发送至黑名单中的执行节点,有效避免Spark作业因为执行节点故障而出现的作业失败问题,从而实现Spark作业的高效执行。
可选的,所述方法还包括:
根据所述执行请求,获取所述Yarn集群中所述Spark作业的目标队列的资源使用情况;
所述根据所述执行请求中所述Spark作业的作业参数、所述Yarn集群的节点黑名单,创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述Yarn集群,包括:
根据所述节点黑名单删除所述Yarn集群中所述不可用的执行节点;
从所述Spark作业的目标队列中确定出用于执行所述Spark作业的第一资源队列,并结合所述执行请求中所述Spark作业的作业参数创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述第一资源队列。
上述技术方案中,获取Spark作业的目标队列的资源使用情况,并根据资源使用情况确定当前资源充足的队列进行Spark作业的提交,即确定出用于执行该Spark作业的第一资源队列,以使得该Spark作业得到高效的调度执行。
可选的,还包括:
获取所述Spark作业在所述第一资源队列中的执行信息;
在确定所述Spark作业执行失败后,根据所述执行信息,确定所述Spark作业的错误代码,并调整所述Spark作业的作业参数;
根据调整后的作业参数,重新生成所述Spark作业并发送至所述第一资源队列。
可选的,所述Spark作业的作业参数包括:
所述Spark作业执行过程中的执行节点的个数、各所述执行节点的内存、网络延时参数、各任务的失败重试数。
上述技术方案中,通过获取Spark作业在第一资源队列中的执行信息,实现跟踪Spark作业的执行状态,在确定Spark作业执行失败后,可以根据执行信息确定错误代码,并根据错误代码的定义,调整作业参数,并将调整作业参数后的Spark作业重新提交,直至成功执行Spark作业为止,无需工作人员定位失败原因,以及手动调整参数和重新提交作业,提高Spark作业的执行效率,且降低人力成本。
相应的,本发明实施例还提供了一种Spark作业的提交装置,包括:
收发单元,用于接收Spark作业的执行请求;
处理单元,用于根据所述执行请求,获取Yarn集群的节点黑名单;所述节点黑名单是监控告警平台生成的记录有所述Yarn集群中不可用的执行节点的名单;根据所述执行请求中所述Spark作业的作业参数、所述Yarn集群的节点黑名单,创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述Yarn集群。
可选的,所述处理单元还用于:
根据所述执行请求,获取所述Yarn集群中所述Spark作业的目标队列的资源使用情况;
所述处理单元具体用于:
根据所述节点黑名单删除所述Yarn集群中所述不可用的执行节点;
从所述Spark作业的目标队列中确定出用于执行所述Spark作业的第一资源队列,并结合所述执行请求中所述Spark作业的作业参数创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述第一资源队列。
可选的,所述处理单元还用于:
获取所述Spark作业在所述第一资源队列中的执行信息;
在确定所述Spark作业执行失败后,根据所述执行信息,确定所述Spark作业的错误代码,并调整所述Spark作业的作业参数;
根据调整后的作业参数,重新生成所述Spark作业并发送至所述第一资源队列。
可选的,所述Spark作业的作业参数包括:
所述Spark作业执行过程中的执行节点的个数、各所述执行节点的内存、网络延时参数、各任务的失败重试数。
相应的,本发明实施例还提供了一种计算设备,包括:
存储器,用于存储程序指令;
处理器,用于调用所述存储器中存储的程序指令,按照获得的程序执行上述Spark作业的提交方法。
相应的,本发明实施例还提供了一种计算机可读非易失性存储介质,包括计算机可读指令,当计算机读取并执行所述计算机可读指令时,使得计算机执行上述Spark作业的提交方法。
相应的,本发明实施例还提供了一种计算机程序产品,所述计算机程序产品包括存储在计算机可读非易失性存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使得所述计算机执行上述Spark作业的提交方法。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种系统架构的示意图;
图2为本发明实施例提供的一种Spark作业的提交方法的流程示意图;
图3为本发明实施例提供的另一种Spark作业的提交方法的示意图;
图4为本发明实施例提供的一种Spark作业的提交装置的结构示意图;
图5为本申请提供的一种计算设备的结构示意图。
具体实施方式
为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作进一步地详细描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。
在Spark On Yarn的集群模式下,主要有两种类型的作业运行方式,一种是Yarn-Cluster模式,另外一种是Yarn-Client模式。
Yarn-Cluster模式下,驱动器节点(Driver)运行在AM(Application Master)中,它负责向Yarn申请资源,并监督作业的运行状况。当用户提交了作业之后,就可以关掉客户端(Client),作业会继续在Yarn上运行,因而Yarn-Cluster模式不适合运行交互类型的作业。
Yarn-Client模式下,AM向Yarn请求执行节点(Executor),Client会和请求的容器(Container)通信来调度执行节点的工作,Client模式是一种更加可控作业运行模式,可以方便的获取作业的运行进度、运行日志、运行结果等信息。
Yarn-Cluster模式和Yarn-Client模式一般都使用Spark官方提供的Spark-Submit命令,需设置运行所需要的参数,比如Yarn队列名称,执行节点的执行个数、CPU(Central Processing Unit,中央处理器)个数、内存大小,以及驱动器节点的CPU个数、内存大小。
示例性的,Spark-Submit命令可以如下所示:
·./bin/spark-submit\
·--class org.apache.spark.examples.SparkPi\
·--master yarn\
·--queue QUEUE_NAME\
·--driver-memory 4G\
·--deploy-mode cluster\#can be client for client mode
·--executor-memory 20G\
·--num-executors 50\
·/path/to/examples.jar\
·1000
需要说明的是,提交上去的Spark作业在Yarn集群的运行过程中,并不一定每次都能正确执行,在很多节点的Yarn集群中,有时会因为某个节点的磁盘故障导致应用失败,有时因为存在OOM(Out Of Memory,内存溢出)错误导致应用失败,有时因为队列资源紧张一直不能被调度执行等一些问题,导致作业无法在无人值守下能正确执行。
而针对上述问题,考虑到数据的本地性问题,现有的方式很可能还是反复把任务分发到上述的故障节点,任务失败达到一定次数,最终会导致Spark作业出现失败。如计算资源的设置不合理,导致节点出现OOM错误,Spark作业最终运行错误,反复提交后还是无法正确运行。如当一个队列同时运行作业过多,导致一些高优先级的作业不断排队等待不能及时调度执行等等一些问题。很明显,现有的方式还是存在上述缺陷。
在本发明实施例中,通过总结归类分析,发现大部分都是以上一些比较常见的错误发生导致,这种依赖于Spark原生的简单作业提交方式没法提升作业的执行成功率,作业出错后还需要大量人工干预,定位问题耗费大量时间和精力,而这种情况严重不符合银行等金融机构的实际需求。基于此缺陷,提出本发明Spark作业的提交方法。
图1示例性的示出了本发明实施例提供Spark作业的提交方法所适用的系统架构,该系统架构可以包括引擎管理服务器100、对外服务入口200、Yarn集群300和监控告警平台400。
其中,引擎管理服务器100负责管理Spark Context(Spark上下文内容)所在的进程创建、状态跟踪、进程销毁等。引擎管理服务器100包括一个或多个Spark Yarn Client引擎,多个Spark Yarn Client引擎可并行执行,Spark Yarn Client引擎通过RPC(Remote Procedure Call,远程过程调用)与对外服务入口200进行通信。Spark Yarn Client引擎在一些实施例中,又可以叫做Spark引擎或Spark Yarn客户端引擎。
对外服务入口200包含作业管理模块和执行队列,负责接收外部作业的提交,并跟踪作业的执行状态,当作业运行过程中出现异常,对外服务入口200可以根据作业的错误代码,自动调整作业参数设置,并进行自动重试提交。本实例中,该对外服务接口可接收HTTP(HyperText Transfer Protocol,超文本传输协议)请求或Socket(套接字)请求等类型的Spark作业执行请求。
Yarn集群300为提供大数据平台中作业调度和集群资源管理的框架。
监控告警平台400用于监控Yarn集群300中作业运行过程中的节点的运行状态,并及时获取故障节点,如磁盘不可读、磁盘写满等故障节点,并建立节点黑名单,节点黑名单即监控告警平台400生成的记录有Yarn集群300中不可用的执行节点的名单。监控告警平台400中的节点黑名单可以被引擎管理服务器100获取,用于引擎管理服务器100在启动Spark作业时,将这些节点黑名单中的执行节点过滤掉,从而及时防止作业被调度到这些故障节点上。
基于上述描述,图2示例性的示出了本发明实施例提供的一种Spark作业的提交方法的流程,该流程可以由Spark作业的提交装置执行。
如图2所示,该流程具体包括:
步骤201,接收Spark作业的执行请求。
步骤202,根据所述执行请求,获取Yarn集群的节点黑名单。
其中,节点黑名单是监控告警平台生成的记录有Yarn集群中不可用的执行节点的名单。
步骤203,根据所述执行请求中所述Spark作业的作业参数、所述Yarn 集群的节点黑名单,创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述Yarn集群。
对外服务入口接收外部系统发送的Spark作业的执行请求,并将该Spark作业放入执行队列中,对外服务入口用于对该执行队列中的Spark作业的执行状态进行跟踪。然后,对外服务入口将执行队列中的Spark作业发送至引擎管理服务器,引擎管理服务器通过统一的对外接口确定待执行的Spark作业,并从监控告警平台中获取Yarn集群对应的节点黑名单。引擎管理服务器根据获取到的Yarn集群的节点黑名单、Spark作业的作业参数,创建Spark引擎,并通过Spark引擎将Spark作业提交到Yarn集群。
需要说明的是,在提交的Spark作业在Yarn集群中无法获取计算资源时,会一直处于Accept状态,得不到实际的调度执行,需要人工重新提交任务,从而导致作业提交进度受影响。因此,本实施例中为了进一步提高作业提交的效率,引擎管理服务器在接收到Spark作业时,根据当前资源队列的资源使用情况,确定当前资源充足的队列进行Spark作业的提交,即通过Yarn的接口,主动获取Yarn集群中Spark作业的目标队列的资源使用情况,从目标队列中确定出用于执行Spark作业的第一资源队列,并通过Spark引擎将Spark作业发送至第一资源队列,进而实现Spark作业的高效调度执行。
即,所述方法还包括:
根据所述执行请求,获取所述Yarn集群中所述Spark作业的目标队列的资源使用情况;
所述根据所述执行请求中所述Spark作业的作业参数、所述Yarn集群的节点黑名单,创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述Yarn集群,包括:
根据所述节点黑名单删除所述Yarn集群中所述不可用的执行节点;
从所述Spark作业的目标队列中确定出用于执行所述Spark作业的第一资源队列,并结合所述执行请求中所述Spark作业的作业参数创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述第一资源队列。
需要说明的是,Yarn集群中包含有多个资源队列,每个资源队列中都对应有一个资源池,每个资源池由多个节点构成,节点包括但不限于Spark驱动器节点、AM节点和执行节点等节点。其中,每一个资源队列所在资源池的多个节点形成一个Spark应用,以由该Spark应用进行Spark作业的执行。因此,在获取Yarn集群的计算资源和节点黑名单之后,先根据节点黑名单去掉不可用的执行节点,然后获取Yarn集群中Spark作业的目标队列的资源使用情况,找出当前空闲的资源队列,之后在空闲的资源队列中结合执行请求中Spark作业的作业参数选择符合该作业参数的节点,并根据选择的节点的标识创建Spark引擎,此时选中的节点会组成Spark应用,之后通过该Spark引擎将Spark作业发送至第一资源队列的Spark应用中,从而实现Spark作业的执行。
本发明实施例中,Spark引擎将Spark作业提交至Yarn集群的Spark应用,具体是发送至Yarn集群中的RM(Resource Manager)节点中,RM节点启动AM节点,AM节点向RM节点请求执行节点,以使执行节点启动以运行提交的Spark作业。
此外,引擎管理服务器将Spark Context的创建包装起来,用于Spark引擎的创建和销毁,并定时维护心跳信息。在Spark引擎启动前,可以主动从Yarn集群中获取各资源队列的资源使用情况,并从中确定出比较空闲的资源队列作为第一资源队列,用于将Spark作业发送至第一资源队列。进一步的,可以将已提交但预设时段后仍未被Yarn调度的Spark作业主动删除,从而防止Spark作业提交后,因为Yarn集群的资源队列问题,而长时间等待影响最终批量定时调度。
此外,对外服务入口还可以跟踪Spark作业的执行状态,具体实现中,对外服务入口获取Spark作业在第一资源队列中的执行信息,并在监控到Spark作业执行失败后,可以根据执行信息,确定Spark作业的错误代码,并根据Spark作业的错误代码,调整Spark作业的作业参数,根据调整后的作业参数,重新生成Spark作业并发送至第一资源队列。
可选的,对外服务入口可以针对不同的Spark作业的错误代码,设置具有针对性的作业参数调整方式。Spark作业的常见错误有节点出现OOM导致发生大量shuffle操作,由于GC(Garbage Collection,垃圾回收)问题、网络等原因丢失了执行节点等,结合Spark作业的常见错误,可以自动调整的作业参数包括但不限于Spark作业执行过程中的执行节点的个数、各执行节点的内存、网络延时参数、各任务(Task)的失败重试数。例如,当出现OOM时,可以调整各执行节点的内存大小,如将原来设置的2G内存提升1倍,即设置为2×2G内存后重新提交。
为了更好的解释本发明实施例,下面将在具体的实施场景下描述该Spark作业的提交流程,如图3所示,具体如下:
步骤301,外部系统发送Spark作业的执行请求。
执行请求可以是HTTP请求、Socket请求等。
步骤302,对外服务入口接收Spark作业的执行请求,将Spark作业放入执行队列中,并跟踪Spark作业的执行状态。
步骤303,对外服务入口通过引擎管理服务器的RPC接口向引擎管理服务器发送Spark引擎创建请求。
步骤304,引擎管理服务器向Yarn集群中的RM查询计算资源。
步骤305,引擎管理服务器从监控告警平台中获取节点黑名单。
步骤306,引擎管理服务器创建Spark引擎。
步骤307,引擎管理服务器通过Spark引擎中的驱动器节点提交Spark作业到Yarn集群的RM。
步骤308,Yarn集群中的RM接收到Spark作业的请求后,启动Yarn集群中的AM。其中,AM是运行在Container中。
步骤309,Yarn集群中的AM向RM请求执行节点。
步骤310,Yarn集群中的AM启动请求到的执行节点。
步骤311,引擎管理服务器创建的Spark引擎接收对外服务入口发送的Spark作业,如SQL(Structured Query Language,结构化查询语言)、Scala代 码。
步骤312,Spark引擎中的Spark驱动器节点发送任务至Yarn集群中的执行节点。
步骤313,Yarn集群中的执行节点执行任务,节点监控线程监控节点运行状态,并发送至监控告警平台。
步骤314,对外服务入口获取Spark作业的执行信息,生成执行结果、执行日志和执行状态等信息。
本发明实施例可应用于金融科技(Fintech)领域,金融科技领域是指将信息技术融入金融领域后,为金融领域带来的一种新的创新科技,通过使用先进的信息技术辅助实现金融作业、交易执行以及金融系统改进,可以提升金融系统的处理效率、业务规模,并可以降低成本和金融风险。示例性的,可以在银行中使用Spark做用户的白名单分析和黑名单分析,可以在银行中基于Spark执行ETL(Extract-transform-load,数据抽取、清洗、转换、装载)作业,在执行各种Spark应用时,可以通过合理调度Yarn集群中的计算资源,实现Spark作业的高效执行。
本技术方案采用一种能够对Client进行管控,对Spark作业进度进行跟踪处理,并根据常见错误制定参数调整,能够自动优化重试提交Spark作业的方案,本技术方案有益效果如下:
(1)在接收到Spark作业的执行请求后,将执行请求中的Spark作业提交到Yarn集群,以使Yarn集群启动资源队列,并且在接收到Spark作业时,根据当前资源队列的资源使用情况,确定出当前资源充足的队列进行Spark作业的提交,即确定出用于执行该Spark作业的第一资源队列,以使得该Spark作业得到高效的调度执行,且将已提交但预设时段后仍未被Yarn调度的Spark作业主动删除,从而防止Spark作业提交后,因为Yarn集群的资源队列问题,而长时间等待影响最终批量定时调度。
(2)建立机器黑名单机制,结合监控告警平台,获取监控告警平台的节点黑名单,将集群的一些特定的告警错误上报节点纳入节点黑名单管理,在 进行引擎初始化的时候,将节点黑名单中执行节点的参数考虑进去,防止将Spark作业提交至黑名单中的执行节点,有效避免Spark作业因为执行节点故障而出现的作业失败问题。
(3)通过获取Spark作业在第一资源队列中的执行信息,实现跟踪Spark作业的执行状态,在确定Spark作业执行失败后,可以根据执行信息确定错误代码,并根据错误代码的定义,调整作业参数,并将调整作业参数后的Spark作业重新提交,直至成功执行Spark作业为止,无需工作人员定位失败原因,以及手动调整参数和重新提交作业,提高Spark作业的执行效率,且降低人力成本。
基于同一发明构思,图4示例性的示出了本发明实施例提供的一种Spark作业的提交装置的结构,该装置可以执行Spark作业的提交方法的流程。
收发单元401,用于接收Spark作业的执行请求;
处理单元402,用于根据所述执行请求,获取Yarn集群的节点黑名单;所述节点黑名单是监控告警平台生成的记录有所述Yarn集群中不可用的执行节点的名单;根据所述执行请求中所述Spark作业的作业参数、所述Yarn集群的节点黑名单,创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述Yarn集群。
可选的,所述处理单元402还用于:
根据所述执行请求,获取所述Yarn集群中所述Spark作业的目标队列的资源使用情况;
所述处理单元402具体用于:
根据所述节点黑名单删除所述Yarn集群中所述不可用的执行节点;
从所述Spark作业的目标队列中确定出用于执行所述Spark作业的第一资源队列,并结合所述执行请求中所述Spark作业的作业参数创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述第一资源队列。
可选的,所述处理单元402还用于:
获取所述Spark作业在所述第一资源队列中的执行信息;
在确定所述Spark作业执行失败后,根据所述执行信息,确定所述Spark作业的错误代码,并调整所述Spark作业的作业参数;
根据调整后的作业参数,重新生成所述Spark作业并发送至所述第一资源队列。
可选的,所述Spark作业的作业参数包括:
所述Spark作业执行过程中的执行节点的个数、各所述执行节点的内存、网络延时参数、各任务的失败重试数。
基于同一发明构思,本发明实施例还提供了一种计算设备,包括:
基于与上述图5所示的方法相同的构思,本申请还提供一种计算设备,如图5所示,该计算设备包括:
处理器501、存储器502、收发器503、总线接口504;其中,处理器501、存储器502与收发器503之间通过总线连接;
所述处理器501,用于读取所述存储器502中的程序,执行上述Spark作业的提交方法;
处理器501可以是中央处理器(central processing unit,简称CPU),网络处理器(network processor,简称NP)或者CPU和NP的组合。还可以是硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,简称ASIC),可编程逻辑器件(programmable logic device,简称PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,简称CPLD),现场可编程逻辑门阵列(field-programmable gate array,简称FPGA),通用阵列逻辑(generic array logic,简称GAL)或其任意组合。
所述存储器502,用于存储一个或多个可执行程序,可以存储所述处理器501在执行操作时所使用的数据。
具体地,程序可以包括程序代码,程序代码包括计算机操作指令。存储器502可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,简称RAM);存储器502也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk  drive,简称HDD)或固态硬盘(solid-state drive,简称SSD);存储器502还可以包括上述种类的存储器的组合。
存储器502存储了如下的元素,可执行模块或者数据结构,或者它们的子集,或者它们的扩展集:
操作指令:包括各种操作指令,用于实现各种操作。
操作系统:包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。
总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图5中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
总线接口504可以为有线通信接入口,无线总线接口或其组合,其中,有线总线接口例如可以为以太网接口。以太网接口可以是光接口,电接口或其组合。无线总线接口可以为WLAN接口。
基于同一发明构思,本发明实施例还提供了一种计算机可读非易失性存储介质,包括计算机可读指令,当计算机读取并执行所述计算机可读指令时,使得计算机执行上述Spark作业的提交方法。
基于同一发明构思,本发明实施例还提供了一种计算机程序产品,所述计算机程序产品包括存储在计算机可读非易失性存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使得所述计算机执行上述Spark作业的提交方法。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流 程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (11)

  1. 一种Spark作业的提交方法,其特征在于,包括:
    接收Spark作业的执行请求;
    根据所述执行请求,获取Yarn集群的节点黑名单;所述节点黑名单是监控告警平台生成的记录有所述Yarn集群中不可用的执行节点的名单;
    根据所述执行请求中所述Spark作业的作业参数、所述Yarn集群的节点黑名单,创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述Yarn集群。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    根据所述执行请求,获取所述Yarn集群中所述Spark作业的目标队列的资源使用情况;
    所述根据所述执行请求中所述Spark作业的作业参数、所述Yarn集群的节点黑名单,创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述Yarn集群,包括:
    根据所述节点黑名单删除所述Yarn集群中所述不可用的执行节点;
    从所述Spark作业的目标队列中确定出用于执行所述Spark作业的第一资源队列,并结合所述执行请求中所述Spark作业的作业参数创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述第一资源队列。
  3. 如权利要求2所述的方法,其特征在于,还包括:
    获取所述Spark作业在所述第一资源队列中的执行信息;
    在确定所述Spark作业执行失败后,根据所述执行信息,确定所述Spark作业的错误代码,并调整所述Spark作业的作业参数;
    根据调整后的作业参数,重新生成所述Spark作业并发送至所述第一资源队列。
  4. 如权利要求3所述的方法,其特征在于,所述Spark作业的作业参数包括:
    所述Spark作业执行过程中的执行节点的个数、各所述执行节点的内存、网络延时参数、各任务的失败重试数。
  5. 一种Spark作业的提交装置,其特征在于,包括:
    收发单元,用于接收Spark作业的执行请求;
    处理单元,用于根据所述执行请求,获取Yarn集群的节点黑名单;所述节点黑名单是监控告警平台生成的记录有所述Yarn集群中不可用的执行节点的名单;根据所述执行请求中所述Spark作业的作业参数、所述Yarn集群的节点黑名单,创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述Yarn集群。
  6. 如权利要求5所述的装置,其特征在于,所述处理单元还用于:
    根据所述执行请求,获取所述Yarn集群中所述Spark作业的目标队列的资源使用情况;
    所述处理单元具体用于:
    根据所述节点黑名单删除所述Yarn集群中所述不可用的执行节点;
    从所述Spark作业的目标队列中确定出用于执行所述Spark作业的第一资源队列,并结合所述执行请求中所述Spark作业的作业参数创建Spark引擎,通过所述Spark引擎将所述Spark作业发送至所述第一资源队列。
  7. 如权利要求6所述的装置,其特征在于,所述处理单元还用于:
    获取所述Spark作业在所述第一资源队列中的执行信息;
    在确定所述Spark作业执行失败后,根据所述执行信息,确定所述Spark作业的错误代码,并调整所述Spark作业的作业参数;
    根据调整后的作业参数,重新生成所述Spark作业并发送至所述第一资源队列。
  8. 如权利要求7所述的装置,其特征在于,所述Spark作业的作业参数包括:
    所述Spark作业执行过程中的执行节点的个数、各所述执行节点的内存、网络延时参数、各任务的失败重试数。
  9. 一种计算设备,其特征在于,包括:
    存储器,用于存储程序指令;
    处理器,用于调用所述存储器中存储的程序指令,按照获得的程序执行权利要求1至4任一项所述的方法。
  10. 一种计算机可读非易失性存储介质,其特征在于,包括计算机可读指令,当计算机读取并执行所述计算机可读指令时,使得计算机执行如权利要求1至4任一项所述的方法。
  11. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储在计算机可读非易失性存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求1~4任一所述方法。
PCT/CN2020/085217 2019-06-12 2020-04-16 一种Spark作业的提交方法及装置 WO2020248708A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910504561.8 2019-06-12
CN201910504561.8A CN110262881A (zh) 2019-06-12 2019-06-12 一种Spark作业的提交方法及装置

Publications (1)

Publication Number Publication Date
WO2020248708A1 true WO2020248708A1 (zh) 2020-12-17

Family

ID=67917731

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/085217 WO2020248708A1 (zh) 2019-06-12 2020-04-16 一种Spark作业的提交方法及装置

Country Status (2)

Country Link
CN (1) CN110262881A (zh)
WO (1) WO2020248708A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110262881A (zh) * 2019-06-12 2019-09-20 深圳前海微众银行股份有限公司 一种Spark作业的提交方法及装置
CN112540858B (zh) * 2019-09-23 2023-10-27 华为云计算技术有限公司 任务处理方法、服务器、客户端及系统
CN111031123B (zh) * 2019-12-10 2022-06-03 中盈优创资讯科技有限公司 Spark任务的提交方法、系统、客户端及服务端
CN113407331A (zh) * 2020-03-17 2021-09-17 腾讯科技(深圳)有限公司 一种任务处理的方法、装置及存储介质
CN111767092B (zh) * 2020-06-30 2023-05-12 深圳前海微众银行股份有限公司 作业执行方法、装置、系统及计算机可读存储介质
CN112000734A (zh) * 2020-08-04 2020-11-27 中国建设银行股份有限公司 一种大数据处理方法和装置
CN112328403B (zh) * 2020-11-25 2024-06-25 北京中天孔明科技股份有限公司 一种SparkContext的配置方法、装置及服务端
CN112486468A (zh) * 2020-12-15 2021-03-12 恩亿科(北京)数据科技有限公司 基于spark内核的任务执行方法、系统和计算机设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653928A (zh) * 2016-02-03 2016-06-08 北京大学 一种面向大数据平台的拒绝服务检测方法
CN106980699A (zh) * 2017-04-14 2017-07-25 中国科学院深圳先进技术研究院 一种数据处理平台和系统
CN110262881A (zh) * 2019-06-12 2019-09-20 深圳前海微众银行股份有限公司 一种Spark作业的提交方法及装置
US20190370146A1 (en) * 2018-06-05 2019-12-05 Shivnath Babu System and method for data application performance management

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033777B (zh) * 2010-09-17 2013-03-20 中国资源卫星应用中心 基于ice的分布式作业调度引擎
US9396031B2 (en) * 2013-09-27 2016-07-19 International Business Machines Corporation Distributed UIMA cluster computing (DUCC) facility
CN105205169B (zh) * 2015-10-12 2018-06-15 中国电子科技集团公司第二十八研究所 一种分布式图像索引与检索方法
US10305747B2 (en) * 2016-06-23 2019-05-28 Sap Se Container-based multi-tenant computing infrastructure
CN109684051B (zh) * 2018-12-17 2020-08-11 杭州玳数科技有限公司 一种混合式大数据任务异步提交的方法和系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653928A (zh) * 2016-02-03 2016-06-08 北京大学 一种面向大数据平台的拒绝服务检测方法
CN106980699A (zh) * 2017-04-14 2017-07-25 中国科学院深圳先进技术研究院 一种数据处理平台和系统
US20190370146A1 (en) * 2018-06-05 2019-12-05 Shivnath Babu System and method for data application performance management
CN110262881A (zh) * 2019-06-12 2019-09-20 深圳前海微众银行股份有限公司 一种Spark作业的提交方法及装置

Also Published As

Publication number Publication date
CN110262881A (zh) 2019-09-20

Similar Documents

Publication Publication Date Title
WO2020248708A1 (zh) 一种Spark作业的提交方法及装置
US11250025B2 (en) Methods and systems for bulk uploading of data in an on-demand service environment
US10453010B2 (en) Computer device, method, and apparatus for scheduling business flow
WO2021237829A1 (zh) 一种实现代码仓库与计算服务整合的方法及系统
WO2020233212A1 (zh) 一种日志记录的处理方法、服务器及存储介质
US10261853B1 (en) Dynamic replication error retry and recovery
US8365193B2 (en) Recoverable asynchronous message driven processing in a multi-node system
US8166350B2 (en) Apparatus and method for persistent report serving
CN110806933B (zh) 一种批量任务处理方法、装置、设备和存储介质
US9495201B2 (en) Management of bottlenecks in database systems
US10455264B2 (en) Bulk data extraction system
CN112000455B (zh) 一种多线程任务处理方法、装置及电子设备
US20060143290A1 (en) Session monitoring using shared memory
JP2008015888A (ja) 負荷分散制御システム及び負荷分散制御方法
CN111160873B (zh) 基于分布式架构的跑批处理装置及方法
US10545815B2 (en) System and method for data redistribution in a database
US20130254771A1 (en) Systems and methods for continual, self-adjusting batch processing of a data stream
US20190238605A1 (en) Verification of streaming message sequence
CN113157411B (zh) 一种基于Celery的可靠可配置任务系统及装置
WO2021118624A1 (en) Efficient transaction log and database processing
WO2013155935A1 (zh) 一种焊接电源与计算机之间进行数据通信的方法
US10089375B2 (en) Idling individually specified objects during data replication
CN112199432A (zh) 一种基于分布式的高性能数据etl装置及控制方法
CN116719623A (zh) 作业调度方法、作业结果处理方法及其装置
CN115904640A (zh) 分布式任务处理系统及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20823627

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20823627

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 21/03/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20823627

Country of ref document: EP

Kind code of ref document: A1