CN115237435A

CN115237435A - Method for deploying PyFlink task to horn cluster

Info

Publication number: CN115237435A
Application number: CN202210951622.7A
Authority: CN
Inventors: 李志强; 陈吉平
Original assignee: Hangzhou Daishu Technology Co ltd
Current assignee: Hangzhou Daishu Technology Co ltd
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2022-10-25
Anticipated expiration: 2042-08-09
Also published as: CN116382773A; CN115237435B

Abstract

The application discloses a method for deploying PyFlink tasks to a horn cluster, which relates to the technical field of big data calculation and comprises the following steps: downloading all resource files of PyFlink tasks uploaded by a front end to a back end, and acquiring Python related information; constructing a PackagedProgramm parameter of the PyFlink task according to the resource file and the Python related information, and calling a deployJobCluster method of a yarnClusterDescriptor to upload all related files of the PyFlink task to the HDFS; starting a Python process, generating JobGraph according to the logic of the PyFlink task, and submitting the JobGraph to a yann cluster through a yannCluster Descriptor. According to the method, the uploaded resources and the dependencies are directly reused when the PyFlink task is submitted, and the related dependencies of the PyFlink task and the PyFlink environment do not need to be installed in the client in advance, so that the PyFlink task can be run in different PyFlink environments.

Description

Method for deploying PyFlink task to horn cluster

The application relates to the technical field of big data computing, in particular to a method for deploying PyFlink tasks to a yann cluster.

Background

In the prior art, submission of a Flink task written by Python mainly depends on a command line mode, the mode requires that a user installs a Python environment and related dependencies of Python in advance at a client, and the Python environment dependency needs to be manually uploaded to a server every time, so that the Python environment dependency cannot be reused, the process is complicated, and resources such as Python program files, jar package dependencies and Python dependencies of Python cannot be effectively managed.

Disclosure of Invention

The method for deploying the PyFlink task to the horn cluster aims to effectively manage Python program files, jar package dependencies, python dependencies and other resources related to the PyFlink task.

In order to achieve the purpose, the following technical scheme is adopted in the application:

the method for deploying the PyFlink task to the horn cluster comprises the following steps:

downloading all resource files of PyFlink tasks uploaded by a front end to a back end, and acquiring Python related information;

constructing a PackagedProgramm parameter of the PyFlink task according to the resource file and the Python related information, and calling a deployJobCluster method of a yarnClusterDescriptor to upload all related files of the PyFlink task to the HDFS;

starting a Python process, generating a JobGraph according to the logic of the PyFlink task, and submitting the JobGraph to a yann cluster through the yannClusterDescriptor.

Preferably, the downloading all resource files of the PyFlink task uploaded by the front end to the back end includes:

receiving a resource file of a PyFlink task uploaded by a front end, wherein the resource file comprises a Python file, a PyFlink environment compression package and a third party dependent jar package;

and storing the resource files into different storage media according to resource types, wherein the storage media comprise HDFS and SFTP, and downloading all the resource files to a back end.

Preferably, after downloading all resource files of the PyFlink task uploaded by the front end to the back end, the method further includes:

and decompressing the PyFlink environment compression packet in the resource file, and packaging the decompressed directory path into PyFlinkInfo.

Preferably, the obtaining Python related information includes:

and searching a Flink-Python. Jar package under a back-end Flink Lib directory, setting a jar package path into PyFlinkInfo, and acquiring path information of a PyFlink environment downloaded to the back end, path information of a PyFlink environment running on a yann cluster and path information of resource files stored in the HDFS and the SFTP.

Preferably, the PackagedProgram parameter includes a Python file, a Python join, a path of a backend PyFlink environment, and a PyFlink environment path of a yarm cluster, where the Python join belongs to the resource file.

Preferably, the related files include all packages in the flink lib, jar packages depended on by PyFlink tasks, log configuration files, hdfs configuration files, and yarn configuration files.

Preferably, the starting of the Python process, the generation of the JobGraph according to the logic of the PyFlink task, and the submission of the JobGraph to the yann cluster through the yanncrusterdescriptor include:

calling a Flink Pythondriver to start a Python process, wherein the Python process is used for communicating with a Java JVM process;

and generating JobGraph according to the logic of the PyFlink task and the Java JVM process, and submitting the JobGraph to a yann cluster through the yannClusterDescriptor.

Preferably, the method further comprises:

and after the PyFlink task is submitted, recursively deleting all files downloaded in the task submitting process.

An electronic device comprising a memory and a processor, the memory for storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement a method of deploying PyFlink tasks to a yann cluster as in any of the above.

A computer readable storage medium storing a computer program which when executed by a computer implements a method of deploying PyFlink tasks to a yann cluster as claimed in any one of the above.

The invention has the following beneficial effects:

compared with the Python ecological integration method based on the Flink computing framework in the prior art, the Python ecological threshold using the Python ecological integration method not only is reduced, but also the user can directly edit the Python task on the platform, the uploaded resources and the dependence can be directly multiplexed when submitting the Python task, the Python environment and the dependence of the Python task do not need to be installed at the client in advance, the Python task can be operated in different Python environments, and various resources of the Python task can be effectively managed.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a flowchart of a method for deploying PyFlink tasks to a horn cluster according to the present application;

FIG. 2 is an exemplary diagram of PyFlink task resources provided at the front end of the present application;

FIG. 3 is an exemplary diagram of a front end user configuration relative path in accordance with the subject application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the claims and in the description of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, it being understood that the terms so used are interchangeable under appropriate circumstances and are merely used to describe a distinguishing manner between similar elements in the embodiments of the present application and that the terms "comprising" and "having" and any variations thereof are intended to cover a non-exclusive inclusion such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Examples

As shown in fig. 1, a method for deploying PyFlink tasks to a yarn cluster includes the following steps:

s110, downloading all resource files of PyFlink tasks uploaded by a front end to a back end, and acquiring Python related information;

s120, constructing a PackagedProgramm parameter of the PyFlink task according to the resource file and the Python related information, and calling a deployJobCluster method of a yarnClusterDescriptor to upload all related files of the PyFlink task to the HDFS;

s130, starting a Python process, generating JobGraph according to the logic of the PyFlink task, and submitting the JobGraph to a yann cluster through the yarnClusterDescriptor.

The embodiment relates to interaction between a front end and a back end, where the front end can be simply understood as a client, and is mainly used for providing a resource file of a PyFlink task and transmitting the resource file to the back end, i.e., a server, for management, and related programming languages are HTML and JavaScript, as shown in fig. 2, the main contents of the related programming languages include:

[001] task name and task type

The task name is the task name of the current PyFlink task, such as Flink1, so that the subsequent task operation and maintenance can be conveniently distinguished from the task, and the task type in the PyFlink task is fixed as PyFlink.

[002] Python file

The Python file is mainly logic of the current Pythink task, can be a Flink streaming task or a batch task, and the programming language is Python language, such as kafka _ stream. The Python file needs to be uploaded to project resources of the system in advance and finally stored in the SFTP.

[003] PyFlink environment

The PyFlink environment is primarily a Python environment that needs to be installed in Flink's Pythonomodule. Most developers use Windows or MacOS, but the backend server uses Linux, so that a Python environment capable of running Flunk is packaged in Linux by Docker, such as Linux _ venv _ final.

[004] PyFlink ginseng

Entries required by the PyFlink task, such as "a", "b".

[005] Third party dependencies

Because the Flink framework is written by Java, the Flink API of Python needs to communicate with the Flink JVM of Java, namely the actual function is realized by the Fink framework written by Java, and some functions of Flink are realized by the Connector plug-in, so that some Flink-dependent Jar packages need to be provided by users, for example, when the PyFlink task needs to read Kafka, the Flink-sql-Connector-Kafka _2.12-1.12.7.Jar needs to be used, and the users need to upload the Jar packages to the front-end page.

Further, receiving a resource file of a PyFlink task uploaded by a front end, wherein the resource file comprises a Python file, a PyFlink environment compression package and a third party dependent jar package;

Illustratively, the backend receives all PyFlink task resources transmitted from the front end and stores them into different storage media according to specific contents: pyFlink environment files are large and stored in an HDFS (Hadoop Distributed File system), and other resource files such as Python files and third-party dependence files are small files relative to the PyFlink environment files and are stored in an SFTP (small File Transfer Protocol), wherein the HDFS is a Hadoop Distributed File system (Hadoop Distributed File system), stores oversized files in a streaming data access mode, is a File system stored across a plurality of computers in a management network and has high fault tolerance and high throughput, the SFTP is a safe File Transfer Protocol, is totally called SSH File Transfer Protocol, has the functions of File access, transmission and management, and can provide a safe network encryption method for transmitting files.

In addition, resources and dependence uploaded by the front end are stored in the storage medium, so that PyFlink task files required to be used when tasks are submitted can be directly reused with the uploaded files without re-uploading, and the method is simpler and more convenient.

When a task is submitted, all resources required by the current PyFlink task are to be submitted to the scheduling system for task priority ordering and calling of the PyFlink task submitting component, which is the prior art and is not described herein again.

Before the task is submitted, downloading the resource files transmitted by the front end to the local, decompressing PyFlink environment files, and packaging the decompressed directory paths into PyFlinkInfo to prepare for submitting the task.

And further searching a Flink-Python. Jar package under a back-end Flink Lib directory, setting a jar package path into PyFlinkInfo, and acquiring path information of a PyFlink environment downloaded to the back end, path information of a PyFlink environment running on a yann cluster and path information of resource files stored in the HDFS and the SFTP.

Then, acquiring related information of Python, wherein the related information mainly comprises three parts: finding a Flink-Python. Jar compressed packet which is directly uploaded to a rear end, wherein the jar packet contains a Flink Python module and needs to be submitted to a yard cluster, and the Flink Lib directory is placed in the packet at the rear end, wherein the Flink Lib directory is transferred to the rear end after the front end is configured by a user, and after finding, setting the path of the jar packet into PyFlinkInfo and directly taking out the PyFlinkInfo for use when submitting a task; the second part is a PyFlink environment path, which comprises python. Client. Executable and python. Executable, wherein the python. Client. Executable is a PyFlink environment path required by the back end, and is obtained by splicing two paths, namely a PyFlink environment position decompressed from a PyFlink environment file downloaded from the HDFS, and the second part is a relative path which is configured by a user and then transmitted to the back end, namely, the python. Client. Executable = PyFlink environment + user-configured relative path, the user-configured relative path is shown in fig. 3, and the python. Executable is a path for filing the PyFlink environment file on the HDFS; the third part is the path information of the resource file transmitted by the front end and also divided into two parts, namely the path information of PyFlink environment from HDFS, and the path information of other resource files from SFTP.

Further, calling a Flink Pythondriver to start a Python process, wherein the Python process is used for communicating with a Java JVM process;

Illustratively, when a task is submitted, firstly constructing PackagedProgram parameter information of a PyFlink task, including a Python file, a PyFlink entry, a PyFlink environment path at the back end and a PyFlink environment path of a yard cluster, then calling a deployJobBcursor method of a yarnCluster Descriptor, uploading all resources of the current PyFlink task, namely all packets in a Flink lib, jar packets, log configuration files, HDFS configuration files and a yard configuration file depended by the PyFlink task to an HDFS, calling a Flink PyFlink driver class to start a Python process, communicating the process with a Flink JVM process, generating JobGraph according to logic of the PyFlink task and Java JVM communication, and finally submitting the JobGraph to the yard cluster through the YamFlink Descriptor.

After the task is submitted, all the downloaded folders of the resource of the task are deleted recursively, and the occupation of the disk space of the server by the garbage resource is avoided.

The method for compiling the Flink real-time computing task in the Python mode is based on a Flink computing framework, and is used for better managing Python files, jar package dependencies, python environments and other resources which are depended by the Flink task in the Python compiling process.

The present application also provides an electronic device comprising a memory and a processor, wherein the memory is used for storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the above-mentioned method for deploying PyFlink tasks to a yann cluster.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the electronic device described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

The present application also provides a computer readable storage medium storing a computer program, which when executed by a computer, implements a method for deploying PyFlink tasks to a yann cluster as described above.

Illustratively, a computer program may be divided into one or more modules/units, one or more modules/units being stored in a memory and executed by a processor and performing I/O interface transfer of data by an input interface and an output interface to perform the present invention, and one or more modules/units may be a series of computer program instruction segments describing the execution of the computer program in a computer device.

The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer device may include, but is not limited to, a memory and a processor, and those skilled in the art will appreciate that the present embodiment is only an example of the computer device and does not constitute a limitation of the computer device, and may include more or less components, or combine certain components, or different components, for example, the computer device may further include an input device, a network access device, a bus, and the like.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device, and further, the memory may also include both an internal storage unit and an external storage device of the computer device, the memory is used for storing computer programs and other programs and data required by the computer device, and the memory may also be used for temporarily storing in the output device, and the aforementioned storage medium includes various Media capable of storing program codes, such as a usb disk, a removable hard disk, a read only memory ROM, a random access memory RAM, a disk, or an optical disk.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions within the technical scope of the present invention are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for deploying PyFlink tasks to a yann cluster, comprising the steps of:

2. The method according to claim 1, wherein the downloading all resource files of the PyFlink task uploaded from the front end to the back end comprises:

3. The method according to claim 1, wherein after downloading all resource files of the PyFlink task uploaded from the front end to the back end, the method further comprises:

4. The method according to claim 2, wherein the obtaining PyFlink tasks comprises:

searching a Flink-Python. Jar package under a back-end Flink Lib directory, setting a jar package path to PyFlinkInfo, and acquiring path information of a PyFlink environment downloaded to the back end, path information of the PyFlink environment running on a yann cluster and path information of resource files stored in an HDFS and an SFTP.

5. The method of claim 1, wherein the PackagedProgram parameters include a Python file, a Python join parameter, a path of a backend PyFlink environment, and a PyFlink environment path of a yanm cluster, the Python join parameter being attributed to the resource file.

6. The method of claim 1, wherein the related files comprise all packages in a flinklib, jar packages on which the PyFlink task depends, a log configuration file, a hdfs configuration file, and a yarn configuration file.

7. The method for deploying PyFlink task to a yann cluster as claimed in claim 1, wherein the starting Python process and generating JobGraph according to the PyFlink task logic, submitting the JobGraph to the yann cluster through the YarnCluster Descriptor comprises:

8. The method of claim 1, wherein the PyFlink task is deployed to a horn cluster, and wherein the Python process is initiated, the method further comprising:

9. An electronic device, comprising a memory and a processor, wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executable by the processor to implement a method for deploying a PyFlink task to a yarn cluster as recited in any one of claims 1 to 8.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed causes a computer to implement the method for deploying a PyFlink task to a yann cluster according to any one of claims 1 to 8.