CN110689245A

CN110689245A - Method and system for analyzing call relation of big data workflow

Info

Publication number: CN110689245A
Application number: CN201910877666.8A
Authority: CN
Inventors: 徐涛; 吴峰; 郭伟
Original assignee: Shanghai Yidianshikong Network Co Ltd
Current assignee: Wheel Interconnection Technology Shanghai Co ltd
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2020-01-14
Anticipated expiration: 2039-09-17
Also published as: CN110689245B

Abstract

A method for analyzing call relations of big data workflows comprises the following steps: storing metadata of a workflow scheduling engine, and obtaining the name of a workflow and the storage position of a description file; reading a first file according to the storage position, wherein the first file describes all job types and execution sequences of the current workflow; reading the first file, and obtaining the operation type and the execution file; writing the name of the workflow and the execution file into a first table, wherein the first table records the workflow attribution of the current operation; and when the job type is hive, continuously reading the execution file, and obtaining a target table and a source table in a data table. By the method for analyzing the call relation of the big data workflow, the relation between the big data workflow scheduling engine workflow and the operation and the relation between the operation and the operation are analyzed. A global invocation relationship chain may be established to assist in querying and analyzing dependencies between them.

Description

Method and system for analyzing call relation of big data workflow

Technical Field

The application relates to the field of databases, in particular to a method and a system for analyzing call relation of big data workflow.

Background

The existing big data workflow scheduling engines mainly comprise Oozie, Azkaban, AirFlow and the like, a DAG graph is used for providing a group of operations and flows which are run in a workflow in a specific sequence, and the supported operation types comprise java, map-reduce, hive, shell, python and the like. However, none of the existing workflow scheduling engines provides a method for querying which workflow a certain scheduling job belongs to, and the existing workflow scheduling engines need to manually traverse all workflows one by one to search. The invention provides a method for analyzing the scheduling relation between jobs and workflows, which helps to quickly find the workflow to which the jobs belong.

The disadvantages are as follows:

1. the existing big data workflow scheduling engine does not provide a method for inquiring the workflow to which the job belongs

2. The existing big data workflow scheduling engine does not provide the dependency relationship between internal tables of hive jobs

Disclosure of Invention

The main purpose of the present application is to provide a method for analyzing a big data workflow invocation relationship, which includes:

storing metadata of a workflow scheduling engine, and obtaining the name of a workflow and the storage position of a description file;

reading a first file according to the storage position, wherein the first file describes all job types and execution sequences of the current workflow;

reading the first file, and obtaining the operation type and the execution file;

writing the name of the workflow and the execution file into a first table, wherein the first table records the workflow attribution of the current operation;

and when the job type is hive, continuously reading the execution file, and obtaining a target table and a source table in a data table.

Optionally, the workflow attribution of the job execution file is queried by executing the file table, and the workflow attribution of the data table generated by querying the data warehouse is queried by executing the file table and the word _ table join.

Optionally, the metadata of ozie is stored through mysql, and WF _ JOBS table is searched through sql below, so as to obtain the name app _ name of the workflow and the saving position app _ path of the description file in hdfs.

Optionally, the first file is a workflow file that reads the lower side of the folder according to the location of the app _ path.

Optionally, the job type work _ type is obtained through a name attribute of an action tag, the execution file work _ script is obtained through a value of a script, and the app _ name and the work _ script are written into an app _ work table of mysql.

Optionally, when the work _ type is live, continuing to read the work _ script file, where the work _ script file records sql that needs to be executed by the job, and analyzing the sql by the following regular expressions to obtain a target table and a source table in a data table:

target table name dst _ table: insert [ \\ n \ t ] + overlay [ \\ n \ t ] + table [ \\ n \ t ] + ([ ^ n \ n ^ t ] +) source table name src _ table: [ \\ n \ t ] from [ \ n \ t ] + ([ ^ n \ n ^ t ] +) or join [ \ n \ t ] + ([ ^ n \ n ^ t ] +)

And writing the work _ script, the dst _ table and the src _ table into a work _ table of mysql, wherein the work _ table records the dependency relationship of the hive operation table.

The application also provides a system for analyzing the call relation of the big data workflow, which comprises the following steps:

the storage module is used for storing the metadata of the workflow scheduling engine and acquiring the name of the workflow and the storage position of the description file;

the reading module is used for reading a first file according to the storage position, wherein the first file describes all job types and execution sequences of the current workflow, and the job types and the execution files are obtained;

the writing module is used for writing the name of the workflow and the execution file into a first table, and the first table records the workflow attribution of the current operation;

and the obtaining module is used for continuously reading the execution file when the operation type is hive, and obtaining a target table and a source table in a data table.

The application also discloses a computer device, which comprises a memory, a processor and a computer program stored in the memory and capable of being executed by the processor, wherein the processor realizes the method of any one of the above items when executing the computer program.

The application also discloses a computer-readable storage medium, a non-volatile readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method of any of the above.

The present application also discloses a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method of any of the above.

By the method for analyzing the call relation of the big data workflow, the relation between the big data workflow scheduling engine workflow and the operation and the relation between the operation and the operation are analyzed. A global invocation relationship chain may be established to assist in querying and analyzing dependencies between them.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:

FIG. 1 is a schematic flow diagram of a method of analyzing big data workflow invocation relationships, according to one embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for analyzing call relations of a big data workflow according to an embodiment of the present application

FIG. 3 is a schematic diagram of a computer device according to one embodiment of the present application; and

FIG. 4 is a schematic diagram of a computer-readable storage medium according to one embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Furthermore, the terms "mounted," "disposed," "provided," "connected," and "sleeved" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; can be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements or components. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

As shown in fig. 1, an embodiment of the present application provides a method for analyzing a call relationship of a big data workflow, including:

s2: storing metadata of a workflow scheduling engine, and obtaining the name of a workflow and the storage position of a description file;

s4: reading a first file according to the storage position, wherein the first file describes all job types and execution sequences of the current workflow;

s6: reading the first file, and obtaining the operation type and the execution file;

s8: writing the name of the workflow and the execution file into a first table, wherein the first table records the workflow attribution of the current operation;

s10: and when the job type is hive, continuously reading the execution file, and obtaining a target table and a source table in a data table.

In an embodiment of the application, the workflow attribution of the job execution file is queried through the execution file table, and the workflow attribution of the data table generated by the data warehouse is queried through the execution file table and the word _ table join.

In an embodiment of the present application, the mysql is used to store metadata of Oozie, and the WF _ JOBS table is searched by the following sql to obtain the name app _ name of the workflow and the storage location app _ path of the description file in hdfs.

In an embodiment of the present application, the first file is a workflow.xml file that reads the bottom of the folder according to the location of the app _ path.

In an embodiment of the application, the job type work _ type is obtained through a name attribute of an action tag, the execution file work _ script is obtained through a value of a script, and the app _ name and the work _ script are written into an app _ work table of mysql.

In an embodiment of the application, when work _ type is live, continuing to read the work _ script file, where the work _ script file records sql to be executed for a job change, and analyzing the sql through the following regular expressions to obtain a target table and a source table in a data table:

Referring to fig. 3, the present application further provides a computer device including a memory, a processor, and a computer program stored in the memory and executable by the processor, wherein the processor implements the method of any one of the above methods when executing the computer program.

Referring to fig. 4, the present application also provides a computer-readable storage medium, a non-volatile readable storage medium, having stored therein a computer program, which when executed by a processor implements the method of any of the above.

The present application also provides a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method of any of the above.

Referring to fig. 2, in an embodiment of the present application, a process of analyzing a call relationship of a big data workflow scheduling engine (Oozie, Azkaban, AirFlow, etc.) is shown in the above diagram, and specific analysis steps are described in detail below with Oozie as a column:

1. the Oozie metadata is stored by using mysql, and the WF _ JOBS table is searched by using the following sql, so that the names app _ name of all workflows and the position app _ path of the workflow description file stored in hdfs can be obtained.

SELECT max(id)id,app_name,max(app_path)app_path FROM WF_JOBS GROUP BYapp_name

2. Reading a workflow file below the folder according to the position of the app _ path, wherein the workflow file describes all job types and job execution sequences in the current workflow

3. Xml file is read, the job type work _ type can be obtained by the name attribute of < action > tag, and the job execution file work _ script can be obtained by the value of < script >. And writing the app _ name and the work _ script in the step 1 into an app _ work table of mysql, wherein the table records which workflow the job belongs to.

4. If work _ type is live, continuing to read a work _ script file, recording the sql which needs to be executed by the job, and analyzing the sql by using the following regular expression to find out a target table and a source table in the data table:

target table name dst _ table: insert [ \\ n \ t ] + overlay [ \\ n \ t ] + table [ \ n \ t ] + ([ ^ n \ n ^ t ] +)

Source table name src _ table: [ \\ n \ t ] from [ \ n \ t ] + ([ ^ n \ n ^ t ] +) or join [ \ n \ t ] + ([ ^ n \ n ^ t ] +)

Writing work _ script, dst _ table and src _ table into mysql work _ table, wherein the table records the dependency relationship of the table in hive operation

5. The workflow to which the job execution file belongs can be inquired by using the work _ script table, and the workflow in which the data table generated by the data warehouse is generated can be inquired by performing join on the work _ script table and the work _ table.

The basic concepts that may be used in this application are as follows:

big data:

https://wiki.mbalib.com/wiki/％E5％A4％A7％E6％95％B0％E6％8D％AE

Hadoop：https://hadoop.apache.org/

big data scheduling engine:

https://www.cnblogs.com/barneywill/p/10109820.html

Oozie:https://oozie.apache.org/

Hive：https://www.yiibai.com/hive/

and (3) data warehouse:

library https:// baike.baidu.com/% E6% 95% B0% E6% 8D% AE% E4% BB% 93% E5% BA% 93

mysql: is an open source code relational database management system (RDBMS) using the most common databases

Management language-Structured Query Language (SQL) for database management

The regular expression is as follows: also known as regular expressions, is a concept of computer science. Regular expressions are typically used to retrieve, replace, text that conforms to a certain pattern (rule).

A DAG: directed acyclic graphs, which are graphs in which any one side has a direction and no loop exists

Oozie: is a workflow scheduler for managing Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work as a Directed Acyclic Graph (DAG) of operations

hive, a data warehouse tool based on Hadoop can map the structured data file into a database table, provide a simple sql query function and convert the sql statement into a MapReduce task to run

hdfs: HDFS Hadoop distributtered FILE SYSTEM is short for, and is a DISTRIBUTED file system

sql: is a standard database language

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for analyzing call relation of big data workflow is characterized by comprising the following steps:

2. The method for analyzing call relation of big data workflow according to claim 1, wherein workflow attribution of job execution file is inquired by executing file table, and workflow attribution of data table generated by data warehouse is inquired by executing file table and word _ table join.

3. The method for analyzing the call relation of the big data workflow as claimed in claim 2, wherein the name app _ name of the workflow and the saving location app _ path of the description file in hdfs are obtained by storing the metadata of Oozie through mysql and looking up the WF _ JOBS table through the following sql.

4. The method for analyzing big data workflow invocation relationship according to claim 3, wherein said first file is a workflow.xml file under the folder read according to the location of said app _ path.

5. The method for analyzing call relation of big data workflow as claimed in claim 4, wherein the job type work _ type is obtained by name attribute of action tag, the execution file work _ script is obtained by value of script, and the app _ name and the work _ script are written into app _ work table of mysql.

6. The method for analyzing call relation of big data workflow as claimed in claim 5, wherein when work _ type is live, continuing to read the work _ script file, the work _ script file records sql that needs to be executed for job change, and the target table and the source table in the data table are obtained by analyzing sql through the following regular expressions:

7. A system for analyzing call relations of a big data workflow, comprising:

8. A computer device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, the processor implementing the method of any one of claims 1-6 when executing the computer program.

9. A computer-readable storage medium, a non-transitory readable storage medium, having stored therein a computer program, which when executed by a processor, implements the method of any one of claims 1-6.

10. A computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method of any of claims 1-6.