CN117608780A - Automatic task rerun method, system, device and storage medium - Google Patents

Automatic task rerun method, system, device and storage medium Download PDF

Info

Publication number
CN117608780A
CN117608780A CN202311577562.8A CN202311577562A CN117608780A CN 117608780 A CN117608780 A CN 117608780A CN 202311577562 A CN202311577562 A CN 202311577562A CN 117608780 A CN117608780 A CN 117608780A
Authority
CN
China
Prior art keywords
task
information
matrix
determining
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311577562.8A
Other languages
Chinese (zh)
Inventor
李忠财
邹京辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Shilian Technology Co ltd
Original Assignee
Tianyi Digital Life Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Digital Life Technology Co Ltd filed Critical Tianyi Digital Life Technology Co Ltd
Priority to CN202311577562.8A priority Critical patent/CN117608780A/en
Publication of CN117608780A publication Critical patent/CN117608780A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a task automatic re-running method, a system, a device and a storage medium, which are characterized in that edit log data of a system are obtained in real time, the edit log data are analyzed, an operation updated data file object is determined, a specific data table and corresponding table information are determined according to the operation updated data file object, then the determined specific data table and the corresponding table information are sent to a task queue, a first relation matrix is constructed through an edited offline task list, then the specific data table and the corresponding table information are sequentially read from the task queue, the obtained specific data table and the corresponding table information are matched with the constructed first relation matrix, and a task node needing to trigger execution of re-running and a corresponding downstream associated task node in the task list are determined so as to execute task re-running operation. The embodiment of the invention reduces the labor and time cost caused by manual intervention, realizes cost reduction and synergy, and can be widely applied to the field of big data.

Description

Automatic task rerun method, system, device and storage medium
Technical Field
The present invention relates to the field of big data, and in particular, to a method, a system, an apparatus, and a storage medium for task automatic rerun.
Background
At present, the big data field still uses a Hadoop architecture to process data, and technicians develop and deploy offline operations, such as DolphinScheduler, airFlow, based on the existing scheduling platform according to the requirements of service development; due to the actual business requirements, the technician typically has a dependency relationship between the developed and deployed offline business; the execution state of the upstream task in the offline task can be automatically detected through the dependency relationship, and if the failure of the execution of the upstream task is detected, the downstream task is not executed; and after detecting that the task execution fails, carrying out task rerun through a rerun mechanism which is preset.
However, in actual service execution, there are abnormal situations such as service caliber change or incomplete initial data source, so that it is necessary to re-run a certain node of DAG (directed acyclic graph, starting from any vertex, and not returning to the vertex through a plurality of directed edges) tasks and all tasks downstream corresponding to the node, or all tasks are manually re-run; all the operations need manual intervention of technicians, and have high execution cost and low efficiency.
Disclosure of Invention
Therefore, the purpose of the embodiments of the present invention is to provide a method, a system, a device and a storage medium for automatic task rerun, which can reduce the degree of manual intervention and realize cost reduction and efficiency enhancement.
In one aspect, the embodiment of the invention provides a task automatic re-running method, which comprises the following steps:
acquiring editing log data; wherein the edit log data includes an operation record of a data file;
analyzing the editing log data to obtain a plurality of first information tables, and sending the plurality of first information tables to a task queue; wherein the first information table includes metadata information of an update table;
sequentially acquiring the first information table according to the task queue, and determining rerun task information according to the first information table and a first relation matrix to trigger task execution; the re-running task information comprises re-running task nodes and re-running sequences.
Optionally, the analyzing the editing log data to obtain a plurality of first information tables specifically includes:
determining operation information according to the editing log data; wherein the operation information comprises an operation data file object and an operation type, and the operation type comprises addition, deletion and modification;
Matching the operation information with the first set to obtain a first information table; wherein the first set includes a number of Hive metadata tables.
Optionally, the constructing a first relation matrix specifically includes:
acquiring an offline task list and initializing a matrix set;
determining a first relation table according to the offline task list;
and determining a first relation matrix according to the first relation table and the initialized matrix set.
Optionally, the determining a first relationship table according to the offline task list specifically includes:
initializing a first list and a second list;
determining upstream and downstream dependency relationships among tasks according to the offline task list, and updating the initialized first list according to the upstream and downstream dependency relationships among the tasks to obtain an upstream and downstream dependency relationship table among the tasks;
determining a task type according to the offline task list, if the task type is a non-SQL task type, determining a first task input-output relationship according to the offline task list, and updating the initialized second list according to the first task input-output relationship to obtain a first task input-output table;
And determining a first relation table according to the inter-task upstream and downstream dependency relation table and the first task input and output table.
Optionally, the determining the first relation table according to the offline task list further includes:
initializing a third list;
determining upstream and downstream dependency relationships among tasks according to the offline task list, and updating the initialized third list according to the upstream and downstream dependency relationships among the tasks to obtain an upstream and downstream dependency relationship table among the tasks;
determining a task type according to the offline task list, and if the task type is an SQL task type, analyzing the offline task list according to a blood-source analyzing tool to obtain a second task input-output table;
and determining a first relation table according to the inter-task upstream and downstream dependency relation table and the second task input and output table.
Optionally, the determining a first relationship matrix according to the first relationship table and the initialized matrix set specifically includes:
determining a first matrix and a second matrix according to the initialized matrix set;
updating the first matrix according to the first relation table to obtain a data table and a task relation matrix, and updating the second matrix according to the first relation table to obtain an inter-task relation matrix;
And determining a first relation matrix according to the relation matrix between the data table and the task and the relation matrix between the tasks.
Optionally, the determining the rerun task information according to the first information table and the first relation matrix specifically includes:
determining table information according to the first information table; the table information comprises a data table name, position information of the data table, size information of the data table and update time information of the data table;
and matching the table information with the first relation matrix, determining task nodes and corresponding associated task nodes, and taking the task nodes and the corresponding associated task nodes as the running task information.
In another aspect, an embodiment of the present invention provides a task automatic rerun system, including:
the first module is used for acquiring editing log data; wherein the edit log data includes an operation record of a data file;
the second module is used for analyzing the editing log data to obtain a plurality of first information tables and sending the plurality of first information tables to a task queue; wherein the first information table includes updated table metadata information;
the third module is used for constructing a first relation matrix, sequentially acquiring the first information table according to the task queue, and determining the rerun task information according to the first information table and the first relation matrix so as to trigger the task to be executed; the re-running task information comprises re-running task nodes and re-running sequences.
In another aspect, an embodiment of the present invention provides a task automatic rerun device, including:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described in the method embodiments above.
In another aspect, embodiments of the present invention provide a computer readable storage medium, in which a processor executable program is stored, which when executed by a processor is configured to perform the method described in the previous method embodiments.
The embodiment of the invention has the following beneficial effects: firstly, editing log data of a system is obtained in real time, then the editing log data is analyzed, a data file object which is subjected to operation update in the editing log data is determined, a specific data table and corresponding table information thereof are obtained according to the data file object which is subjected to operation update, the specific data table comprises names, positions, sizes and the like, the specific data table and the corresponding table information thereof are sent to a task queue, a first relation matrix is constructed through an edited offline task list, then the specific data table and the corresponding table information thereof are sequentially read from the task queue, the obtained specific data table and the corresponding table information thereof are matched with the constructed first relation matrix, and task nodes which need to trigger execution of running again and corresponding downstream associated task nodes in the task list are determined so as to execute task running again operation; constructing a first relation matrix through the edited offline task list, analyzing the edited log data, determining data table metadata information change caused by service caliber change or upstream data source change, determining a rerun task node and an associated downstream task node through the changed data table metadata information and the first relation matrix, and carrying out task rerun; the update table information is sent to the queue, and the execution of the task rerun is carried out by reading the queue, so that the real-time performance of the execution of the task rerun is improved; the execution of a large number of manual starting tasks is reduced, the labor and time cost caused by manual intervention is reduced, and cost reduction and efficiency improvement are realized.
Drawings
FIG. 1 is a schematic flow chart of steps of a task automatic re-running method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating steps for performing analysis of edit log data in a task automatic re-running method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an edit log data format in a task automatic re-running method according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating steps for determining a first relationship matrix in a task automatic re-running method according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating steps for determining a first relationship table in a task automatic re-running method according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating another step of determining a first relationship table in a task automatic re-running method according to an embodiment of the present invention;
FIG. 7 is a flowchart illustrating steps for determining a first relationship matrix in a task automatic re-running method according to an embodiment of the present invention;
FIG. 8 is a flowchart illustrating steps for determining information of a rerun task in a task automatic rerun method according to an embodiment of the present invention;
FIG. 9 is a schematic flow chart of steps of a specific embodiment provided in an embodiment of the present invention;
FIG. 10 is a flowchart illustrating a process for constructing a task-to-task relationship matrix in an embodiment of the present invention;
FIG. 11 is a flowchart illustrating a process for constructing a relationship matrix between a data table and a task in an embodiment of the present invention;
FIG. 12 is a block diagram of a task automatic rerun system provided by an embodiment of the present invention;
fig. 13 is a block diagram of a task automatic rerun device according to an embodiment of the present invention.
Detailed Description
The invention will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the invention described herein to be practiced otherwise than as illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the embodiments of the invention is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
Before describing embodiments of the present invention in further detail, the terms and terminology involved in the embodiments of the present invention will be described, and the terms and terminology involved in the embodiments of the present invention will be used in the following explanation.
Hadoop metadata, which is a distributed system infrastructure developed by Apache foundation, hadoop is composed of HDFS, mapReduce and the like, and an HDFS is composed of a Namenode and a plurality of datanodes, wherein the Namenode is used as a main server to manage the connection between the naming space of a file system and a client, the datanodes in a cluster manage the data stored respectively, and the Hadoop metadata refers to the naming space of the file system managed by the Namenode.
As shown in fig. 1, an embodiment of the present invention provides a task automatic re-running method, which includes the following steps.
S100, acquiring editing log data; wherein the edit log data includes an operation record of the data file.
Specifically, in practical application, a Hadoop system is generally adopted to schedule or monitor offline tasks, records are recorded in NameNode editing log data of the Hadoop system when a service caliber or a data source of the tasks is changed, and the NameNode editing log data of the Hadoop system is collected in real time for analysis, so that change records in the editing log data are obtained for subsequent automatic task rerun.
S200, analyzing the editing log data to obtain a plurality of first information tables, and sending the plurality of first information tables to a task queue; wherein the first information table includes metadata information of the update table.
Specifically, after acquiring NameNode editing log data, detecting table information of all Hive metadata tables in the NameNode editing log data, wherein the table information comprises file attribute information related to specific positions of file blocks managed by NameNode, sizes of files, update time information and the like; and determining whether the metadata is changed or not by detecting the table information, and if the metadata is changed, extracting the table corresponding to the changed metadata for automatic re-running of the subsequent task.
S300, constructing a first relation matrix, sequentially acquiring a first information table from a task queue, and determining rerun task information according to the first information table and the first relation matrix so as to trigger task execution; the re-running task information comprises re-running task nodes and re-running sequences.
Specifically, after the table corresponding to the changed metadata is obtained in step S200, first, a relationship matrix of the input-output relationship between the data table and the offline task in the Hadoop system and a relationship matrix of the upstream-downstream relationship between the offline tasks are constructed, and then the starting node and the associated downstream tasks are re-run according to the table corresponding to the changed metadata and the established relationship matrix positioning task. In a specific embodiment, the Hadoop system has five data tables and three tasks, specifically, a table a, a table B, a table C, a table D and a table E, where the input-output relationship between the data table and the task may be that the table a inputs the task a to obtain the table B, the table B and the table C inputs the task C to obtain the table E, and the table E and the table B inputs the task B to obtain the table D; then the upstream-downstream relationship between the tasks is that task a is the upstream task of task b and task c, and task b is the upstream task of task c; when the table B is changed, the tasks which need to be run again are a task B and a task c according to the obtained relation matrix.
Optionally, the editing log data is analyzed to obtain a plurality of first information tables, and the specific step flow is shown in fig. 2:
S210, determining operation information according to the editing log data; wherein the operation information includes an operation data file object and an operation type including addition, deletion, and modification.
Specifically, in this embodiment, the format of the NameNode edit log data of the real-time collection system is shown in fig. 3, where the NameNode edit log data includes all operation records on data files in the Hdfs file system, for example, adding and deleting operations on the files, in fig. 3, op_delete represents deleting the data files, and op_mkdir represents creating a data folder.
S220, matching the operation information with the first set to obtain a first information table; wherein the first set includes a number of Hive metadata tables.
Specifically, in this embodiment, the first set includes a plurality of Hive metadata tables, where the Hive metadata tables include an HDFS file path, and the Hadoop system may match the operation information with the plurality of Hive metadata tables in the first set according to the HDFS file path, so as to obtain a Hive metadata table mapped by the changed data file object.
Optionally, a first relation matrix is constructed, and the specific step flow is as shown in fig. 4:
S310, acquiring an offline task list and initializing a matrix set.
Specifically, the offline task list is an offline job developed and deployed by a developer on the scheduling platform, and the offline task list contains an upstream-downstream dependency relationship between tasks and a data table input-output relationship between tasks, and because more than one changed metadata table obtained by analyzing NameNode editing log data may be needed to be constructed for each changed metadata table, each first relationship matrix comprises a relationship matrix of input-output between the data table and the tasks and an upstream-downstream dependency relationship matrix between the tasks; after each time the changed metadata table is obtained, the constructed first relation matrix is initialized, the first relation matrix corresponding to the obtained changed metadata table is reconstructed, and the specific initialization operation is to set the first relation matrix as a 0 matrix.
S320, determining a first relation table according to the offline task list.
Specifically, the first relation table includes an upstream-downstream dependency relation table among tasks and a task input-output table, and the offline task list includes a table a, a table B, a table C, a table D and a table E, where the task a, the task B and the task C may be input into the task a by the data table and the task C to obtain the table B, the table B and the table C to obtain the table E, and the table E and the table B to obtain the table D; then the upstream-downstream relationship between the tasks is that task a is the upstream task of task b and task c, and task b is the upstream task of task c; the inter-task upstream and downstream dependency relationship table is used for subsequently constructing the inter-task relationship matrix, and the task input and output table is used for subsequently constructing the data table and the inter-task relationship matrix.
S330, determining a first relation matrix according to the first relation table and the initialized matrix set.
Specifically, in this embodiment, after the first relationship table is obtained in step S320, the first relationship table is traversed, the initialized matrix set is updated according to the information of the first relationship table, and, illustratively, when the data table and the task relationship matrix in the first relationship matrix are constructed, the input data and the output data of each task in the first relationship table are obtained, the matrix element in the initialized matrix set is set to 1 to represent the data source of the task, and set to 2 to represent the output source of the task; when the relation matrix among the tasks in the first relation matrix is constructed, through each task and the upstream and downstream tasks in the first relation table, setting 1 of the matrix element indicates that the association relation exists between the two tasks, and setting 0 indicates that the association relation does not exist between the two tasks.
Optionally, the first relation table is determined according to the offline task list, and the specific step flow is as shown in fig. 5:
s321, initializing a first list and a second list;
specifically, in this embodiment, there may be more than one metadata that is changed, so there may be more than one metadata table that is changed and obtained by analyzing the NameNode edit log data, a corresponding first relationship matrix needs to be constructed for each metadata table that is changed, after each time the metadata table that is changed is obtained, the constructed first relationship matrix needs to be initialized, and the construction of the first relationship matrix depends on the information of the first relationship table, so after each time the offline task list is obtained, the first relationship table needs to be initialized, and the range of initializing the first relationship table is different according to different editing languages used for deploying the offline task; in the step flow shown in fig. 5, the first relation table is initialized to perform the first initialization, because the first relation table includes the inter-task upstream and downstream dependency relationship table and the task input and output table, and the first relation table is initialized to perform the first initialization for both the inter-task upstream and downstream dependency relationship table for executing the re-running task last time and the task input and output table for executing the re-running task last time.
S322, determining the upstream and downstream dependency relationship between tasks according to the offline task list, and updating the initialized first list according to the upstream and downstream dependency relationship between tasks to obtain an upstream and downstream dependency relationship table between tasks;
specifically, in this embodiment, the offline task list includes an upstream-downstream dependency relationship between tasks and a data table input-output relationship for task execution; for the inter-task upstream and downstream dependency relationship table, the initialized inter-task upstream and downstream dependency relationship table only needs to be updated according to the upstream and downstream dependency relationship between the tasks in the offline task list, and the offline task list includes a table a, a table B, a table C, a table D and a table E, the input and output relationship between the data table and the task may be the table a input task a to obtain the table B, the table B and the table C input task C to obtain the table E, the table E and the table B input task B to obtain the table D, the inter-task upstream and downstream dependency relationship obtained according to the offline task list is that the task a has no upstream task, the task a downstream task is that the task B has no downstream task, the task C upstream task is that the task a, the task C downstream task is that the task B, and the inter-task upstream and downstream dependency relationship table is updated according to the inter-task upstream and downstream dependency relationship.
S323, determining a task type according to the offline task list, if the task type is a non-SQL task type, determining a first task input-output relationship according to the offline task list, and updating an initialized second list according to the first task input-output relationship to obtain a first task input-output table;
specifically, the task input/output table is determined according to the task input/output relationship in the offline task list, the task type of the deployed offline task is also required to be judged, and the operations of different determined task input/output tables are determined according to different task types; when the deployed offline task is an offline task edited by other languages than SQL, the task input-output table may be updated according to the task input-output relationship in the offline task list, or manually configured according to the task input-output relationship, so as to obtain the first task input-output table.
S324, determining a first relation table according to the inter-task upstream and downstream dependency relation table and the first task input and output table.
Specifically, after the inter-task upstream and downstream dependency relationship table and the first task input and output table are obtained in step S322 and step S323, the two may be subjected to operations such as stacking or splicing to obtain the first relationship table.
Optionally, determining the first relationship table according to the offline task list further includes the step flow shown in fig. 6:
s325, initializing a third list.
Specifically, in this embodiment, there may be more than one changed metadata table obtained by analyzing the NameNode edit log data, a corresponding first relationship matrix needs to be constructed for each changed metadata table, after each changed metadata table is obtained, the constructed first relationship matrix needs to be initialized, and the construction of the first relationship matrix depends on the information of the first relationship table, so after each offline task list is obtained, the first relationship table needs to be initialized; according to different editing languages adopted for deploying the offline tasks, the range for initializing the first relation table is also different; in the step flow shown in fig. 6, the first relation table is initialized to perform the second initialization, the first relation table includes the inter-task upstream and downstream dependency relation table and the task input and output table, the first relation table is initialized to perform the second initialization to the inter-task upstream and downstream dependency relation table for executing the re-running task last time, and the task input and output table for executing the re-running task last time is discarded or deleted.
S326, determining the upstream and downstream dependency relationship among the tasks according to the offline task list, and updating the initialized third list according to the upstream and downstream dependency relationship among the tasks to obtain an upstream and downstream dependency relationship table among the tasks.
Specifically, in this embodiment, the offline task list includes an upstream-downstream dependency relationship between tasks and a data table input-output relationship for task execution, and for the upstream-downstream dependency relationship table between tasks, only the initialized upstream-downstream dependency relationship table between tasks is required to be updated according to the upstream-downstream dependency relationship between tasks in the offline task list, and the offline task list includes, for example, a table a, a table B, a table C, a table D and a table E, where the input-output relationship between the data table and the task may be a table a, to obtain a table B, a table B and a table C to obtain a table E, a table E and a table B to obtain a table D, the upstream-downstream dependency relationship between tasks according to the offline task list is a task B and a task C, the upstream-downstream task of task B is a task a and a task C, the upstream-task of task C is a task B, and the downstream-task of task C is a task B, and the upstream-downstream dependency relationship between tasks is updated according to the upstream-downstream dependency relationship between tasks.
S327, determining a task type according to the offline task list, and if the task type is SQL task type, analyzing the offline task list according to a blood edge analysis tool to obtain a second task input/output table;
specifically, the task input/output table is determined according to the task input/output relationship in the offline task list, the task type of the deployed offline task is also required to be judged, and the operations of different determined task input/output tables are determined according to different task types; when the deployed offline task is an offline task edited by the SQL language, the task input/output table cannot be updated directly according to the task input/output relationship in the offline task list, or manual configuration is performed according to the task input/output relationship, an edge analysis tool is required to be input in the offline task list, the input offline task is analyzed by the edge analysis tool, a corresponding second task input/output table is obtained, the edge analysis tool comprises Datablau, SQLLineage and the like, and can be selected according to a specific application scenario, and the embodiment is not particularly limited.
S328, determining a first relation table according to the inter-task upstream and downstream dependency relation table and the second task input and output table.
Specifically, after the upstream and downstream dependency relationship table between tasks and the second task input and output table are obtained in step S326 and step S327, the two may be overlapped or spliced to obtain the first relationship table.
Optionally, the first relation matrix is determined according to the first relation table and the initialized matrix set, and the specific step flow is as shown in fig. 7:
s331, determining a first matrix and a second matrix according to the initialized matrix set.
Specifically, in this embodiment, the first relationship table includes an inter-task upstream-downstream dependency relationship table and a task input-output table, the first relationship matrix corresponds to the first relationship table, and the first relationship matrix includes a data table, a task relationship matrix and an inter-task relationship matrix, specifically, the data table corresponds to the task relationship matrix and the task input-output table, and the inter-task relationship matrix corresponds to the inter-task upstream-downstream dependency relationship table; similar to the process of determining the first relationship table, the determination of the first relationship matrix requires that after the first relationship matrix is initialized, the initialized data table, the initialized task relationship matrix and the task relationship matrix are obtained, and the data table, the task relationship matrix and the task relationship matrix are reconstructed according to the first relationship table obtained in steps S321-S328.
And S332, updating the first matrix according to the first relation table to obtain a data table and a task relation matrix, and updating the second matrix according to the first relation table to obtain a task relation matrix.
Specifically, in this embodiment, step S331 obtains the first matrix and the second matrix in the initialized matrix set to be the 0 matrix at this time, updates matrix elements of the first matrix and the second matrix of the 0 matrix according to the first relationship table obtained in steps S321-S328, and the task input/output table in the obtained first relationship table is the data source of task 1 and is the table1 and table 2, the output source of task 1 is table 4, there is no association between task 1 and tables 3 and 5; the data sources of the task 2 are shown in tables 3 and 4, the output source of the task 2 is shown in table 5, and no association exists between the task 2 and tables 1 and 2; setting 0 to represent that no association exists between the data table and the task, setting 1 to represent that the data table is a data source of the task, setting 2 to represent that the data table is an output source of the task, and setting rows of the matrix to represent the data table and columns of the matrix to represent the task; the matrix obtained according to the task input-output table is The obtained inter-task upstream and downstream dependency relationship table in the first relationship table is that task 1 and task 2 have no upstream task, the downstream task of task 1 is task 3, the downstream task of task 2 is task 3 and task 4, and the downstream task of task 3 is task 4; setting the matrix elements of the second matrix as the associated relation exists between the tasks, setting 0 as the associated relation does not exist between the tasks, wherein the rows of the matrix represent downstream tasks, and the columns of the matrix represent upstream tasks, for example, the second row and the third column of the relationship matrix between the tasks are 1, and the downstream task of the task 2 is the task 3 or the upstream task of the task 3 is the task 2; the inter-task relation matrix obtained according to the inter-task upstream and downstream dependency relation table is +.>
S333, determining a first relation matrix according to the relation matrix between the data table and the task and the relation matrix between the tasks.
Specifically, the data table, the task relationship matrix and the relationship matrix between tasks obtained through steps S321 to S322 are used as a matrix set, and the matrix set is the first relationship matrix.
Optionally, the re-running task information is determined according to the first information table and the first relation matrix, and the specific step flow is as shown in fig. 8:
s341, determining table information according to the first information table; the table information comprises a data table name, position information of the data table, size information of the data table and update time information of the data table.
Specifically, after analyzing the NameNode editing log data of the system acquired in real time through steps S210-S220, a first information table containing data table information corresponding to the metadata of the change is obtained, the first information table is analyzed in real time, and the data table information corresponding to the metadata of the change is obtained, such as a data table name, a data table update time and the like.
S342, matching the table information with the first relation matrix, determining task nodes and corresponding associated task nodes, and taking the task nodes and the corresponding associated task nodes as the re-running task information.
Specifically, in this embodiment, matching is performed according to the obtained data table information corresponding to the changed metadata and the data table and task relationship matrix in the first relationship matrix, determining a specific data table in the data table information corresponding to the data table and task relationship matrix, and determining a target task to be rerun according to the input-output relationship between the data table and the task, for example, determining the task with the data table corresponding to the changed metadata as a data source as the target task; after the target task is determined, matching is carried out according to the relation matrix between the target task and the task, the task with the association relation with the target task is also used as the target task needing to be rerun, for example, the task with the target task as the upstream task is used as the target task needing to be rerun, and the rerun operation is sequentially carried out according to the upstream and downstream relation between the rerun target tasks.
In a specific embodiment, as shown in fig. 9, a Hadoop NameNode metadata information detection system is constructed, then a NameNode editing log in the Hadoop system is collected in real time, the NameNode editing log is parsed to determine whether metadata with changes exist, if it is determined that metadata with changes exist, the detection system obtains file objects and operation types of operations recorded in the editing log, for example, all operation records of the HDFS file system are recorded in the editing log, including operations of adding, deleting, modifying, etc. data files in the file system, specifically, in the HDFS file system, op_delete represents the deletion operation of the data files, op_mkdir represents creation in the HDFS file systemA folder; when the NameNode editing log is analyzed and detected, determining that the operation is performed on the data table 2, wherein the operation is that the data table 2 is created; then mapping operation is carried out according to the obtained operated file object to obtain specific updated table information of the operated file object, for example, a metadata table mapped by the operated file object in a Hive table set is found according to an HDFS file path of the operated file object obtained by analysis, the metadata table and the table information are obtained and used as updated table and table information, and the obtained updated table and table information is sent to a message queue; the detection system sequentially acquires an update table and table information from the message queue, analyzes specific update table information from the update table and table information for each acquired update table and table information, determines specific rerun tasks through the specific update table information, such as table names of the update table, update time and the like, determines rerun tasks through the table names of the update table and the relation between the data table and the task, determines the sequence of execution of different rerun tasks according to the update time, such as the execution of the rerun task with early update time first, the execution of the rerun task with late update time later, and sequentially realizes the real-time of the rerun task execution; then, analyzing and determining a task node needing to trigger the rerun and a task node needing to execute the rerun at the downstream of the task node according to the specific updated table information obtained by analysis and a relation matrix constructed by a detection system, so that the Hadoop system executes the rerun of the task; specifically, the relation matrix of the detection system is constructed through the step flow shown in fig. 10 and 11, the detection system has a task list and a data list, the relation matrix between tasks and the relation matrix between the data list and the tasks, for the construction of the relation matrix between the tasks, the detection system firstly initializes the task list and the relation matrix between the initialized tasks to 0 matrix, 0 in the relation matrix indicates that no association exists between the two tasks, 1 indicates that an association exists between the two tasks, a row of the relation matrix indicates a downstream task, a column of the matrix indicates an upstream task, and if a second row and a third column of the relation matrix between the tasks are 1, a downstream task indicating the task 2 is the task 3 or an upstream task indicating the task 3 is the task 2; then obtain the developer's code Editing offline tasks, editing upstream and downstream dependency relationships between the offline tasks on a scheduling platform, and then constructing a relationship matrix between the tasks according to the edited upstream and downstream dependency relationships, for example, an offline task list comprises a table A, a table B, a table C, a table D and a table E, wherein the input and output relationships between a data table and the tasks are the table A input task a, the table B and the table C input task C to obtain the table B, the table D and the table B input task C to obtain the table D, the upstream and downstream dependency relationships between the tasks obtained according to the offline task list are no upstream tasks, the downstream tasks of the task a are the task B and the task C, the upstream tasks of the task B are the task a and the task C, the downstream tasks of the task C are the task a, the constructed relationship matrix between the task C and the task isFor the construction of a relation matrix between a data table and a task, a detection system firstly initializes a task list and a data table list, and initializes the relation matrix between the data table between a character list and the data table list and the task to be a 0 matrix, wherein 0 in the relation matrix indicates that no association relation exists between the data table and the task, 1 indicates that the data table is a data source of the task, 2 indicates that the data table is an output source of the task, a row of the relation matrix indicates the data table, and a column of the matrix indicates the task; then acquiring an edited offline task, judging whether the offline task is of an SQL task type, if the offline task is of the SQL task type, analyzing a task input-output table of the offline task by using a blood-margin analysis tool, and if the offline task is not of the SQL task type, manually configuring the input-output table of the offline task; then, according to the obtained task input/output table configuration data table and the relation matrix between tasks, for example, the offline task list comprises a table 1, a table 2, a table 3, a table 4 and a table 5, the task 1 and the task 2 are configured manually or the data sources of the task 1 obtained through a blood margin analysis tool are the table 1 and the table 2, the output source of the task 1 is the table 4, and no association relation exists between the task 1 and the table 3 and the table 5; the data sources of task 2 are any of tables 3 and 4 The output source of the task 2 is shown in the table 5, no association relation exists between the task 2 and the tables 1 and 2, and the constructed relation matrix between the data table and the task is ∈>
The embodiment of the invention has the following beneficial effects: firstly, editing log data of a system is obtained in real time, then the editing log data is analyzed, a data file object which is updated in operation is determined, a specific data table and corresponding table information thereof are obtained according to the data file object which is updated in operation, the specific data table and the corresponding table information thereof comprise names, positions, sizes and the like of the updated data table, then the determined specific data table and the corresponding table information thereof are sent to a task queue, a first relation matrix is constructed through an edited offline task list, then the specific data table and the corresponding table information thereof are sequentially read from the task queue, the obtained specific data table and the corresponding table information thereof are matched with the constructed first relation matrix, and task nodes which need to trigger execution of running again and corresponding downstream associated task nodes in the task list are determined so as to execute task running again operation; constructing a first relation matrix through the edited offline task list, analyzing the edited log data, determining data table metadata information change caused by service caliber change or upstream data source change, determining a rerun task node and an associated downstream task node through the changed data table metadata information and the first relation matrix, and carrying out task rerun; the update table information is sent to the queue, and the execution of the task rerun is carried out by reading the queue, so that the real-time performance of the execution of the task rerun is improved; and a large amount of manual starting task execution is not needed, so that the labor and time cost caused by manual intervention is reduced, and the cost reduction and efficiency improvement are realized.
As shown in fig. 12, the embodiment of the present invention further provides a task automatic rerun system, which includes:
the first module is used for acquiring editing log data; wherein the edit log data includes an operation record of a data file;
the second module is used for analyzing the editing log data to obtain a plurality of first information tables and sending the plurality of first information tables to a task queue; wherein the first information table includes updated table metadata information;
the third module is used for constructing a first relation matrix, sequentially acquiring the first information table from the task queue, and determining the re-running task information according to the first information table and the first relation matrix so as to trigger task execution; the re-running task information comprises re-running task nodes and re-running sequences.
It can be seen that the foregoing method embodiments are applicable to the system embodiment, and the functions specifically implemented by the system embodiment are the same as those of the foregoing method embodiment, and the achieved beneficial effects are the same as those of the foregoing method embodiment.
As shown in fig. 13, the embodiment of the present invention further provides a task automatic rerun device, which includes:
At least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the task automatic re-running method steps of the previous method embodiments.
Wherein the memory is operable as a non-transitory computer readable storage medium storing a non-transitory software program and a non-transitory computer executable program. The memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes remote memory provided remotely from the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
It can be seen that the foregoing method embodiments are applicable to the apparatus embodiment, and the functions specifically implemented by the apparatus embodiment are the same as those of the foregoing method embodiment, and the advantages achieved by the apparatus embodiment are the same as those achieved by the foregoing method embodiment.
Furthermore, embodiments of the present application disclose a computer program product or a computer program, which is stored in a computer readable storage medium. The computer program may be read from a computer readable storage medium by a processor of a computer device, the processor executing the computer program causing the computer device to perform the method as described above. Similarly, the content in the above method embodiment is applicable to the present storage medium embodiment, and the specific functions of the present storage medium embodiment are the same as those of the above method embodiment, and the achieved beneficial effects are the same as those of the above method embodiment.
The present invention also provides a computer-readable storage medium storing a processor-executable program for implementing the method of the previous method embodiment when executed by a processor.
It is to be understood that all or some of the steps, systems, and methods disclosed above may be implemented in software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
While the preferred embodiment of the present invention has been described in detail, the invention is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the invention, and these modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. The automatic task rerun method is characterized by comprising the following steps:
acquiring editing log data; wherein the edit log data includes an operation record of a data file;
analyzing the editing log data to obtain a plurality of first information tables, and sending the plurality of first information tables to a task queue; wherein the first information table includes metadata information of an update table;
constructing a first relation matrix, sequentially acquiring the first information table from the task queue, and determining re-running task information according to the first information table and the first relation matrix so as to trigger task execution; the re-running task information comprises re-running task nodes and re-running sequences.
2. The method according to claim 1, wherein the analyzing the edit log data to obtain a plurality of first information tables specifically includes:
Determining operation information according to the editing log data; wherein the operation information comprises an operation data file object and an operation type, and the operation type comprises addition, deletion and modification;
matching the operation information with the first set to obtain a first information table; wherein the first set includes a number of Hive metadata tables.
3. The method according to claim 1, wherein said constructing a first relationship matrix comprises:
acquiring an offline task list and initializing a matrix set;
determining a first relation table according to the offline task list;
and determining a first relation matrix according to the first relation table and the initialized matrix set.
4. A method according to claim 3, wherein said determining a first relationship table from said offline task list comprises:
initializing a first list and a second list;
determining upstream and downstream dependency relationships among tasks according to the offline task list, and updating the initialized first list according to the upstream and downstream dependency relationships among the tasks to obtain an upstream and downstream dependency relationship table among the tasks;
determining a task type according to the offline task list, if the task type is a non-SQL task type, determining a first task input-output relationship according to the offline task list, and updating the initialized second list according to the first task input-output relationship to obtain a first task input-output table;
And determining a first relation table according to the inter-task upstream and downstream dependency relation table and the first task input and output table.
5. A method according to claim 3, wherein said determining a first relationship table from said offline task list comprises:
initializing a third list;
determining upstream and downstream dependency relationships among tasks according to the offline task list, and updating the initialized third list according to the upstream and downstream dependency relationships among the tasks to obtain an upstream and downstream dependency relationship table among the tasks;
determining a task type according to the offline task list, and if the task type is an SQL task type, analyzing the offline task list according to a blood-source analyzing tool to obtain a second task input-output table;
and determining a first relation table according to the inter-task upstream and downstream dependency relation table and the second task input and output table.
6. A method according to claim 3, wherein said determining a first relationship matrix from said first relationship table and said initialized set of matrices comprises:
determining a first matrix and a second matrix according to the initialized matrix set;
updating the first matrix according to the first relation table to obtain a data table and a task relation matrix, and updating the second matrix according to the first relation table to obtain an inter-task relation matrix;
And determining a first relation matrix according to the relation matrix between the data table and the task and the relation matrix between the tasks.
7. The method according to claim 1, wherein the determining the rerun task information according to the first information table and the first relation matrix specifically includes:
determining table information according to the first information table; the table information comprises a data table name, position information of the data table, size information of the data table and update time information of the data table;
and matching the table information with the first relation matrix, determining task nodes and corresponding associated task nodes, and taking the task nodes and the corresponding associated task nodes as the running task information.
8. A mission automatic rerun system, comprising:
the first module is used for acquiring editing log data; wherein the edit log data includes an operation record of a data file;
the second module is used for analyzing the editing log data to obtain a plurality of first information tables and sending the plurality of first information tables to a task queue; wherein the first information table includes updated table metadata information;
The third module is used for constructing a first relation matrix, sequentially acquiring the first information table from the task queue, and determining the re-running task information according to the first information table and the first relation matrix so as to trigger task execution; the re-running task information comprises re-running task nodes and re-running sequences.
9. A task automatic rerun device, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any of claims 1-7.
10. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for performing the method according to any of claims 1-7 when being executed by a processor.
CN202311577562.8A 2023-11-23 2023-11-23 Automatic task rerun method, system, device and storage medium Pending CN117608780A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311577562.8A CN117608780A (en) 2023-11-23 2023-11-23 Automatic task rerun method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311577562.8A CN117608780A (en) 2023-11-23 2023-11-23 Automatic task rerun method, system, device and storage medium

Publications (1)

Publication Number Publication Date
CN117608780A true CN117608780A (en) 2024-02-27

Family

ID=89945678

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311577562.8A Pending CN117608780A (en) 2023-11-23 2023-11-23 Automatic task rerun method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN117608780A (en)

Similar Documents

Publication Publication Date Title
CN111522816B (en) Data processing method, device, terminal and medium based on database engine
US10210233B2 (en) Automated identification of complex transformations and generation of subscriptions for data replication
CN106557470B (en) Data extraction method and device
US9043651B2 (en) Systematic failure remediation
CN109739828B (en) Data processing method and device and computer readable storage medium
CN110442371B (en) Method, device and medium for releasing codes and computer equipment
CN113204598B (en) Data synchronization method, system and storage medium
WO2022012327A1 (en) Code analysis method and system, and computing device
CN112214221A (en) Method and equipment for constructing Linux system
CN116126950A (en) Real-time materialized view system and method
CN114020840A (en) Data processing method, device, server, storage medium and product
US11657069B1 (en) Dynamic compilation of machine learning models based on hardware configurations
US11636124B1 (en) Integrating query optimization with machine learning model prediction
CN111753141B (en) Data management method and related equipment
JP6336919B2 (en) Source code review method and system
CN116821098A (en) Data warehouse management method, service system and storage medium
CN115391457B (en) Cross-database data synchronization method, device and storage medium
CN117608780A (en) Automatic task rerun method, system, device and storage medium
CN113792026B (en) Method and device for deploying database script and computer-readable storage medium
CN110221952B (en) Service data processing method and device and service data processing system
CN114490509A (en) Tracking change data capture log history
CN113127056B (en) Information processing method, device, equipment and readable storage medium
CN117407362B (en) Method and device for file migration among heterogeneous file systems
US11461355B1 (en) Ontological mapping of data
CN113741950A (en) Code synchronization method, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240401

Address after: Unit 1, Building 1, China Telecom Zhejiang Innovation Park, No. 8 Xiqin Street, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province, 311100

Applicant after: Tianyi Shilian Technology Co.,Ltd.

Country or region after: China

Address before: 200000 room 1423, No. 1256 and 1258, Wanrong Road, Jing'an District, Shanghai

Applicant before: Tianyi Digital Life Technology Co.,Ltd.

Country or region before: China

TA01 Transfer of patent application right