CN116069539A - Fault processing method, device and computer readable storage medium - Google Patents

Fault processing method, device and computer readable storage medium Download PDF

Info

Publication number
CN116069539A
CN116069539A CN202310160864.9A CN202310160864A CN116069539A CN 116069539 A CN116069539 A CN 116069539A CN 202310160864 A CN202310160864 A CN 202310160864A CN 116069539 A CN116069539 A CN 116069539A
Authority
CN
China
Prior art keywords
fault
task
person
responsibility
change
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310160864.9A
Other languages
Chinese (zh)
Inventor
汪盼
赵卫
顾超
张朝辉
陆刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Avatr Technology Chongqing Co Ltd
Original Assignee
Avatr Technology Chongqing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Avatr Technology Chongqing Co Ltd filed Critical Avatr Technology Chongqing Co Ltd
Priority to CN202310160864.9A priority Critical patent/CN116069539A/en
Publication of CN116069539A publication Critical patent/CN116069539A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Abstract

The application relates to the field of big data processing, and discloses a fault processing method, a fault processing device and a computer readable storage medium. The method comprises the following steps: acquiring a tool log and a task running log when a data warehouse task fault occurs; the tool log is a work log of big data tool equipment in a big data processing system for executing the data warehouse task; the task running log is a work log of a big data platform in a big data processing system for executing the tasks of the data warehouse; calling a target fault classification model to process the tool log and the task running log to obtain a fault class; the fault type is tool side fault or bin side fault or platform side fault; determining a fault responsibility person according to the fault category and sending alarm information to a fault responsibility terminal; the fault responsibility terminal is a fault responsibility terminal corresponding to a fault responsibility person; the alarm information is used for indicating a fault responsibility person to process the fault of the data warehouse task. By applying the technical scheme of the invention, the fault processing efficiency can be improved.

Description

Fault processing method, device and computer readable storage medium
Technical Field
The embodiment of the invention relates to the technical field of big data, in particular to a fault processing method, a fault processing device and a computer readable storage medium.
Background
With the development of big data industry, each company has a large number of ETL (Extract-Transform-Load) tasks that need maintenance, especially when data failure occurs, alarming to relevant responsible people for quick repair. However, because the ETL job execution link is relatively long, data warehouse, large data platform and large data tools are involved. The fault alert of the ETL task is often sent to a person who is not responsible for the actual fault or who can resolve the fault. And in the whole process, the possible fault responsibility people need to be judged layer by layer manually and then contacted, so that a great deal of manpower and communication cost are wasted, and the fault resolution efficiency is low.
Disclosure of Invention
In view of the above problems, embodiments of the present invention provide a fault processing method, apparatus, and computer readable storage medium, which are used to solve the problem in the prior art that a fault liability person has difficulty in determining and low fault resolution efficiency.
In a first aspect, the present application provides a fault handling method, the method comprising: when a data warehouse task fault occurs, acquiring a tool log and a task running log; the tool log is a work log of big data tool equipment in a big data processing system for executing the data warehouse task; the task running log is a work log of a big data platform in a big data processing system for executing the tasks of the data warehouse; inputting the tool logs and the task running logs into a target fault classification model to obtain fault categories; the fault category is any one of the following: tool side failure, bin side failure, platform side failure; according to the fault category, determining a fault responsibility person and sending alarm information to a fault responsibility terminal; the fault responsibility terminal is a fault responsibility terminal corresponding to a fault responsibility person; the alarm information is used for indicating a fault responsibility person to process the fault of the data warehouse task.
In one possible design manner of the first aspect, according to the fault category, determining a fault responsibility person and sending alarm information to the fault responsibility terminal, including: if the fault type is a tool side fault, determining a tool responsibility person corresponding to the big data tool equipment and communication information of the tool responsibility person according to a responsibility table of the big data tool equipment; the responsibility table comprises communication information of a plurality of tool managers and the corresponding relation between the tool managers and time periods responsible for big data tool equipment; and determining the tool responsibility person as a fault responsibility person, and sending alarm information to the tool responsibility terminal by utilizing the communication information of the tool responsibility person.
In one possible design manner of the first aspect, according to the fault category, determining a fault responsibility person and sending alarm information to the fault responsibility terminal, including: if the fault class is a fault of the silo side, acquiring a first change record of a first subtask of the data warehouse task; the first change record is a first subtask version change record or a first subtask table change record; if the task version change or the task table change of the first subtask is determined according to the first change record, the latest first change person corresponding to the first subtask is determined to be a fault responsible person, and alarm information is sent to a first change terminal corresponding to the first change person.
In one possible design manner of the first aspect, after obtaining the first change record of the first subtask of the data warehouse task, the method further includes: if the first subtask is determined to have no task version change or task table change according to the first change record, a second change record of a second subtask is obtained; the second subtask is an upstream task of the first subtask in the data warehouse task; the second change record is a second subtask version change record or a second subtask table change record; if the task version change or the task table change exists in the second subtask according to the second change record, determining a second change person corresponding to the new version change of the second subtask as a fault responsible person and sending alarm information to a second change terminal corresponding to the second change person; if the second subtask is determined to have no task version change or task table change according to the second change record, the latest first change person corresponding to the first subtask is determined to be a fault responsible person, and alarm information is sent to a first change terminal corresponding to the first change person.
In one possible design manner of the first aspect, according to the fault category, determining a fault responsibility person and sending alarm information to the fault responsibility terminal, including: if the fault type is a platform fault, acquiring cluster running state parameters of the big data platform; if the cluster running state parameters indicate that the target assembly of the big data platform is abnormal, determining an assembly responsibility person corresponding to the target assembly as a fault responsibility person, and sending alarm information to an assembly responsibility terminal corresponding to the assembly responsibility person; if the cluster running state parameters indicate that the configuration parameters of the big data platform are changed, determining a platform responsibility person corresponding to the big data platform as a fault responsibility person, and sending alarm information to a platform responsibility terminal corresponding to the platform responsibility person.
In one possible design manner of the first aspect, after obtaining the cluster operation state parameter of the big data platform, the method further includes: if the cluster running state parameters indicate that all components of the big data platform are not abnormal and the configuration parameters of the big data platform are not changed, determining the latest first change person corresponding to the first subtask of the data warehouse task as a fault responsible person and sending alarm information to a first change terminal corresponding to the first change person.
In one possible design manner of the first aspect, before the target fault classification model is invoked to process the tool log and the task running log, the method further includes: acquiring a plurality of groups of sample data and sample categories corresponding to the plurality of groups of sample data one by one; the sample data are tool logs and task running logs when the big data processing system has a data warehouse task fault; the sample class corresponding to the sample data is the fault class of the sample data; and taking the sample data as training data, taking the sample category as supervision information, and iteratively training the initial fault classification model to obtain the target fault classification model.
Based on the technical scheme provided by the embodiment of the application, the work logs (i.e. tool logs) of the big data tool equipment and the work logs (i.e. task running logs) of the big data platform in the execution process of the data warehouse task (i.e. ETL task) can be acquired first. Since the tool log and task execution log are indicative of failures generated by the data warehouse tasks, particularly those portions of the large data processing system. Based on the method, the tool log and the task running log can be input into a target fault classification model to obtain the fault category of the fault generated by the data warehouse task. The fault category may then be a tool side fault or a bin side fault or a platform side fault. That is, it may be determined at this time approximately which portion of the large data processing system failed by the data warehouse task, in particular, failed to properly execute the data warehouse task. Because the part of the big data processing system generates the fault when the fault is known, finally, the fault responsibility person can be accurately determined according to the determined fault category and alarm information is sent to the fault responsibility terminal corresponding to the fault responsibility person, so that the fault responsibility person can process the fault of the data warehouse task. Based on the technical scheme provided by the application, the fault type of the data warehouse task when faults occur can be automatically determined, and then the fault responsibility person can be accurately determined. Compared with the prior art, the complex flow of fault responsibility people in the determining process is reduced, the communication cost consumed in the fault responsibility people determining process is reduced, and the fault solving efficiency is improved.
In a second aspect, the present application provides a fault handling apparatus, the apparatus comprising: an acquisition module and a processing module.
The acquisition module is used for acquiring a tool log and a task running log when a data warehouse task fault occurs; the tool log is a work log of big data tool equipment in a big data processing system for executing the data warehouse task; the task running log is a work log of a big data platform in a big data processing system for executing the tasks of the data warehouse; the processing module is used for inputting the tool logs and the task running logs acquired by the acquisition module into the target fault classification model to acquire fault categories; the fault category is any one of the following: tool side failure, bin side failure, platform side failure; the processing module is also used for determining a fault responsibility person and sending alarm information to the fault responsibility terminal according to the fault category; the fault responsibility terminal is a fault responsibility terminal corresponding to a fault responsibility person; the alarm information is used for indicating a fault responsibility person to process the fault of the data warehouse task.
In a third aspect, an electronic device is provided that includes a processor, a memory, a communication interface, and a communication bus. The processor, the memory and the communication interface complete communication with each other through a communication bus. The memory is used for storing computer instructions. The computer instructions, when run on a processor, cause the processor to perform the fault handling method as claimed in any one of the first aspects above.
In a fourth aspect, there is provided a computer readable storage medium having stored therein computer instructions which, when run on an electronic device, cause the electronic device to perform the fault handling method according to any of the first aspects above.
In a fifth aspect, there is provided a computer program product comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the fault handling method according to any of the first aspects above.
It will be appreciated that the foregoing second to tenth aspects of the present invention are all configured to perform the corresponding method provided in the first aspect, and therefore, the advantages achieved by the foregoing aspects may refer to the advantages in the corresponding method provided in the foregoing, and are not described herein.
It should be understood that in this application, the names of the fault handling apparatus and the electronic device described above do not constitute limitations on the device or the functional module itself, and in actual implementation, these devices or functional modules may appear under other names. Insofar as the function of each device or function module is similar to the present invention, it is within the scope of the present disclosure and the equivalents thereof. In addition, it is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present invention can be more clearly understood, and the following specific embodiments of the present invention are given for clarity and understanding.
Drawings
The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a schematic flow chart of a method for processing task faults in a data warehouse according to the prior art;
FIG. 2 is a schematic diagram of a big data processing system according to an embodiment of the present application;
fig. 3 is a schematic flow chart of a fault handling method according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of another fault handling method according to an embodiment of the present disclosure;
FIG. 5 is a flow chart of another fault handling method according to an embodiment of the present disclosure;
fig. 6 is a flow chart of a training method of a target fault classification model according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a fault handling apparatus according to an embodiment of the present application;
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The terms "first" and "second" are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more. "A and/or B" includes the following three combinations: only a, only B, and combinations of a and B.
With the development of big data industry, each company has a large number of ETL (Extract-Transform-Load) tasks (i.e. data warehouse tasks in this application) that need maintenance, especially when data faults occur, that need to be alerted to the relevant responsible person for quick repair. In practice, the execution link of the data ETL task is relatively long, and a data warehouse (abbreviated as a "bin"), a large data platform (abbreviated as a "platform"), and a large data tool (abbreviated as a "tool") are involved in a large data processing system applied to the data ETL task.
The data warehouse is mainly responsible for being connected with a plurality of service systems, and further extracting service data in a corresponding service system to process ETL tasks when certain service demands are received. The data warehouse also needs to be responsible for the development of all programs or functions related to the ETL task.
Data warehouse, when implementing ETL tasks and developing related programs or functions, then needs to be done in contact with big data tools. After the data warehouse develops and completes the programs or functions related to the ETL tasks by means of the development environment provided by the big data tool, the programs or functions and the sequential logic executed in the ETL tasks are sent to the big data tool, so that the big data tool schedules the programs or functions.
The big data platform is a server cluster formed by a plurality of servers, wherein different servers can be used as different components of the big data platform to realize different functions. Such as a storage component, a computing component, a management component, a rights control component, and the like. When the data warehouse implements the ETL task by means of the big data tool, the data warehouse needs to rely on the computing resource and the storage resource provided by the big data platform, so that the ETL task can be implemented smoothly.
Referring to fig. 1, when an ETL task in the prior art fails, since the ETL task is mainly completed by a plurality of bins using tools through the computing resources and the storage resources of the platform, when the ETL task fails, alarm information is generated on the plurality of bins.
After receiving the alarm, the personnel responsible for the number bin needs to judge whether a new version change (or online) of the program or function related to the ETL task exists. And if the new version change of the program or the function related to the ETL task exists, sending an alarm to a terminal corresponding to the corresponding change personnel to instruct the change personnel to process.
If the new version change of the program or the function related to the ETL task does not exist, judging whether the fault can be processed by the number bin side, and if the fault can be processed by the number bin side, sending alarm information to a terminal corresponding to a task responsibility person of the current subtask of the ETL task to indicate the terminal to process the fault.
If the number bin side is determined to be incapable of processing the fault, first indication information (which can be sent to the platform) is sent to the platform responsibility person so that the platform responsibility person determines a specific fault responsibility person.
After receiving the first indication information, the platform responsibility person determines whether the tool side has an abnormality (i.e. whether the tool is abnormal) according to the platform log (i.e. the work log of the platform). If the tool is abnormal, sending alarm information to a terminal corresponding to the tool responsibility person to instruct the tool responsibility person to process. After receiving the alarm information, the tool responsible person looks at the tool log (or referred to as the application log) and processes the fault according to the tool log.
If the task configuration is not the tool abnormality, judging whether the task configuration of the ETL task is abnormal. If the task configuration is abnormal, the platform provides a processing scheme according to the specific abnormal situation and sends alarm information to a terminal corresponding to the task responsible person of the current subtask so as to instruct the terminal to process the fault according to the processing scheme. If the task configuration is not abnormal, the problem can be considered to exist on the platform side, and a platform responsibility person can acquire cluster state parameters and component logs of a server cluster of the platform so as to process faults by combining the logs.
In summary, it can be seen that the fault alarms of the existing ETL tasks are often sent to a person who is not responsible for the actual fault or who can solve the fault. And in the whole process, the possible fault responsibility people need to be judged layer by layer manually and then contacted, so that a great deal of manpower and communication cost are wasted, and the fault resolution efficiency is low.
In view of the above problems, the present application provides a fault handling method. In the method, when the data warehouse task (namely the ETL task) is executed and fails, a tool log (acquired from a tool side) and a task operation log (acquired from a platform side) which can represent all possible failure categories in the execution process of the data warehouse task can be acquired. And then, inputting the tool log and the task running log into a target fault classification model trained in advance, so as to obtain fault types. Because the fault class is such that it can be determined which part of the big data processing system the fault is in, i.e. on the tool side or the silo side or the platform side. Therefore, the fault responsibility person can be automatically and accurately determined according to the fault category, and the alarm information is sent to the fault responsibility terminal corresponding to the fault responsibility person. Therefore, the complex flow of the fault responsibility person in the determination process is reduced, the communication cost consumed in the fault responsibility person determination process is reduced, and the fault resolution efficiency is improved.
In the embodiment of the present application, the data warehouse task may specifically refer to an ETL task, and in the embodiment disclosed in the present application, the data warehouse task is taken as an example of the ETL task, and this will not be repeated in the following description.
Fig. 2 is a schematic diagram of a big data processing system to which a fault handling method is applied according to an exemplary embodiment. Referring to FIG. 2, the big data processing system may include a tool server 01 and a platform server 02. The tool server 01 and the platform server 02 can communicate with each other by a wired communication method or a wireless communication method.
The tool server 01 in the present disclosure may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center, which is not limited in the present disclosure. Among these, the tool server 01 may mainly carry a data warehouse and big data tools that perform ETL tasks. The tool server 01 can also be in communication connection with a plurality of service servers 03 carrying different service systems, so that service data can be conveniently acquired from the service systems to implement ETL tasks.
For example, the platform server 02 in the present disclosure may be a server cluster formed by a plurality of servers, or a cloud computing service center. The platform server 02 is mainly used to provide the tool server 01 with storage resources and computing resources required to perform ETL tasks. The server cluster that it includes can be divided into multiple components, such as storage components, computing components, management components, entitlement control components, and the like.
In addition, the tool server 01 or the platform server 02 may also be used to train the target fault classification model. The fault type is determined by using the model, so that the fault processing method provided by the application is smoothly implemented.
The fault handling method provided in the embodiment of the present application is described in detail below with reference to a service.
The self-rigid embodiment provides a fault handling method applied to a fault handling device. The fault handling means may be an electronic device or part of an electronic device, which may be part of a platform server or part of a tool server. Referring to fig. 3, the method may include S301 to S303:
s301, acquiring a tool log and a task running log when an ETL task fault occurs.
When the ETL task fails, since the bin side is the main execution body, when the ETL task fails and cannot be successfully executed, the bin side (for example, a client used by an operator who implements the ETL task) automatically generates a fault prompt for development to indicate that the ETL task fails. Since the electronic device performing the fault handling method may be part of the platform server or part of the tool server, the fault cues may be discovered in time by real-time detection or periodic detection of short periods. The specific detection mode may be any feasible mode.
The tool log is a work log of big data tool equipment in a big data processing system for executing ETL tasks; the task execution log is a work log of a big data platform in a big data processing system executing ETL tasks.
In the actual ETL task execution process, the ETL task mainly comprises a plurality of core stages or subtasks such as task scheduling, task issuing, task generation execution plan, task submitting operation and the like. Wherein, task scheduling, task issuing stage exception belongs to faults occurring on the tool side. The exception to generate an execution plan is typically due to code parsing compilation/compilation exceptions, which are failures on the side of the bins. The exception of the task commit run phase is a platform side failure with a high probability. The task scheduling, the task issuing and the task generating execution plan are implemented in the tool server because the number bin side and the tool side are both in the tool server to which the big data tool belongs. The tool side fault and the bin side fault can be characterized by the content of the tool log corresponding to the tool server, i.e. it can be determined whether the two faults are based on the tool log.
The task submission operation is mainly implemented at the platform side, so that the platform side fault can be characterized by a task operation log generated by a platform server, i.e. whether the platform side fault is determined according to the character operation log.
Therefore, in order to determine that the fault generated by the ETL task is exactly that fault, it is necessary to first acquire a tool log and a task running log.
S302, calling a target fault classification model to process the tool log and the task running log to obtain fault types.
In connection with the foregoing, in the implementation of the present application, the fault category may be any of the following: tool side failure, bin side failure, platform side failure.
In some embodiments, after the electronic device invokes the target fault classification model to process the tool log and the task running log, three probabilities corresponding to the tool side fault, the bin side fault, and the platform side fault, such as v1 corresponding to the tool side fault, v2 corresponding to the bin side fault, and v3 corresponding to the platform side fault, may be obtained. The final fault class may be the largest one of the three probabilities corresponding to the fault class. For example, the three probability values for v1-v3 are 0.4, 0.5 and 0.1, respectively. The final fault category may be the tool side fault corresponding to v 1.
S303, determining a fault responsibility person according to the fault category and sending alarm information to the fault responsibility terminal.
The fault responsibility terminals are fault responsibility terminals corresponding to fault responsibility persons; the alarm information is used for indicating a fault liability person to process the faults of the ETL task.
When the fault responsibility person is determined, the communication information of the fault responsibility person can be further obtained according to the association relation between the fault responsibility person and the communication information. Such as a cell phone number, a terminal account number, etc. Furthermore, the electronic equipment can send the alarm information to the fault responsibility terminal according to the communication information.
By way of example, the fault-responsible terminal in the embodiments of the present application may be a terminal capable of wirelessly communicating with an electronic device performing a fault-handling method, such as a cell phone, tablet, desktop, laptop, handheld computer, notebook, ultra-mobile personal computer (UMPC), netbook, and cellular phone, personal digital assistant (personal digital assistant, PDA), augmented reality (augmented reality, AR)/Virtual Reality (VR) device, personal computer (Personal Computer, PC), etc. The embodiment of the application does not limit the specific form of the terminal.
In some embodiments, to facilitate faster handling of faults by fault responsibility, the alert information may also include a fault category, and corresponding log information, such as a tool log or task execution log.
Based on the technical scheme provided by the embodiment of the application, the work logs (i.e. tool logs) of the big data tool equipment and the work logs (i.e. task running logs) of the big data platform in the execution process of the ETL task (i.e. data warehouse task) can be acquired first. Since tool logs and task execution logs are useful in characterizing faults generated by ETL tasks, particularly those portions of large data processing systems. Based on the method, the tool log and the task running log can be input into a target fault classification model to obtain the fault category of the fault generated by the ETL task. The fault category may then be a tool side fault or a bin side fault or a platform side fault. That is, it may be determined at this time approximately which portion of the big data processing system failed by the ETL task, in particular, failed to perform the ETL task normally. Because the part of the big data processing system generates the fault when the fault is known, finally, the fault responsibility person can be accurately determined according to the determined fault category and alarm information is sent to the fault responsibility terminal corresponding to the fault responsibility person, so that the fault responsibility person processes the fault of the ETL task. Based on the technical scheme provided by the application, the fault type of the ETL task when faults occur can be automatically determined, and then the fault responsibility person can be accurately determined. Compared with the prior art, the complex flow of fault responsibility people in the determining process is reduced, the communication cost consumed in the fault responsibility people determining process is reduced, and the fault solving efficiency is improved.
For tool side faults (or task scheduling/issuing anomalies), which are mainly due to errors of the tool itself, the fault responsibilities can be directly identified as the manager currently responsible for managing the tool.
For a data bin side fault (or code analysis/compilation time exception), the fault is mainly caused by the code exception corresponding to the data bin. The exception of the code corresponding to the number bin side is mainly caused by that the subtask which is being executed by the ETL task has logic modification or table structure modification (or is called task version modification or task table modification), but is not subjected to strict test and verification, so that the new version corresponding to the subtask can be in fault on line. Likewise, if there is a similar change to an upstream subtask of a subtask, the current subtask may also be subject to error. Therefore, for the fault of the multi-bin side, a specific fault responsibility person can be determined through the change record of the current subtask or the upstream subtask.
The sub-tasks in the ETL task are finally submitted to a server cluster in the platform to run through the tool side by the several bin side. So for a platform side failure (or task running exception), if a component of a server cluster at the platform side is abnormal, resources are abnormal or configuration is abnormal, the whole task running may be in error even if codes corresponding to ETL tasks are normal. Therefore, for the middle fault category, cluster operation state parameters capable of reflecting the abnormal situation need to be acquired, and a specific fault responsibility person needs to be determined.
Based on the above description, the embodiment of the application also provides another embodiment of a fault handling method. The method may be applied on a fault handling apparatus, which may be an electronic device or a part of an electronic device, which may be a part of a platform server or a part of a tool server. Referring to fig. 4, the fault handling method in this embodiment may include S401 to S413:
s401, acquiring a tool log and a task running log when an ETL task fault occurs.
The specific implementation of S401 may refer to the specific expression of S301 in the foregoing embodiment, and will not be described herein.
S402, calling a target fault classification model to process the tool log and the task running log to obtain fault types.
The specific implementation of S401 may refer to the specific expression of S302 in the foregoing embodiment, and will not be described herein.
After S402, if the fault type is a tool side fault, the electronic device may determine, according to a shift schedule (may be referred to as a responsibility schedule), a tool responsibility person currently responsible for the tool side to determine a fault responsibility person, and send an alarm message to a corresponding tool responsibility terminal. I.e. the subsequent S403 and S404 are performed.
After S402, if the fault class is a fault on the silo side, the electronic device may determine, according to the change record of the currently executed subtask of the ETL task, that when there is a task version change or a task table change in the currently executed subtask, determine that the corresponding change person is a fault responsible person and send alarm information to the corresponding change terminal. I.e. the subsequent S405-S406 are performed.
Or if the electronic equipment determines that the currently executed subtask has no task version change or task list change, a specific fault responsibility person can be determined according to the change record of the upstream subtask, and alarm information can be sent to the corresponding change terminal. I.e. the subsequent S407-S409 are performed.
After S402, if the fault type is a platform side fault, a specific fault responsible person may be determined according to the cluster operation status parameter of the big data platform. If the cluster running state parameters are determined to be abnormal of some parts of the platform side, the corresponding responsible person can be determined to be the fault responsible person, and alarm information is sent to the corresponding terminal. I.e. S410-S413 are performed.
If the task configuration parameters of the current subtask are abnormal, the task configuration parameters of the current subtask can be considered to be abnormal, so that the operation is abnormal, at the moment, the responsible person of the current subtask can be determined to be the responsible person of the fault, and alarm information is sent to the corresponding terminal. Both S414 are performed.
And S403, if the fault type is a tool side fault, determining a tool responsibility person corresponding to the big data tool equipment and communication information of the tool responsibility person according to a responsibility table of the big data tool equipment.
The responsibility table comprises communication information of a plurality of tool managers and corresponding relations between the tool managers and time periods responsible for big data tool equipment. Illustratively, the responsibility table may be as shown in table 1 below.
TABLE 1 liability table
Figure BDA0004094157500000091
Figure BDA0004094157500000101
S404, determining the tool responsibility person as a fault responsibility person, and sending alarm information to the tool responsibility terminal by utilizing communication information of the tool responsibility person.
Based on the technical schemes corresponding to S403 and S404, the current tool responsibility person can be automatically determined as the fault responsibility person under the condition that the fault class is the tool side fault, and the alarm information is sent to the corresponding tool responsibility terminal. Therefore, faults of the ETL task are processed as soon as possible, and the fault processing efficiency is improved.
S405, if the fault type is a fault of the silo side, acquiring a first change record of a first subtask of the ETL task.
The first change record is a first subtask version change record or a first subtask table change record. The first subtask may be a subtask currently being executed by the ETL task.
In practice, each time the code of each subtask is changed, there will be a corresponding person who updates and develops the code of the subtask. Since code development is based on the development environment provided by the tool, the person will leave a corresponding change record (i.e., the first change record) on the tool side (i.e., in the tool server) when developing changes. The electronic device may then determine, based on the first change record, whether a task version change or a task table change exists in the first subtask, and determine a corresponding modifier if the task version change or the task table change exists, which both executes S406.
S406, if it is determined that the task version change or the task table change exists in the first subtask according to the first change record, determining the latest first change person corresponding to the first subtask as a fault responsible person and sending alarm information to a first change terminal corresponding to the first change person.
Specifically, if a change record within a certain time range (for example, within 5 seconds) at the current time exists in the first change record, it may indicate that the first subtask has a task version change or a task table change. In addition, in order to facilitate the determination of the responsible person, the communication information of the change person corresponding to each change record may exist in the first change record. Based on the above, the electronic device may determine the first change person corresponding to the latest change record in the first change record as the failure responsible person when it is determined that the task version change or the task table change exists in the first subtask, and send the alarm information to the first change terminal by using the communication information of the first change person.
Based on the technical schemes corresponding to S405 and S406, if it is determined that the task version change or the task table change exists in the currently executed first subtask under the condition that the fault class is a fault on the silo side, the first change person corresponding to the change can be determined as the fault responsible person, and alarm information can be sent to the corresponding first change terminal. Therefore, faults of the ETL task are processed as soon as possible, and the fault processing efficiency is improved.
S407, if it is determined that the task version change or the task table change does not exist in the first subtask according to the first change record, acquiring a second change record of the second subtask.
The second subtask is an upstream task of the first subtask in the ETL task. For example, taking the example that the ETL task includes an extraction subtask, a conversion subtask, and a loading subtask, the extraction subtask should be executed first, then the conversion subtask, and finally the loading subtask. It may then be determined that the extraction subtask is an upstream task of the conversion subtask, which is an upstream task of the loading subtask. For another example, if there are also a plurality of sub-tasks in the extraction sub-task, there is a similar upstream-downstream relationship between the plurality of sub-tasks in the extraction sub-task. This is not particularly limited in this application.
The second change record is a second subtask version change record or a second subtask table change record.
S408, if it is determined that the task version change or the task table change exists in the second subtask according to the second change record, determining a second change person corresponding to the new version change of the second subtask as a fault responsible person and sending alarm information to a second change terminal corresponding to the second change person.
Specifically, if a change record within a certain time range (for example, within 5 seconds) at the current time exists in the second change record, it may indicate that the second subtask has a task version change or a task table change. In addition, in order to facilitate the determination of the responsible person, the second change record may have communication information of the corresponding change person of each change record. Based on the above, the electronic device may determine the second person corresponding to the latest change record in the second change records as the failure responsible person when determining that the task version change or the task table change exists in the second subtask, and send the alarm information to the second change terminal by using the communication information of the second person.
S409, if it is determined that the task version change or the task table change does not exist in the second subtask according to the second change record, determining the latest first change person corresponding to the first subtask as a fault responsible person and sending alarm information to the first change terminal corresponding to the first change person.
When it is determined that the first subtask does not have a task version change or a task table change currently according to the first change record, and it is determined that the second subtask does not have a task version change or a task table change currently according to the second change record, it can be considered that the last change in the currently executed first subtask makes the first subtask abnormal with a high probability. At this time, the last change person in the first change record, that is, the latest first change person, can be determined as the fault responsible person, and alarm information can be sent to the corresponding first change terminal.
Based on the technical scheme corresponding to S407-S409, if it is determined that the task version change or the task table change does not exist in the currently executed first subtask currently under the condition that the fault class is a fault on the silo side, a real fault responsibility person can be accurately determined according to the second change record of the second subtask at the upstream of the task version change or the task table change, and alarm information is sent to a corresponding change terminal. Therefore, faults of the ETL task are processed as soon as possible, and the fault processing efficiency is improved.
S410, if the fault type is a platform fault, acquiring cluster operation state parameters of the big data platform.
Specifically, the electronic device may obtain cluster operation state parameters of its server cluster from the big data platform in any feasible manner. The cluster operation state parameters may at least include: information about whether each component in the server cluster is abnormal, parameter configuration and hardware configuration information of the server cluster, and the like.
S411, if the cluster running state parameter indicates that the target component of the big data platform is abnormal, determining a component responsible person corresponding to the target component as a fault responsible person, and sending alarm information to a component responsible terminal corresponding to the component responsible person.
Wherein each component of the big data platform has a corresponding component responsible person, and information (including communication information) of the component responsible person can be stored in the big data platform. When the electronic equipment determines that the target component of the big data platform is abnormal, the component responsible person corresponding to the target component can be determined from the big data platform, and alarm information is sent to the corresponding component responsible terminal according to the communication information.
S412, if the cluster running state parameter indicates that the configuration parameter of the big data platform is changed, determining a platform responsibility person corresponding to the big data platform as a fault responsibility person, and sending alarm information to a platform responsibility terminal corresponding to the platform responsibility person.
The configuration parameters of the big data platform specifically may include configuration parameters of software and configuration parameters of hardware, where the configuration parameters are mainly used for determining how many resources corresponding to each component in the big data platform. If the configuration parameters are changed, but the content of the ETL task is not changed, enough resources cannot be used when a certain component in the big data platform is needed to be used in the ETL task, so that the execution fails or is abnormal.
Configuration parameters of the big data platform are adjusted and modified by a platform responsibility person of the big data platform, and information (including communication information) of the platform responsibility person can be stored in the big data platform. When the electronic equipment determines that the configuration parameters of the big data platform are abnormal, the platform responsible person can be determined from the big data platform, and alarm information is sent to the corresponding platform responsible terminal according to the communication information.
Based on the technical scheme corresponding to S410-S412, under the condition that the fault type is a platform side fault, determining which part of the big data platform generates the fault according to the cluster operation state parameters, determining a corresponding responsible person as a fault responsible person, and sending alarm information to a corresponding fault responsible terminal. Therefore, faults of the ETL task are processed as soon as possible, and the fault processing efficiency is improved.
S413, if the cluster running state parameters indicate that all components of the big data platform are not abnormal and the configuration parameters of the big data platform are not changed, determining the latest first change person corresponding to the first subtask of the ETL task as a fault responsible person and sending alarm information to a first change terminal corresponding to the first change person.
If it is determined that no abnormality or change exists in the components and the skin containment parameters of the big data platform according to the cluster running state parameters, the big probability bar can consider that abnormality exists in the task configuration parameters of the first subtask currently executed. For example, the task configuration parameter of the first subtask indicates that the first subtask is executed by the a server in the big data platform, but if the a server itself executes other subtasks to cause insufficient resources, the first subtask is abnormal in operation. In this case, the latest first change person corresponding to the first subtask is determined as the fault responsible person and the alarm information is sent to the first change terminal corresponding to the first change person.
How to determine the first modifier and how to send the alarm information to the first modification station may refer to the relevant expressions in the foregoing embodiments, which are not described herein.
Based on the technical scheme corresponding to S413, if it is determined that the fault is not caused by the abnormality generated by the large data platform according to the cluster operation state parameter if the fault type is the platform side fault, it may be considered that the task configuration parameter configured by the last changer (i.e., the latest first changer) of the first task currently executing is abnormal, so the first changer may be determined as the fault responsible person, and alarm information may be sent to the corresponding first change terminal. Therefore, faults of the ETL task are processed as soon as possible, and the fault processing efficiency is improved.
It should be noted that after the electronic device determines the fault responsibility person and sends the alarm information to the corresponding fault responsibility terminal. After the corresponding abnormality is processed, the fault responsibility person can re-run the subtask with the error at present, so that the whole ETL task is smoothly implemented. If the fault responsibility person finds that the actual fault is not responsible by himself after receiving the alarm information, the fault responsibility person distributes the processing of the fault to the actual responsibility person according to the actual situation.
Based on the technical schemes corresponding to the S401-S413, specific responsible persons can be automatically and accurately determined according to different fault types, and alarm information can be sent to corresponding fault responsible terminals. Therefore, faults of the ETL task are processed as soon as possible, and the fault processing efficiency is improved.
In order to more clearly illustrate the technical solution provided by the embodiments of the present application, the embodiments of the present application further provide another embodiment of a fault handling method. In this embodiment, taking an electronic device as a part of a platform server as an example, referring to fig. 5, the method specifically may include S501-S511:
s501, acquiring a tool log and a task running log under the condition that an ETL task is in error (namely, fails).
The specific implementation of S501 may refer to S301 in the foregoing embodiment, and will not be described herein.
S502, determining fault categories according to the tool logs and the task running logs.
The specific implementation of S502 may refer to S302 in the foregoing embodiment, which is not described herein.
If the fault class is the tool side fault, executing S503; if the fault class is a bin side fault, executing S504; if the failure type is a platform-side failure, S509 is executed.
S503, processing by the alarm tool responsibilities.
Specifically, S503 is specifically to send alarm information to the tool responsibility terminal of the tool responsibility person. The specific implementation may refer to the specific expression of S404 in the foregoing embodiment, and will not be repeated here.
S504, judging whether the first subtask has a new version change.
Wherein the current subtask is the first subtask. Specifically, S504 determines whether a task version change or a task table change exists in the first subtask according to the first change record of the first subtask. For specific implementation, reference may be made to the relevant expression of S406 in the foregoing embodiment.
If it is determined that the current subtask has a new version change, S505 is executed; if it is determined that the current subtask does not have a new version change, S506 is performed.
S505, alarming the first change person to process.
Wherein the first change is the most recent change person in the change record of the first subtask. The specific implementation of S505 may refer to the relevant expression in S406, and will not be described herein.
S506, determining an upstream task of the first subtask.
Specifically, the electronic device may determine, according to the setting of the number bin side for each subtask in the ETL task, an upstream task of the first subtask. Explanation of the upstream-downstream relationship between the subtasks may refer to the relevant expressions in the foregoing embodiments, and will not be repeated here.
S507, judging whether the second subtask has new version change.
Wherein the second subtask is an upstream task of the first subtask.
Specifically, S507 is specifically configured to determine whether a task version change or a task table change exists in the second subtask according to the second change record of the second subtask. Specific implementation may refer to the relevant expressions of S407 and S408 in the foregoing embodiments, which are not repeated here.
If it is determined that the new version change exists in the second subtask, S508 is performed.
If it is determined that the second subtask does not have a new version change, the first subtask is considered to have an exception due to the fact that the last change in the first subtask currently executed is a high probability. At this time, the last change person in the first change record, that is, the latest first change person, can be determined as the fault responsible person, and alarm information can be sent to the corresponding first change terminal. I.e., S505 is performed.
S508, alarming the second change person to process.
Wherein the second change is the most recent change person in the change record of the second subtask. The specific implementation of S508 may refer to the relevant expression in S408, which is not described herein.
S509, acquiring cluster operation state parameters.
The specific implementation of S509 may refer to the related expression of S410 in the foregoing embodiment, which is not described herein.
S510, judging whether the target component is abnormal or the configuration parameters are changed in the big data platform.
If there is an abnormality of the target component or a change of the configuration parameter in the big data platform, S511 is executed.
If the target component is not abnormal or the configuration parameters are not changed in the big data platform, the big probability bar can consider that the task configuration parameters of the first subtask currently executed are abnormal, and can determine the first changer with the latest first subtask as the fault responsible person and alarm the fault responsible person, namely, execute S505.
S511, if the target component is abnormal, the responsible person of the alarm component alarms, and if the configuration parameters are changed, the responsible person of the platform alarms.
The specific implementation of S511 may refer to the relevant expressions of S411 and S412 in the foregoing embodiments, which are not repeated here.
The technical effects of the technical schemes corresponding to S501-S511 may refer to the technical effects of the technical schemes disclosed in the foregoing embodiments, and are not described herein again.
In the embodiment of the application, in order to enable the electronic device to perform accurate fault classification, a target fault classification model needs to be trained in advance (at least before the electronic device inputs the tool log and the task running log into the target fault classification model). Based on this, the fault handling method provided in the present application should further include a method for training the target fault classification model, which may be specifically implemented before S302 and S402 in the foregoing embodiments. Referring to fig. 6, the training method of the object fault classification model may include S601 and S602:
s601, a plurality of groups of sample data and sample categories corresponding to the plurality of groups of sample data one by one are obtained.
Each group of sample data is a tool log and a task running log when the big data processing system has ETL task faults; and the sample category corresponding to the sample data is the fault category of the sample data.
The sample data acquired in S601 may be acquired from the big data processing system before the current time. Specifically, when the big data processing system processes the ETL task to fail, the failure type is manually determined, and a corresponding tool log and task running log are obtained. In the group of data obtained in this way, the tool log and the task running log are one sample data, and the corresponding fault type is the sample type corresponding to the sample data.
S602, using sample data as training data and sample categories as supervision information, and iteratively training an initial fault classification model to obtain a target fault classification model.
Illustratively, S802 may specifically include: initializing a fault classification model; inputting the sample data into a fault classification model to obtain a prediction category; determining a loss value according to the prediction category and the sample category; iteratively updating the fault classification model according to the loss value; and repeatedly executing the step of inputting the sample data into the fault classification model to obtain the predicted class until the target fault classification model meeting the preset condition is obtained.
The initializing of the fault classification model may specifically refer to setting corresponding super parameters according to a model framework selected by implementation, and initializing weight parameters that may be optimized in a training process.
In particular, the loss value may be calculated in accordance with any feasible loss function. For example mean squared error (mean squared error, MSE), the MSE being in particular the euclidean distance between the calculated predicted value (predicted control decision) and the actual value (sample control decision). This is not particularly limited in this application.
Then, based on the loss value, parameters in the fault classification model can be adjusted in any feasible parameter adjustment mode. Such as random gradient descent (stochastic gradient descent, SGD) and the like. This is not particularly limited in this application.
In addition, the target fault classification model satisfying the preset condition may include: the iteration times of the fault classification model meet the preset times, or the loss value is smaller than a preset threshold value. That is, in the process of training the fault classification model, if the iteration number of the fault classification model after a certain iteration is greater than or equal to the preset number of times, or the loss value corresponding to the fault classification model after a certain iteration is smaller than the preset threshold value, the fault classification model is the target fault classification model.
The preset times and the preset threshold value can be determined according to requirements and model training experience. The specific values may be determined according to actual requirements, and the present application is not particularly limited. Therefore, training of the target fault classification model can be timely terminated on the basis of reaching a training target according to user requirements.
In addition, S602 may be trained using any feasible decision tree algorithm when training the target fault classification model. During training, all sample data and corresponding sample categories can be further divided into a training set and a checking set. After the training set is trained to obtain a pending fault classification model, the accuracy, recall and precision of the pending fault classification model can be evaluated according to the check set. And then, according to the requirements of accuracy, recall and precision, parameters of the undetermined fault classification model are adjusted, and then the target fault classification model is obtained.
It should be noted that, the electronic device for training the target fault classification model may be a device for implementing the technical solution corresponding to the foregoing embodiment, or may be other devices that may perform communication with the device. When the electronic device is other device, after training the target fault classification model, the model may be sent to the device implementing the technical solution corresponding to the foregoing embodiment. In particular how this is achieved, the present application is not particularly limited.
Based on the technical schemes corresponding to the S801 and the S802, a target fault classification model can be obtained through training in a machine learning manner, and the target fault classification model has the capability of predicting the fault class of the ETL task by using the tool log and the task running log. Therefore, in the fault processing method provided by the application, the model can be used for conveniently and quickly judging and determining the fault type.
It will be appreciated that, in order to achieve the above-mentioned functions, the electronic device includes corresponding hardware structures and/or software modules for performing the respective functions. Those of skill in the art will readily appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.
The embodiment of the application also provides a fault processing device under the condition that each functional module is divided by corresponding each function. Fig. 7 is a schematic structural diagram of a fault handling apparatus according to an embodiment of the present application. The apparatus may include: an acquisition module 701 and a processing module 702.
The acquiring module 701 is configured to acquire a tool log and a task running log when an ETL task fault occurs; the tool log is a work log of big data tool equipment in a big data processing system for executing ETL tasks; the task running log is a work log of a big data platform in a big data processing system for executing ETL tasks; the processing module 702 is configured to invoke the target fault classification model to process the tool log and the task running log acquired by the acquisition module 701, so as to obtain a fault class; the fault category is any one of the following: tool side failure, bin side failure, platform side failure; the processing module 702 is further configured to determine a fault responsibility person according to the fault category and send alarm information to the fault responsibility terminal; the fault responsibility terminal is a fault responsibility terminal corresponding to a fault responsibility person; the alarm information is used for indicating a fault responsibility person to process the fault of the ETL task.
In one possible implementation, the processing module 702 is specifically configured to include: if the fault type is a tool side fault, determining a tool responsibility person corresponding to the big data tool equipment and communication information of the tool responsibility person according to a responsibility table of the big data tool equipment; the responsibility table comprises communication information of a plurality of tool managers and the corresponding relation between the tool managers and time periods responsible for big data tool equipment; and determining the tool responsibility person as a fault responsibility person, and sending alarm information to the tool responsibility terminal by utilizing the communication information of the tool responsibility person.
In one possible implementation, the processing module 702 is specifically configured to include: if the fault class is a fault of the silo side, acquiring a first change record of a first subtask of the ETL task; the first change record is a first subtask version change record or a first subtask table change record; if the task version change or the task table change of the first subtask is determined according to the first change record, the latest first change person corresponding to the first subtask is determined to be a fault responsible person, and alarm information is sent to a first change terminal corresponding to the first change person.
In one possible implementation, after obtaining the first change record of the first subtask of the ETL task, the processing module 702 is further configured to: if the first subtask is determined to have no task version change or task table change according to the first change record, a second change record of a second subtask is obtained; the second subtask is an upstream task of the first subtask in the ETL task; the second change record is a second subtask version change record or a second subtask table change record; if the task version change or the task table change exists in the second subtask according to the second change record, determining a second change person corresponding to the new version change of the second subtask as a fault responsible person and sending alarm information to a second change terminal corresponding to the second change person; if the second subtask is determined to have no task version change or task table change according to the second change record, the latest first change person corresponding to the first subtask is determined to be a fault responsible person, and alarm information is sent to a first change terminal corresponding to the first change person.
In one possible implementation, the processing module 702 is specifically configured to: if the fault type is a platform fault, acquiring cluster running state parameters of the big data platform; if the cluster running state parameters indicate that the target assembly of the big data platform is abnormal, determining an assembly responsibility person corresponding to the target assembly as a fault responsibility person, and sending alarm information to an assembly responsibility terminal corresponding to the assembly responsibility person; if the cluster running state parameters indicate that the configuration parameters of the big data platform are changed, determining a platform responsibility person corresponding to the big data platform as a fault responsibility person, and sending alarm information to a platform responsibility terminal corresponding to the platform responsibility person.
In one possible implementation, after acquiring the cluster operation state parameters of the big data platform, the processing module 702 is further configured to: if the cluster running state parameters indicate that all components of the big data platform are not abnormal and the configuration parameters of the big data platform are not changed, determining the latest first change person corresponding to the first subtask of the ETL task as a fault responsible person and sending alarm information to a first change terminal corresponding to the first change person.
In one possible implementation, the apparatus further comprises a training module 703. The training module 703 is specifically configured to: acquiring a plurality of groups of sample data and sample categories corresponding to the plurality of groups of sample data one by one; the sample data are tool logs and task running logs when the ETL task fault occurs in the big data processing system; the sample class corresponding to the sample data is the fault class of the sample data; and taking the sample data as training data, taking the sample category as supervision information, and iteratively training the initial fault classification model to obtain the target fault classification model.
The specific manner in which each module performs the operation and the corresponding beneficial effects of the fault handling apparatus in the foregoing embodiments are described in detail in the foregoing embodiments of the fault handling method, and will not be described herein again.
Fig. 8 is a schematic diagram of a possible structure of an electronic device according to an exemplary embodiment, where the electronic device may be the fault handling apparatus described above, or may be a terminal or a server including the fault handling apparatus. As shown in fig. 8, the electronic device includes a processor 81 and a memory 82. The memory 82 is configured to store instructions executable by the processor 81, and the processor 81 may implement the functions of each module in the fault handling apparatus in the foregoing embodiment. Wherein the memory 82 stores at least one instruction that is loaded and executed by the processor 81 to implement the methods provided by the various method embodiments described above.
Therein, in a specific implementation, as an embodiment, the processor 81 (81-1 and 81-2) may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 8. And as one example the electronic device may include a plurality of processors 81, such as processor 81-1 and processor 81-2 shown in fig. 8. Each of these processors 81 may be a Single-core processor (Single-CPU) or a Multi-core processor (Multi-CPU). The processor 81 herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
The memory 82 may be, but is not limited to, a read-only memory 82 (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, or an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc read-only memory (compact disc read-only memory) or other optical disk storage, a compact disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), a magnetic disk computer storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 82 may be stand alone and be coupled to the processor 81 via a communication bus 83. The memory 82 may also be integrated with the processor 81.
The communication bus 83 may be an industry standard architecture (industry standard architecture, ISA) bus, an external device interconnect (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The communication bus 83 may be classified into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.
In addition, to facilitate information interaction between the electronic device and other devices (e.g., with a server when the electronic device is a terminal, or with a terminal when the electronic device is a server), the electronic device includes a communication interface 84. The communication interface 84 uses any transceiver-like means for communicating with other devices or communication networks, such as a control system, a radio access network (radio access network, RAN), a wireless local area network (wireless local area networks, WLAN), etc. The communication interface 84 may include a receiving unit to implement a receiving function and a transmitting unit to implement a transmitting function. The communication interface 84 is connected to the processor 81 and the memory 82 via the communication bus 83 to perform mutual communication.
Embodiments of the present application also provide a computer-readable storage medium storing computer instructions. When executed on an electronic device, the computer instructions cause the electronic device to perform the fault handling method in the method embodiments described above.
For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.
The present application also provides a computer program product containing computer instructions which, when run on an electronic device, cause the electronic device to perform the fault handling method in the above method embodiments.
The electronic device, the computer readable storage medium, or the computer program product provided in the embodiments of the present application are configured to perform the corresponding methods provided above, and therefore, the advantages achieved by the electronic device, the computer readable storage medium, or the computer program product may refer to the advantages of the corresponding methods provided above, which are not described herein.
From the foregoing description of the embodiments, it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of functional modules is illustrated, and in practical applications, the above-described functional allocation may be performed by different functional modules, that is, the internal structure of the apparatus (e.g., electronic device) is divided into different functional modules, so as to perform all or part of the functions described above. The specific operation of the above-described system, apparatus (e.g., electronic device) and unit may refer to the corresponding process in the foregoing method embodiment, which is not described herein.
In several embodiments provided herein, it should be understood that the disclosed systems, apparatuses (e.g., electronic devices) and methods may be implemented in other ways. For example, the above-described embodiments of an apparatus (e.g., an electronic device) are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform all or part of the steps of the methods described in the various embodiments of the present application. And the aforementioned storage medium includes: flash memory, removable hard disk, read-only memory, random access memory, magnetic or optical disk, and the like.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of fault handling, the method comprising:
when a data warehouse task fault occurs, acquiring a tool log and a task running log; the tool log is a work log of big data tool equipment in a big data processing system for executing the data warehouse task; the task running log is a work log of a big data platform in a big data processing system for executing the task of the data warehouse;
calling a target fault classification model to process the tool log and the task running log to obtain a fault class; the fault category is any one of the following: tool side failure, bin side failure, platform side failure;
according to the fault category, determining a fault responsibility person and sending alarm information to a fault responsibility terminal; the fault responsibility terminal is a fault responsibility terminal corresponding to the fault responsibility person; and the alarm information is used for indicating the fault responsibility person to process the fault of the data warehouse task.
2. The method of claim 1, wherein said determining a fault responsibility person and sending an alarm message to a fault responsibility terminal based on the fault category comprises:
if the fault type is a tool side fault, determining a tool responsibility person corresponding to the big data tool equipment and communication information of the tool responsibility person according to a responsibility table of the big data tool equipment; the responsibility table comprises communication information of a plurality of tool managers and the corresponding relation between the tool managers and time periods responsible for the big data tool equipment;
and determining the tool responsibility person as the fault responsibility person, and sending the alarm information to the tool responsibility terminal by utilizing the communication information of the tool responsibility person.
3. The method of claim 1, wherein said determining a fault responsibility person and sending an alarm message to a fault responsibility terminal based on the fault category comprises:
if the fault class is a fault of a silo side, acquiring a first change record of a first subtask of the data warehouse task; the first change record is a first subtask version change record or a first subtask table change record;
And if the task version change or the task table change exists in the first subtask according to the first change record, determining the latest first change person corresponding to the first subtask as the fault responsible person and sending the alarm information to a first change terminal corresponding to the first change person.
4. The method of claim 3, wherein after the obtaining the first change log for the first subtask of the data warehouse task, the method further comprises:
if the first subtask is determined to have no task version change or task table change according to the first change record, a second change record of a second subtask is obtained; the second subtask is an upstream task of the first subtask in the data warehouse task; the second change record is a second subtask version change record or a second subtask table change record;
if the task version change or the task table change exists in the second subtask according to the second change record, determining a second change person corresponding to the new version change of the second subtask as the fault responsible person and sending the alarm information to a second change terminal corresponding to the second change person;
And if the second subtask is determined to have no task version change or task table change according to the second change record, determining the latest first change person corresponding to the first subtask as the fault responsible person and sending the alarm information to a first change terminal corresponding to the first change person.
5. The method of claim 1, wherein said determining a fault responsibility person and sending an alarm message to a fault responsibility terminal based on the fault category comprises:
if the fault type is a platform fault, acquiring cluster running state parameters of the big data platform;
if the cluster running state parameters indicate that the target component of the big data platform is abnormal, determining a component responsible person corresponding to the target component as the fault responsible person, and sending the alarm information to a component responsible terminal corresponding to the component responsible person;
if the cluster running state parameters indicate that the configuration parameters of the big data platform are changed, determining a platform responsibility person corresponding to the big data platform as the fault responsibility person, and sending the alarm information to a platform responsibility terminal corresponding to the platform responsibility person.
6. The method of claim 5, wherein after the obtaining the cluster operation state parameters of the big data platform, the method further comprises:
if the cluster running state parameters indicate that all components of the big data platform are not abnormal and the configuration parameters of the big data platform are not changed, determining the latest first change person corresponding to the first subtask of the data warehouse task as the fault responsible person and sending the alarm information to a first change terminal corresponding to the first change person.
7. The method of claim 1, wherein before the invoking the target fault classification model processes the tool log and the task execution log, the method further comprises:
obtaining a plurality of groups of sample data and sample categories corresponding to the plurality of groups of sample data one by one; the sample data are tool logs and task running logs when the big data processing system has a data warehouse task fault; the sample category corresponding to the sample data is a fault category of the sample data;
and taking the sample data as training data, taking the sample category as supervision information, and iteratively training an initial fault classification model to obtain the target fault classification model.
8. A fault handling apparatus, the apparatus comprising:
the acquisition module is used for acquiring a tool log and a task running log when the data warehouse task fails; the tool log is a work log of big data tool equipment in a big data processing system for executing the data warehouse task; the task running log is a work log of a big data platform in a big data processing system for executing the task of the data warehouse;
the processing module is used for inputting the tool logs and the task running logs acquired by the acquisition module into a target fault classification model to acquire fault categories; the fault category is any one of the following: tool side failure, bin side failure, platform side failure;
the processing module is also used for determining a fault responsibility person and sending alarm information to a fault responsibility terminal according to the fault category; the fault responsibility terminal is a fault responsibility terminal corresponding to the fault responsibility person; and the alarm information is used for indicating the fault responsibility person to process the fault of the data warehouse task.
9. An electronic device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
The memory is configured to store at least one executable instruction that causes the processor to perform the operations of the fault handling method as claimed in any one of claims 1 to 7.
10. A computer readable storage medium, characterized in that at least one executable instruction is stored in the storage medium, which executable instructions, when run on an electronic device, cause the electronic device to perform the operations of the fault handling method according to any of claims 1-7.
CN202310160864.9A 2023-02-23 2023-02-23 Fault processing method, device and computer readable storage medium Pending CN116069539A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310160864.9A CN116069539A (en) 2023-02-23 2023-02-23 Fault processing method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310160864.9A CN116069539A (en) 2023-02-23 2023-02-23 Fault processing method, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN116069539A true CN116069539A (en) 2023-05-05

Family

ID=86176813

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310160864.9A Pending CN116069539A (en) 2023-02-23 2023-02-23 Fault processing method, device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116069539A (en)

Similar Documents

Publication Publication Date Title
US20180285247A1 (en) Systems, methods, and apparatus for automated code testing
US8938648B2 (en) Multi-entity test case execution workflow
US8434053B2 (en) Package review process workflow
CN112199355B (en) Data migration method and device, electronic equipment and storage medium
CN110955715A (en) ERP system, data conversion method and device of target platform and electronic equipment
CN113312341A (en) Data quality monitoring method and system and computer equipment
US9612944B2 (en) Method and system for verifying scenario based test selection, execution and reporting
US11962456B2 (en) Automated cross-service diagnostics for large scale infrastructure cloud service providers
CN113094081B (en) Software release method, device, computer system and computer readable storage medium
CN114168471A (en) Test method, test device, electronic equipment and storage medium
CN112379913B (en) Software optimization method, device, equipment and storage medium based on risk identification
US8380729B2 (en) Systems and methods for first data capture through generic message monitoring
CN116069539A (en) Fault processing method, device and computer readable storage medium
CN109471787B (en) Software quality evaluation method and device, storage medium and electronic equipment
CN115016321A (en) Hardware-in-loop automatic testing method, device and system
CN113656239A (en) Monitoring method and device for middleware and computer program product
Huang et al. Kubebench: A benchmarking platform for ml workloads
CN111400191A (en) Webpage security testing method and device and computer readable storage medium
CN111552631A (en) System testing method, device and computer readable storage medium
KR101478017B1 (en) Method and system for processing simulation data
CN115022317B (en) Cloud platform-based application management method and device, electronic equipment and storage medium
CN112950138B (en) Collaborative development state management method, device and server
US20220374328A1 (en) Advanced simulation management tool for a medical records system
US20220237021A1 (en) Systems and methods of telemetry diagnostics
US11651154B2 (en) Orchestrated supervision of a cognitive pipeline

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination