CN107204868B

CN107204868B - Task operation monitoring information acquisition method and device

Info

Publication number: CN107204868B
Application number: CN201610158804.3A
Authority: CN
Inventors: 卢山
Original assignee: China Mobile Group Shanxi Co Ltd
Current assignee: China Mobile Group Shanxi Co Ltd
Priority date: 2016-03-18
Filing date: 2016-03-18
Publication date: 2020-08-18
Anticipated expiration: 2036-03-18
Also published as: CN107204868A

Abstract

The invention discloses a method for acquiring task operation monitoring information, which respectively sets task identifiers according to the operation information of each task node where a task is located; the method further comprises the following steps: according to the fault point task identification, task monitoring information, platform monitoring information and/or equipment monitoring information of the task node corresponding to the fault point task identification content are obtained from the monitoring information, and fault information is determined. The invention also discloses a device for acquiring the task operation monitoring information.

Description

Task operation monitoring information acquisition method and device

Technical Field

The invention relates to a big data operation management technology, in particular to a method and a device for acquiring task operation monitoring information.

Background

At present, big data application is developed rapidly and is widely applied in various fields; each technical platform task operation has respective unique advantages, but forms a system and is respectively administrative; the advantages of different technical systems need to be integrated to meet the service requirements of different layers, so that the remarkable characteristic that the large data platform is different from the traditional platform is achieved by building an operation mode of isomerism, technical mixing, fusion deployment and multi-service linkage. In this case, at the time of failure, the troubleshooting steps are basically as follows:

firstly, fault positioning: because the large data platform tasks run and are scattered in each platform, the monitoring information management systems of each platform are independent of each other and work according to the standards and rules of each platform, when a fault occurs, operation and maintenance personnel need to cross and compare the fault information found by each platform on the management platform of each platform in a manual mode, manually remove secondary and associated alarms, and confirm fault alarms and fault reasons;

secondly, fault analysis: because the data of each system are independent, most of the current fault analysis is carried out in the system, the data of each system are summarized and associated in a manual mode, when the platform has numerous tasks and the operation logic relationship is complex, the mutual crossing relationship cannot be basically judged, the analysis can only be gradually refined from the whole system, the process is long in time consumption, and the influence degree and the influence range of the fault cannot be confirmed at the first time;

thirdly, failure resolution: after information of each manufacturer is summarized and a fault point is manually determined, each manufacturer needs to be coordinated to solve the fault point, each manufacturer is only responsible for solving the fault of the system, the influence of the fault solution of the system on other systems is not considered, and the manufacturer cannot stand at the system architecture level to carry out integral grasp on the fault; after the solution is completed, the solution condition is required to be confirmed on the management and control platform of each system, whether the fault is solved or not is judged manually, whether a new problem is generated or not is judged, and the like.

Under the existing conditions, the fault service influence evaluation, the fault analysis efficiency and the alarm accuracy of the big data platform have the following defects: the existing fault location is respectively analyzed by different manufacturers, particularly under the condition of mixed construction of a large data platform and multiple technologies, as platform tasks are more and more online, the logical association relationship among the platform tasks cannot be really known, and the operation dependency relationship among the tasks is difficult to clear, so that the comprehensive fault analysis capability is lacked, and the influence of a fault on a service is difficult to accurately evaluate; the existing fault analysis can not clear the system fault correlation influence among the assemblies under the background of uneven operation management levels of large data technologies of various manufacturers, particularly under the mixed use condition of a plurality of technical assemblies such as Spark, Storm, Sqoop, HIVE, HBASE and the like, and only each assembly can be checked and positioned from the beginning in the fault positioning process, so that the solution timeliness of the whole fault is prolonged; the existing fault monitoring is mainly carried out, each task fault, platform fault and equipment fault have own interface and information, the association and fusion of the information are lacked, when the fault occurs, a large amount of alarm information appears on different layers, the alarm information cannot be effectively associated and filtered, an alarm storm is formed, and operation and maintenance personnel can not follow the alarm storm. The method needs a manual mode to gather and analyze the alarm information of each platform, eliminates secondary alarms and associated alarms, finds out the fault reason, and has high dependency on expert personnel and low fault processing efficiency.

Therefore, the rapid positioning of the fault of the big data platform is improved, the automatic analysis of the fault influence is realized, the timeliness of fault solution is improved, and the problem to be solved urgently is solved.

Disclosure of Invention

In view of this, embodiments of the present invention are expected to provide a method and an apparatus for acquiring task operation monitoring information, so as to improve fast positioning of a fault of a large data platform, implement automatic analysis of fault influence, and improve timeliness of fault resolution.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

the embodiment of the invention provides a method for acquiring task operation monitoring information, which comprises the following steps: respectively setting task identifiers according to the running information of each task node where the task is located; the method further comprises the following steps:

according to the fault point task identification, task monitoring information, platform monitoring information and/or equipment monitoring information of the task node corresponding to the fault point task identification content are obtained from the monitoring information, and fault information is determined.

In the above scheme, the operation information includes: a run platform type, and/or a run platform component type;

the content of the task identification comprises: the operating platform type, and/or the operating platform component type, and/or a task serial number;

the task sequence number includes: and presetting a unique identification number according to the task, or presetting a unique identification number according to the task node.

In the above scheme, according to the fault point task identifier, the task monitoring information of the task node corresponding to the content of the fault point task identifier is obtained from the monitoring information, and the fault information is determined; the method comprises the following steps:

pre-associating the task serial number with the task monitoring information;

and determining task monitoring information of the task node corresponding to the fault point task identifier according to the task serial number in the fault point task identifier.

In the above scheme, the platform monitoring information of the task node corresponding to the content of the fault point task identifier is obtained from the monitoring information according to the fault point task identifier, and the fault information is determined; the method comprises the following steps:

determining the platform type and the platform component type of a task node corresponding to the fault point task identifier according to the operation platform type and the operation platform component type in the fault point task identifier;

and determining the platform type of the task node corresponding to the fault point task identifier and the platform monitoring information corresponding to the platform component type according to the platform type and the platform component type of the task node corresponding to the fault point task identifier.

In the above scheme, according to the fault point task identifier, acquiring device monitoring information of the task node corresponding to the content of the fault point task identifier from monitoring information, and determining fault information; the method comprises the following steps:

searching the running platform type and the equipment host name of the running platform component type running in the fault point task identifier in a task execution log;

and determining the equipment monitoring information of the task node corresponding to the fault point task identifier according to the equipment host name.

The embodiment of the invention also provides a device for acquiring the task operation monitoring information, which comprises: setting means and determining means, wherein,

the setting device is used for respectively setting task identifiers according to the running information of each task node where the task is located;

and the determining device acquires the task monitoring information, and/or platform monitoring information and/or equipment monitoring information of the task node corresponding to the content of the fault point task identifier from the monitoring information according to the fault point task identifier, and determines the fault information.

In the foregoing solution, the determining apparatus is specifically configured to:

pre-associating the task serial number with the task monitoring information;

the determining the fault information includes: and determining task monitoring information of the task node corresponding to the fault point task identifier according to the task serial number in the fault point task identifier.

According to the method and the device for acquiring the task operation monitoring information, the task identifiers are respectively set according to the operation information of each task node where the task is located; the method further comprises the following steps: according to the fault point task identification, task monitoring information, platform monitoring information and/or equipment monitoring information of the task node corresponding to the fault point task identification content are obtained from the monitoring information, and fault information is determined. Therefore, the task monitoring information, the platform monitoring information and the equipment monitoring information of the task node of the fault point can be accurately acquired through the task identifier of the task node of the fault point; when a fault occurs, fault monitoring information can be acquired according to the task identification of the fault node, the fault is quickly positioned, automatic analysis of fault influence is realized, and the timeliness of fault resolution is improved.

Drawings

Fig. 1 is a schematic flow chart of a task operation monitoring information acquisition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of task identifier components according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a sequence of a process for implementing fault location by task identification according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a task identifier association principle according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a business process of an application example according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a business process operation record of an application example according to an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating a service task identifier operation result of an application example according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a component of a task operation monitoring information obtaining apparatus according to an embodiment of the present invention.

Detailed Description

In the embodiment of the invention, task identifiers are respectively set according to the running information of each task node where a task is located; the method further comprises the following steps: according to the fault point task identification, task monitoring information, platform monitoring information and/or equipment monitoring information of the task node corresponding to the fault point task identification content are obtained from the monitoring information, and fault information is determined.

The present invention will be described in further detail with reference to examples.

As shown in fig. 1, a method for acquiring task operation monitoring information according to an embodiment of the present invention includes:

step 101: respectively setting task identifiers according to the running information of each task node where the task is located;

generally, one big data application comprises a plurality of tasks, the relation between the big data application and the tasks is a one-to-many relation, and the incidence relation between the big data application and the tasks is maintained by a system; a single task is also called a workflow, and the single workflow is composed of a plurality of task nodes; the method adopted by the prior art is that a task identifier or a task serial number which is irrelevant to task running information is distributed to each task and used for tracking the execution state of the task, and the method has the defects that the execution state of each task node cannot be accurately obtained, and the running information cannot be directly obtained from the execution state;

according to the technical scheme, different task identifiers are set for each task node in the task running process, the task identifiers comprise running information of tasks at the task nodes, and the running information comprises: a run platform type, a run platform component type, and/or a task serial number; the operation platform type refers to the type of an operation platform of a node where the task operates, such as a Java, Storm, Hadoop, Spark and other platforms; the operation platform component type refers to the type of a component of an operation platform of a task Node where a task operates, such as a Java-Node component in a Java platform; the task sequence number is a unique sequence number assigned to the task before the task runs.

In practical applications, the form of the task identifier may be as shown in fig. 2, where the task identifier may further include: task type and task name, used for identifying the task more quickly and conveniently; according to different specific task component types, in the form of fig. 2, the task identifier may be set as follows:

for the synchronous task component, the synchronous task component directly runs on the oozie server, and only success or failure information is available, and no information is particularly obvious, so the task identifier can be set as: oozie: none;

for a single mapping (map)/regression (Reduce) task component, submitting a single mapping (only map) mapreduce task trigger to run, wherein a user-defined action component is encapsulated in the map, and a task identifier of the task can be defined as: oozie: and (2) lancher: t ═ 0 }: w ═ 1 }: a ═ 2 }: ID {3 }; wherein, 0 represents the component type, 1 represents the platform type, 2 represents the task name, and 3 represents the task serial number, thus, the task represents the relationship between the embodied task and the platform;

for a dual map/Reduce task component, on the basis of running a single map/Reduce task, an action defined by a user in a map is a job with a map property, and a task identifier of the map job with the map property can be defined as: oozie: action: t ═ 0 }: w ═ 1 }: a ═ 2 }: the ID is {3}, wherein 0 represents a component type, 1 represents a platform type, 2 represents a task name, and 3 represents a task serial number, so that the task represents the relationship between the embodied task and the platform;

Single-map/Reduce task components and dual-map/Reduce task components are referred to herein as asynchronous task components.

Step 102: according to the fault point task identification, acquiring task monitoring information, platform monitoring information and/or equipment monitoring information of a task node corresponding to the content of the fault point task identification from monitoring information, and determining fault information;

the existing big data monitoring information comprises: applying monitoring information, task monitoring information, platform monitoring information and equipment monitoring information; each monitoring information comprises various running information of the task in the running process; the task monitoring information, the platform monitoring information and the equipment monitoring information are independent and not related to each other; the association relationship between the big data application and the task is maintained by the system, so that the association between the task monitoring information and the application monitoring information can be completed through the task attribution relationship. The task monitoring information, the platform monitoring information and the equipment monitoring information can be associated through the task identifier of the technical scheme of the invention, so that the application monitoring information, the task monitoring information, the platform monitoring information and the equipment monitoring information are associated; wherein, the platform monitoring information includes: the method comprises the following steps of (1) carrying out task execution on a platform according to a platform name, a platform type, a platform state and a task execution state and log on the platform; the task monitoring information includes: task flow information, flow links, current links, time used by each link, node change and task output logs; the device monitoring information includes: the method comprises the following steps of (1) carrying out equipment platform, equipment host information and equipment host running conditions; the task identification of the technical scheme of the invention realizes the association of the task monitoring information, the platform monitoring information and the equipment monitoring information, and can combine the associated information into the application monitoring information, so that the application monitoring information can provide the following information: applying tasks contained in the system, wherein the execution condition of each task, the running condition of the task on a platform and the running condition of a device host where the task is located are applied; therefore, the information of each task of the whole application and the running information of each task node can be obtained;

specifically, before the task runs, a unique task serial number may be assigned to the task, and the task monitoring information may correspond to the task serial number, for example: the task monitoring information can be named by the task serial number, so that the corresponding task monitoring information can be obtained through the task serial number in the task identifier; the unique task node identifier can be added behind the serial number, and the task monitoring information is respectively established according to the serial number of each task node, so that the corresponding task node task monitoring information can be obtained through the task serial number in the task identifier; in the running process of the big data task, if a fault occurs, task monitoring information corresponding to the fault task node can be obtained through the task identifier of the fault task node.

The platform type and the platform component type of the task node where the current task runs can be determined through the operation platform type and the operation platform component type in the task identifier; generally, the platform monitoring information is classified according to a platform type and a platform component type, so that the platform monitoring information of the task node can be associated through the platform type and the platform component type; in the running process of the big data task, if a fault occurs, the platform monitoring information of the fault task node can be associated through the platform type and the platform component type in the task identifier of the fault task node.

Through the operation platform type and the operation platform component type in the task identifier, the host name of the equipment for operating the platform type and the component type can be retrieved from the task execution log of the platform monitoring information; the device monitoring information of the task node can be retrieved from the device monitoring information through the host name of the device; in the process of running the big data task, if a fault occurs, the host name of the equipment can be determined in the task execution log through the platform type and the platform component type in the task identifier of the fault task node, and the equipment monitoring information of the task node is further obtained.

Therefore, the monitoring information, the task monitoring information, the platform monitoring information and the equipment monitoring information are applied to complete the association of the monitoring information through the task identification, and the seamless fusion of the monitoring information is achieved; in practical application, through the fusion relationship, a user interface can be established during daily maintenance and guarantee, the associated information of each monitoring information is collected, the associated application monitoring information, task monitoring information, platform monitoring information, equipment monitoring information and other information are provided for maintenance personnel at the same time, various monitoring information of the application including tasks in each task node is directly obtained, and operation guarantee and fault location are facilitated; when a fault occurs, the type of a platform component or fault equipment related to the fault can be determined through the fault point task identifier; the efficiency of fault location is greatly improved.

The present invention is described in further detail below with reference to example 1.

As shown in fig. 3, a and b respectively show the flow sequence of the synchronous task component and the asynchronous task component in the implementation of fault location by task identification; the principle of associating tasks, platforms, devices and applications by the task identifier is shown in fig. 4: an application comprises a plurality of tasks, the relationship between the application and the tasks is a one-to-many relationship, and the association relationship is maintained by the system; the single task refers to a work flow, the work flow consists of a plurality of task nodes, and each task node sets a unique task identifier when the flow runs each time; associating with a task running on the platform through the task identification; in the execution of platform jobs on a particular device, the platform job log or job status may in turn be associated with corresponding device information. Thus, the task identification can complete the capabilities of tasks, platforms, equipment, application association and the like; further, when a fault occurs, service processing such as fault positioning, fault analysis and fault monitoring can be completed through the task log.

The present invention will be described in further detail with reference to example 2.

In the embodiment, the task identifier is set at the task node of each task of the specific service application, so that the task running in the specific service application is well monitored;

here, the functions implemented by the specific service application are: calculating and analyzing the behavior of the user interaction circle, finding and analyzing the interaction circle of the user by using the user voice detailed list, analyzing the interaction behavior of the user by judging the communication behaviors of the user and other users, such as indexes of communication frequency, communication duration, communication time period and the like, and judging whether the user is the user with the highest influence on the interaction circle; the specific business process is shown in fig. 5, and includes:

step 501: collecting detailed files from an interface machine into a Hadoop Distributed File System (HDFS);

step 502: washing, filtering and sorting the detailed list by using a Map/Reduce program;

step 503: the result of the step 502 is input into a Hive library, and is summarized according to the calling number and the called number, and indexes such as the number of calls, the call duration, the call time period and the like are calculated;

step 504: and exporting the analysis result to a relational database through the sqoop script.

The technical scheme of the invention is adopted in the business process, task identifiers are set at each node, the business process running records are shown in figure 6, the business process node states are shown in figure 7(a), and the business process running logs are shown in figure 7 (b); the task identifier in the service flow node corresponds to the task identifier in the platform according to the name, as shown in fig. 7 (c); the operation state of the task identifier in the service flow node corresponding to the operation in the platform is shown in fig. 7 (d); the operation situation and the corresponding device situation of the task identifier in the service flow node in the platform are shown in fig. 7 (e);

it can be seen from fig. 7 that in the whole service flow, the task monitoring information, the platform monitoring information, and the device monitoring information in the task node have been associated by the task identifier; therefore, maintenance personnel can conveniently acquire required information; when a fault occurs, information of a platform, a platform component or equipment and the like of the fault can be conveniently determined.

As shown in fig. 8, the device for acquiring task operation monitoring information according to an embodiment of the present invention includes: a setting module 81, a determination module 82, wherein,

the setting module 81 is configured to set task identifiers according to the operation information of each task node where the task is located;

according to the technical scheme, different task identifiers are set for each task node in the task running process, the task identifiers comprise running information of tasks at the task nodes, and the running information comprises: a run platform type, a run platform component type, and/or a task serial number; the operation platform type refers to a type of an operation platform of a task node where a task operates, such as a Java, Storm, Hadoop, Spark and other platforms; the operation platform component type refers to the type of a component of an operation platform of a task Node where a task operates, such as a Java-Node component in a Java platform; the task sequence number is a unique sequence number assigned to the task before the task runs.

The determining module 82 is configured to obtain monitoring information of task operation of each task node corresponding to the content of each task identifier from the monitoring information according to each task identifier;

In practical applications, the setting module 81 and the determining module 82 may be implemented by a Central Processing Unit (CPU), a microprocessor unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like of a big data server system.

The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims

1. A task operation monitoring information acquisition method is characterized by comprising the following steps: respectively setting task identifiers according to the running information of each task node where the task is located; the method further comprises the following steps:

according to the fault point task identification, acquiring task monitoring information, platform monitoring information and/or equipment monitoring information of a task node corresponding to the content of the fault point task identification from monitoring information, and determining fault information;

the content of the task identification comprises: a run platform type, and/or a run platform component type, and/or a task serial number;

acquiring task monitoring information of a task node corresponding to the content of a fault point task identifier from monitoring information according to the fault point task identifier, and determining fault information; the method comprises the following steps: pre-associating the task serial number with the task monitoring information; determining task monitoring information of a task node corresponding to the fault point task identifier according to the task serial number in the fault point task identifier;

according to the fault point task identification, platform monitoring information of a task node corresponding to the content of the fault point task identification is obtained from monitoring information, and fault information is determined; the method comprises the following steps: determining the platform type and the platform component type of a task node corresponding to the fault point task identifier according to the operation platform type and the operation platform component type in the fault point task identifier; determining the platform type of the task node corresponding to the fault point task identifier and the platform monitoring information corresponding to the platform component type according to the platform type and the platform component type of the task node corresponding to the fault point task identifier;

according to the fault point task identification, acquiring equipment monitoring information of the task node corresponding to the content of the fault point task identification from monitoring information, and determining fault information; the method comprises the following steps: searching the running platform type and the equipment host name of the running platform component type running in the fault point task identifier in a task execution log; and determining the equipment monitoring information of the task node corresponding to the fault point task identifier according to the equipment host name.

2. The method of claim 1,

the operation information includes: a run platform type, and/or a run platform component type;

3. A task operation monitoring information acquisition apparatus, characterized in that the apparatus comprises: setting means and determining means, wherein,

the determining device acquires task monitoring information, platform monitoring information and/or equipment monitoring information of a task node corresponding to the content of the fault point task identifier from monitoring information according to the fault point task identifier, and determines fault information;

the determining means is specifically configured to:

determining task monitoring information of a task node corresponding to the fault point task identifier according to the task serial number in the fault point task identifier;

determining the platform type and the platform component type of a task node corresponding to the fault point task identifier according to the operation platform type and the operation platform component type in the fault point task identifier; determining the platform type of the task node corresponding to the fault point task identifier and the platform monitoring information corresponding to the platform component type according to the platform type and the platform component type of the task node corresponding to the fault point task identifier;

searching the running platform type and the equipment host name of the running platform component type running in the fault point task identifier in a task execution log; and determining the equipment monitoring information of the task node corresponding to the fault point task identifier according to the equipment host name.

4. The apparatus of claim 3,