CN112579685A

CN112579685A - State monitoring and health degree evaluation method and device for big data operation

Info

Publication number: CN112579685A
Application number: CN202011644280.1A
Authority: CN
Inventors: 温秋荣; 沈鹏; 王圣玉; 张程
Original assignee: Shanghai Wiwide Network Technology Co ltd
Current assignee: Shanghai Wiwide Network Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-03-30

Abstract

The method and the device for monitoring the state and evaluating the health degree of big data operation comprise the following steps: configuring monitoring parameters for various types of operation in advance, and configuring analysis parameters for various analysis projects in advance; collecting operation running data related to the operation according to the monitoring parameters in the operation running process; ETL processing is carried out on the operation data to obtain data to be evaluated related to the operation; and analyzing the data to be evaluated according to the analysis parameters to obtain the scores of the jobs corresponding to the analyzed data to be evaluated. The invention establishes a unified log rule, log management configuration and log format for each big data job, thereby realizing the purposes of rapidly acquiring the operation state of the job, finding problems in time and rapidly and accurately positioning abnormity.

Description

State monitoring and health degree evaluation method and device for big data operation

Technical Field

The invention relates to the technical field of big data operation, in particular to a method and a device for monitoring the state and evaluating the health degree of big data operation.

Background

In large data jobs, when a job has a problem, it is often difficult to find and process the job in a timely manner. Specifically, due to factors such as a data source, program logic, and a running environment, it is often difficult for a server to comprehensively know the working state of a running job, and information related to the job is often recorded in each scheduling link in a fragmented manner, so that a maintainer often cannot find a job problem in advance.

Even after a problem occurs, a maintainer needs to log in a scheduling node to screen a corresponding log file, and then screen related abnormal information from a large amount of log information recorded in the log file, so that the problem troubleshooting process is complicated and consumes a long time.

Therefore, how to quickly acquire the operation state of the operation, find problems in time and quickly and accurately locate the abnormality in the big data operation is a problem to be solved urgently in the field.

The ETL (Extract-Transform-Load) refers to extracting (Extract) data of a business system (or other source end), converting (Transform), and loading (Load) the data to a data warehouse (or other destination end), so as to integrate scattered, disordered, and non-uniform data in an enterprise, and provide an analysis basis for enterprise decision making.

Disclosure of Invention

The technical problem solved by the invention is as follows: in big data operation, how to quickly acquire operation running state, find problems in time and quickly and accurately locate abnormity.

In order to solve the above technical problem, an embodiment of the present invention provides a method for monitoring a status and evaluating a health degree of big data job, including:

configuring monitoring parameters for various types of operation in advance, and configuring analysis parameters for various analysis projects in advance;

when the operation is started, acquiring operation information of the operation, and acquiring a monitoring parameter corresponding to the operation type of the operation according to the operation information;

collecting operation running data related to the operation according to the monitoring parameters in the operation running process;

ETL processing is carried out on the operation data to obtain data to be evaluated related to the operation;

and analyzing the data to be evaluated according to the analysis parameters to obtain the scores of the jobs corresponding to the analyzed data to be evaluated.

Optionally, the method for monitoring the status and evaluating the health degree of the big data job further includes: and generating identifiers of the jobs in the running process according to the job information and the monitoring parameters/analysis parameters, wherein different jobs have different identifiers, and the identifiers act on the starting position and the ending position of the job scheduling node to realize the association between the jobs and the scheduling environment.

Optionally, the job information includes: basic information and supplementary information of the job; wherein the supplemental information is determined by the job type and the business needs of the job.

Optionally, the monitoring parameters include: the status reporting frequency, the fault tolerance mechanism and the alarm mechanism of the operation type.

Optionally, after the obtaining of the monitoring parameter corresponding to the job type of the job, the method further includes: and creating an information acquisition thread related to the job according to the acquired job information of the job and the monitoring parameters corresponding to the job type.

Optionally, in the process of job running, collecting job running data about the job according to the monitoring parameter includes: the method comprises the steps of starting a data acquisition module according to preset starting parameters and monitoring parameters, periodically monitoring whether each operation runs or not by the data acquisition module, if the operation running is found, acquiring the monitoring parameters corresponding to the operation type of the operation according to the operation type of the operation, and acquiring operation running data related to the operation according to the monitoring parameters.

Optionally, the performing ETL processing on the job running data to obtain data to be evaluated about the job includes: and acquiring parameters related to the relevant index model, and carrying out ETL processing on the job operation data according to the parameters related to the relevant index model.

Optionally, the analysis includes a plurality of analysis items, different analysis items respectively have corresponding index data, weighting coefficients and analysis rules, and after analysis of a certain item, a subentry score of an analyzed node with respect to the analysis item is obtained.

Optionally, the total score of the analyzed node is calculated according to the itemized scores of the analyzed node on each analysis item, the weighting coefficients of each analysis item, and the analysis rules of each analysis item, and the calculation formula is as follows:

Score＝A₁W₁X₁+A₂W₂X₂+…+A_nW_nX_n

wherein Score denotes the overall Score, A_nAnalysis rule parameters, W, representing individual analysis items_nWeight coefficient, X, representing each analysis item_nThe itemized scores of the respective analysis items are represented.

Optionally, the job running data and the data to be evaluated of a certain node are bound with the node identifier of the corresponding node for common transmission, and the node identifier of each node adopts a secret code.

In order to solve the above technical problem, an embodiment of the present invention further provides a device for monitoring status and evaluating health degree of big data job, including:

a processor adapted to load and execute instructions of a software program;

a memory adapted to store a software program comprising instructions for performing the steps of:

Optionally, the device for monitoring status and evaluating health degree of big data job further includes: the system comprises a configuration module 201, a data acquisition module 202, an extraction service module 203 and an analysis service module 204; wherein:

the configuration module 201 is suitable for configuring monitoring parameters for various types of operation in advance and configuring analysis parameters for various analysis projects in advance;

the data acquisition module 202 is suitable for acquiring the operation data of the operation according to the monitoring parameters in the operation process of the operation;

the extraction service module 203 is suitable for performing ETL processing on the job running data to obtain data to be evaluated about the job;

the analysis service module 204 is adapted to analyze the data to be evaluated according to the analysis parameters, so as to obtain a score of a job corresponding to the analyzed data to be evaluated.

In order to solve the above technical problem, an embodiment of the present invention further provides a server, including the above status monitoring and health degree evaluating device for big data job.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

configuring monitoring parameters for various types of operation in advance, and configuring analysis parameters for various analysis projects in advance; collecting operation running data related to the operation according to the monitoring parameters in the operation running process; ETL processing is carried out on the operation data to obtain data to be evaluated related to the operation; and analyzing the data to be evaluated according to the analysis parameters to obtain scores of the jobs corresponding to the analyzed data to be evaluated, so that a uniform log rule, log management configuration and log format are established for each big data job, and further, the operation state of the job is quickly obtained, the problem is timely found, and the abnormality is quickly and accurately positioned.

Furthermore, a specific scoring rule is disclosed, and a total score is further obtained according to the fractional score of each analysis project, so that a uniform running state evaluation mechanism is established for each big data job.

Drawings

FIG. 1 is a flow chart of a method for status monitoring and health assessment of big data jobs according to an embodiment of the present invention;

FIG. 2 is a block diagram of a status monitoring and health assessment apparatus for big data jobs according to an embodiment of the present invention.

Detailed Description

As can be seen from the analysis in the background art, in a big data job, when a job problem occurs, it is often difficult to find and process the job in time. For example, when a problem with a software compiler causes an error in the application software, problem location is very time consuming.

How to quickly acquire the operation state, find problems in time and quickly and accurately locate the abnormality is a problem which needs to be solved urgently in the field.

After research, the inventor finds that big data operation mainly relates to aspects such as data acquisition, data preprocessing, data storage and the like. Wherein:

the data acquisition mainly comprises database acquisition and file acquisition. Generally, scripts need to be written, corresponding tools need to be called to extract and convert data, and the data are transmitted to a designated big data storage system, the process is generally scheduled and executed through a scheduling system, and job failure may be caused by problems of data rules, network connection and the like in the scheduling process. Error information about data acquisition is usually saved in a log file of a node where a script runs, and when abnormal information is checked, the node needs to be logged in, and a system command is used for checking the log information of a corresponding file.

The data preprocessing mainly comprises data cleaning, data conversion, data integration and data specification. The process is mainly completed by a computing engine, data needs to be analyzed, filtered, converted, classified and the like in the job execution process, and a job execution error may be caused due to insufficient node resource allocation, disordered data sources and the like in the operation, and the error may affect a single piece of data or multiple pieces of data, even cause the whole job to fail. Exception information regarding data preprocessing is typically stored in a log directory of a compute engine cluster, requiring a command and job ID provided by the compute engine to be used to view log information corresponding to the job.

The data storage mainly encapsulates the storage interface capability of the bottom layer so as to realize the storage and the update of data, and operation execution errors are caused by the size of data volume, the data type, the node network state and other reasons in the operation process. The abnormal information about data storage is usually stored in a log directory of a storage engine cluster, the storage engines are of different types and are suitable for different service scenes, the log directories are also different, the abnormal information for troubleshooting needs to be logged in the storage cluster, and a system command is used for checking the log information of a corresponding file.

Through the analysis, the log information is important information of the job, and whether the running logic and the job state of the job meet expectations or not can be known through the log information.

In the prior art, a program scheduled in a job can provide a log information recording and configuring method, even a log information display interface, and can check job running information more conveniently.

However, in practical applications, log records and configuration methods provided by the programs of the job scheduling are not uniform, log context information cannot be effectively associated in each scheduling link in the same job, the recorded log information is staggered and fragmented, maintenance personnel often need to repeatedly screen logs in log information of multiple job scheduling nodes, the job problem screening process is complicated, the consumed time is long, and further additional consumption of system maintenance cost is caused.

For example, in the prior art, for the case of software error, the conventional way is to locate the problem by using a debugger manually, but the problem locating efficiency of the compiler is low.

The method comprises the steps of configuring monitoring parameters for various types of operation in advance, and configuring analysis parameters for various analysis items in advance; collecting operation running data related to the operation according to the monitoring parameters in the operation running process; ETL processing is carried out on the operation data to obtain data to be evaluated related to the operation; and analyzing the data to be evaluated according to the analysis parameters to obtain scores of the jobs corresponding to the analyzed data to be evaluated, so that a uniform log rule, log management configuration and log format are established for each big data job, and further, the operation state of the job is quickly obtained, the problem is timely found, and the abnormality is quickly and accurately positioned.

In order that those skilled in the art will better understand and realize the present invention, the following detailed description is given by way of specific embodiments with reference to the accompanying drawings.

Example one

As described below, embodiments of the present invention provide a method for monitoring status and evaluating health of big data job.

Referring to the flow chart of the status monitoring and health degree evaluation method of big data job shown in fig. 1, the following detailed description is made through specific steps:

s101, monitoring parameters are configured for various types of jobs in advance, and analysis parameters are configured for various analysis items in advance.

Further, a data acquisition module, an extraction service module and a determination service module (server as environment) are deployed on the server of the big data cluster based on the pre-configured information such as the configuration of the designated big data cluster, the configuration of job acquisition, the configuration of extraction, the configuration of calculation model and the like.

S102, when the operation is started, the operation information of the operation is obtained, and the monitoring parameters corresponding to the operation type of the operation are obtained according to the operation information.

With respect to job information, in some embodiments, the job information may include: basic information and supplementary information of the job.

Wherein the supplemental information is determined by the job type and the business needs of the job.

With respect to monitoring parameters, in some embodiments, the monitoring parameters may include: the status reporting frequency, the fault tolerance mechanism and the alarm mechanism of the operation type.

In some embodiments, after the obtaining of the monitoring parameter corresponding to the job type of the job, the method may further include: and creating an information acquisition thread related to the job according to the acquired job information of the job and the monitoring parameters corresponding to the job type.

Further, in some embodiments, the method for monitoring the status and evaluating the health of the big data job may further include: and generating identifiers of the jobs in the running process according to the job information and the monitoring parameters/analysis parameters, wherein different jobs have different identifiers, and the identifiers act on the starting position and the ending position of the job scheduling node to realize the association between the jobs and the scheduling environment.

And S103, collecting the operation data of the operation according to the monitoring parameters in the operation process of the operation.

And establishing an acquisition thread according to the acquired configuration information, collecting the operation data of the corresponding operation according to the configured operation information, and periodically submitting the acquired operation data to an acquisition server according to the monitoring parameters.

Specifically, in some embodiments, the collecting, according to the monitoring parameter, job running data about the job during the job running may include: the method comprises the steps of starting a data acquisition module according to preset starting parameters and monitoring parameters, periodically monitoring whether each operation runs or not by the data acquisition module, if the operation running is found, acquiring the monitoring parameters corresponding to the operation type of the operation according to the operation type of the operation, and acquiring operation running data related to the operation according to the monitoring parameters.

When the operation is started, the operation is initialized first, and the operation information and the monitoring parameters configured corresponding to the current operation type are acquired first in the initialization process, so that the configuration information of the latest version can be acquired.

Furthermore, when the data acquisition module is started, an initialization operation is performed first, and the starting module on each node is started based on the preset starting parameters and the preset monitoring parameters. After the starting module is started, the operation running state can be periodically monitored, if the running big data operation exists, corresponding data acquisition is carried out according to the monitoring parameters, and the data are synchronously reported to the preprocessing module.

Furthermore, because the current big data clusters are all in a lamb mode, a plurality of service nodes are usually used for executing the operation, the module generates a unique key for each operation based on the configuration of the index model and the operation related information of the service node, collects the related index data during the operation of the operation, maps and associates the key of the operation and the index data of the operation, and synchronizes the data to the extraction module, so that the problem that a plurality of index data exist when the operation is split and operated by the plurality of service nodes is solved, and even the plurality of service nodes can take the index data of each operation.

The "key" is understood to mean that in a distributed system, since a combination of a plurality of jobs calculated by a plurality of nodes is combined to form a final result, it is necessary to know which job the final result belongs to, and the "key" has a role in enabling the result of multi-node processing to be mapped onto one job, thereby realizing distributed calculation. The "key" is then reported as part of the data.

With respect to the "key" described above, in some embodiments, a "key" for a job may be formed based on the identity of the job, the server node ID associated with the job, and the serial number at which the job runs.

Further, on this basis, a "key" for a certain job may be formed based on information such as a machine room ID associated with the job, a rack ID associated with the job, and a time at which the job program is executed.

For example, the key may be:

"key" JobName _ roommid _ racldd _ NodeId _ Time _ No.

Wherein, JobName represents the identification of the operation, roommid represents the machine room ID related to the operation, RackId represents the rack ID related to the operation, NodeId represents the server node ID related to the operation, Time represents the Time of obtaining execution when the operation program runs, and No represents the serial number when the operation runs.

S104, ETL processing is carried out on the job running data, and data to be evaluated about the job are obtained.

Specifically, in some embodiments, the performing ETL processing on the job running data to obtain data to be evaluated about the job may include: and acquiring parameters related to the relevant index model, and carrying out ETL processing on the job operation data according to the parameters related to the relevant index model.

After the extraction service module receives the operation running data reported by the data acquisition module, the extraction service module loads data of the index model, ETL processing is carried out on the synchronized operation running data based on the index model data, finally, calculation data which can be directly used by the determination service module is achieved, and operation health scoring is synchronously carried out on the determination service module.

The extraction service module is used for solving the problem that the change of metadata in the data warehouse design cannot be directly consumed and used by downstream services, so that the data needs to be processed hierarchically. In some embodiments, the metadata is preprocessed, and the preprocessing method includes logic of structuring, screening and the like of the health data, and is finally provided for a data model (data to be evaluated) which can be directly analyzed at the downstream. The specific pretreatment method is not the focus of the present invention, and is not described herein.

And S105, analyzing the data to be evaluated according to the analysis parameters to obtain a score of the job corresponding to the analyzed data to be evaluated.

After the service module receives the data after the ETL of the service module is extracted, the service module is determined to load the configuration data, the analysis is performed based on the configuration data, the analysis service module has multiple analysis modes, the analysis of different analysis service modules evaluates the health score of a certain node of the cluster based on different index data, corresponding weighting coefficients and analysis parameters, and for convenience of understanding, for example, the health score may be:

Score＝A₁W₁X₁+A₂W₂X₂+…+A_nW_nX_n

wherein A is_nIs an analysis parameter of a different analysis service module, W_nRefers to the corresponding coefficient of each evaluation index parameter, which may also be a weighting coefficient, X_nThe values of the analysis parameters have different judgment rules according to different analysis modules, so that the health condition of the cluster nodes can be calculated more accurately and more comprehensively.

And analyzing the health scores 'Score' provided by all the service modules finally obtained by the service modules, classifying the index data of the same job according to the provided 'key', and finally determining the (health state) Score of the job running process.

The above description of the technical solution shows that: in this embodiment, monitoring parameters are configured for various types of jobs in advance, and analysis parameters are configured for various analysis items in advance; collecting operation running data related to the operation according to the monitoring parameters in the operation running process; ETL processing is carried out on the operation data to obtain data to be evaluated related to the operation; and analyzing the data to be evaluated according to the analysis parameters to obtain scores of the jobs corresponding to the analyzed data to be evaluated, so that a uniform log rule, log management configuration and log format are established for each big data job, and further, the operation state of the job is quickly obtained, the problem is timely found, and the abnormality is quickly and accurately positioned.

Specifically, in some embodiments, the analysis may include a plurality of analysis items, and different analysis items respectively have corresponding index data, weighting coefficients and analysis rules, and after a certain item is analyzed, a fractional score of an analyzed node with respect to the analysis item is obtained.

Further, the total score of the analyzed node can be calculated according to the itemized scores of the analyzed node relative to each analysis item, the weighting coefficients of each analysis item and the analysis rules of each analysis item, and the calculation formula is as follows:

Score＝A₁W₁X₁+A₂W₂X₂+…+A_nW_nX_n

The above description of the technical solution shows that: in the embodiment, a specific scoring rule is further disclosed, and a total score is further obtained according to the subentry scores of all analysis items, so that a uniform running state evaluation mechanism is established for all big data jobs.

In some embodiments, job running data and data to be evaluated of a certain node may be bound with a node identifier of a corresponding node for common transmission, and the node identifier of each node adopts a secret code.

Example two

As described below, embodiments of the present invention provide a status monitoring and health evaluation device for big data job.

The state monitoring and health degree evaluation device for big data operation comprises:

a processor adapted to load and execute instructions of a software program;

In some embodiments, as shown in fig. 2, the status monitoring and health evaluating apparatus for big data job may further include: the system comprises a configuration module, a data acquisition module, an extraction service module and an analysis service module; wherein:

the configuration module is suitable for configuring monitoring parameters for various types of operation in advance and configuring analysis parameters for various analysis projects in advance;

the data acquisition module is suitable for acquiring operation running data related to the operation according to the monitoring parameters in the operation running process;

the extraction service module is suitable for carrying out ETL processing on the operation data to obtain data to be evaluated related to the operation;

and the analysis service module is suitable for analyzing the data to be evaluated according to the analysis parameters to obtain the scores of the jobs corresponding to the analyzed data to be evaluated.

EXAMPLE III

As described below, embodiments of the present invention provide a server.

The difference from the prior art is that the server includes a status monitoring and health assessment device for big data jobs as provided in the embodiments of the present invention. Therefore, the server can configure monitoring parameters for various types of operation in advance and analysis parameters for various analysis items in advance; collecting operation running data related to the operation according to the monitoring parameters in the operation running process; ETL processing is carried out on the operation data to obtain data to be evaluated related to the operation; and analyzing the data to be evaluated according to the analysis parameters to obtain scores of the jobs corresponding to the analyzed data to be evaluated, so that a uniform log rule, log management configuration and log format are established for each big data job, and further, the operation state of the job is quickly obtained, the problem is timely found, and the abnormality is quickly and accurately positioned.

Those skilled in the art will understand that, in the methods of the embodiments, all or part of the steps can be performed by hardware associated with program instructions, and the program can be stored in a computer-readable storage medium, which can include: ROM, RAM, magnetic or optical disks, and the like.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for monitoring the state and evaluating the health degree of big data operation is characterized by comprising the following steps:

2. The method for status monitoring and health assessment of big data jobs according to claim 1, wherein the method for status monitoring and health assessment of big data jobs further comprises: and generating identifiers of the jobs in the running process according to the job information and the monitoring parameters/analysis parameters, wherein different jobs have different identifiers, and the identifiers act on the starting position and the ending position of the job scheduling node to realize the association between the jobs and the scheduling environment.

3. The method for status monitoring and health assessment of big data jobs according to claim 1, wherein the job information comprises: basic information and supplementary information of the job; wherein the supplemental information is determined by the job type and the business needs of the job.

4. The method for status monitoring and health assessment of big data jobs according to claim 1, wherein the monitoring parameters comprise: the status reporting frequency, the fault tolerance mechanism and the alarm mechanism of the operation type.

5. The method for status monitoring and health assessment of big data job according to claim 1, further comprising after obtaining the monitoring parameter corresponding to the job type of the job: and creating an information acquisition thread related to the job according to the acquired job information of the job and the monitoring parameters corresponding to the job type.

6. The method for monitoring the status and evaluating the health of the big data job according to claim 1, wherein the collecting the job running data related to the job according to the monitoring parameters during the running of the job comprises: the method comprises the steps of starting a data acquisition module according to preset starting parameters and monitoring parameters, periodically monitoring whether each operation runs or not by the data acquisition module, if the operation running is found, acquiring the monitoring parameters corresponding to the operation type of the operation according to the operation type of the operation, and acquiring operation running data related to the operation according to the monitoring parameters.

7. The method for status monitoring and health assessment of big data jobs according to claim 1, wherein the ETL processing of the job running data to obtain the data to be assessed about the job comprises: and acquiring parameters related to the relevant index model, and carrying out ETL processing on the job operation data according to the parameters related to the relevant index model.

8. The method according to claim 1, wherein the analysis includes a plurality of analysis items, different analysis items have corresponding index data, weighting coefficients and analysis rules, and after a certain item is analyzed, a sub-item score of the analyzed node with respect to the analysis item is obtained.

9. The method according to claim 8, wherein the total score of the analyzed node is calculated according to the itemized scores of the analyzed nodes with respect to the analysis items, the weighting coefficients of the analysis items, and the analysis rules of the analysis items, and the calculation formula is as follows:

Score＝A₁W₁X₁+A₂W₂X₂+...+A_nW_nX_n

10. The method for monitoring the status and evaluating the health of the big data job according to claim 1, wherein the job running data and the data to be evaluated of a certain node are bound with the node identifier of the corresponding node and transmitted together, and the node identifier of each node adopts a secret code.

11. A status monitoring and health degree evaluation device for big data operation is characterized by comprising:

a processor adapted to load and execute instructions of a software program;

12. The big data job status monitoring and health assessment apparatus according to claim 11, further comprising: the system comprises a configuration module, a data acquisition module, an extraction service module and an analysis service module; wherein:

13. A server, characterized by comprising a status monitoring and health assessment device for big data jobs according to any one of claims 11 to 12.