CN108255661A - A kind of method and system for realizing Hadoop cluster monitorings - Google Patents

A kind of method and system for realizing Hadoop cluster monitorings Download PDF

Info

Publication number
CN108255661A
CN108255661A CN201611242909.3A CN201611242909A CN108255661A CN 108255661 A CN108255661 A CN 108255661A CN 201611242909 A CN201611242909 A CN 201611242909A CN 108255661 A CN108255661 A CN 108255661A
Authority
CN
China
Prior art keywords
data
monitoring
hadoop
alarm
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611242909.3A
Other languages
Chinese (zh)
Inventor
李冬峰
刘荣明
孙明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201611242909.3A priority Critical patent/CN108255661A/en
Publication of CN108255661A publication Critical patent/CN108255661A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3457Performance evaluation by simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/865Monitoring of software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/875Monitoring of systems including the internet

Abstract

The present invention relates to a kind of method for realizing Hadoop cluster monitorings, this method includes:Monitoring rules are set;Obtain the operation data of Hadoop clusters;According to the monitoring rules and the operation data, it is determined whether alarm;And the trigger condition of automation O&M order is set, after alarm is determined whether, generates monitoring information, according to the monitoring information and the trigger condition, selection automation O&M order is simultaneously run.This method can be by extending monitoring rules so that the various aspects such as storage of the hardware of Hadoop clusters, software, data can be monitored, and then solve the problems, such as that conventional monitoring systems can not be monitored Hadoop distributed systems simultaneously.The invention further relates to a kind of systems for realizing Hadoop cluster monitorings.

Description

A kind of method and system for realizing Hadoop cluster monitorings
Technical field
The present invention relates to technical field of the computer network more particularly to a kind of method for realizing Hadoop cluster monitorings and it is System.
Background technology
Quietly arrive when the big data epoch, in face of work such as the calculating of mass data, carrying, storage, distributions.Traditional clothes Business device pattern can not be coped with, and the epoch of distributed system cluster have arrived.First choices of the Hadoop as distributed system, It is generally in the field of business in full flourish at present.But there is presently no a set of towards big data business pattern, to realize to Hadoop System data set acquisition storage, calculates, application, and finally realizes the Hadoop monitoring systems of the automation O&M of cluster management.
Traditional monitoring system can only be directed to the monitoring that computer carries out simple target system, according to monitoring objective system Difference can substantially be divided into:The O&M monitoring system of computer hardware is monitored, such as:Monitor the operation feelings such as hard disk, memory, CPU Condition;The software monitoring system of computer software operating condition is monitored, such as:The executive condition of monitoring and scheduling task;Monitoring data The data class monitoring system of quality condition, such as:Monitoring data quality, fluctuation etc..But when the arriving in big data epoch, enterprise needs Want it is a set of can to the software of Hadoop clusters, hardware, data, operating condition etc. many-sided integrated monitoring system, it is clear that This traditional monitoring system can not be competent at.
Invention content
In view of this, the present invention provides a kind of method and system for realizing Hadoop cluster monitorings, can be by monitoring Rule Extended so that the various aspects such as storage of the hardware of Hadoop clusters, software, data can be monitored simultaneously, and then Solve the problems, such as that conventional monitoring systems can not be monitored Hadoop distributed systems.
To achieve the above object, according to an aspect of the invention, there is provided a kind of side for realizing Hadoop cluster monitorings Method.
The method of the present invention includes:
Monitoring rules are set;Obtain the operation data of Hadoop clusters;According to the monitoring rules and the operation data, Determine whether to alarm;And the trigger condition of automation O&M order is set, after alarm is determined whether, generation monitoring letter Breath, according to the monitoring information and the trigger condition, selection automation O&M order is simultaneously run.
Optionally, the monitoring rules include hardware monitoring rule, software supervision rule or data monitoring rule, the fortune Row data include hardware operation data, running software data or data run data.
Optionally, the method further includes:After monitoring rules are set, pressure survey is carried out to the Hadoop clusters Examination, wherein the pressure test includes:Input more than one pressure testing data;Obtain the pressure test fortune of Hadoop clusters Row data;According to the monitoring rules and the pressure test operation data, the pressure-bearing number of the Hadoop clusters is obtained According to;The performance bottleneck value of the Hadoop clusters is generated according to the pressure-bearing data.
Optionally, according to the performance bottleneck value set automation O&M order, it is described automation O&M order include but It is not limited to:Restart current server, stop to the server send task, activation backup cluster, again race current task, kill The dead currently running thread of Hadoop clusters is to discharge cluster memory or operation information is sent to system manager.
Optionally, the hardware monitoring rule is:Hardware threshold condition and times condition are set so that if the hardware The number that operation data continuously reaches the hardware threshold condition is not less than the times condition, it is determined that alarm.
The software supervision rule is:Entry-into-force time and time interval are set so that if within the entry-into-force time, even The continuous interval for receiving the time between the running software data reaches the time interval, it is determined that alarm.The software supervision Rule further includes:The final time time limit is set so that if before the final time time limit, the software fortune has not been obtained Row data, it is determined that alarm.
The data monitoring rule is:Setting comparison Value Types, reduced value range and relativity determine the data fortune The reduced value range of row data and comparison Value Types, if the data run data meet the relativity, it is determined that alarm.
According to another aspect of the present invention, a kind of system for realizing Hadoop cluster monitorings is provided.
The system of the present invention includes:Setup module, for setting monitoring rules;Acquisition module, for obtaining Hadoop collection The operation data of group;Determining module, for according to the monitoring rules and the operation data, it is determined whether alarm.It further includes O&M module is automated, it is raw after determining module determines whether alarm for setting the trigger condition of automation O&M order Into monitoring information, according to the monitoring information and the trigger condition, selection automation O&M order is simultaneously run.
Optionally, the monitoring rules include hardware monitoring rule, software supervision rule or data monitoring rule, the fortune Row data include hardware operation data, running software data or data run data.
Optionally, the system also includes pressure test module, for after monitoring rules are set, to the Hadoop Cluster carries out pressure test.It is additionally operable to input more than one pressure testing data;Obtain the pressure test fortune of Hadoop clusters Row data;According to the monitoring rules and the pressure test operation data, the pressure-bearing number of the Hadoop clusters is obtained According to;And the performance bottleneck value of the Hadoop clusters is generated according to the pressure-bearing data.
Optionally, the pressure test module is additionally operable to set automation O&M order, institute according to the performance bottleneck value Automation O&M order is stated to include but is not limited to:Restart current server, stop to the server send task, activation it is standby Part cluster runs current task, kills the currently running thread of Hadoop clusters to discharge cluster memory or believe operation again Breath is sent to system manager.
Optionally, the hardware monitoring rule is:The setup module setting hardware threshold condition and times condition so that If the hardware operation data continuously reaches the number of the hardware threshold condition, during not less than the times condition, then institute Determining module is stated to determine to alarm.
The software supervision rule is:The setup module setting entry-into-force time and time interval so that if described In entry-into-force time, the interval that the acquisition module continuously receives the time between the running software data reached between the time Every then the determining module determines to alarm.The software supervision rule further includes:The setup module sets the final time phase Limit so that if before the final time time limit, the running software data have not been obtained in the acquisition module, then described Determining module determines to alarm.
The data monitoring rule is:The setup module setting comparison Value Types, reduced value range and relativity, institute It states determining module and determines the reduced value range of the data run data and comparison Value Types, if the data run data meet The relativity, then the determining module determine to alarm.
According to another aspect of the present invention, a kind of device for realizing Hadoop cluster monitorings is provided.
The inventive system comprises memory and processor, wherein, the memory is for storing instruction;The processor Method described in any one of the above embodiments is performed according to described instruction.
According to the technique and scheme of the present invention, by setting monitoring rules and obtaining the operation data of Hadoop clusters, sentence Determine Hadoop clusters hardware, software, data various operation conditions such as storage, monitor the operations of Hadoop clusters, load, The comprehensive states such as core node (NameNode), back end (DataNode).Also, can also by pressure test, And setting automation O&M order trigger condition, realize the acquisition to Hadoop cluster operation datas, performance bottleneck analysis, And the final automatic automation maintenance work for realizing part.
Description of the drawings
Attached drawing does not form inappropriate limitation of the present invention for more fully understanding the present invention.Wherein:
Fig. 1 is a kind of schematic diagram of method key step for realizing Hadoop cluster monitorings according to embodiments of the present invention;
Fig. 2 is the flow of pressure test in a kind of method for realizing Hadoop cluster monitorings according to embodiments of the present invention Figure;
Fig. 3 is a kind of signal of the main modular of system for realizing Hadoop cluster monitorings according to embodiments of the present invention Figure;
Fig. 4 is a kind of schematic diagram of device for realizing Hadoop cluster monitorings according to embodiments of the present invention.
Specific embodiment
It explains below in conjunction with attached drawing to the exemplary embodiment of the present invention, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, various changes and modifications can be made to the embodiments described herein, without departing from scope and spirit of the present invention.Together For clarity and conciseness, the description to known function and structure is omitted in sample in following description.
Fig. 1 is a kind of schematic diagram of method key step for realizing Hadoop cluster monitorings according to embodiments of the present invention.
As shown in Figure 1, a kind of key step of method for realizing Hadoop cluster monitorings of the embodiment of the present invention includes:
S1:Monitoring rules are set.Wherein, regulatory control then includes hardware monitoring rule, software supervision rule or data monitoring rule Then.
S2:Obtain the operation data of Hadoop clusters.Wherein, operation data includes hardware operation data, running software number According to or data run data.
S3:According to the monitoring rules and the operation data, it is determined whether alarm.
S4:Monitoring information is generated, according to monitoring information and the trigger condition of the automation O&M order of setting, selection is automatic Change O&M order and run.
In the method for the invention, above-mentioned steps can be decomposed and/or be reconfigured.Also, perform above-mentioned series Step can perform in chronological order according to the sequence of explanation naturally, but not need to centainly perform sequentially in time, Certain steps can perform parallel or independently of one another.
Further, the hardware monitoring rule set in above-mentioned S1 as:Hardware threshold condition and times condition are set so that If the number that hardware operation data continuously reaches hardware threshold condition is not less than times condition, it is determined that alarm.Because By hundreds and thousands of servers, cooperation forms Hadoop clusters jointly, externally provides unified service.So to Hadoop collection The hardware monitoring of group is substantially to the operating condition of the hardware such as hundreds and thousands of server CPU, memory, hard disk, network interface cards, makes It is monitored in real time with situation, abnormal conditions.The operating pressure of Hadoop clusters being then made of multiple servers, can be by more Platform server shared, so need to monitor is the situation of overall operation when Hadoop clusters are run.Due to server set Group can realize dynamic resource allocation, common sharing system pressure by the mechanism of load balancing.So fortune of server cluster Market condition is to be rendered as linearly fluctuating, i.e., linear fluctuation whithin a period of time.Hardware monitoring that the present invention extends rule, be When continuously have in Hadoop clusters repeatedly reach hardware threshold condition (threshold value of warning) when, can just send alarm, reach back services Device collection moving law.So that when single server is stopped in Hadoop cluster monitorings, Hadoop clusters can be from visibly moved Mistake does not interfere with cluster normal operation.Also, the rank of alarm can also be set to reflect the degree for the problem of monitoring.Example Such as, monitoring rank is set as three ranks, respectively pays attention to, alerts and serious.
The hardware threshold condition of setting is numerical value, this numerical value includes supporting two kinds of forms of fixed value and interval value, and hardware Threshold condition includes but is not limited to:CPU usage, hard disk remaining space and free memory.It is to transport hardware in the prior art Row data are compared with above-mentioned numerical value.Such as:During the percentage of monitoring CPU, current server CPU usage is more than or equal to It determines to alarm when 90%;When monitoring the value of hard disk, present hard discs remaining space determines to alarm when being equal to 30G;Monitor the value of memory When, current residual memory is less than or equal to determine to alarm during 256M.In the present embodiment, using setting hardware threshold condition and time several The hardware monitoring rule of part, wherein times condition refers to number, refers to when generation n times in certain time.Such as:Monitoring CPU wave It is dynamic, it determines to alarm when server CPU usage is more than or equal to 90% continuous 3 times;Memory fluctuation is monitored, current residual memory connects Continuous 5 this be less than or equal to determine to alarm during 256M;It monitors whether to survive, continuous 3 times of current value determines to alarm when being equal to 0.
Further, software supervision rule is:Entry-into-force time and time interval are set so that if within the entry-into-force time, even The continuous interval for receiving the time between running software data reaches time interval, it is determined that alarm.Also, software supervision rule is also wrapped It includes:The final time time limit is set so that if before the final time time limit, the running software data have not been obtained, Then determine alarm.Within the entry-into-force time, the interval for continuously receiving the time between running software data reaches time interval, it is determined that Alarm.It is not made a decision outside entry-into-force time range, removes the transmission of warning message not by time restriction.When can set the acquiescence to come into force Between, such as 24 hours, i.e., if separately the entry-into-force time is not set, acquiescence does monitoring and judges running software situation in 24 hours.
Hadoop group systems are substantially that software systems of a set of foundation on distributed hardware server (put down by software Platform), so ensureing Hadoop system normal operation, need to be monitored the operating condition of Hadoop system in itself.Hadoop Software systems are assisted jointly by numerous subsystems such as MapReduce, HDFS, Zookeeper, Hive, Hbase, Common, Avro Constitute software architecture.Each subsystem is functionally relatively independent, interdepends in business, so to Hadoop clusters system The monitoring of system is also the monitoring of sub-system.In order to avoid leading to entire Hadoop group systems due to the failure of some subsystem Exception, it is necessary to the run time of each subsystem is explicitly divided, is impacted to avoid to other subsystems.
By taking most common scheduler task in the monitoring of Hadoop clustered softwares as an example:(1) if (whether task for monitor task survival Operation), it sets interval when being 10 minutes, whether inspections in every 10 minutes once has running software data to reach, if arrival is For survival, otherwise alarm.(2) it if the quantity of monitoring message queue, sets interval 5 minutes, the final time time limit is 24: 00:00, period type is daily;Then daily 24:00 statistics current task completes number.
Further, data monitoring rule is:Setting comparison Value Types, reduced value range and relativity, determine that data are transported The reduced value range of row data and comparison Value Types, if data run data meet relativity, it is determined that alarm.
Relative to hardware and software monitoring rules, the rule of Hadoop data monitorings is more complicated.Due to data monitoring not The operating condition for operating in data warehouse on Hadoop, Data Mart is only monitored, also wants the specific field in monitoring data table Fluctuation situation.Most basic requirement to data monitoring is, after multiple steps such as acquiring, calculating, summarize, carry, data It is safely stored into database.Then need a variety of numbers such as the current value to data operation data, maximum value, minimum value, average value Value is compared, and carries out year-on-year, ring ratio to different data range.Because all single software supervision rules can not meet need It asks, increases a variety of dimension software supervision rules such as reduced value, reduced value range, relativity.Wherein, compare Value Types include but It is not limited to:Average value, maximum value, minimum value.Maximum value (minimum value) be in the data run data obtained with it is maximum (most It is small) object as a comparison.Reduced value range is to be combined with comparison Value Types, sets the range intervals of reduced value.Such as:Monitoring Gdm_m04_ord_sum_test table sale_ord_id fields need to set nearly 7 unit periods, and comparison Value Types are average Value, acquisition Value Types are record line number, then are to use sale_ord_id records line number yesterday, compare sale_ord_id fields nearly seven The average value of day entry line number.
Fig. 2 is the flow of pressure test in a kind of method for realizing Hadoop cluster monitorings according to embodiments of the present invention Figure.
A kind of method for realizing Hadoop cluster monitorings of the embodiment of the present invention can also carry out pressure survey to Hadoop clusters Examination.In order to solve the mass datas tests such as similar 11.11 electric business promotion, enterprise needs to press overall operation environment in advance Power is tested.Gradually 5 times, 10 times or even 50 times of data volume of press-in normal operation data passes through pressure test preview positioning, solution The certainly performance bottleneck of system, so as to ensure system normal operation when being tested in face of the limit.However in the situation of pressure test, reality When monitoring system operating status, and precise positioning to system bottleneck be always perplex enterprise problem.Traditional method is to say Total system is divided into several subsystems, tests one by one.But not only method is time-consuming and laborious for this, and feedack is endless It is whole, not accurate.
As shown in Fig. 2, after the monitoring rules of setting extension, more than one pressure testing data can be inputted, and obtain Take the pressure test operation data of Hadoop clusters;According to monitoring rules and pressure test operation data, described in acquisition The pressure-bearing data of Hadoop clusters.And the performance bottleneck value of Hadoop clusters can be generated according to the pressure-bearing data of acquisition.And then Completely show Hadoop group systems in extreme environment, the operating condition of Hadoop cluster various aspects finds performance bottleneck Value promotes the limiting value of entirety Hadoop clusters.Such as:By the combination of numerical value, number, time limit, can be used for clothes Business device cluster carries out pressure test.Specific method is to be set as the monitoring rules of server cluster:
7 days 00 June in 2016:00-04:During 00, the CPU usage of cluster is more than 90%, then alarms;
7 days 00 June in 2016:00-04:During 00, the hard disk utilization rate of cluster is more than 90%, then alarms;
7 days 00 June in 2016:00-04:During 00, the memory usage of cluster is more than 80%, then alarms;
7 days 00 June in 2016:00-04:During 00, the hard disk utilization rate of cluster is more than 90%, then alarms;
7 days 00 June in 2016:00-04:During 00, cluster task backlog is more than 9000, then alarms.
Further carry out the acquisition of the pressure test operation data of cluster:
00:00-04:During 00, it is gradually pressed into 5 times of normal operation data, 10 times even 50 times of data volume.And The core data of observation monitoring system feedback in real time, it is as follows:
Clustered node survival condition, including:The dead number of nodes of DataNode, DataNode live-vertexs quantity, The dead number of nodes of NodeManager;
Cluster resource use state, including:Cluster Application operating statuses statistics, cluster Containers operations Statistic, cluster Memory service conditions;
Cluster HDFS disk service conditions, including:The total and fast sum of HDFS disk utilizations, disk file;
Cluster heap memory utilization rate, including:NameNode groups of memory usages;
RPC handling durations, including:NameNode average treatment durations;
Cluster-thread operating status, including:NameNode threads operation quantity, NameNode thread blocks quantity, NameNode, thread waiting number, NameNode thread time-out quantity.
According to the monitoring rules of setting, the pressure test operation data for obtaining Hadoop clusters is analyzed:
By analyzing clustered node survival condition, the overall operation state of cluster can be reacted, and reflects cluster and faces Weak spot during pressure.If dead number of nodes 1%-2% belongs to normal fluctuation, if reaching 5% belongs to peak condition, if being more than 5% belongs to overload operation;
By analyzing cluster resource, HDFS, memory service condition, the linear wave of cluster emotionally condition can be reacted.Such as when Preceding memory usage is shown as belonging to normal condition during the linear fluctuation of rule, if during program rectangle fluctuation on the contrary, shows memory It needs to optimize without normally release or low memory;
By analyzing cluster RPC handling durations, thread operating status, the operating index of cluster can be more accurately analyzed, RPC handling durations are shorter, thread operation obstruction quantity is fewer, it was demonstrated that cluster personality is better, on the contrary then need to optimize.
User can realize unattended, automation O&M by formulating the operation of monitoring system automatic trigger.
The monitoring collection data in above-mentioned pressure test are quoted, carry out the setting of following automation O&M order:
When CPU, the memory usage of some current server are persistently more than 90%, triggering automation O&M order (restart):Restart current server;
When some current server task backlog is more than 9000, triggering automation O&M order (pause task):Stop Only task is sent to the server;
When the dead number of nodes of cluster DataNode is more than or equal to 5%, the dead number of nodes of NodeManager is more than or equal to When 2%, triggering automation O&M order (setUp New Hadoop):Activation backup cluster shares current cluster pressure, and will The information of current operation short message is sent to system manager;
When cluster RPC average treatments duration is more than or equal to 10 seconds, triggering automation O&M order (rerun current task):Current task is run again, and current operation is sent to system manager with the information of short message;
When cluster totality Memory memories linearly fluctuate (i.e. memory usage is more than 90% always), triggering automation O&M order (kill current project):The currently running thread release cluster memory of cluster is killed, and will currently be grasped Make to be sent to system manager with the information of short message.
Fig. 3 is a kind of signal of the main modular of system for realizing Hadoop cluster monitorings according to embodiments of the present invention Figure.
As shown in figure 3, a kind of system for realizing Hadoop cluster monitorings of the embodiment of the present invention mainly includes:Mould is set Block, for setting monitoring rules;Acquisition module, for obtaining the operation data of Hadoop clusters;Determining module, for according to institute State monitoring rules and the operation data, it is determined whether alarm.Wherein, monitoring rules include hardware monitoring rule, software supervision Rule or data monitoring rule, operation data include hardware operation data, running software data or data run data.
Hardware monitoring rule is:Setup module sets hardware threshold condition and times condition so that if hardware runs number According to the number for continuously reaching hardware threshold condition, during not less than times condition, it is determined that module determines to alarm.Hardware threshold condition Including but not limited to:CPU usage, hard disk remaining space and free memory.
Software supervision rule is:Setup module sets entry-into-force time and time interval so that if within the entry-into-force time, obtains The interval that modulus block continuously receives the time between running software data reaches time interval, it is determined that module determines to alarm.Also wrap It includes:Setup module sets the final time time limit so that if before the final time time limit, software fortune has not been obtained in acquisition module Row data, it is determined that module determines to alarm.
Data monitoring rule is:Setup module setting comparison Value Types, reduced value range and relativity, determining module are true Fixed number is according to the reduced value range of operation data and comparison Value Types, if data run data meet relativity, it is determined that module Determine alarm.Value Types are compared to include but is not limited to:Average value, maximum value, minimum value.
A kind of system for realizing Hadoop cluster monitorings of the embodiment of the present invention further includes pressure test module, for setting After putting monitoring rules, pressure test is carried out to Hadoop clusters.Pressure test module is additionally operable to:Input more than one pressure Test data;Obtain the pressure test operation data of Hadoop clusters;It is run according to the pressure test of monitoring rules and acquisition Data obtain the pressure-bearing data of Hadoop clusters.Moreover, pressure test module can also generate Hadoop clusters according to pressure-bearing data Performance bottleneck value.Pressure test module is additionally operable to set automation O&M order, automation O&M life according to performance bottleneck value Order includes but is not limited to:Restart current server, stop sending task to server, activation backup cluster, be run again as predecessor Business kills the currently running thread of Hadoop clusters to discharge cluster memory or operation information is sent to system manager.
A kind of system for realizing Hadoop cluster monitorings of the embodiment of the present invention further includes automation O&M module, for setting Surely the trigger condition of O&M order is automated;It is additionally operable to after determining module determines whether alarm, generates monitoring information;Also use According to monitoring information and trigger condition, selection automation O&M order is simultaneously run.It can be according to Hadoop cluster pressure Test obtains the continuous accumulation of system bottleneck value and failure, is in advance a variety of failure setting automation O&M lives that may be sent Order and trigger condition.So as to which the ultimate aim of operation management work be done step-by-step:Unattended, automation operation management.
Fig. 4 is a kind of schematic diagram of device for realizing Hadoop cluster monitorings according to embodiments of the present invention.Such as Fig. 4 institutes Show, a kind of device 4 for realizing Hadoop cluster monitorings of the embodiment of the present invention includes memory 41 and processor 42.Wherein, it deposits For storing instruction, processor 42 performs the method for realizing Hadoop cluster monitorings of any of the above-described according to instruction to reservoir 41.
Above-mentioned specific embodiment, does not form limiting the scope of the invention.Those skilled in the art should be bright It is white, depending on design requirement and other factors, various modifications, combination, sub-portfolio and replacement can occur.It is any Modifications, equivalent substitutions and improvements made within the spirit and principles in the present invention etc., should be included in the scope of the present invention Within.

Claims (11)

  1. A kind of 1. method for realizing Hadoop cluster monitorings, which is characterized in that including:
    Monitoring rules are set;
    Obtain the operation data of Hadoop clusters;
    According to the monitoring rules and the operation data, it is determined whether alarm;And
    The trigger condition of setting automation O&M order, after alarm is determined whether, generates monitoring information, according to the monitoring Information and the trigger condition, selection automation O&M order are simultaneously run.
  2. 2. according to the method described in claim 1, it is characterized in that, the monitoring rules include hardware monitoring rule, software prison Regulatory control is then or data monitoring rule, the operation data include hardware operation data, running software data or data run data.
  3. 3. according to the method described in claim 1, it is characterized in that, the method further includes:It is right after monitoring rules are set The Hadoop clusters carry out pressure test,
    Wherein described pressure test includes the following steps:
    Input more than one pressure testing data;
    Obtain the pressure test operation data of Hadoop clusters;
    According to the monitoring rules and the pressure test operation data, the pressure-bearing data of the Hadoop clusters are obtained;
    The performance bottleneck value of the Hadoop clusters is generated according to the pressure-bearing data.
  4. 4. according to the method described in claim 3, it is characterized in that, automation O&M life is set according to the performance bottleneck value It enables, the automation O&M order includes but is not limited to:Restart current server, stop to the server send task, Activation backup cluster runs current task, kills the currently running thread of Hadoop clusters to discharge cluster memory or incite somebody to action again Operation information is sent to system manager.
  5. 5. according to the method described in claim 2, it is characterized in that,
    The hardware monitoring rule is:Hardware threshold condition and times condition are set so that if the hardware operation data connects The continuous number for reaching the hardware threshold condition is not less than the times condition, it is determined that alarm;
    The software supervision rule is:Entry-into-force time and time interval are set so that if within the entry-into-force time, it is continuous to receive Reach the time interval to the interval of time between the running software data, it is determined that alarm;
    The data monitoring rule is:Setting comparison Value Types, reduced value range and relativity, determine the data run number According to reduced value range and comparison Value Types, if the data run data meet the relativity, it is determined that alarm.
  6. 6. a kind of system for realizing Hadoop cluster monitorings, which is characterized in that including:
    Setup module, for setting monitoring rules;
    Acquisition module, for obtaining the operation data of Hadoop clusters;
    Determining module, for according to the monitoring rules and the operation data, it is determined whether alarm;And
    O&M module is automated, for setting the trigger condition of automation O&M order, determines whether to alarm it in determining module Afterwards, monitoring information is generated, according to the monitoring information and the trigger condition, selection automation O&M order is simultaneously run.
  7. 7. system according to claim 6, which is characterized in that the monitoring rules include hardware monitoring rule, software prison Regulatory control is then or data monitoring rule, the operation data include hardware operation data, running software data or data run data.
  8. 8. system according to claim 6, which is characterized in that pressure test module is further included, in setting monitoring rule After then, pressure test is carried out to the Hadoop clusters;
    It is additionally operable to input more than one pressure testing data;
    Obtain the pressure test operation data of Hadoop clusters;
    According to the monitoring rules and the pressure test operation data, the pressure-bearing data of the Hadoop clusters are obtained;And And the performance bottleneck value of the Hadoop clusters is generated according to the pressure-bearing data.
  9. 9. system according to claim 8, which is characterized in that the pressure test module is additionally operable to according to the performance bottle The setting automation O&M order of neck value, the automation O&M order include but is not limited to:Restart current server, stop to The server sends task, activation backup cluster, runs current task again, kills the currently running thread of Hadoop clusters To discharge cluster memory or operation information be sent to system manager.
  10. 10. any one of them system according to claim 6, which is characterized in that
    The hardware monitoring rule is:The setup module setting hardware threshold condition and times condition so that if described hard When the number that part operation data continuously reaches the hardware threshold condition is not less than the times condition, then the determining module is true Fixed alarm;
    The software supervision rule is:The setup module setting entry-into-force time and time interval so that if come into force described In time, the interval that the acquisition module continuously receives the time between the running software data reaches the time interval, then The determining module determines to alarm;
    The data monitoring rule is:The setup module setting comparison Value Types, reduced value range and relativity, it is described true Cover half block determines the reduced value range of the data run data and comparison Value Types, if described in data run data satisfaction Relativity, then the determining module determine to alarm.
  11. 11. a kind of device for realizing Hadoop cluster monitorings, which is characterized in that including memory and processor, wherein,
    The memory is for storing instruction;
    The processor performs the method according to any one of claims 1-5 according to described instruction.
CN201611242909.3A 2016-12-29 2016-12-29 A kind of method and system for realizing Hadoop cluster monitorings Pending CN108255661A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611242909.3A CN108255661A (en) 2016-12-29 2016-12-29 A kind of method and system for realizing Hadoop cluster monitorings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611242909.3A CN108255661A (en) 2016-12-29 2016-12-29 A kind of method and system for realizing Hadoop cluster monitorings

Publications (1)

Publication Number Publication Date
CN108255661A true CN108255661A (en) 2018-07-06

Family

ID=62719840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611242909.3A Pending CN108255661A (en) 2016-12-29 2016-12-29 A kind of method and system for realizing Hadoop cluster monitorings

Country Status (1)

Country Link
CN (1) CN108255661A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408325A (en) * 2018-09-29 2019-03-01 华为技术有限公司 The method and apparatus for carrying out alarm operation
CN110971483A (en) * 2019-11-08 2020-04-07 苏宁云计算有限公司 Pressure testing method and device and computer system
CN111694705A (en) * 2019-03-15 2020-09-22 北京沃东天骏信息技术有限公司 Monitoring method, device, equipment and computer readable storage medium
WO2022161100A1 (en) * 2021-01-29 2022-08-04 苏州浪潮智能科技有限公司 Edge computing server resetting method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093462A (en) * 2006-06-22 2007-12-26 上海全成通信技术有限公司 Automatization method for testing schooling pressure on database application
US7434104B1 (en) * 2005-03-31 2008-10-07 Unisys Corporation Method and system for efficiently testing core functionality of clustered configurations
CN103207804A (en) * 2013-04-07 2013-07-17 杭州电子科技大学 MapReduce load simulation method based on cluster job logging
US20150006716A1 (en) * 2013-06-28 2015-01-01 Pepperdata, Inc. Systems, methods, and devices for dynamic resource monitoring and allocation in a cluster system
CN104461856A (en) * 2013-09-22 2015-03-25 阿里巴巴集团控股有限公司 Performance test method, device and system based on cloud computing platform
CN104866619A (en) * 2015-06-09 2015-08-26 北京京东尚科信息技术有限公司 Data monitoring method and system for data warehouse
CN105337765A (en) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 Distributed hadoop cluster fault automatic diagnosis and restoration system
CN105426290A (en) * 2015-11-18 2016-03-23 北京京东尚科信息技术有限公司 Intelligent abnormal information processing method and system
CN105718351A (en) * 2016-01-08 2016-06-29 北京汇商融通信息技术有限公司 Hadoop cluster-oriented distributed monitoring and management system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7434104B1 (en) * 2005-03-31 2008-10-07 Unisys Corporation Method and system for efficiently testing core functionality of clustered configurations
CN101093462A (en) * 2006-06-22 2007-12-26 上海全成通信技术有限公司 Automatization method for testing schooling pressure on database application
CN103207804A (en) * 2013-04-07 2013-07-17 杭州电子科技大学 MapReduce load simulation method based on cluster job logging
US20150006716A1 (en) * 2013-06-28 2015-01-01 Pepperdata, Inc. Systems, methods, and devices for dynamic resource monitoring and allocation in a cluster system
CN104461856A (en) * 2013-09-22 2015-03-25 阿里巴巴集团控股有限公司 Performance test method, device and system based on cloud computing platform
CN104866619A (en) * 2015-06-09 2015-08-26 北京京东尚科信息技术有限公司 Data monitoring method and system for data warehouse
CN105337765A (en) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 Distributed hadoop cluster fault automatic diagnosis and restoration system
CN105426290A (en) * 2015-11-18 2016-03-23 北京京东尚科信息技术有限公司 Intelligent abnormal information processing method and system
CN105718351A (en) * 2016-01-08 2016-06-29 北京汇商融通信息技术有限公司 Hadoop cluster-oriented distributed monitoring and management system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨胜利: "《软件测试技术》", 31 August 2015 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408325A (en) * 2018-09-29 2019-03-01 华为技术有限公司 The method and apparatus for carrying out alarm operation
CN109408325B (en) * 2018-09-29 2020-11-03 华为技术有限公司 Method and device for performing alarm operation
CN111694705A (en) * 2019-03-15 2020-09-22 北京沃东天骏信息技术有限公司 Monitoring method, device, equipment and computer readable storage medium
CN110971483A (en) * 2019-11-08 2020-04-07 苏宁云计算有限公司 Pressure testing method and device and computer system
CN110971483B (en) * 2019-11-08 2021-11-09 苏宁云计算有限公司 Pressure testing method and device and computer system
WO2022161100A1 (en) * 2021-01-29 2022-08-04 苏州浪潮智能科技有限公司 Edge computing server resetting method and device

Similar Documents

Publication Publication Date Title
CN110297711B (en) Batch data processing method, device, computer equipment and storage medium
CN108874640B (en) Cluster performance evaluation method and device
CN105095056B (en) A kind of method of data warehouse data monitoring
CN108255661A (en) A kind of method and system for realizing Hadoop cluster monitorings
US8826286B2 (en) Monitoring performance of workload scheduling systems based on plurality of test jobs
CN103246592B (en) A kind of monitoring acquisition system and method
US10116534B2 (en) Systems and methods for WebSphere MQ performance metrics analysis
US20200319935A1 (en) System and method for automatically scaling a cluster based on metrics being monitored
CN111539633A (en) Service data quality auditing method, system, device and storage medium
US10447565B2 (en) Mechanism for analyzing correlation during performance degradation of an application chain
CN106126403B (en) Oracle database failure analysis methods and device
KR20080044508A (en) System and method for management of performance fault using statistical analysis
CN101297536A (en) A method and system for preparing execution of systems management tasks on endpoints
Jassas et al. Failure analysis and characterization of scheduling jobs in google cluster trace
CN109344189A (en) Big data calculation method and device based on NiFi
US20110239050A1 (en) System and Method of Collecting and Reporting Exceptions Associated with Information Technology Services
CN108509313A (en) A kind of business monitoring method, platform and storage medium
Kim et al. Towards hpc i/o performance prediction through large-scale log analysis
CN111858251A (en) Big data computing technology-based data security audit method and system
CN111124830A (en) Monitoring method and device for micro-service
CN105069029B (en) A kind of real-time ETL system and method
CN113656245A (en) Data inspection method and device, storage medium and processor
GB2514584A (en) Methods and apparatus for monitoring conditions prevailing in a distributed system
Khan et al. Modeling the autoscaling operations in cloud with time series data
US11722558B2 (en) Server-side resource monitoring in a distributed data storage environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180706