CN104156296A

CN104156296A - System and method for intelligently monitoring large-scale data center cluster computing nodes

Info

Publication number: CN104156296A
Application number: CN201410377856.0A
Authority: CN
Inventors: 刘羽; 吕文静; 金莲; 陈博文; 于涛
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2014-08-01
Filing date: 2014-08-01
Publication date: 2014-11-19
Anticipated expiration: 2034-08-01
Also published as: CN104156296B

Abstract

The invention provides a system and a method for intelligently monitoring large-scale data center cluster computing nodes. Hardware microarchitecture data indexes of the computing nodes and data indexes related to processes of running applications are acquired through monitoring nodes in the system, the data indexes are transmitted to monitoring equipment in the system, big data analysis is executed via the monitoring equipment, and results are sent to customer premise equipment to be displayed to a user. By the system and the method, the microarchitecture data indexes of the computing nodes and the data indexes of the processes of the running applications are acquired, so that intelligent big data analysis is realized, the faulted computing nodes are positioned automatically, and fault causes are provided.

Description

The system and method for intelligent monitoring large-scale data center cluster computing node

Technical field

The present invention relates to field of computer technology, be specifically related to the system and method for a kind of intelligent monitoring large-scale data center cluster computing node.

Background technology

Along with the continuous progress of human society, the development of science and technology, people are not only more and more extensive to natural understanding, and the demand that outfield is explored is also more and more urgent.The sharply growth of property of amount that this just makes the mankind support the information data of holding, and meanwhile, the information data of these magnanimity all needs analyze timely and process.For example, a large-scale astronomical radio telescope array just can produce universe microwave data more than 100GB one second, and these data all need to be analyzed in time; For another example,, in particle physics research field, the data that LHC once clashes are also measured taking TB as unit; In addition, also computing power has been proposed to more and more higher requirement as human genome engineering, petroleum prospecting, weather forecast etc. field.Under this overall background, numerical evaluation becomes the third the extremely important Science Explorations means except experiment, theoretical analysis already.Reality based on such just, the supercomputer of greatly developing that has impelled each science and technology power of the world today all doing one's utmost.As, in the world TOP500 issuing in Dec, 2013, China of ranking the first No. two, the Milky Way " (TH-2) " has just reached the peak velocity of 54.9PFlops, has used altogether more than 16000 computing nodes.

In addition,, along with the development of the new techniques such as cloud computing, large data, Internet of Things, there is increasing large-scale data center, cloud computing center.They have ten hundreds of computer nodes easily.The Dalles data center that is positioned at Ore. as Google (Google) has approximately 150,000 station server nodes.In so large-scale data center, the performance monitoring of computing node, localization of fault, fault recovery, and central whole Efficiency Statistics etc., all exist unprecedented challenge.Therefore, how efficiently extensive and even ultra-large data center of management and use, is that world is all in a popular domain of making great efforts to explore.

For a long time, the monitoring management of data center all completes by artificial automanual mode.The personnel that are responsible for O&M need the real-time running status of checking cluster, once go wrong, although sometimes can location node position, often can not accurately locate the equipment of fault, also need to waste time and energy by staff's experience judge, troubleshooting; Although the user of cluster can understand by numerous job scheduling software the operation situation of oneself, seldom can count on the historical analysis of operation; Moreover the decision maker of cluster often cannot directly obtain, about expense expenditure, service efficiency, personnel's work efficiency, cost effectiveness etc. are about the information material of decision-making, can only, by the manual analysis of mass data being carried out to decision-making, wasting time and energy from cluster.In addition, application developer also often cannot obtain hardware micro-architecture, system process, storehouse, the module error optimizing application software and be badly in need of from cluster and collapse the information such as statistics, need to obtain by a large amount of experiments by rule of thumb, i.e. time-consuming effort again.

Summary of the invention

The present invention proposes the system and method for a kind of intelligent monitoring large-scale data center cluster computing node, there is maximization, multi-functional, facing multiple users group's feature.It has perfect intellectual analysis and statistical function, can provide data reference frame for different levels user's decision-making.

Described system, comprising: be arranged on the monitor node on data center's cluster computing node, watch-dog and the subscriber terminal equipment of communicating by letter with each monitor node, it is characterized in that:

Described monitor node, for the control of the hardware controls register by obtaining computing node, gather the hardware micro-architecture data target of described computing node, by obtaining the control of operating system nucleus, obtain the data target relevant to the process of the application program of moving on described computing node, and described data target is sent to watch-dog;

Described watch-dog, for receiving described data target, carries out large data analysis based on described data target, and the result of described analysis is sent to subscriber terminal equipment;

Described subscriber terminal equipment, for receiving described result and being shown to user.

Described method comprises:

Start the monitor node being arranged in computing node;

Described monitor node is by the control of the hardware controls register of acquisition computing node, gather the hardware micro-architecture data target of described computing node, by obtaining the control of operating system nucleus, obtain the data target relevant to the process of the application program of moving on described computing node, and described data target is sent to watch-dog;

Described watch-dog receives described data target, carries out large data analysis based on described data target, and the result of described analysis is sent to subscriber terminal equipment;

Described subscriber terminal equipment receives described result and is shown to user.

Especially, described analysis comprises: locate the computing node breaking down and definite failure cause according to described data target.

Especially, described hardware micro-architecture data target comprise CPU real-time floating-point travelling speed, stream SIMD instruction extension collection SSE unit by using rate, senior vectorial superset AVX unit by using rate, vector instruction vectorization ratio, complete one or more the combination in the required clock number CPI of every instruction, afterbody buffer memory LLC hit rate, memory bandwidth, PCI high-speed bus interface PCI-E device bandwidth, cache hit/miss rate; The relevant data target of process of the application program of moving on described and described computing node comprises one or more the combination in process switching number of times, stack information, heap memory distribution condition.

Especially, described data target is the real-time floating-point travelling speed of CPU and/or completes every clock number CPI that instruction is required, described analysis comprises: when described data target continues lower than default threshold value in Preset Time section, decision processor breaks down, and definite fault is former because the abnormal frequency reducing of processor.

Especially, the cpu busy percentage, memory usage, local disk IO data and/or the Ethernet handling capacity that are provided by operating system are also provided described monitor node.

MSR control register in the performance monitoring unit PMU of the processor that especially, the hardware controls register of wherein said computing node is described computing node.

The invention has the beneficial effects as follows:

Extract necessary system-level performance index information by the performance monitoring apparatus at each computing node, and transmission is responsible for maintenance by monitoring management node.And monitoring management node has abnormal identification and alert capability, excavate respectively recorded historical data by customer group simultaneously, and by result feedback to user.Meanwhile, monitoring management node can also be as required, on a time period, the information of the aspects such as the monitor node extraction hardware micro-architecture feature to appointment and process, storehouse.Thereby realize multi-userization, multifunction and intellectuality to large-scale cluster monitoring.

In order to realize the actual effect of monitoring, the monitoring client of each computing node has been realized the monitoring mode refreshing per second.In order to reduce the resource occupation of computing node, each computing node only extracts for the necessary minimum index item of data analysis, comprises cpu busy percentage simultaneously, memory usage, ten several indexs such as local disk read-write and Ethernet handling capacity.

In order to realize multifunction, this intelligent monitor system also provides the monitoring analysis of the index relevant to hardware micro-architecture, as floating-point travelling speed, and vectorization ratio, memory bandwidth, IB bandwidth etc.But because this part content is relatively many to taking of system resource in the time monitoring, therefore, they start as required according to user instruction.

In order to realize multi-userization, this intelligent monitor system has proposed to contain administration and supervision authorities, O&M layer, practical application client layer and application and development layer, the hierarchical view of four levels.

In order to realize intellectuality, this intelligent monitor system has been invented a kind of analytical approach of data mining, and it,, according to basic performance monitoring data message, excavates the most interested statistical indicator of different levels user by calculating.

Brief description of the drawings

Fig. 1 is the system chart of a kind of intelligent monitoring large-scale data center cluster of proposing of the present invention

Fig. 2 is the process flow diagram of the method for a kind of intelligent monitoring large-scale data center cluster of proposing of the present invention

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, the present invention is done further and described in detail.

Referring to accompanying drawing 1, the system that shows a kind of intelligent monitoring large-scale data center cluster computing node of the present invention's proposition, comprises the monitor node being arranged on data center's cluster computing node, the watch-dog and the subscriber terminal equipment that are connected with each monitor node.Wherein data center's cluster computing node has corresponding hardware device, such as processor CPU, internal memory, hard disk, Ethernet controller etc., operation system and application software on described computing node; Watch-dog comprises main monitor node and database, main monitor node is communicated by letter with each monitor node being arranged on above-mentioned computing node, can obtain the hardware and software service data of data center's cluster computing node, for example cpu busy percentage, memory usage, local disk IO data, Ethernet handling capacity, and for the data target of the micro-architecture data target of this computing node hardware and the program process level of operation.Main monitor node, by the above-mentioned data write into Databasce obtaining, automatically performs large data mining and preserves the result obtaining after large data mining.User is by ustomer premises access equipment reading result demonstration from database.User can also input user-defined data mining program to watch-dog by subscriber terminal equipment, extracted the corresponding data index of data center's clustered node by watch-dog, carry out large data mining and show result to user according to user-defined data mining program.

Referring to accompanying drawing 2, the method for a kind of intelligent monitoring large-scale data center cluster computing node that the present invention proposes is made up of several key steps such as data acquisition, large data mining, point level display and localization of fault and warnings.Wherein data acquisition comprises master data collection and high-level data collection, and master data collection is automatically performed by system, arranges without user; High-level data collection need to be according to user intention setting.

1. data acquisition

Data acquisition refers on data center's cluster computing node installs monitor node, extract cpu busy percentage, memory usage, local disk IO data, the Ethernet handling capacity of this computing node, and for the data target of the micro-architecture data target of this computing node hardware and the program process level of operation.Wherein, be called high-level data collection for the collection of the micro-architecture data target of computing node hardware and the data target of program process level, the collection of all the other indexs is called master data collection.Master data collection is the step that system default arranges, and can carry out without user intervention, and high-level data collection arranges execution according to user's request.Owing to needing the actual effect of guaranteed performance achievement data, monitor node must meet second grade acquisition capacity refreshing, and must ensure extremely low computing node resources occupation rate simultaneously.

The collecting method that the present invention proposes is different from the method proposing in prior art.In the prior art, data acquisition is only to collect some achievement datas that operating system itself provides, i.e. the collection of data target depends on the operating system of moving on computing node, the data target that cannot provide for operating system, and monitor node cannot obtain.And collecting method proposed by the invention, the collection of the above-mentioned data target being provided by operating system not only can be provided, can also gather some hardware micro-architecture data targets, the real-time floating-point travelling speed of for example CPU, stream SIMD instruction extension collection SSE (Streaming SIMD Extensions) unit by using rate, senior vectorial superset AVX (Advanced Vector Extensios) unit by using rate, vector instruction vectorization ratio, complete every clock number (CPI) that instruction is required, afterbody buffer memory LLC (Last Level Cache) hit rate, translation lookaside buffer TLB (Translation Lookaside Buffer) parameter, memory bandwidth, PCI high-speed bus interface PCI-E (PCI Express) device bandwidth, cache hit/miss (cache hit/miss) rate, TLB unit etc.In addition, can also gather the data target of some program process levels, as process switching number of times, stack information, heap memory distribution condition etc.These indexs are of great significance for the performance, analytic set character and the positioning software level fault tool that excavate application software.

Due to needs acquisition hardware and process level data target, the monitor node that therefore the present invention proposes is realized by the mode of software client.The method that described monitor node proposes with prior art the collection of master data, does not repeat them here, and the process of high-level data collection is specifically described as follows:

Need to be by the control of related register in hardware be realized to the extraction of above-mentioned hardware micro-architecture data target.As, for processor micro-architecture data target, be mainly to control to realize by the performance monitoring unit PMU in processor (Performance Monitoring Unit).Therefore, this just requires the monitor node of this case to have the highest root authority.Control flow to PMU is described below:

S1: the control of obtaining MSR (the Module Specific Register) control register in the PMU of processor of computing node;

S2: the coding of dependent event and mask are write in the MSR control register of having controlled, and this control register is set, start dependent event to count, for example, in the time gathering LLC hit rate data target, first the coding of LLC hit rate and mask are write in MSR control register, then this register is set and starts to count LLC and hit quantity, after finishing, counting reads the count number in this control register, statistics LLC hit rate.

Need to realize the monitoring of correlative code in kernel the extraction of system kernel level index.The for example monitoring to process switching, need to monitor the part of controlling process in kernel in the code about management of process part.When computing node starts, kernel starts monitoring after successfully loading.Therefore, monitor node must have the control to kernel level.May affect a little the performance of system on the extraction of system kernel level index, therefore can provide as required for the occasion of monitoring.

2. large data mining and point level display

The above-mentioned monitor node being arranged in computing node also has the ability that sends data to watch-dog, receives and manage each monitor node by watch-dog unification.Main monitor node in watch-dog is responsible for receiving from each monitor node the data target gathering, and send control command to each monitor node, described control command comprises the master data acquisition that described system default produces, and the high-level data acquisition producing is set according to user, described each monitor node is carried out the collection of corresponding data index according to described control command.Main monitor node is also responsible for the described data target receiving to deposit in database by certain storage format simultaneously, as the input data of next step data mining.

In order to realize intellectuality, watch-dog also has large data mining ability, the data target that it arranges preserving in database according to default statistics carries out large data processing, and according to default classification exhibition scheme, the user who is respectively different provides data statistics and analysis result.In addition, watch-dog also has user interface, can receive custom data mining algorithm, and excavates according to described data mining algorithm executing data.Described default statistics setting comprises:

One, administration and supervision authorities customer group index

1. throughput rate (task flux)

A. real time execution task, application number

B., in one week (month, year), complete the number of tasks [row figure, table] of (failure) every day

C., in one week (month, year), complete (failure) number of tasks average every day

D., in one week (month, year), always complete (failure) number of tasks

E. per task time

2. O&M cost (energy consumption) (calculating, storage, exchange, machine room [refrigeration])

A. real-time total power consumption

B. in one week (month, year), energy consumption every day (KW/h) [row figure, table]

C. in one week (month, year), average energy consumption every day (KW/h)

D. in one week (month, year), total energy consumption (KW/h)

E. between the monitoring of equipment amortization, machine room entirety amortization charge and each expense unit than Data-Statistics, unit costs operation performance

3. assets utilization efficiency

A. in one week (month, year), every day cluster dutycycle

B. in one week (month, year), average every day cluster dutycycle

C. in one week (month, year), cluster peak hours/period every day (calculating cluster dutycycle per hour)

D. in one week (month, year), time consistent busy hour section (the annual dutycycle on 24 hour period)

E. real-time online number of users (special delegated authority, check personal information)

F. in one week (month, year), online user number every day [row figure, table]

G. in one week (month, year), average every day online user number

H. in one week (month, year), every day the average user number of finishing the work

I. in one week (month, year), average every user number of finishing the work

4. equipment health degree

A. real time fail nodes, failure rate

B. in one week (month, year), every day malfunctioning node number, failure rate [row figure, table]

C. in one week (month, year), average every day malfunctioning node number, failure rate

Two, cluster device management maintenance personnel customer group index

1. fault alarm and location

A. real time fail nodes, failure rate

B. in one week (month, year), every day malfunctioning node record, failure rate [row figure, table]

C. in one week (month, year), average every node failure number of times, every node failure rate (adding up easy malfunctioning node)

D. malfunctioning node is located in real time

E. malfunctioning node Realtime Alerts

F. fault, the classification of failure node failure type: can connect, can not connect, power down etc.

G. accurately locate faulty equipment to connecting fault: faulty disk position, fall internal memory (position) etc.

2. equipment running status is checked

A. cluster entirety cpu busy percentage, centralized stores IO bandwidth in real time

B. in one week (month, year), every day cluster ensemble average cpu busy percentage, average centralized stores IO bandwidth

C. in one week (month, year), cluster ensemble average cpu busy percentage, average centralized stores IO bandwidth

D. can the every node running status of real time inspection: CPU, internal memory, local disk, network etc. index

E. can historical query in 1 year all nodes move attitude every day

F. resource bottleneck analysis (CPU, storage, internal memory, network [distinguishing storage, exchanges data])

3. billing function

A. add up when subscriber computer

Three, task customer group index

1. current task information

A. the nodes, check figure that current task is used, the memory size taking etc.

B. can check the status information of the nodes that current task uses: CPU, internal memory, local disk, network etc.

C. the number of tasks of current queuing

D. current task queuing time

2. historic task statistics

A. this user's historic task working time

B. the historical average Runtime of this user

C. this user completes the historic task number of (inefficacy)

D. Mission Success rate (successful number of tasks/inefficacy number of tasks)

E. this user's historic task is used nodes, check figure

F. this user's averaged historical task is used nodes, check figure

G. the average queuing time of historic task

Four, application software research staff customer group index

1. program (module) is used Information Statistics

A., in one week (month, year), process the total number of modules of (inefficacy) every day

B. in one week (month, year), module crash rate

C., in one week (month, year), module is used hot statistics, rank, and the access times accounting of each module

D. in one week (month, year), failed module hot statistics, rank, and the Failure count accounting of each failed module

2. performance trace index

The loading condition of a. service of all applications (database, file system, job scheduling, middle acceleration layer, parallel framework etc.)

B. the information of micro-architecture level: cache hit/miss leads, TLB

C. the information of operating system grade: process number, process switching, storehouse, heap memory distribution condition etc.

3. the statistics of user's use habit

A. the delay of the visit data of interactive application, residence time, I/O access module etc.

Finally, watch-dog is pressed to the statistical study information that foregoing excavates, be shown to respectively ustomer premises access equipment by the client layer of specifying.

Data mining in embodiments of the present invention is to distinguish by user's type.The excavation item of having listed in invention is to sum up after fully having analyzed correlation type user's real needs and focus.And this class index does not have in common monitoring, need artificial data are derived and analyzed, and the embodiment that the present invention proposes is intellectuality, automatically completes.In addition, the embodiment that the present invention proposes is also designed with and reserved excavates interface by custom data, can carry out user-defined data mining program.

3. localization of fault and warning

By above-mentioned data mining analysis, can obtain the equipment work at present performance index of computing node, the reason that can analytical equipment whether breaks down and break down according to described serviceability index.Error message can be showed to specific user by the intelligent display module of ustomer premises access equipment on the one hand, on the other hand, can fault alarm module be installed at user visitor end equipment, certain stereo set, light units etc. are for example installed, to send a warning when the equipment failure, thereby remind maintainer to pay close attention to fast faulty equipment, finishing equipment fault is got rid of fast.

The failure exception situation of equipment or application software can reflect according to the performance data index of statistics.For be simple and easy to the present invention be the fault of extremely locating fault, particularly some aspect of performances by analytical performance data target, be to get rid of by usual method.Such as, the heat radiation of cluster is bad, may cause the frequency reducing operation of processor, can not report to the police this time by normal failure monitoring means, but the method that adopts the present invention to propose, because collection has processor micro-architecture data target, the floating-point travelling speed can real-time monitoring processor completing, and complete every clock number CPI that instruction is required, so when in monitored node heavy duty and these two indexs within a longer time continue lower than default threshold value, judging fault by watch-dog occurs and intelligent alarm, also just located the reason that fault occurs simultaneously, it is the improper frequency reducing of processor.

Certainly; the present invention also can have other various embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art are when making according to the present invention various corresponding changes and distortion, but these corresponding changes and distortion all should belong to the protection domain of claim of the present invention.

Claims

1. a system for intelligent monitoring large-scale data center cluster computing node, comprises the monitor node being arranged on data center's cluster computing node, watch-dog and the subscriber terminal equipment of communicating by letter with each monitor node, it is characterized in that:

2. the system as claimed in claim 1, is characterized in that, described analysis comprises: locate the computing node breaking down and definite failure cause according to described data target.

3. system as claimed in claim 1 or 2, is characterized in that: described hardware micro-architecture data target comprise CPU real-time floating-point travelling speed, stream SIMD instruction extension collection SSE unit by using rate, senior vectorial superset AVX unit by using rate, vector instruction vectorization ratio, complete one or more the combination in the required clock number CPI of every instruction, afterbody buffer memory LLC hit rate, memory bandwidth, PCI high-speed bus interface PCI-E device bandwidth, cache hit/miss rate; The relevant data target of process of the application program of moving on described and described computing node comprises one or more the combination in process switching number of times, stack information, heap memory distribution condition.

4. system as claimed in claim 3, it is characterized in that: the real-time floating-point travelling speed that described data target is CPU and/or complete every clock number CPI that instruction is required, described analysis comprises: when described data target continues lower than default threshold value in Preset Time section, decision processor breaks down, and definite fault is former because the abnormal frequency reducing of processor.

5. the system as claimed in claim 1, is characterized in that: the cpu busy percentage, memory usage, local disk IO data and/or the Ethernet handling capacity that are provided by operating system are also provided described monitor node.

6. the system as claimed in claim 1, is characterized in that: the MSR control register in the performance monitoring unit PMU of the processor that the hardware controls register of wherein said computing node is described computing node.

7. a method for intelligent monitoring large-scale data center cluster computing node, is characterized in that:

Start the monitor node being arranged in computing node;

8. method as claimed in claim 7, is characterized in that, described analysis comprises: locate the computing node breaking down and definite failure cause according to described data target.

9. method as claimed in claim 7 or 8, is characterized in that: described hardware micro-architecture data target comprise CPU real-time floating-point travelling speed, stream SIMD instruction extension collection SSE unit by using rate, senior vectorial superset AVX unit by using rate, vector instruction vectorization ratio, complete one or more the combination in the required clock number CPI of every instruction, afterbody buffer memory LLC hit rate, memory bandwidth, PCI high-speed bus interface PCI-E device bandwidth, cache hit/miss rate; The relevant data target of process of the application program of moving on described and described computing node comprises one or more the combination in process switching number of times, stack information, heap memory distribution condition.

10. system as claimed in claim 9, it is characterized in that: the real-time floating-point travelling speed that described data target is CPU and/or complete every clock number CPI that instruction is required, described analysis comprises: when described data target continues lower than default threshold value in Preset Time section, decision processor breaks down, and definite fault is former because the abnormal frequency reducing of processor.

11. methods as claimed in claim 10, is characterized in that: the cpu busy percentage, memory usage, local disk IO data and/or the Ethernet handling capacity that are provided by operating system are also provided described monitor node.

12. methods as claimed in claim 11, is characterized in that: the MSR control register in the performance monitoring unit PMU of the processor that the hardware controls register of wherein said computing node is described computing node.