CN104156296A - System and method for intelligently monitoring large-scale data center cluster computing nodes - Google Patents

System and method for intelligently monitoring large-scale data center cluster computing nodes Download PDF

Info

Publication number
CN104156296A
CN104156296A CN201410377856.0A CN201410377856A CN104156296A CN 104156296 A CN104156296 A CN 104156296A CN 201410377856 A CN201410377856 A CN 201410377856A CN 104156296 A CN104156296 A CN 104156296A
Authority
CN
China
Prior art keywords
computing node
data target
data
node
rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410377856.0A
Other languages
Chinese (zh)
Other versions
CN104156296B (en
Inventor
刘羽
吕文静
金莲
陈博文
于涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN201410377856.0A priority Critical patent/CN104156296B/en
Publication of CN104156296A publication Critical patent/CN104156296A/en
Application granted granted Critical
Publication of CN104156296B publication Critical patent/CN104156296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention provides a system and a method for intelligently monitoring large-scale data center cluster computing nodes. Hardware microarchitecture data indexes of the computing nodes and data indexes related to processes of running applications are acquired through monitoring nodes in the system, the data indexes are transmitted to monitoring equipment in the system, big data analysis is executed via the monitoring equipment, and results are sent to customer premise equipment to be displayed to a user. By the system and the method, the microarchitecture data indexes of the computing nodes and the data indexes of the processes of the running applications are acquired, so that intelligent big data analysis is realized, the faulted computing nodes are positioned automatically, and fault causes are provided.

Description

The system and method for intelligent monitoring large-scale data center cluster computing node
Technical field
The present invention relates to field of computer technology, be specifically related to the system and method for a kind of intelligent monitoring large-scale data center cluster computing node.
Background technology
Along with the continuous progress of human society, the development of science and technology, people are not only more and more extensive to natural understanding, and the demand that outfield is explored is also more and more urgent.The sharply growth of property of amount that this just makes the mankind support the information data of holding, and meanwhile, the information data of these magnanimity all needs analyze timely and process.For example, a large-scale astronomical radio telescope array just can produce universe microwave data more than 100GB one second, and these data all need to be analyzed in time; For another example,, in particle physics research field, the data that LHC once clashes are also measured taking TB as unit; In addition, also computing power has been proposed to more and more higher requirement as human genome engineering, petroleum prospecting, weather forecast etc. field.Under this overall background, numerical evaluation becomes the third the extremely important Science Explorations means except experiment, theoretical analysis already.Reality based on such just, the supercomputer of greatly developing that has impelled each science and technology power of the world today all doing one's utmost.As, in the world TOP500 issuing in Dec, 2013, China of ranking the first No. two, the Milky Way " (TH-2) " has just reached the peak velocity of 54.9PFlops, has used altogether more than 16000 computing nodes.
In addition,, along with the development of the new techniques such as cloud computing, large data, Internet of Things, there is increasing large-scale data center, cloud computing center.They have ten hundreds of computer nodes easily.The Dalles data center that is positioned at Ore. as Google (Google) has approximately 150,000 station server nodes.In so large-scale data center, the performance monitoring of computing node, localization of fault, fault recovery, and central whole Efficiency Statistics etc., all exist unprecedented challenge.Therefore, how efficiently extensive and even ultra-large data center of management and use, is that world is all in a popular domain of making great efforts to explore.
For a long time, the monitoring management of data center all completes by artificial automanual mode.The personnel that are responsible for O&M need the real-time running status of checking cluster, once go wrong, although sometimes can location node position, often can not accurately locate the equipment of fault, also need to waste time and energy by staff's experience judge, troubleshooting; Although the user of cluster can understand by numerous job scheduling software the operation situation of oneself, seldom can count on the historical analysis of operation; Moreover the decision maker of cluster often cannot directly obtain, about expense expenditure, service efficiency, personnel's work efficiency, cost effectiveness etc. are about the information material of decision-making, can only, by the manual analysis of mass data being carried out to decision-making, wasting time and energy from cluster.In addition, application developer also often cannot obtain hardware micro-architecture, system process, storehouse, the module error optimizing application software and be badly in need of from cluster and collapse the information such as statistics, need to obtain by a large amount of experiments by rule of thumb, i.e. time-consuming effort again.
Summary of the invention
The present invention proposes the system and method for a kind of intelligent monitoring large-scale data center cluster computing node, there is maximization, multi-functional, facing multiple users group's feature.It has perfect intellectual analysis and statistical function, can provide data reference frame for different levels user's decision-making.
Described system, comprising: be arranged on the monitor node on data center's cluster computing node, watch-dog and the subscriber terminal equipment of communicating by letter with each monitor node, it is characterized in that:
Described monitor node, for the control of the hardware controls register by obtaining computing node, gather the hardware micro-architecture data target of described computing node, by obtaining the control of operating system nucleus, obtain the data target relevant to the process of the application program of moving on described computing node, and described data target is sent to watch-dog;
Described watch-dog, for receiving described data target, carries out large data analysis based on described data target, and the result of described analysis is sent to subscriber terminal equipment;
Described subscriber terminal equipment, for receiving described result and being shown to user.
Described method comprises:
Start the monitor node being arranged in computing node;
Described monitor node is by the control of the hardware controls register of acquisition computing node, gather the hardware micro-architecture data target of described computing node, by obtaining the control of operating system nucleus, obtain the data target relevant to the process of the application program of moving on described computing node, and described data target is sent to watch-dog;
Described watch-dog receives described data target, carries out large data analysis based on described data target, and the result of described analysis is sent to subscriber terminal equipment;
Described subscriber terminal equipment receives described result and is shown to user.
Especially, described analysis comprises: locate the computing node breaking down and definite failure cause according to described data target.
Especially, described hardware micro-architecture data target comprise CPU real-time floating-point travelling speed, stream SIMD instruction extension collection SSE unit by using rate, senior vectorial superset AVX unit by using rate, vector instruction vectorization ratio, complete one or more the combination in the required clock number CPI of every instruction, afterbody buffer memory LLC hit rate, memory bandwidth, PCI high-speed bus interface PCI-E device bandwidth, cache hit/miss rate; The relevant data target of process of the application program of moving on described and described computing node comprises one or more the combination in process switching number of times, stack information, heap memory distribution condition.
Especially, described data target is the real-time floating-point travelling speed of CPU and/or completes every clock number CPI that instruction is required, described analysis comprises: when described data target continues lower than default threshold value in Preset Time section, decision processor breaks down, and definite fault is former because the abnormal frequency reducing of processor.
Especially, the cpu busy percentage, memory usage, local disk IO data and/or the Ethernet handling capacity that are provided by operating system are also provided described monitor node.
MSR control register in the performance monitoring unit PMU of the processor that especially, the hardware controls register of wherein said computing node is described computing node.
The invention has the beneficial effects as follows:
Extract necessary system-level performance index information by the performance monitoring apparatus at each computing node, and transmission is responsible for maintenance by monitoring management node.And monitoring management node has abnormal identification and alert capability, excavate respectively recorded historical data by customer group simultaneously, and by result feedback to user.Meanwhile, monitoring management node can also be as required, on a time period, the information of the aspects such as the monitor node extraction hardware micro-architecture feature to appointment and process, storehouse.Thereby realize multi-userization, multifunction and intellectuality to large-scale cluster monitoring.
In order to realize the actual effect of monitoring, the monitoring client of each computing node has been realized the monitoring mode refreshing per second.In order to reduce the resource occupation of computing node, each computing node only extracts for the necessary minimum index item of data analysis, comprises cpu busy percentage simultaneously, memory usage, ten several indexs such as local disk read-write and Ethernet handling capacity.
In order to realize multifunction, this intelligent monitor system also provides the monitoring analysis of the index relevant to hardware micro-architecture, as floating-point travelling speed, and vectorization ratio, memory bandwidth, IB bandwidth etc.But because this part content is relatively many to taking of system resource in the time monitoring, therefore, they start as required according to user instruction.
In order to realize multi-userization, this intelligent monitor system has proposed to contain administration and supervision authorities, O&M layer, practical application client layer and application and development layer, the hierarchical view of four levels.
In order to realize intellectuality, this intelligent monitor system has been invented a kind of analytical approach of data mining, and it,, according to basic performance monitoring data message, excavates the most interested statistical indicator of different levels user by calculating.
Brief description of the drawings
Fig. 1 is the system chart of a kind of intelligent monitoring large-scale data center cluster of proposing of the present invention
Fig. 2 is the process flow diagram of the method for a kind of intelligent monitoring large-scale data center cluster of proposing of the present invention
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, the present invention is done further and described in detail.
Referring to accompanying drawing 1, the system that shows a kind of intelligent monitoring large-scale data center cluster computing node of the present invention's proposition, comprises the monitor node being arranged on data center's cluster computing node, the watch-dog and the subscriber terminal equipment that are connected with each monitor node.Wherein data center's cluster computing node has corresponding hardware device, such as processor CPU, internal memory, hard disk, Ethernet controller etc., operation system and application software on described computing node; Watch-dog comprises main monitor node and database, main monitor node is communicated by letter with each monitor node being arranged on above-mentioned computing node, can obtain the hardware and software service data of data center's cluster computing node, for example cpu busy percentage, memory usage, local disk IO data, Ethernet handling capacity, and for the data target of the micro-architecture data target of this computing node hardware and the program process level of operation.Main monitor node, by the above-mentioned data write into Databasce obtaining, automatically performs large data mining and preserves the result obtaining after large data mining.User is by ustomer premises access equipment reading result demonstration from database.User can also input user-defined data mining program to watch-dog by subscriber terminal equipment, extracted the corresponding data index of data center's clustered node by watch-dog, carry out large data mining and show result to user according to user-defined data mining program.
Referring to accompanying drawing 2, the method for a kind of intelligent monitoring large-scale data center cluster computing node that the present invention proposes is made up of several key steps such as data acquisition, large data mining, point level display and localization of fault and warnings.Wherein data acquisition comprises master data collection and high-level data collection, and master data collection is automatically performed by system, arranges without user; High-level data collection need to be according to user intention setting.
1. data acquisition
Data acquisition refers on data center's cluster computing node installs monitor node, extract cpu busy percentage, memory usage, local disk IO data, the Ethernet handling capacity of this computing node, and for the data target of the micro-architecture data target of this computing node hardware and the program process level of operation.Wherein, be called high-level data collection for the collection of the micro-architecture data target of computing node hardware and the data target of program process level, the collection of all the other indexs is called master data collection.Master data collection is the step that system default arranges, and can carry out without user intervention, and high-level data collection arranges execution according to user's request.Owing to needing the actual effect of guaranteed performance achievement data, monitor node must meet second grade acquisition capacity refreshing, and must ensure extremely low computing node resources occupation rate simultaneously.
The collecting method that the present invention proposes is different from the method proposing in prior art.In the prior art, data acquisition is only to collect some achievement datas that operating system itself provides, i.e. the collection of data target depends on the operating system of moving on computing node, the data target that cannot provide for operating system, and monitor node cannot obtain.And collecting method proposed by the invention, the collection of the above-mentioned data target being provided by operating system not only can be provided, can also gather some hardware micro-architecture data targets, the real-time floating-point travelling speed of for example CPU, stream SIMD instruction extension collection SSE (Streaming SIMD Extensions) unit by using rate, senior vectorial superset AVX (Advanced Vector Extensios) unit by using rate, vector instruction vectorization ratio, complete every clock number (CPI) that instruction is required, afterbody buffer memory LLC (Last Level Cache) hit rate, translation lookaside buffer TLB (Translation Lookaside Buffer) parameter, memory bandwidth, PCI high-speed bus interface PCI-E (PCI Express) device bandwidth, cache hit/miss (cache hit/miss) rate, TLB unit etc.In addition, can also gather the data target of some program process levels, as process switching number of times, stack information, heap memory distribution condition etc.These indexs are of great significance for the performance, analytic set character and the positioning software level fault tool that excavate application software.
Due to needs acquisition hardware and process level data target, the monitor node that therefore the present invention proposes is realized by the mode of software client.The method that described monitor node proposes with prior art the collection of master data, does not repeat them here, and the process of high-level data collection is specifically described as follows:
Need to be by the control of related register in hardware be realized to the extraction of above-mentioned hardware micro-architecture data target.As, for processor micro-architecture data target, be mainly to control to realize by the performance monitoring unit PMU in processor (Performance Monitoring Unit).Therefore, this just requires the monitor node of this case to have the highest root authority.Control flow to PMU is described below:
S1: the control of obtaining MSR (the Module Specific Register) control register in the PMU of processor of computing node;
S2: the coding of dependent event and mask are write in the MSR control register of having controlled, and this control register is set, start dependent event to count, for example, in the time gathering LLC hit rate data target, first the coding of LLC hit rate and mask are write in MSR control register, then this register is set and starts to count LLC and hit quantity, after finishing, counting reads the count number in this control register, statistics LLC hit rate.
Need to realize the monitoring of correlative code in kernel the extraction of system kernel level index.The for example monitoring to process switching, need to monitor the part of controlling process in kernel in the code about management of process part.When computing node starts, kernel starts monitoring after successfully loading.Therefore, monitor node must have the control to kernel level.May affect a little the performance of system on the extraction of system kernel level index, therefore can provide as required for the occasion of monitoring.
2. large data mining and point level display
The above-mentioned monitor node being arranged in computing node also has the ability that sends data to watch-dog, receives and manage each monitor node by watch-dog unification.Main monitor node in watch-dog is responsible for receiving from each monitor node the data target gathering, and send control command to each monitor node, described control command comprises the master data acquisition that described system default produces, and the high-level data acquisition producing is set according to user, described each monitor node is carried out the collection of corresponding data index according to described control command.Main monitor node is also responsible for the described data target receiving to deposit in database by certain storage format simultaneously, as the input data of next step data mining.
In order to realize intellectuality, watch-dog also has large data mining ability, the data target that it arranges preserving in database according to default statistics carries out large data processing, and according to default classification exhibition scheme, the user who is respectively different provides data statistics and analysis result.In addition, watch-dog also has user interface, can receive custom data mining algorithm, and excavates according to described data mining algorithm executing data.Described default statistics setting comprises:
One, administration and supervision authorities customer group index
1. throughput rate (task flux)
A. real time execution task, application number
B., in one week (month, year), complete the number of tasks [row figure, table] of (failure) every day
C., in one week (month, year), complete (failure) number of tasks average every day
D., in one week (month, year), always complete (failure) number of tasks
E. per task time
2. O&M cost (energy consumption) (calculating, storage, exchange, machine room [refrigeration])
A. real-time total power consumption
B. in one week (month, year), energy consumption every day (KW/h) [row figure, table]
C. in one week (month, year), average energy consumption every day (KW/h)
D. in one week (month, year), total energy consumption (KW/h)
E. between the monitoring of equipment amortization, machine room entirety amortization charge and each expense unit than Data-Statistics, unit costs operation performance
3. assets utilization efficiency
A. in one week (month, year), every day cluster dutycycle
B. in one week (month, year), average every day cluster dutycycle
C. in one week (month, year), cluster peak hours/period every day (calculating cluster dutycycle per hour)
D. in one week (month, year), time consistent busy hour section (the annual dutycycle on 24 hour period)
E. real-time online number of users (special delegated authority, check personal information)
F. in one week (month, year), online user number every day [row figure, table]
G. in one week (month, year), average every day online user number
H. in one week (month, year), every day the average user number of finishing the work
I. in one week (month, year), average every user number of finishing the work
4. equipment health degree
A. real time fail nodes, failure rate
B. in one week (month, year), every day malfunctioning node number, failure rate [row figure, table]
C. in one week (month, year), average every day malfunctioning node number, failure rate
Two, cluster device management maintenance personnel customer group index
1. fault alarm and location
A. real time fail nodes, failure rate
B. in one week (month, year), every day malfunctioning node record, failure rate [row figure, table]
C. in one week (month, year), average every node failure number of times, every node failure rate (adding up easy malfunctioning node)
D. malfunctioning node is located in real time
E. malfunctioning node Realtime Alerts
F. fault, the classification of failure node failure type: can connect, can not connect, power down etc.
G. accurately locate faulty equipment to connecting fault: faulty disk position, fall internal memory (position) etc.
2. equipment running status is checked
A. cluster entirety cpu busy percentage, centralized stores IO bandwidth in real time
B. in one week (month, year), every day cluster ensemble average cpu busy percentage, average centralized stores IO bandwidth
C. in one week (month, year), cluster ensemble average cpu busy percentage, average centralized stores IO bandwidth
D. can the every node running status of real time inspection: CPU, internal memory, local disk, network etc. index
E. can historical query in 1 year all nodes move attitude every day
F. resource bottleneck analysis (CPU, storage, internal memory, network [distinguishing storage, exchanges data])
3. billing function
A. add up when subscriber computer
Three, task customer group index
1. current task information
A. the nodes, check figure that current task is used, the memory size taking etc.
B. can check the status information of the nodes that current task uses: CPU, internal memory, local disk, network etc.
C. the number of tasks of current queuing
D. current task queuing time
2. historic task statistics
A. this user's historic task working time
B. the historical average Runtime of this user
C. this user completes the historic task number of (inefficacy)
D. Mission Success rate (successful number of tasks/inefficacy number of tasks)
E. this user's historic task is used nodes, check figure
F. this user's averaged historical task is used nodes, check figure
G. the average queuing time of historic task
Four, application software research staff customer group index
1. program (module) is used Information Statistics
A., in one week (month, year), process the total number of modules of (inefficacy) every day
B. in one week (month, year), module crash rate
C., in one week (month, year), module is used hot statistics, rank, and the access times accounting of each module
D. in one week (month, year), failed module hot statistics, rank, and the Failure count accounting of each failed module
2. performance trace index
The loading condition of a. service of all applications (database, file system, job scheduling, middle acceleration layer, parallel framework etc.)
B. the information of micro-architecture level: cache hit/miss leads, TLB
C. the information of operating system grade: process number, process switching, storehouse, heap memory distribution condition etc.
3. the statistics of user's use habit
A. the delay of the visit data of interactive application, residence time, I/O access module etc.
Finally, watch-dog is pressed to the statistical study information that foregoing excavates, be shown to respectively ustomer premises access equipment by the client layer of specifying.
Data mining in embodiments of the present invention is to distinguish by user's type.The excavation item of having listed in invention is to sum up after fully having analyzed correlation type user's real needs and focus.And this class index does not have in common monitoring, need artificial data are derived and analyzed, and the embodiment that the present invention proposes is intellectuality, automatically completes.In addition, the embodiment that the present invention proposes is also designed with and reserved excavates interface by custom data, can carry out user-defined data mining program.
3. localization of fault and warning
By above-mentioned data mining analysis, can obtain the equipment work at present performance index of computing node, the reason that can analytical equipment whether breaks down and break down according to described serviceability index.Error message can be showed to specific user by the intelligent display module of ustomer premises access equipment on the one hand, on the other hand, can fault alarm module be installed at user visitor end equipment, certain stereo set, light units etc. are for example installed, to send a warning when the equipment failure, thereby remind maintainer to pay close attention to fast faulty equipment, finishing equipment fault is got rid of fast.
The failure exception situation of equipment or application software can reflect according to the performance data index of statistics.For be simple and easy to the present invention be the fault of extremely locating fault, particularly some aspect of performances by analytical performance data target, be to get rid of by usual method.Such as, the heat radiation of cluster is bad, may cause the frequency reducing operation of processor, can not report to the police this time by normal failure monitoring means, but the method that adopts the present invention to propose, because collection has processor micro-architecture data target, the floating-point travelling speed can real-time monitoring processor completing, and complete every clock number CPI that instruction is required, so when in monitored node heavy duty and these two indexs within a longer time continue lower than default threshold value, judging fault by watch-dog occurs and intelligent alarm, also just located the reason that fault occurs simultaneously, it is the improper frequency reducing of processor.
Certainly; the present invention also can have other various embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art are when making according to the present invention various corresponding changes and distortion, but these corresponding changes and distortion all should belong to the protection domain of claim of the present invention.

Claims (12)

1. a system for intelligent monitoring large-scale data center cluster computing node, comprises the monitor node being arranged on data center's cluster computing node, watch-dog and the subscriber terminal equipment of communicating by letter with each monitor node, it is characterized in that:
Described monitor node, for the control of the hardware controls register by obtaining computing node, gather the hardware micro-architecture data target of described computing node, by obtaining the control of operating system nucleus, obtain the data target relevant to the process of the application program of moving on described computing node, and described data target is sent to watch-dog;
Described watch-dog, for receiving described data target, carries out large data analysis based on described data target, and the result of described analysis is sent to subscriber terminal equipment;
Described subscriber terminal equipment, for receiving described result and being shown to user.
2. the system as claimed in claim 1, is characterized in that, described analysis comprises: locate the computing node breaking down and definite failure cause according to described data target.
3. system as claimed in claim 1 or 2, is characterized in that: described hardware micro-architecture data target comprise CPU real-time floating-point travelling speed, stream SIMD instruction extension collection SSE unit by using rate, senior vectorial superset AVX unit by using rate, vector instruction vectorization ratio, complete one or more the combination in the required clock number CPI of every instruction, afterbody buffer memory LLC hit rate, memory bandwidth, PCI high-speed bus interface PCI-E device bandwidth, cache hit/miss rate; The relevant data target of process of the application program of moving on described and described computing node comprises one or more the combination in process switching number of times, stack information, heap memory distribution condition.
4. system as claimed in claim 3, it is characterized in that: the real-time floating-point travelling speed that described data target is CPU and/or complete every clock number CPI that instruction is required, described analysis comprises: when described data target continues lower than default threshold value in Preset Time section, decision processor breaks down, and definite fault is former because the abnormal frequency reducing of processor.
5. the system as claimed in claim 1, is characterized in that: the cpu busy percentage, memory usage, local disk IO data and/or the Ethernet handling capacity that are provided by operating system are also provided described monitor node.
6. the system as claimed in claim 1, is characterized in that: the MSR control register in the performance monitoring unit PMU of the processor that the hardware controls register of wherein said computing node is described computing node.
7. a method for intelligent monitoring large-scale data center cluster computing node, is characterized in that:
Start the monitor node being arranged in computing node;
Described monitor node is by the control of the hardware controls register of acquisition computing node, gather the hardware micro-architecture data target of described computing node, by obtaining the control of operating system nucleus, obtain the data target relevant to the process of the application program of moving on described computing node, and described data target is sent to watch-dog;
Described watch-dog receives described data target, carries out large data analysis based on described data target, and the result of described analysis is sent to subscriber terminal equipment;
Described subscriber terminal equipment receives described result and is shown to user.
8. method as claimed in claim 7, is characterized in that, described analysis comprises: locate the computing node breaking down and definite failure cause according to described data target.
9. method as claimed in claim 7 or 8, is characterized in that: described hardware micro-architecture data target comprise CPU real-time floating-point travelling speed, stream SIMD instruction extension collection SSE unit by using rate, senior vectorial superset AVX unit by using rate, vector instruction vectorization ratio, complete one or more the combination in the required clock number CPI of every instruction, afterbody buffer memory LLC hit rate, memory bandwidth, PCI high-speed bus interface PCI-E device bandwidth, cache hit/miss rate; The relevant data target of process of the application program of moving on described and described computing node comprises one or more the combination in process switching number of times, stack information, heap memory distribution condition.
10. system as claimed in claim 9, it is characterized in that: the real-time floating-point travelling speed that described data target is CPU and/or complete every clock number CPI that instruction is required, described analysis comprises: when described data target continues lower than default threshold value in Preset Time section, decision processor breaks down, and definite fault is former because the abnormal frequency reducing of processor.
11. methods as claimed in claim 10, is characterized in that: the cpu busy percentage, memory usage, local disk IO data and/or the Ethernet handling capacity that are provided by operating system are also provided described monitor node.
12. methods as claimed in claim 11, is characterized in that: the MSR control register in the performance monitoring unit PMU of the processor that the hardware controls register of wherein said computing node is described computing node.
CN201410377856.0A 2014-08-01 2014-08-01 The system and method for intelligent monitoring large-scale data center cluster calculate node Active CN104156296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410377856.0A CN104156296B (en) 2014-08-01 2014-08-01 The system and method for intelligent monitoring large-scale data center cluster calculate node

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410377856.0A CN104156296B (en) 2014-08-01 2014-08-01 The system and method for intelligent monitoring large-scale data center cluster calculate node

Publications (2)

Publication Number Publication Date
CN104156296A true CN104156296A (en) 2014-11-19
CN104156296B CN104156296B (en) 2017-06-30

Family

ID=51881801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410377856.0A Active CN104156296B (en) 2014-08-01 2014-08-01 The system and method for intelligent monitoring large-scale data center cluster calculate node

Country Status (1)

Country Link
CN (1) CN104156296B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104407959A (en) * 2014-12-12 2015-03-11 深圳中兴网信科技有限公司 Application based monitoring method and monitoring device
CN106325200A (en) * 2016-08-30 2017-01-11 江苏永冠给排水设备有限公司 Realization method of sodium hypochlorite generator equipment group control system based on online self-service
CN107205243A (en) * 2017-06-05 2017-09-26 柳州市盛景科技有限公司 A kind of intelligent gateway for possessing monitoring function
CN107257305A (en) * 2017-08-02 2017-10-17 郑州云海信息技术有限公司 The monitoring method and device of a kind of multi-node system
CN108108282A (en) * 2017-12-07 2018-06-01 联想(北京)有限公司 Information processing method and device and electronic equipment
CN108319538A (en) * 2018-02-02 2018-07-24 世纪龙信息网络有限责任公司 The monitoring method and system of big data platform operating status
CN108845878A (en) * 2018-05-08 2018-11-20 南京理工大学 The big data processing method and processing device calculated based on serverless backup
CN109040478A (en) * 2018-08-31 2018-12-18 北京云迹科技有限公司 The overload alarm method and device of phone box
CN109660537A (en) * 2018-12-20 2019-04-19 武汉钢铁工程技术集团通信有限责任公司 A method of real time monitoring and maintenance cloud platform physical resource service operation state
CN110928750A (en) * 2018-09-19 2020-03-27 阿里巴巴集团控股有限公司 Data processing method, device and equipment
CN110928738A (en) * 2018-09-19 2020-03-27 阿里巴巴集团控股有限公司 Performance analysis method, device and equipment
CN112148316A (en) * 2020-09-29 2020-12-29 联想(北京)有限公司 Information processing method and information processing device
CN112306802A (en) * 2020-10-29 2021-02-02 平安科技(深圳)有限公司 Data acquisition method, device, medium and electronic equipment of system
CN113574502A (en) * 2020-02-12 2021-10-29 深圳元戎启行科技有限公司 Data acquisition method and device for unmanned vehicle operating system
WO2023279815A1 (en) * 2021-07-08 2023-01-12 华为技术有限公司 Performance monitoring system and related method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945198A (en) * 2012-10-19 2013-02-27 浪潮电子信息产业股份有限公司 Method for characterizing application characteristics of high performance computing
CN103246569A (en) * 2013-05-20 2013-08-14 浪潮(北京)电子信息产业有限公司 Method and device for representing high-performance calculation application characteristics
CN103501253A (en) * 2013-10-18 2014-01-08 浪潮电子信息产业股份有限公司 Monitoring organization method for high-performance computing application characteristics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945198A (en) * 2012-10-19 2013-02-27 浪潮电子信息产业股份有限公司 Method for characterizing application characteristics of high performance computing
CN103246569A (en) * 2013-05-20 2013-08-14 浪潮(北京)电子信息产业有限公司 Method and device for representing high-performance calculation application characteristics
CN103501253A (en) * 2013-10-18 2014-01-08 浪潮电子信息产业股份有限公司 Monitoring organization method for high-performance computing application characteristics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
易昭华: "大规模机群监控系统信息采集与存储技术研究", 《中国优秀博硕士学位论文全文数据库》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104407959A (en) * 2014-12-12 2015-03-11 深圳中兴网信科技有限公司 Application based monitoring method and monitoring device
CN106325200A (en) * 2016-08-30 2017-01-11 江苏永冠给排水设备有限公司 Realization method of sodium hypochlorite generator equipment group control system based on online self-service
CN107205243A (en) * 2017-06-05 2017-09-26 柳州市盛景科技有限公司 A kind of intelligent gateway for possessing monitoring function
CN107257305A (en) * 2017-08-02 2017-10-17 郑州云海信息技术有限公司 The monitoring method and device of a kind of multi-node system
CN107257305B (en) * 2017-08-02 2020-05-15 苏州浪潮智能科技有限公司 Monitoring method and device for multi-node system
CN108108282A (en) * 2017-12-07 2018-06-01 联想(北京)有限公司 Information processing method and device and electronic equipment
CN108108282B (en) * 2017-12-07 2020-06-23 联想(北京)有限公司 Information processing method and device and electronic equipment
CN108319538A (en) * 2018-02-02 2018-07-24 世纪龙信息网络有限责任公司 The monitoring method and system of big data platform operating status
CN108845878A (en) * 2018-05-08 2018-11-20 南京理工大学 The big data processing method and processing device calculated based on serverless backup
CN109040478A (en) * 2018-08-31 2018-12-18 北京云迹科技有限公司 The overload alarm method and device of phone box
CN110928738A (en) * 2018-09-19 2020-03-27 阿里巴巴集团控股有限公司 Performance analysis method, device and equipment
CN110928750A (en) * 2018-09-19 2020-03-27 阿里巴巴集团控股有限公司 Data processing method, device and equipment
CN110928750B (en) * 2018-09-19 2023-04-18 阿里巴巴集团控股有限公司 Data processing method, device and equipment
CN110928738B (en) * 2018-09-19 2023-04-18 阿里巴巴集团控股有限公司 Performance analysis method, device and equipment
CN109660537A (en) * 2018-12-20 2019-04-19 武汉钢铁工程技术集团通信有限责任公司 A method of real time monitoring and maintenance cloud platform physical resource service operation state
CN113574502A (en) * 2020-02-12 2021-10-29 深圳元戎启行科技有限公司 Data acquisition method and device for unmanned vehicle operating system
CN112148316A (en) * 2020-09-29 2020-12-29 联想(北京)有限公司 Information processing method and information processing device
CN112148316B (en) * 2020-09-29 2022-04-22 联想(北京)有限公司 Information processing method and information processing device
CN112306802A (en) * 2020-10-29 2021-02-02 平安科技(深圳)有限公司 Data acquisition method, device, medium and electronic equipment of system
WO2023279815A1 (en) * 2021-07-08 2023-01-12 华为技术有限公司 Performance monitoring system and related method

Also Published As

Publication number Publication date
CN104156296B (en) 2017-06-30

Similar Documents

Publication Publication Date Title
CN104156296A (en) System and method for intelligently monitoring large-scale data center cluster computing nodes
CN104113585B (en) The method and apparatus that hardware level for producing instruction load balanced state interrupts
CN102254016B (en) Cloud-computing-environment-oriented fault-tolerant parallel Skyline inquiry method
WO2016101638A1 (en) Operation management method for electric power system cloud simulation platform
Xhafa et al. Processing and analytics of big data streams with yahoo! s4
CN104915793A (en) Public information intelligent analysis platform based on big data analysis and mining
CN105843182A (en) Power dispatching accident handling scheme preparing system and power dispatching accident handling scheme preparing method based on OMS
CN104156810A (en) Power dispatching production management system based on cloud computing and realization method of power dispatching production management system
CN104008443A (en) Mission planning and scheduling system of land observation satellite data ground receiving station network
CN108063699A (en) Network performance monitoring method, apparatus, electronic equipment, storage medium
CN104486255A (en) Service resource dispatching method and device
CN103618652A (en) Audit and depth analysis system and audit and depth analysis method of business data
CN107645410A (en) A kind of virtual machine management system and method based on OpenStack cloud platforms
CN102571499A (en) Monitoring method of cloud database server cluster
CN102945198B (en) A kind of method characterizing high-performance calculation application characteristic
Ma et al. Review of power spatio-temporal big data technologies for mobile computing in smart grid
CN106874067A (en) Parallel calculating method, apparatus and system based on lightweight virtual machine
CN103246569A (en) Method and device for representing high-performance calculation application characteristics
CN103501253A (en) Monitoring organization method for high-performance computing application characteristics
Ouyang et al. Mitigating stragglers to avoid QoS violation for time-critical applications through dynamic server blacklisting
CN110168503A (en) Timeslice inserts facility
CN115471215B (en) Business process processing method and device
CN111984301A (en) Micro-service data management framework based on spring close and kubernets
CN106649034A (en) Visual intelligent operation and maintenance method and platform
CN110837970A (en) Regional health platform quality control method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant