CN104156296B - The system and method for intelligent monitoring large-scale data center cluster calculate node - Google Patents

The system and method for intelligent monitoring large-scale data center cluster calculate node Download PDF

Info

Publication number
CN104156296B
CN104156296B CN201410377856.0A CN201410377856A CN104156296B CN 104156296 B CN104156296 B CN 104156296B CN 201410377856 A CN201410377856 A CN 201410377856A CN 104156296 B CN104156296 B CN 104156296B
Authority
CN
China
Prior art keywords
data
calculate node
data target
node
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410377856.0A
Other languages
Chinese (zh)
Other versions
CN104156296A (en
Inventor
刘羽
吕文静
金莲
陈博文
于涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN201410377856.0A priority Critical patent/CN104156296B/en
Publication of CN104156296A publication Critical patent/CN104156296A/en
Application granted granted Critical
Publication of CN104156296B publication Critical patent/CN104156296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

Propose a kind of system and method for intelligent monitoring large-scale data center cluster calculate node, by the hardware micro-architecture data target of the monitor node collection calculate node in the system data target related to the process of the application program of operation, and the data target is sent to the monitoring device in system, big data is performed by monitoring device to analyze, and send the result to ustomer premises access equipment be shown to user.The system and method can gather the program process data target of calculate node micro-architecture data target and operation, realize intelligent big data analysis, be automatically positioned the calculate node that breaks down and provide failure cause.

Description

The system and method for intelligent monitoring large-scale data center cluster calculate node
Technical field
The present invention relates to field of computer technology, and in particular to a kind of intelligent monitoring large-scale data center cluster calculates section The system and method for point.
Background technology
With the continuous progress of human society, the development of science and technology, understanding of the people not only to nature is more and more wider It is general and also more and more urgent to the demand that outfield is explored.This allows for the amount sharpness of the information data that mankind's support is held Growth, and at the same time, the information data of these magnanimity is required for timely analyzing and processing.For example, a large-scale astronomical Radio telescope array can just produce the cosmic microwave data of more than 100GB, these data to be required for being divided in time for one second Analysis;For another example, in particle physics research field, data that LHC once clashes are also to be counted in units of TB Amount;Additionally, as human genome project, oil exploration, weather forecast etc. field also propose increasingly to computing capability Requirement high.Already to become the third in addition to experiment, theory analysis of crucial importance for numerical computations under this overall background Science Explorations means.Such reality is based on, has promoted each science and technology power of the world today big all what is done one's utmost Power develops supercomputer.Such as, in the world TOP500 of in December, 2013 issue, the China's " Milky Way two for ranking the first (TH-2) peak velocity of 54.9PFlops " has just been reached, more than 16000 calculate nodes has been used altogether.
In addition, with the development of the new techniques such as cloud computing, big data, Internet of Things, occur in that increasing big Type data center, cloud computing center.They possess ten hundreds of computer nodes easily.As Google (Google) is located at the U.S. Oregonian Dalles data centers possess about 150,000 server nodes.In so large-scale data center, calculate The performance monitoring of node, fault location, fault recovery, and center whole efficiency statistics etc., all exist unprecedented Challenge.Therefore, how one extensive or even ultra-large data center of efficient management and use, be countries in the world today All in a popular domain for making great efforts to explore.
For a long time, all manually automanual mode is completed for the monitoring management of data center.It is responsible for O&M Personnel need to check the running status of cluster in real time, once go wrong, although sometimes can be with positioning node position, but often The equipment of failure can not be accurately positioned, in addition it is also necessary to waste time and energy by the experience of staff to judge, troubleshooting;The user of cluster Although the handling situations of oneself can be understood by numerous job scheduling software, the history point of operation can be seldom counted on Analysis;Furthermore the policymaker of cluster often cannot directly obtain relevant expense expenditure, service efficiency, person works' effect from cluster Rate, cost effectiveness etc. can only be wasted time and energy about the information material of decision-making by the manual analysis to mass data come decision-making.This Outward, application developer also tends to that hardware micro-architecture, system process, heap that optimization application software is badly in need of cannot be obtained from cluster Stack, module error collapse the information such as statistics, it is necessary to be obtained empirically by substantial amounts of experiment, i.e., time-consuming and laborious.
The content of the invention
The present invention proposes a kind of system and method for intelligent monitoring large-scale data center cluster calculate node, with big The characteristics of type, multi-functional, facing multiple users group.It possesses perfect intellectual analysis and statistical function, can be different levels The decision-making of user provides data reference foundation.
The system, including:Monitor node and each monitor node on data center's PC cluster node lead to The monitoring device and subscriber terminal equipment of letter, it is characterised in that:
The monitor node, for the control of the hardware controls register by obtaining calculate node, gathers the meter The hardware micro-architecture data target of operator node, by obtaining the control of operating system nucleus, obtain with the calculate node The related data target of the process of the application program of operation, and the data target is sent to monitoring device;
The monitoring device, for receiving the data target, big data analysis is performed based on the data target, and will The result of the analysis is sent to subscriber terminal equipment;
The subscriber terminal equipment, for receiving the result and being shown to user.
Methods described includes:
Start the monitor node in calculate node;
The monitor node gathers the calculate node by obtaining the control of the hardware controls register of calculate node Hardware micro-architecture data target, by obtaining the control of operating system nucleus, obtain and the calculate node on run The related data target of the process of application program, and the data target is sent to monitoring device;
The monitoring device receives the data target, and big data analysis is performed based on the data target, and will be described The result of analysis is sent to subscriber terminal equipment;
The subscriber terminal equipment receives the result and is shown to user.
Especially, the analysis includes:According to the calculate node that data target positioning breaks down, and determine event Barrier reason.
Especially, the hardware micro-architecture data target includes the real-time floating-point speed of service of CPU, stream SIMD instruction extension Needed for collection SSE unit utilization rate, high-level vector superset AVX units utilization rate, vector instruction vectorization ratio, every instruction of completion Clock number CPI, afterbody caching LLC hit rate, memory bandwidth, PCI fast bus interface PCI-E devices bandwidth, caching One or more in hit/miss rate of combination;The process phase with the application program of operation in the calculate node The data target of pass includes the combination of one or more in process switching number of times, stack information, heap memory distribution condition.
Especially, clock of the data target for needed for every instruction of the real-time floating-point speed of service and/or completion of CPU Number CPI, the analysis includes:When the data target is consistently less than default threshold value in preset time period, then judgement treatment Device breaks down, and is processor exception frequency reducing the reason for determine failure.
Especially, the monitor node also gathers cpu busy percentage, memory usage, this earth magnetism provided by operating system Disk I/O data and/or Ethernet handling capacity.
Especially, wherein the hardware controls register of the calculate node is the performance prison of the processor of the calculate node MSR control registers in control unit PMU.
The beneficial effects of the invention are as follows:
Extract necessary system level performance metrics information by the performance monitoring apparatus of each calculate node, and send by Monitoring management node is responsible for maintenance.And monitoring management node, then with abnormal identification and alert capability, while pressing customer group Recorded historical data is excavated respectively, and result is fed back into user.Meanwhile, monitoring management node can also on demand, on time Between section, the information of the aspect such as hardware micro-architecture feature and process, storehouse is extracted to specified monitor node.So as to realize to big rule The multi-userization of mould cluster monitoring, multifunction and intellectuality.
In order to realize the actual effect of monitoring, the monitoring client of each calculate node realizes the monitoring mode of refreshing per second.Simultaneously In order to reduce the resource occupation of calculate node, each calculate node is only extracted for minimum index item, bag necessary to data analysis Include cpu busy percentage, memory usage, ten several indexs such as local disk read-write and Ethernet handling capacity.
In order to realize multifunction, this intelligent monitor system additionally provides the monitoring point of the index related to hardware micro-architecture Analysis, such as the floating-point speed of service, vectorization ratio, memory bandwidth, IB bandwidth etc..But due to this partial content monitoring when to system The occupancy of resource is relatively more, therefore, they start on demand according to user instruction.
In order to realize multi-userization, this intelligent monitor system is proposed covers management level, O&M layer, practical application client layer With application and development layer, four hierarchical views of level.
In order to realize intellectuality, this intelligent monitor system has invented a kind of analysis method of data mining, and it is according to basic Performance monitoring data information, by calculate excavate the statistical indicator that different levels user is most interested in.
Brief description of the drawings
Fig. 1 is a kind of system block diagram of intelligent monitoring large-scale data center cluster proposed by the present invention
Fig. 2 is a kind of flow chart of the method for intelligent monitoring large-scale data center cluster proposed by the present invention
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention is done into one below in conjunction with accompanying drawing Step ground is described in detail.
Referring to accompanying drawing 1, a kind of intelligent monitoring large-scale data center cluster calculate node proposed by the present invention is shown System, including monitor node on data center's PC cluster node and the monitoring device of each monitor node connection, And subscriber terminal equipment.Wherein data center's PC cluster node has a corresponding hardware device, such as processor CPU, interior Deposit, hard disk, Ethernet controller etc., operating system and application software are run in the calculate node;Monitoring device bag Main monitor node and database are included, main monitor node communicates with each monitor node in above-mentioned calculate node, can Obtain the hardware and software service data of data center's PC cluster node, such as cpu busy percentage, memory usage, this earth magnetism Disk I/O data, Ethernet handling capacity, and micro-architecture data target and the application program of operation for the calculate node hardware The data target of process level.In the above-mentioned data write into Databasce that main monitor node will be obtained, the automatic big data that performs is excavated simultaneously Preserve the result obtained after big data is excavated.User reads result and shows from database by ustomer premises access equipment.User is also User-defined data mining program can be input into monitoring device by subscriber terminal equipment, by monitoring device extraction data The corresponding data index of heart clustered node, performs big data and excavates and shown to user according to user-defined data mining program As a result.
Referring to accompanying drawing 2, a kind of method of intelligent monitoring large-scale data center cluster calculate node proposed by the present invention by Data acquisition, big data are excavated, are classified several key steps compositions such as displaying and fault location and alarm.Wherein data are adopted Collection includes master data collection and high-level data collection, and master data collection is performed, set without user automatically by system;It is senior Data acquisition needs to be set according to user intention.
1. data acquisition
Data acquisition refers to install monitor node on data center's PC cluster node, extracts the CPU of the calculate node Utilization rate, memory usage, local disk I/O data, Ethernet handling capacity, and for the micro-architecture of the calculate node hardware The data target of data target and the program process of operation level.Wherein, the micro-architecture data for calculate node hardware refer to The collection of the data target of mark and program process level is referred to as high-level data collection, and the collection of remaining index is referred to as master data Collection.Master data collection for system default set the step of, be that can perform without user intervention, high-level data collection according to Family demand and execution is set.Due to needing the actual effect of guarantee performance indications data, monitor node must is fulfilled for what second level refreshed Acquisition capacity, while must assure that extremely low calculate node resources occupation rate.
Collecting method proposed by the present invention is different from the method for proposing in the prior art.In the prior art, data Collection is only to collect some achievement datas that operating system is provided in itself, i.e. the collection of data target is depended in calculate node The operating system of operation, for the data target that operating system cannot be provided, monitor node cannot be obtained.And it is proposed by the invention Collecting method, be not only only capable of complete it is above-mentioned by operating system provide data target collection, can also gather Hardware micro-architecture data target, the real-time floating-point speed of service of such as CPU, stream SIMD instruction superset SSE (Streaming SIMD Extensions) unit utilization rate, high-level vector superset AVX (Advanced Vector Extensios) unit profit With rate, vector instruction vectorization ratio, complete every required clock number (CPI) of instruction, afterbody caching LLC (Last Level Cache) hit rate, translation lookaside buffer TLB (Translation Lookaside Buffer) parameter, internal memory band Width, PCI fast bus interfaces PCI-E (PCI Express) device bandwidth, cache hit/miss (cache hit/miss) Rate, TLB unit etc..Further, it is also possible to gather the data target of some program process level, such as process switching number of times, heap Stack information, heap memory distribution condition etc..These indexs are soft for excavating performance, analysis cluster features and the positioning of application software Part level failure tool is of great significance.
Due to needing acquisition hardware and process level data target, therefore monitor node proposed by the present invention passes through software client The mode at end is realized.The method that collection of the monitor node to master data is proposed compared with technology, will not be repeated here, right The process of high-level data collection is specifically described as follows:
Extraction to above-mentioned hardware micro-architecture data target needs to be realized by the control to related register in hardware. Such as, for processor micro-architecture data target, mainly by the performance monitoring unit PMU (Performance in processor Monitoring Unit) it is controlled to realize.Therefore, this requires that the monitor node of this case possesses highest root authority. Control flow to PMU is described below:
S1:MSR (Module Specific Register) control deposits in the PMU of the processor for obtaining calculate node The control of device;
S2:In the MSR control registers that the coding of dependent event and mask write-in have been controlled, and control deposit is set Device, starts to count dependent event, for example, when LLC hit rate data target is gathered, first by the coding of LLC hit rate and covering In code write-in MSR control registers, then the register is set and starts counting up LLC hit quantity, counting reads the control after terminating Count number in register processed, counts LLC hit rate.
Extraction to system kernel level index needs the monitoring to correlative code in kernel to realize.For example to process switching Monitoring, it is necessary in monitoring kernel relevant control process in the code of management of process part part.It is interior when calculate node starts Core starts monitoring after successfully loading.Therefore, monitor node must possess the control to kernel level.To system kernel level index Extract may slightly affected system performance, therefore can be directed to monitoring occasion provide on demand.
2. big data is excavated and classification displaying
The above-mentioned monitor node in calculate node also has the ability that data are sent to monitoring device, is set by monitoring Standby unification receives and manages each monitor node.Main monitor node in monitoring device is responsible for receiving collection from each monitor node Data target, and to each monitor node send control command, the control command include the system default produce Master data acquisition, and the high-level data acquisition for being set according to user and being produced, described each monitor node root The collection of corresponding data index is performed according to the control command.Main monitor node is also responsible for the data target that will be received simultaneously It is stored in database by certain storage format, as the input data of next step data mining.
In order to realize intellectuality, monitoring device also has big data mining ability, and it is set to data according to default statistics The data target preserved in storehouse carries out big data treatment, and according to default classification exhibition scheme, respectively different users carry For data statistics and analysis result.Additionally, monitoring device also has user interface, custom data excavation can be received Algorithm, and perform data mining according to the data mining algorithm.Default statistics setting includes:
First, management level customer group index
1. throughput rate (task flux)
A. real time execution task, using number
B. in one week (moon, year), the number of tasks of (failure) is completed daily【Row figure, table】
C. it is average to complete (failure) number of tasks daily in one week (moon, year)
D. it is total to complete (failure) number of tasks in one week (moon, year)
E. per task time
2. O&M cost (energy consumption) (calculate, storage, exchange, computer room【Refrigeration】)
A. real-time total power consumption
B. in one week (moon, year), daily energy consumption (KW/h)【Row figure, table】
C. in one week (moon, year), average energy consumption (KW/h) daily
D. in one week (moon, year), total energy consumption (KW/h)
E. it is completeer than Data-Statistics, unit costs operation between equipment depreciation, computer room entirety amortization charge monitoring and each expense unit Cheng Liang
3. assets utilization efficiency
A. in one week (moon, year), daily cluster dutycycle
B. in one week (moon, year), average cluster dutycycle daily
C. in one week (moon, year), daily cluster peak hours/period (calculating cluster dutycycle per hour)
D. in one week (moon, year), time consistent busy hour section (the annual dutycycle on 24 hour period)
E. real-time online number of users (special delegated authority, check personal information)
F. in one week (moon, year), daily online user number【Row figure, table】
G. in one week (moon, year), average online user number daily
H. in one week (moon, year), daily average user completes number of tasks
I. in one week (moon, year), average per-user completes number of tasks
4. equipment health degree
A. real time fail nodes, fault rate
B. in one week (moon, year), daily malfunctioning node number, fault rate【Row figure, table】
C. in one week (moon, year), average malfunctioning node number, fault rate daily
2nd, cluster device management service human user group index
1. fault alarm and positioning
A. real time fail nodes, fault rate
B. in one week (moon, year), daily malfunctioning node record, fault rate【Row figure, table】
C. it is average per node failure number of times in one week (moon, year), per node failure rate (the easy malfunctioning node of statistics)
D. malfunctioning node is positioned in real time
E. malfunctioning node Realtime Alerts
F. failure, the classification of failure node failure type:Can couple, can not couple, power down etc.
G. pair can couple failure and be accurately positioned faulty equipment:Faulty disk position, fall internal memory (position) etc.
2. equipment running status are checked
A. cluster overall cpu busy percentage, centrally stored I/O bandwidth in real time
B. in one week (moon, year), daily cluster ensemble average cpu busy percentage, average centrally stored I/O bandwidth
C. in one week (moon, year), cluster ensemble average cpu busy percentage, average centrally stored I/O bandwidth
D. can the every node running status of real time inspection:CPU, internal memory, local disk, network etc. index
E. attitude can daily be run by all nodes in historical query 1 year
F. resource bottleneck analyzes (CPU, storage, internal memory, network【Distinguish storage, data exchange】)
3. billing function
A. counted during subscriber computer
3rd, task customer group index
1. current task information
A. current task is used nodes, check figure, memory size of occupancy etc.
B. the status information of the nodes that current task is used can be checked:CPU, internal memory, local disk, network etc.
C. the number of tasks currently queued up
D. current task queuing time
2. historic task is counted
A. the user's history Runtime
B. the average Runtime of the user's history
C. the user completes the historic task number of (failure)
D. Mission Success rate (success number of tasks/failure number of tasks)
E. the user's history task is used nodes, check figure
F. user's averaged historical task is used nodes, check figure
G. the average queuing time of historic task
4th, application software research staff customer group index
1. program (module) use information is counted
A. in one week (moon, year), the total number of modules of (failure) is processed daily
B. in one week (moon, year), Module Fail rate
C. in one week (moon, year), module uses hot statistics, ranking, and each module access times accounting
D. in one week (moon, year), failed module hot statistics, ranking, and each failed module Failure count accounting
2. performance trace index
A. the service (database, file system, job scheduling, middle acceleration layer, parallel framework etc.) of all applications Loading condition
B. micro-architecture level information:Cache hit/miss rates, TLB
C. the information of operating system grade:Enter number of passes, process switching, storehouse, heap memory distribution condition etc..
3. the statistics of user's use habit
A. the delay of the access data of interactive application, residence time, I/O access modules etc.
Finally, monitoring device has been pressed the statistical analysis information of the above excavation, has been opened up respectively by the client layer specified Show ustomer premises access equipment.
Data mining in embodiments of the present invention is distinguished by the type of user.The excavation listed in invention Item is summary after the real needs and focus for fully analyzing correlation type user.And this kind of index is in common monitoring It is no, it is necessary to artificial derive data is analyzed, and implementation method proposed by the present invention is intelligent, is automatically performed. Further it is proposed that implementation method be also devised with it is reserved interface is excavated by custom data, can perform user The data mining program of definition.
3. fault location and alarm
By above-mentioned data mining analysis, the equipment work at present performance indications of calculate node are obtained in that, according to described The reason for whether service behaviour index can be broken down and be broken down with analytical equipment.On the one hand error message can be led to The intelligent display module for crossing ustomer premises access equipment shows specific user, on the other hand, can install event in user's visitor's end equipment Barrier alarm module, for example, install certain stereo set, light units etc., is sent a warning with equipment failure, so that Remind attendant quickly to pay close attention to faulty equipment, be rapidly completed equipment fault exclusion.
The failure exception situation of equipment or application software can reflect according to the performance data index of statistics.In order to simple The easy-to-use present invention is the failure that failure, particularly some aspect of performances are positioned by the exception of analytical performance data target, It is that cannot be excluded by usual method.Such as, the radiating of cluster is bad, may result in the frequency reducing operation of processor, this Will not be alarmed by normal failure monitoring means when individual, but use method proposed by the present invention, have treatment due to collecting Device micro-architecture data target, can in real time monitoring processor complete the floating-point speed of service and complete every instruction needed for Clock number CPI, so when in monitored node heavy duty and this two indexs are consistently less than in a longer time Default threshold value, then judge be out of order generation and intelligent alarm by monitoring device, while the reason for also just located failure and occur, That is the improper frequency reducing of processor.
Certainly, the present invention can also have other various embodiments, ripe in the case of without departing substantially from spirit of the invention and its essence Know those skilled in the art and work as and various corresponding changes and deformation, but these corresponding changes and change can be made according to the present invention Shape should all belong to scope of the claims of the invention.

Claims (12)

1. a kind of system of intelligent monitoring large-scale data center cluster calculate node, including installed in data center's PC cluster Monitoring device and subscriber terminal equipment that monitor node on node communicates with each monitor node, it is characterised in that:
The monitor node, for the control of the hardware controls register by obtaining calculate node, gathers described calculating and saves The hardware micro-architecture data target of point, by obtaining the control of operating system nucleus, obtains and is run with the calculate node Application program the related data target of process, and the data target is sent to monitoring device;
The monitoring device, for receiving the data target, big data analysis is performed based on the data target, and will be described The result of analysis is sent to subscriber terminal equipment;The subscriber terminal equipment, for receiving the result and being shown to user;
The data target related to the process of the application program run in the calculate node includes process switching number of times, heap One or more in stack information, heap memory distribution condition of combination;
Monitoring device, be additionally operable to be set according to default statistics carries out big data treatment to the data target preserved in database, and According to default classification exhibition scheme, respectively different users provide data statistics and analysis result;
Monitoring device also has user interface, specifically for receiving custom data mining algorithm, and according to the number Data mining is performed according to mining algorithm.
2. the system as claimed in claim 1, it is characterised in that the analysis includes:Positioned according to the data target and occurred The calculate node of failure, and determine failure cause.
3. system as claimed in claim 1 or 2, it is characterised in that:The hardware micro-architecture data target includes that CPU's is real-time The floating-point speed of service, stream SIMD instruction superset SSE unit utilization rate, high-level vector superset AVX units utilization rate, vector refer to Clock number CPI, afterbody caching LLC hit rate needed for making vectorization ratio, every instruction of completion, memory bandwidth, PCI are quick One or more in EBI PCI-E device bandwidth, cache hit/miss rate of combination.
4. system as claimed in claim 3, it is characterised in that:The data target for CPU the real-time floating-point speed of service and/ Or every required clock number CPI of instruction is completed, the analysis includes:When the data target is persistently low in preset time period In default threshold value, then decision processor breaks down, and is processor exception frequency reducing the reason for determine failure.
5. the system as claimed in claim 1, it is characterised in that:The monitor node also gathers the CPU provided by operating system Utilization rate, memory usage, local disk I/O data and/or Ethernet handling capacity.
6. the system as claimed in claim 1, it is characterised in that:The hardware controls register of wherein described calculate node is described MSR control registers in the performance monitoring unit PMU of the processor of calculate node.
7. a kind of method of intelligent monitoring large-scale data center cluster calculate node, it is characterised in that:
Start the monitor node in calculate node;
The monitor node gathers the hard of the calculate node by obtaining the control of the hardware controls register of calculate node Part micro-architecture data target, by obtaining the control of operating system nucleus, obtains and the application run in the calculate node The related data target of the process of program, and the data target is sent to monitoring device;
The monitoring device receives the data target, and big data analysis is performed based on the data target, and by the analysis Result be sent to subscriber terminal equipment;
The subscriber terminal equipment receives the result and is shown to user;
The data target related to the process of the application program run in the calculate node includes process switching number of times, heap One or more in stack information, heap memory distribution condition of combination;
Monitoring device, set according to default statistics carries out big data treatment to the data target preserved in database, and according to pre- If classification exhibition scheme, respectively different users provide data statistics and analysis result;
Monitoring device also has user interface, receives custom data mining algorithm, and calculate according to the data mining Method performs data mining.
8. method as claimed in claim 7, it is characterised in that the analysis includes:Positioned according to the data target and occurred The calculate node of failure, and determine failure cause.
9. method as claimed in claim 7 or 8, it is characterised in that:The hardware micro-architecture data target includes that CPU's is real-time The floating-point speed of service, stream SIMD instruction superset SSE unit utilization rate, high-level vector superset AVX units utilization rate, vector refer to Clock number CPI, afterbody caching LLC hit rate needed for making vectorization ratio, every instruction of completion, memory bandwidth, PCI are quick One or more in EBI PCI-E device bandwidth, cache hit/miss rate of combination.
10. system as claimed in claim 9, it is characterised in that:The data target is the real-time floating-point speed of service of CPU And/or every required clock number CPI of instruction is completed, the analysis includes:When the data target is held in preset time period The reason for continuing and be less than default threshold value, then decision processor breaks down, and determine failure is processor exception frequency reducing.
11. methods as claimed in claim 10, it is characterised in that:The monitor node also gathers what is provided by operating system Cpu busy percentage, memory usage, local disk I/O data and/or Ethernet handling capacity.
12. methods as claimed in claim 11, it is characterised in that:The hardware controls register of wherein described calculate node is institute MSR control registers in the performance monitoring unit PMU of the processor for stating calculate node.
CN201410377856.0A 2014-08-01 2014-08-01 The system and method for intelligent monitoring large-scale data center cluster calculate node Active CN104156296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410377856.0A CN104156296B (en) 2014-08-01 2014-08-01 The system and method for intelligent monitoring large-scale data center cluster calculate node

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410377856.0A CN104156296B (en) 2014-08-01 2014-08-01 The system and method for intelligent monitoring large-scale data center cluster calculate node

Publications (2)

Publication Number Publication Date
CN104156296A CN104156296A (en) 2014-11-19
CN104156296B true CN104156296B (en) 2017-06-30

Family

ID=51881801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410377856.0A Active CN104156296B (en) 2014-08-01 2014-08-01 The system and method for intelligent monitoring large-scale data center cluster calculate node

Country Status (1)

Country Link
CN (1) CN104156296B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104407959A (en) * 2014-12-12 2015-03-11 深圳中兴网信科技有限公司 Application based monitoring method and monitoring device
CN106325200B (en) * 2016-08-30 2019-04-23 江苏永冠给排水设备有限公司 A kind of implementation method based on self-service hypochlorite generator's group control of equipment system of networking
CN107205243A (en) * 2017-06-05 2017-09-26 柳州市盛景科技有限公司 A kind of intelligent gateway for possessing monitoring function
CN107257305B (en) * 2017-08-02 2020-05-15 苏州浪潮智能科技有限公司 Monitoring method and device for multi-node system
CN108108282B (en) * 2017-12-07 2020-06-23 联想(北京)有限公司 Information processing method and device and electronic equipment
CN108319538B (en) * 2018-02-02 2019-11-08 世纪龙信息网络有限责任公司 The monitoring method and system of big data platform operating status
CN108845878A (en) * 2018-05-08 2018-11-20 南京理工大学 The big data processing method and processing device calculated based on serverless backup
CN109040478A (en) * 2018-08-31 2018-12-18 北京云迹科技有限公司 The overload alarm method and device of phone box
CN110928738B (en) * 2018-09-19 2023-04-18 阿里巴巴集团控股有限公司 Performance analysis method, device and equipment
CN110928750B (en) * 2018-09-19 2023-04-18 阿里巴巴集团控股有限公司 Data processing method, device and equipment
CN109660537A (en) * 2018-12-20 2019-04-19 武汉钢铁工程技术集团通信有限责任公司 A method of real time monitoring and maintenance cloud platform physical resource service operation state
CN112148316B (en) * 2020-09-29 2022-04-22 联想(北京)有限公司 Information processing method and information processing device
CN112306802A (en) * 2020-10-29 2021-02-02 平安科技(深圳)有限公司 Data acquisition method, device, medium and electronic equipment of system
WO2023279815A1 (en) * 2021-07-08 2023-01-12 华为技术有限公司 Performance monitoring system and related method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945198A (en) * 2012-10-19 2013-02-27 浪潮电子信息产业股份有限公司 Method for characterizing application characteristics of high performance computing
CN103246569A (en) * 2013-05-20 2013-08-14 浪潮(北京)电子信息产业有限公司 Method and device for representing high-performance calculation application characteristics
CN103501253A (en) * 2013-10-18 2014-01-08 浪潮电子信息产业股份有限公司 Monitoring organization method for high-performance computing application characteristics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945198A (en) * 2012-10-19 2013-02-27 浪潮电子信息产业股份有限公司 Method for characterizing application characteristics of high performance computing
CN103246569A (en) * 2013-05-20 2013-08-14 浪潮(北京)电子信息产业有限公司 Method and device for representing high-performance calculation application characteristics
CN103501253A (en) * 2013-10-18 2014-01-08 浪潮电子信息产业股份有限公司 Monitoring organization method for high-performance computing application characteristics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
大规模机群监控系统信息采集与存储技术研究;易昭华;《中国优秀博硕士学位论文全文数据库》;20060615(第01期);第I140-86页 *

Also Published As

Publication number Publication date
CN104156296A (en) 2014-11-19

Similar Documents

Publication Publication Date Title
CN104156296B (en) The system and method for intelligent monitoring large-scale data center cluster calculate node
CN106020715B (en) Storage pool capacity management
CN104113585B (en) The method and apparatus that hardware level for producing instruction load balanced state interrupts
US20160358106A1 (en) Electrical transformer failure prediction
CN106777703A (en) A kind of bus passenger real-time analyzer and its construction method
CN104915793A (en) Public information intelligent analysis platform based on big data analysis and mining
CN108038040A (en) Computer cluster performance indicator detection method, electronic equipment and storage medium
CN106506266B (en) Network flow analysis method based on GPU, Hadoop/Spark mixing Computational frame
CN107645410A (en) A kind of virtual machine management system and method based on OpenStack cloud platforms
CN102945198B (en) A kind of method characterizing high-performance calculation application characteristic
CN103399851A (en) Method and system for analyzing and predicting performance of structured query language (SQL) scrip
CN110162445A (en) The host health assessment method and device of Intrusion Detection based on host log and performance indicator
CN109088747A (en) The management method and device of resource in cloud computing system
CN103986790A (en) Monitoring and warning method of infrastructures of cloud data center
Ma et al. Review of power spatio-temporal big data technologies for mobile computing in smart grid
CN103501253A (en) Monitoring organization method for high-performance computing application characteristics
US20100296411A1 (en) Transaction Topology Discovery Using Constraints
CN103246569A (en) Method and device for representing high-performance calculation application characteristics
US10346204B2 (en) Creating models based on performance metrics of a computing workloads running in a plurality of data centers to distribute computing workloads
CN109412155B (en) Power distribution network power supply capacity evaluation method based on graph calculation
CN106649765A (en) Smart power grid panoramic data analysis method based on big data technology
CN110168503A (en) Timeslice inserts facility
CN106649034B (en) Visual intelligent operation and maintenance method and platform
CN115471215B (en) Business process processing method and device
CN107590747A (en) Power grid asset turnover rate computational methods based on the analysis of comprehensive energy big data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant