CN107943668B

CN107943668B - Computer server cluster log monitoring method and monitor supervision platform

Info

Publication number: CN107943668B
Application number: CN201711353494.1A
Authority: CN
Inventors: 尤福宝; 汤成辉; 徐文渊; 黄云辉
Original assignee: Jiangsu Shenwei Cloud Technology Co Ltd
Current assignee: Jiangsu Shenwei Cloud Technology Co Ltd
Priority date: 2017-12-15
Filing date: 2017-12-15
Publication date: 2019-02-26
Anticipated expiration: 2037-12-15
Also published as: CN107943668A

Abstract

The present invention relates to field of computer technology, specially a kind of computer server cluster log monitoring method and monitor supervision platform, the specific steps of the monitoring method include: each server magnanimity machine code instruction that mainboard generates in the process of running in A. monitoring computer server cluster, are acquired in real time using the real-time streaming data acquisition frame in big data technology to the machine code instruction that mainboard generates；B. data classification and transfer operation are carried out to collected data, including filters, process and stores in real time；C. to step B, treated that data are analyzed and shown in real time, obtains analysis result in real time and simultaneously stores；D. it combines analysis result in real time and the historical analysis result of storage to be analyzed, obtains the potential abnormal results of computer server, and carry out early warning and notice.The present invention has many advantages, such as that real-time, expansible, automation, High Availabitity and API extension are abundant.

Description

Computer server cluster log monitoring method and monitor supervision platform

Technical field

The present invention relates to field of computer technology, specially a kind of computer server cluster log monitoring method and monitoring Platform.

Background technique

With the arrival of information age, the technologies such as big data, cloud computing and machine learning are increasingly becoming computer field Research hotspot.The technologies such as big data, cloud computing, machine learning have the characteristics that one it is common: need to carry out complicated and huge It calculates, therefore, often requires to use a large amount of computer server and form one or more clusters progress parallel computations, resultant force is complete At one or more calculating tasks.Increasingly vigorous recently as demand, computer server cluster scale is also from original several Ten are developing progressively several hundred, Ji Qiantai or even tens of thousands of.

With the large-scale of computer server cluster, how to manage clustering performance index (such as: the CPU of server, Memory, the service condition of network, magnetic disc i/o read-write situation etc.) and how to find in time cluster there are the problem of and handle in time It is increasingly becoming the hang-up that computer room operation maintenance personnel faces.

In the prior art, the main method that computer room operation maintenance personnel uses has:

(1) computer room regular visit, checks whether the indicator light of server has alarm；

(2) using some free server monitoring softwares on internet, assistance management is carried out.

However, as number of servers is more and more, thousands of servers are periodically patrolled only according to computer lab management personnel Inspection, judges, checks problem by visually going, not only workload is too big, but also be easy to cause false retrieval, missing inspection.Even if utilizing interconnection The monitoring software in online face assist management there is also very big safety problem, due to these monitoring softwares internal structure not Know, use rashly, there are trojan horse attack or hacker attacks risks, and these monitoring softwares are usually applicable only to computer The application of server cluster negligible amounts, when number of servers is tens, software runnability is good；But if clothes Business device has reached several hundred, and thousands of even tens of thousands of, then software performance can be remarkably decreased, or even occur what software can not be supported Situation.

It is received in view of the above-mentioned problems, studies in China personnel also develop some logs specifically for computer server cluster Collection or monitoring method.For example, the application for a patent for invention that China Patent Publication No. is CN105095502A discloses a kind of collection The log collecting method of group's storage system, technical solution are as follows: a kind of log collecting method of cluster storage system, this method institute The module for including has log management module, log collection module, journaling agent module, and the log management module operates in collection In group on transmission monitor node, it is responsible for management coordination log collection module and journaling agent module, the module is with Embedded side Formula is integrated into the operation flow of cluster storage system, and the log collection module runs transmission monitor node in the cluster On, it is responsible for the data that the multiple journaling agent module push of collection management come, and sort data under storage to the catalogue formulated, root According to the size of the scale dynamic configuration log collection module of cluster, each section of the journaling agent module operation in the cluster On point, it is responsible for the Log log of the cluster storage system on node where collecting, and log is pushed to log collection module, Each described journaling agent module can monitor 1024 files, the transmission day in the journaling agent module Configuration file Will security level attributes are E2E and SendOnly.Although this method can be realized the collection of log, but there are non real-time nature, The disadvantages of non-distributed storage, early warning without exception, and can not by system platform direct visualizztion demonstrating computer cluster The information of middle server is unfavorable for computer room operation maintenance personnel real time monitoring.China Patent Publication No. is that the invention of CN106326008 is special Benefit application also discloses that a kind of monitoring method towards group system, technical solution mainly comprise the steps that step 1, adopt The detailed attributes and groundwork state for collecting a basic point in group system, generate the report log of the basic working condition of each node； Step 2, the groundwork state of each node according to obtained in step 1, judge whether there is node more than Node B threshold or because Failure and stop working；There is threshold value of the groundwork state of several points more than default if it exists or in the shape that stops working State, then the resource service condition of the entire group system of scan statistics, and judge whether the resource service condition of group system exceeds System thresholds, while generating the resource service condition report log of entire group system；Step 3, if entire cluster in step 2 The resource situation of system is less than system thresholds, then the node being in idle condition in scanning search group system, enables in sky The node of not busy state shunts operation of the groundwork state more than the node of Node B threshold；Step 4, if whole in step 2 The resource situation of a group system is more than system thresholds, then is determined by priority of the system to each operation, keep priority minimum Task stop working and wait in line into queue.The patent be by every computer in control terminal node scan cluster, The relevant information of computer in cluster is obtained, this processing mode cannot achieve real time monitoring especially when computer number in cluster When measuring more, scanning spends the time longer, and the Internet resources in cluster can be consumed by network sweep, influence the Internet resources of cluster Quality.It would therefore be highly desirable to develop real-time one kind, monitoring visualization, the computer server collection for not influencing cluster network resources quality Group's log monitoring method and platform.

Summary of the invention

For the problems of the prior art, the present invention provide one kind can monitor in real time and monitored results visualization, no Influence the computer server cluster log monitoring method and monitor supervision platform of cluster network resources quality.

To realize the above technical purpose, the technical scheme is that

A kind of computer server cluster log monitoring method, specific steps include:

A. each server magnanimity machine code that mainboard generates in the process of running in computer server cluster is monitored Instruction adopts the machine code instruction data that mainboard generates using the real-time streaming data acquisition frame in big data technology in real time Collection, the machine code instruction refer to including at least memory instruction, cpu instruction, disk I/O instruction, network flow instruction, TCP connection number It enables, application process parameter instruction；

B. data classification and transfer operation are carried out to collected machine code instruction, including filters, processes and deposits in real time Storage；

C. to step B, treated that data are analyzed and shown in real time, obtains analysis result in real time and simultaneously stores；

D. it combines analysis result in real time and the historical analysis result of storage to be analyzed, show that computer server is potential different Often as a result, and carrying out early warning and notice.

As an improvement, by RHCS technology, the monitoring service of main monitoring node is disposed on 2 servers in step A, Monitoring service disposes Collection agent on the server in computer server cluster automatically, to guarantee monitoring service in High Availabitity State, main monitoring node energy automatically scanning finds and adds the server increased newly in cluster to watch-list, and automatic addition is supervised Control the monitored item of server；2 servers use the working method of MS master-slave server: primary server work is in from server Monitor preparation situation；When primary server delay machine, from all work of server adapter tube primary server, restore just to primary server Service is switched on primary server in a manner of automatic or manual by the setting of user and is run by Chang Hou.

Preferably, the flow data acquisition frame includes N number of Collection agent module, acquisition service module, data filtering Module, dynamic property balance module and distributed stream computing module, N number of Collection agent module operate in N number of monitored section On point, the magnanimity machine code instruction on monitored node is acquired in real time, the acquisition service module operates on main monitoring node, The machine code instruction that each Collection agent module is sent is acquired in real time and sends it to data filtering module, the data filtering mould Block operates on main monitoring node, receives the machine code instruction that acquisition service module is sent and will carry out primary filtration, and incited somebody to action Machine code instruction after filter is sent to distributed stream computing module, and the dynamic property balance module acquires generation for dynamic equilibrium It manages module and acquisition service module, acquisition service module and data filtering module, data filtering module and distributed stream calculates mould Service performance between block.

As an improvement, the acquisition project of Collection agent module described in step A can be added and be configured, the phase of project is acquired It is settable to close threshold value.

Preferably, filtering in real time, processing described in step B and storage specifically include: by based on Flume+Kafka The distributed stream of+Storm frame calculates the machine code semi-structured data that service generates server and carries out preliminary filtering, protects Remain for the data of analysis, then the data translation of the part is turned at readable, regular structuring and semi-structured data Data after translating are stored by HBase high-performance columnar database, externally provided in conjunction with Phoenix data it is real-time reading/ It writes, HBase does perdurable data storage using HDFS distributed file system, while providing static structure data using Hive Inquiry, use SQL-like language, bottom by compiling indexing MapReduce program run on Hadoop, when data storage When node utilization rate is excessively high, horizontal extension is carried out to guarantee the normal operation of step B by increasing new memory node.

Preferably, step C is specifically included: passing through the Web based on the front end Nginx+PHP to the step B data flow obtained It carries out real-time figure line and report is shown, while the analysis of row data is flowed into data, mark the data value in the presence of exception, according to Data analysis result obtains the availability report of monitored computer server cluster.

Preferably, showing figure line and report using front end data Visualization Framework EChart and figure line and report can be made by oneself Justice be all kinds of chart collection and network topological diagram, while utilize a set of API of PHP language extension, for manage monitored server, Read collected monitoring log, self-developing monitoring situation and data visualization interface etc..

Preferably, step D is specifically included: by analyzing real time data, the trend data and historical data of storage, to can The failure and exception that can occur are estimated, and the solution of recommendation are matched and provide, during estimating, using deep learning Frame Deeplearning4j carries out autonomous learning to historical data, the accuracy and timeliness of event anticipation is improved, by opening The alarm API put accesses mail, the alarm notification service of short message or wechat.

As an improvement, establishing mass alarm event base SDK, divided in step D in combination with mass alarm event base Analysis.

A kind of monitor supervision platform based on above-mentioned computer server cluster log monitoring method, including host equipment, storage Device equipment and network communication apparatus, the host equipment include monitoring system, message system, storage system, analysis system, displaying System and warning system, the memory devices include file system, Database Systems, and the network communication apparatus includes modulation Demodulator, router and the network switch, the host equipment are designed using High Availabitity, use MS master-slave server mode: main clothes Business device work is in monitoring preparation situation from server；When primary server delay machine, from all of server adapter tube primary server Service is switched to primary server in a manner of automatic or manual by the setting of user after primary server restores normal by work Upper operation；

The monitoring system using real-time streaming data acquisition frame to monitored computer server in the process of running The machine code instruction that mainboard generates is acquired in real time and is sent to message system；

The message system carries out data classification and transfer operation to collected data, including filters in real time, processes；

The storage system utilizes HBase column high-performance data library technology, carries out to the data after message system processing Fast read/write, and store into file system；

The analysis system is calculated in real time the data after storage and trend prediction analysis, and acquisition processing result simultaneously will It is respectively sent to display systems and Database Systems, while will indicate that the result of trend prediction exception is sent in processing result Warning system；

The processing result received is shown by the display systems in the form of image and report；

The warning system is issued to operation maintenance personnel according to the result received and is alerted；

The file system is HDFS distributed file system, for storing the data after message system is processed；

The Database Systems are used to store the processing result of the analysis system received；

The network communication apparatus is set for host equipment and monitored computer server, host equipment and memory Communication between standby

From the above, it can be seen that the present invention has following advantages:

1. have good real-time, the present invention utilize big data technology streaming computing, and combine HBase high concurrent and The database technology of fast reading and writing can not only carry out data real-time exhibition, and can write data into Hadoop HDFS is stored, for offline search.

2. have scalability, due to the present invention be using big data technology, using Hadoop ecosphere tool Flume, Kafka, Storm, HBase, HDFS are handled and are stored to daily record data, when back end utilization rate is excessively high, Ke Yitong It crosses increase server node and carries out horizontal extension, operated normally without influencing system；Some common monitoring compared on internet Tool (uses single traditional database such as MySQL, Oracle), and the present invention is more easily extensible, and supports more monitoring devices Access.

3. high degree of automation is either still directed to extensive, imperial scale cluster for small-scale cluster, this Automatically dispose is supported in invention, and the deployment time of each node can control in 1 second.By the way that automatic discovery rule is arranged, it is System can monitor the node newly extended in cluster automatically, pass through all kinds of powerful monitoring templates, each server of system energy automatic collection Cpu data, internal storage data, network data, using data etc., automatically generated data curve graph and abnormal report.

4. High Availabitity, monitoring server is designed using High Availabitity, uses MS master-slave server mode: primary server work, from Server is in monitoring preparation situation；When primary server delay machine, from all work of server adapter tube primary server, to main clothes It is engaged in after device recovery normally, service is switched on primary server in a manner of automatic or manual by the setting of user and is run.

5. API abundant extension, the communication in system respectively between service uses the form of Restful API, at the same time, System also can either add server, increase monitoring mould newly by these API with the platform outside system or using being communicated Plate, or monitoring data is read, abnormality alarming is carried out, can be efficiently treated through by the opening API of the system, and be Secondary development is linked into the application scenarios such as other service platforms and provides convenient and fast solution route.

Detailed description of the invention

Fig. 1 is the system block diagram of monitor supervision platform of the present invention；

Fig. 2 is the system block diagram of flow data acquisition frame of the present invention.

Specific embodiment

In conjunction with Fig. 1 to Fig. 2, the specific embodiment that the present invention will be described in detail, but claim of the invention is not done Any restriction.

A. each server magnanimity machine code that mainboard generates in the process of running in computer server cluster is monitored to refer to It enables, the machine code instruction data that mainboard generates is adopted in real time using the real-time streaming data acquisition frame in big data technology Collection, in which:

Machine code instruction includes at least memory instruction, cpu instruction, disk I/O instruction, network flow instruction, TCP connection number Instruction, application process parameter instruction；

By RHCS technology, the monitoring service of main monitoring node is disposed on 2 servers, monitoring service is calculating automatically Collection agent is disposed on server in machine server cluster, to guarantee monitoring service in High Availabitity state, main monitoring node energy Automatically scanning finds and adds the server increased newly in cluster to watch-list, the automatic monitored item for adding monitored server, 2 Platform server is designed using High Availabitity (High Availability abbreviation HA), uses the working method of MS master-slave server: main Server work is in monitoring preparation situation from server；When primary server delay machine, from the one of server adapter tube primary server Cutting is made, and after primary server restores normal, service is switched to main service in a manner of automatic or manual by the setting of user It is run on device；

As shown in Fig. 2, flow data acquisition frame includes N number of Collection agent module, acquisition service module, data filtering mould Block, dynamic property balance module and distributed stream computing module, N number of Collection agent module operate on N number of monitored node (i.e. On monitored server), the magnanimity machine code instruction on monitored node, the acquisition project of Collection agent module are acquired in real time It can add and configure, the dependent thresholds for acquiring project are settable, and acquisition service module operates on main monitoring node, acquire in real time The machine code instruction of each Collection agent module transmission simultaneously sends it to data filtering module, and data filtering module operates in main prison It controls on node, receiving the machine code instruction that acquisition service module is sent simultaneously will carry out primary filtration, and by filtered machine code Instruction is sent to distributed stream computing module, and dynamic property balance module is serviced for dynamic equilibrium Collection agent module and acquisition Service performance between module, acquisition service module and data filtering module, data filtering module and distributed stream computing module, Dynamic property balance module passes through between Flume technology and Kafka technology equilibrium data acquisition service and data filtering services Performance, it is ensured that the two is able to maintain high-throughput ability.

B. data classification and transfer operation are carried out to collected machine code instruction, including filters, processes and deposits in real time Storage calculates the machine code half hitch that service generates server by the distributed stream based on Flume+Kafka+Storm frame Structure data carry out preliminary filtering, are preserved for the data of analysis, then the data translation of the part at readable, regular Structuring and semi-structured data, the data after translation stored by HBase high-performance columnar database, in conjunction with Phoenix externally provides the real-time read/write of data, and HBase does perdurable data storage using HDFS distributed file system, together When using Hive (Tool for Data Warehouse based on Hadoop) provide static structure data inquiry, use SQL-like language, Bottom is run on Hadoop by compiling indexing MapReduce program, when data memory node utilization rate is excessively high, passes through increasing New memory node is added to carry out horizontal extension to guarantee the normal operation of step B.

C. to step B, treated that data are analyzed and shown in real time, obtains analysis result in real time and simultaneously stores, i.e., to step The data flow that rapid B is obtained carries out real-time figure line by the Web based on the front end Nginx+PHP and report is shown, while to data The analysis of row data is flowed into, marks and monitored computer server is obtained according to data analysis result in the presence of abnormal data value The availability report of cluster, can use front end data Visualization Framework EChart displaying figure line and report and figure line and report can It is customized for all kinds of chart collection and network topological diagram, while utilizing a set of API of PHP language extension, for managing monitored clothes Business device reads collected monitoring log, self-developing monitoring situation and data visualization interface etc..

D. it combines analysis result in real time and the historical analysis result of storage to be analyzed, show that computer server is potential different Often as a result, and carry out early warning and notice, that is, pass through analysis real time data, the trend data and historical data of storage, to may go out Existing failure and exception is estimated, and the solution of recommendation is matched and provide, during estimating, using deep learning frame Deeplearning4j carries out autonomous learning to historical data, the accuracy and timeliness of event anticipation is improved, by open API is alerted, mail, the alarm notification service of short message or wechat are accessed.In order to improve early warning performance, mass alarm can also be established Event base SDK, when being analyzed in conjunction with analysis result in real time and the historical analysis result of storage, in combination with mass alarm event It is analyzed in library.

Based on above-mentioned computer server cluster log monitoring method, it is flat to establish computer server cluster log monitoring The system architecture of platform, the platform is as shown in Figure 1.

The computer server cluster log monitor supervision platform includes host equipment, memory devices and network communication apparatus, Host equipment includes monitoring system, message system, storage system, analysis system, display systems and warning system, memory devices Including file system, Database Systems, network communication apparatus includes modem, router and the network switch, and host is set It is standby to be designed using High Availabitity, MS master-slave server mode: primary server work is used, is in monitoring preparation situation from server；When When primary server delay machine, from all work of server adapter tube primary server, after primary server restores normal, by user's Service is switched on primary server by setting in a manner of automatic or manual to be run；

Monitoring system is running monitored computer server using real-time streaming data acquisition frame (as shown in Figure 2) The machine code instruction that mainboard generates in the process is acquired in real time and is sent to message system；

Message system carries out data classification and transfer operation to collected data, including filters in real time, processes；

Storage system utilizes HBase column high-performance data library technology, carries out to the data after message system processing quick Read/write, and store into file system；

Analysis system is calculated in real time the data after storage and trend prediction analysis, obtains processing result and by its point It is not sent to display systems and Database Systems, while will indicate that the result of trend prediction exception is sent to alarm in processing result System；

The processing result received is shown by display systems in the form of image and report；

Warning system is issued to operation maintenance personnel according to the result received and is alerted；

File system is HDFS distributed file system, for storing the data after message system is processed；

Database Systems are used to store the processing result of the analysis system received；

Network communication apparatus for host equipment and monitored computer server, host equipment and memory devices it Between communication.

In computer server cluster log monitoring method of the present invention and the system platform developed based on this method:

(1) streaming computing of big data technology is utilized, and combines the database technology of HBase high concurrent and fast reading and writing, Data real-time exhibition can not only be carried out, and the HDFS that can be write data into Hadoop is stored, for offline search, To make system that there is good real-time.

(2) due to using big data technology, using Hadoop ecosphere tool Flume, Kafka, Storm, HBase, HDFS is handled and is stored to daily record data, when back end utilization rate is excessively high, can by increase server node into Row horizontal extension is operated normally without influencing system, and compared on internet, some common monitoring tools (use single tradition Database such as MySQL, Oracle), the present invention is more easily extensible, and supports the access of more monitoring devices, and scalability is good.

(3) it is still either directed to extensive, imperial scale cluster for small-scale cluster, the present invention supports certainly Dynamicization deployment, the deployment time of each monitored node can control in 1 second, by the way that automatic discovery rule, system energy is arranged The node newly extended in automatic monitoring cluster, passes through all kinds of powerful monitoring templates, the CPU of each server of system energy automatic collection Data, internal storage data, network data, using data etc., automatically generated data curve graph and abnormal report, high degree of automation.

(4) monitoring server is designed using High Availabitity, uses MS master-slave server mode: main services, from server In monitoring preparation situation；When primary server delay machine, from all work of server adapter tube primary server, restore to primary server After normal, service is switched on primary server in a manner of automatic or manual by the setting of user and is run.

(5) communication between respectively servicing in system uses the form of Restful API, and at the same time, system can also pass through These API are with the platform outside system or using being communicated, and either addition server, newly-increased monitoring template, or reading are supervised Data are controlled, abnormality alarming is carried out, can be efficiently treated through by the opening API of the system, and are secondary development, access Convenient and fast solution route is provided to application scenarios such as other service platforms, so that system be made to extend with API abundant.

In conclusion the invention has the following advantages that

1. having good real-time；

2. having scalability；

3. high degree of automation；

4. High Availabitity；

5. API extension abundant.

It is understood that being merely to illustrate the present invention above with respect to specific descriptions of the invention and being not limited to this Technical solution described in inventive embodiments.Those skilled in the art should understand that still can be carried out to the present invention Modification or equivalent replacement, to reach identical technical effect；As long as meet use needs, all protection scope of the present invention it It is interior.

Claims

1. a kind of computer server cluster log monitoring method, specific steps include:

A. each server magnanimity machine code instruction that mainboard generates in the process of running in computer server cluster is monitored, The machine code instruction data that mainboard generates are acquired in real time using the real-time streaming data acquisition frame in big data technology, institute Machine code instruction is stated to instruct, answer including at least memory instruction, cpu instruction, disk I/O instruction, network flow instruction, TCP connection number With process parameter instruction；Wherein:

By RHCS technology, the monitoring service of main monitoring node is disposed on 2 servers, monitoring service takes in computer automatically Collection agent is disposed on server in business device cluster, to guarantee monitoring service in High Availabitity state, main monitoring node can be automatic Scanning discovery simultaneously adds the server increased newly in cluster to watch-list, adds the monitored item of monitored server automatically；2 clothes Business device uses the working method of MS master-slave server: primary server work, is in monitoring preparation situation from server；When main service When device delay machine, from all work of server adapter tube primary server, after primary server restores normal, by the setting of user with Service is switched on primary server and runs by automatic or manual mode；

The flow data acquisition frame includes N number of Collection agent module, acquisition service module, data filtering module, dynamic property Balance module and distributed stream computing module, N number of Collection agent module operate on N number of monitored node, acquire in real time Magnanimity machine code instruction on monitored node, the acquisition service module operate on main monitoring node, and acquisition is respectively adopted in real time The machine code instruction of collection proxy module transmission simultaneously sends it to data filtering module, and the data filtering module operates in main prison It controls on node, receiving the machine code instruction that acquisition service module is sent simultaneously will carry out primary filtration, and by filtered machine code Instruction is sent to distributed stream computing module, and the dynamic property balance module is for dynamic equilibrium Collection agent module and acquisition Service between service module, acquisition service module and data filtering module, data filtering module and distributed stream computing module Performance；

B. data classification and transfer operation are carried out to collected machine code instruction, including filters, processes and stores in real time；Its In:

Filtering in real time, processing and the storage specifically includes: by the distribution based on Flume+Kafka+Storm frame The machine code semi-structured data that stream calculation service generates server carries out preliminary filtering, is preserved for the data of analysis, Again the data translation of the part at readable, regular structuring and semi-structured data, the data after translation pass through HBase high-performance columnar database is stored, and the real-time read/write of data is externally provided in conjunction with Phoenix, and HBase is utilized HDFS distributed file system does perdurable data storage, while providing the inquiry of static structure data using Hive, makes With SQL-like language, bottom is run on Hadoop by compiling indexing MapReduce program, when data memory node utilization rate When excessively high, horizontal extension is carried out to guarantee the normal operation of step B by increasing new memory node；

D. it combines analysis result in real time and the historical analysis result of storage to be analyzed, obtains the potential abnormal knot of computer server Fruit, and carry out early warning and notice.

2. computer server cluster log monitoring method according to claim 1, it is characterised in that: described in step A The acquisition project of Collection agent module can add and configure, the dependent thresholds for acquiring project are settable.

3. computer server cluster log monitoring method according to claim 1, it is characterised in that: step C is specifically wrapped It includes: real-time figure line being carried out by the Web based on the front end Nginx+PHP to the data flow that step B is obtained and report is shown, simultaneously The analysis of row data is flowed into data, marks and monitored computer is obtained according to data analysis result in the presence of abnormal data value The availability report of server cluster.

4. computer server cluster log monitoring method according to claim 3, it is characterised in that: utilize front end data Visualization Framework EChart shows figure line and report and figure line and report can customize as all kinds of chart collection and network topological diagram, The a set of API of PHP language extension is utilized simultaneously, for managing monitored server, reading collected monitoring log, self-developing Monitor situation and data visualization interface etc..

5. computer server cluster log monitoring method according to claim 1, it is characterised in that: step D is specifically wrapped Include: by analyzing real time data, the trend data and historical data of storage estimate the failure and exception that are likely to occur, The solution for matching and providing recommendation, during estimating, using deep learning frame Deeplearning4j to history number According to autonomous learning is carried out, improve the accuracy and timeliness of event anticipation, by open alarm API, access mail, short message or The alarm notification service of wechat.

6. computer server cluster log monitoring method according to claim 1, it is characterised in that: establish mass alarm Event base SDK is analyzed in combination with mass alarm event base in step D.

7. a kind of monitor supervision platform based on computer server cluster log monitoring method described in claim 1, feature exist In: including host equipment, memory devices and network communication apparatus, the host equipment includes monitoring system, message system, deposits Storage system, analysis system, display systems and warning system, the memory devices include file system, Database Systems, described Network communication apparatus includes modem, router and the network switch, and the host equipment is designed using High Availabitity, is used MS master-slave server mode: primary server work is in monitoring preparation situation from server；When primary server delay machine, from service All work of device adapter tube primary server, after primary server restores normal, by the setting of user in a manner of automatic or manual Service is switched on primary server and is run；

The monitoring system is using real-time streaming data acquisition frame to monitored computer server mainboard in the process of running The machine code instruction of generation is acquired in real time and is sent to message system；

The storage system utilizes HBase column high-performance data library technology, carries out to the data after message system processing quick Read/write, and store into file system；

The analysis system is calculated in real time the data after storage and trend prediction analysis, obtains processing result and by its point It is not sent to display systems and Database Systems, while will indicate that the result of trend prediction exception is sent to alarm in processing result System；

The network communication apparatus for host equipment and monitored computer server, host equipment and memory devices it Between communication.