CN107943668B - Computer server cluster log monitoring method and monitor supervision platform - Google Patents

Computer server cluster log monitoring method and monitor supervision platform Download PDF

Info

Publication number
CN107943668B
CN107943668B CN201711353494.1A CN201711353494A CN107943668B CN 107943668 B CN107943668 B CN 107943668B CN 201711353494 A CN201711353494 A CN 201711353494A CN 107943668 B CN107943668 B CN 107943668B
Authority
CN
China
Prior art keywords
data
server
real time
module
monitoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201711353494.1A
Other languages
Chinese (zh)
Other versions
CN107943668A (en
Inventor
尤福宝
汤成辉
徐文渊
黄云辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Shenwei Cloud Technology Co Ltd
Original Assignee
Jiangsu Shenwei Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Shenwei Cloud Technology Co Ltd filed Critical Jiangsu Shenwei Cloud Technology Co Ltd
Priority to CN201711353494.1A priority Critical patent/CN107943668B/en
Publication of CN107943668A publication Critical patent/CN107943668A/en
Application granted granted Critical
Publication of CN107943668B publication Critical patent/CN107943668B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention relates to field of computer technology, specially a kind of computer server cluster log monitoring method and monitor supervision platform, the specific steps of the monitoring method include: each server magnanimity machine code instruction that mainboard generates in the process of running in A. monitoring computer server cluster, are acquired in real time using the real-time streaming data acquisition frame in big data technology to the machine code instruction that mainboard generates;B. data classification and transfer operation are carried out to collected data, including filters, process and stores in real time;C. to step B, treated that data are analyzed and shown in real time, obtains analysis result in real time and simultaneously stores;D. it combines analysis result in real time and the historical analysis result of storage to be analyzed, obtains the potential abnormal results of computer server, and carry out early warning and notice.The present invention has many advantages, such as that real-time, expansible, automation, High Availabitity and API extension are abundant.

Description

Computer server cluster log monitoring method and monitor supervision platform
Technical field
The present invention relates to field of computer technology, specially a kind of computer server cluster log monitoring method and monitoring Platform.
Background technique
With the arrival of information age, the technologies such as big data, cloud computing and machine learning are increasingly becoming computer field Research hotspot.The technologies such as big data, cloud computing, machine learning have the characteristics that one it is common: need to carry out complicated and huge It calculates, therefore, often requires to use a large amount of computer server and form one or more clusters progress parallel computations, resultant force is complete At one or more calculating tasks.Increasingly vigorous recently as demand, computer server cluster scale is also from original several Ten are developing progressively several hundred, Ji Qiantai or even tens of thousands of.
With the large-scale of computer server cluster, how to manage clustering performance index (such as: the CPU of server, Memory, the service condition of network, magnetic disc i/o read-write situation etc.) and how to find in time cluster there are the problem of and handle in time It is increasingly becoming the hang-up that computer room operation maintenance personnel faces.
In the prior art, the main method that computer room operation maintenance personnel uses has:
(1) computer room regular visit, checks whether the indicator light of server has alarm;
(2) using some free server monitoring softwares on internet, assistance management is carried out.
However, as number of servers is more and more, thousands of servers are periodically patrolled only according to computer lab management personnel Inspection, judges, checks problem by visually going, not only workload is too big, but also be easy to cause false retrieval, missing inspection.Even if utilizing interconnection The monitoring software in online face assist management there is also very big safety problem, due to these monitoring softwares internal structure not Know, use rashly, there are trojan horse attack or hacker attacks risks, and these monitoring softwares are usually applicable only to computer The application of server cluster negligible amounts, when number of servers is tens, software runnability is good;But if clothes Business device has reached several hundred, and thousands of even tens of thousands of, then software performance can be remarkably decreased, or even occur what software can not be supported Situation.
It is received in view of the above-mentioned problems, studies in China personnel also develop some logs specifically for computer server cluster Collection or monitoring method.For example, the application for a patent for invention that China Patent Publication No. is CN105095502A discloses a kind of collection The log collecting method of group's storage system, technical solution are as follows: a kind of log collecting method of cluster storage system, this method institute The module for including has log management module, log collection module, journaling agent module, and the log management module operates in collection In group on transmission monitor node, it is responsible for management coordination log collection module and journaling agent module, the module is with Embedded side Formula is integrated into the operation flow of cluster storage system, and the log collection module runs transmission monitor node in the cluster On, it is responsible for the data that the multiple journaling agent module push of collection management come, and sort data under storage to the catalogue formulated, root According to the size of the scale dynamic configuration log collection module of cluster, each section of the journaling agent module operation in the cluster On point, it is responsible for the Log log of the cluster storage system on node where collecting, and log is pushed to log collection module, Each described journaling agent module can monitor 1024 files, the transmission day in the journaling agent module Configuration file Will security level attributes are E2E and SendOnly.Although this method can be realized the collection of log, but there are non real-time nature, The disadvantages of non-distributed storage, early warning without exception, and can not by system platform direct visualizztion demonstrating computer cluster The information of middle server is unfavorable for computer room operation maintenance personnel real time monitoring.China Patent Publication No. is that the invention of CN106326008 is special Benefit application also discloses that a kind of monitoring method towards group system, technical solution mainly comprise the steps that step 1, adopt The detailed attributes and groundwork state for collecting a basic point in group system, generate the report log of the basic working condition of each node; Step 2, the groundwork state of each node according to obtained in step 1, judge whether there is node more than Node B threshold or because Failure and stop working;There is threshold value of the groundwork state of several points more than default if it exists or in the shape that stops working State, then the resource service condition of the entire group system of scan statistics, and judge whether the resource service condition of group system exceeds System thresholds, while generating the resource service condition report log of entire group system;Step 3, if entire cluster in step 2 The resource situation of system is less than system thresholds, then the node being in idle condition in scanning search group system, enables in sky The node of not busy state shunts operation of the groundwork state more than the node of Node B threshold;Step 4, if whole in step 2 The resource situation of a group system is more than system thresholds, then is determined by priority of the system to each operation, keep priority minimum Task stop working and wait in line into queue.The patent be by every computer in control terminal node scan cluster, The relevant information of computer in cluster is obtained, this processing mode cannot achieve real time monitoring especially when computer number in cluster When measuring more, scanning spends the time longer, and the Internet resources in cluster can be consumed by network sweep, influence the Internet resources of cluster Quality.It would therefore be highly desirable to develop real-time one kind, monitoring visualization, the computer server collection for not influencing cluster network resources quality Group's log monitoring method and platform.
Summary of the invention
For the problems of the prior art, the present invention provide one kind can monitor in real time and monitored results visualization, no Influence the computer server cluster log monitoring method and monitor supervision platform of cluster network resources quality.
To realize the above technical purpose, the technical scheme is that
A kind of computer server cluster log monitoring method, specific steps include:
A. each server magnanimity machine code that mainboard generates in the process of running in computer server cluster is monitored Instruction adopts the machine code instruction data that mainboard generates using the real-time streaming data acquisition frame in big data technology in real time Collection, the machine code instruction refer to including at least memory instruction, cpu instruction, disk I/O instruction, network flow instruction, TCP connection number It enables, application process parameter instruction;
B. data classification and transfer operation are carried out to collected machine code instruction, including filters, processes and deposits in real time Storage;
C. to step B, treated that data are analyzed and shown in real time, obtains analysis result in real time and simultaneously stores;
D. it combines analysis result in real time and the historical analysis result of storage to be analyzed, show that computer server is potential different Often as a result, and carrying out early warning and notice.
As an improvement, by RHCS technology, the monitoring service of main monitoring node is disposed on 2 servers in step A, Monitoring service disposes Collection agent on the server in computer server cluster automatically, to guarantee monitoring service in High Availabitity State, main monitoring node energy automatically scanning finds and adds the server increased newly in cluster to watch-list, and automatic addition is supervised Control the monitored item of server;2 servers use the working method of MS master-slave server: primary server work is in from server Monitor preparation situation;When primary server delay machine, from all work of server adapter tube primary server, restore just to primary server Service is switched on primary server in a manner of automatic or manual by the setting of user and is run by Chang Hou.
Preferably, the flow data acquisition frame includes N number of Collection agent module, acquisition service module, data filtering Module, dynamic property balance module and distributed stream computing module, N number of Collection agent module operate in N number of monitored section On point, the magnanimity machine code instruction on monitored node is acquired in real time, the acquisition service module operates on main monitoring node, The machine code instruction that each Collection agent module is sent is acquired in real time and sends it to data filtering module, the data filtering mould Block operates on main monitoring node, receives the machine code instruction that acquisition service module is sent and will carry out primary filtration, and incited somebody to action Machine code instruction after filter is sent to distributed stream computing module, and the dynamic property balance module acquires generation for dynamic equilibrium It manages module and acquisition service module, acquisition service module and data filtering module, data filtering module and distributed stream calculates mould Service performance between block.
As an improvement, the acquisition project of Collection agent module described in step A can be added and be configured, the phase of project is acquired It is settable to close threshold value.
Preferably, filtering in real time, processing described in step B and storage specifically include: by based on Flume+Kafka The distributed stream of+Storm frame calculates the machine code semi-structured data that service generates server and carries out preliminary filtering, protects Remain for the data of analysis, then the data translation of the part is turned at readable, regular structuring and semi-structured data Data after translating are stored by HBase high-performance columnar database, externally provided in conjunction with Phoenix data it is real-time reading/ It writes, HBase does perdurable data storage using HDFS distributed file system, while providing static structure data using Hive Inquiry, use SQL-like language, bottom by compiling indexing MapReduce program run on Hadoop, when data storage When node utilization rate is excessively high, horizontal extension is carried out to guarantee the normal operation of step B by increasing new memory node.
Preferably, step C is specifically included: passing through the Web based on the front end Nginx+PHP to the step B data flow obtained It carries out real-time figure line and report is shown, while the analysis of row data is flowed into data, mark the data value in the presence of exception, according to Data analysis result obtains the availability report of monitored computer server cluster.
Preferably, showing figure line and report using front end data Visualization Framework EChart and figure line and report can be made by oneself Justice be all kinds of chart collection and network topological diagram, while utilize a set of API of PHP language extension, for manage monitored server, Read collected monitoring log, self-developing monitoring situation and data visualization interface etc..
Preferably, step D is specifically included: by analyzing real time data, the trend data and historical data of storage, to can The failure and exception that can occur are estimated, and the solution of recommendation are matched and provide, during estimating, using deep learning Frame Deeplearning4j carries out autonomous learning to historical data, the accuracy and timeliness of event anticipation is improved, by opening The alarm API put accesses mail, the alarm notification service of short message or wechat.
As an improvement, establishing mass alarm event base SDK, divided in step D in combination with mass alarm event base Analysis.
A kind of monitor supervision platform based on above-mentioned computer server cluster log monitoring method, including host equipment, storage Device equipment and network communication apparatus, the host equipment include monitoring system, message system, storage system, analysis system, displaying System and warning system, the memory devices include file system, Database Systems, and the network communication apparatus includes modulation Demodulator, router and the network switch, the host equipment are designed using High Availabitity, use MS master-slave server mode: main clothes Business device work is in monitoring preparation situation from server;When primary server delay machine, from all of server adapter tube primary server Service is switched to primary server in a manner of automatic or manual by the setting of user after primary server restores normal by work Upper operation;
The monitoring system using real-time streaming data acquisition frame to monitored computer server in the process of running The machine code instruction that mainboard generates is acquired in real time and is sent to message system;
The message system carries out data classification and transfer operation to collected data, including filters in real time, processes;
The storage system utilizes HBase column high-performance data library technology, carries out to the data after message system processing Fast read/write, and store into file system;
The analysis system is calculated in real time the data after storage and trend prediction analysis, and acquisition processing result simultaneously will It is respectively sent to display systems and Database Systems, while will indicate that the result of trend prediction exception is sent in processing result Warning system;
The processing result received is shown by the display systems in the form of image and report;
The warning system is issued to operation maintenance personnel according to the result received and is alerted;
The file system is HDFS distributed file system, for storing the data after message system is processed;
The Database Systems are used to store the processing result of the analysis system received;
The network communication apparatus is set for host equipment and monitored computer server, host equipment and memory Communication between standby
From the above, it can be seen that the present invention has following advantages:
1. have good real-time, the present invention utilize big data technology streaming computing, and combine HBase high concurrent and The database technology of fast reading and writing can not only carry out data real-time exhibition, and can write data into Hadoop HDFS is stored, for offline search.
2. have scalability, due to the present invention be using big data technology, using Hadoop ecosphere tool Flume, Kafka, Storm, HBase, HDFS are handled and are stored to daily record data, when back end utilization rate is excessively high, Ke Yitong It crosses increase server node and carries out horizontal extension, operated normally without influencing system;Some common monitoring compared on internet Tool (uses single traditional database such as MySQL, Oracle), and the present invention is more easily extensible, and supports more monitoring devices Access.
3. high degree of automation is either still directed to extensive, imperial scale cluster for small-scale cluster, this Automatically dispose is supported in invention, and the deployment time of each node can control in 1 second.By the way that automatic discovery rule is arranged, it is System can monitor the node newly extended in cluster automatically, pass through all kinds of powerful monitoring templates, each server of system energy automatic collection Cpu data, internal storage data, network data, using data etc., automatically generated data curve graph and abnormal report.
4. High Availabitity, monitoring server is designed using High Availabitity, uses MS master-slave server mode: primary server work, from Server is in monitoring preparation situation;When primary server delay machine, from all work of server adapter tube primary server, to main clothes It is engaged in after device recovery normally, service is switched on primary server in a manner of automatic or manual by the setting of user and is run.
5. API abundant extension, the communication in system respectively between service uses the form of Restful API, at the same time, System also can either add server, increase monitoring mould newly by these API with the platform outside system or using being communicated Plate, or monitoring data is read, abnormality alarming is carried out, can be efficiently treated through by the opening API of the system, and be Secondary development is linked into the application scenarios such as other service platforms and provides convenient and fast solution route.
Detailed description of the invention
Fig. 1 is the system block diagram of monitor supervision platform of the present invention;
Fig. 2 is the system block diagram of flow data acquisition frame of the present invention.
Specific embodiment
In conjunction with Fig. 1 to Fig. 2, the specific embodiment that the present invention will be described in detail, but claim of the invention is not done Any restriction.
A kind of computer server cluster log monitoring method, specific steps include:
A. each server magnanimity machine code that mainboard generates in the process of running in computer server cluster is monitored to refer to It enables, the machine code instruction data that mainboard generates is adopted in real time using the real-time streaming data acquisition frame in big data technology Collection, in which:
Machine code instruction includes at least memory instruction, cpu instruction, disk I/O instruction, network flow instruction, TCP connection number Instruction, application process parameter instruction;
By RHCS technology, the monitoring service of main monitoring node is disposed on 2 servers, monitoring service is calculating automatically Collection agent is disposed on server in machine server cluster, to guarantee monitoring service in High Availabitity state, main monitoring node energy Automatically scanning finds and adds the server increased newly in cluster to watch-list, the automatic monitored item for adding monitored server, 2 Platform server is designed using High Availabitity (High Availability abbreviation HA), uses the working method of MS master-slave server: main Server work is in monitoring preparation situation from server;When primary server delay machine, from the one of server adapter tube primary server Cutting is made, and after primary server restores normal, service is switched to main service in a manner of automatic or manual by the setting of user It is run on device;
As shown in Fig. 2, flow data acquisition frame includes N number of Collection agent module, acquisition service module, data filtering mould Block, dynamic property balance module and distributed stream computing module, N number of Collection agent module operate on N number of monitored node (i.e. On monitored server), the magnanimity machine code instruction on monitored node, the acquisition project of Collection agent module are acquired in real time It can add and configure, the dependent thresholds for acquiring project are settable, and acquisition service module operates on main monitoring node, acquire in real time The machine code instruction of each Collection agent module transmission simultaneously sends it to data filtering module, and data filtering module operates in main prison It controls on node, receiving the machine code instruction that acquisition service module is sent simultaneously will carry out primary filtration, and by filtered machine code Instruction is sent to distributed stream computing module, and dynamic property balance module is serviced for dynamic equilibrium Collection agent module and acquisition Service performance between module, acquisition service module and data filtering module, data filtering module and distributed stream computing module, Dynamic property balance module passes through between Flume technology and Kafka technology equilibrium data acquisition service and data filtering services Performance, it is ensured that the two is able to maintain high-throughput ability.
B. data classification and transfer operation are carried out to collected machine code instruction, including filters, processes and deposits in real time Storage calculates the machine code half hitch that service generates server by the distributed stream based on Flume+Kafka+Storm frame Structure data carry out preliminary filtering, are preserved for the data of analysis, then the data translation of the part at readable, regular Structuring and semi-structured data, the data after translation stored by HBase high-performance columnar database, in conjunction with Phoenix externally provides the real-time read/write of data, and HBase does perdurable data storage using HDFS distributed file system, together When using Hive (Tool for Data Warehouse based on Hadoop) provide static structure data inquiry, use SQL-like language, Bottom is run on Hadoop by compiling indexing MapReduce program, when data memory node utilization rate is excessively high, passes through increasing New memory node is added to carry out horizontal extension to guarantee the normal operation of step B.
C. to step B, treated that data are analyzed and shown in real time, obtains analysis result in real time and simultaneously stores, i.e., to step The data flow that rapid B is obtained carries out real-time figure line by the Web based on the front end Nginx+PHP and report is shown, while to data The analysis of row data is flowed into, marks and monitored computer server is obtained according to data analysis result in the presence of abnormal data value The availability report of cluster, can use front end data Visualization Framework EChart displaying figure line and report and figure line and report can It is customized for all kinds of chart collection and network topological diagram, while utilizing a set of API of PHP language extension, for managing monitored clothes Business device reads collected monitoring log, self-developing monitoring situation and data visualization interface etc..
D. it combines analysis result in real time and the historical analysis result of storage to be analyzed, show that computer server is potential different Often as a result, and carry out early warning and notice, that is, pass through analysis real time data, the trend data and historical data of storage, to may go out Existing failure and exception is estimated, and the solution of recommendation is matched and provide, during estimating, using deep learning frame Deeplearning4j carries out autonomous learning to historical data, the accuracy and timeliness of event anticipation is improved, by open API is alerted, mail, the alarm notification service of short message or wechat are accessed.In order to improve early warning performance, mass alarm can also be established Event base SDK, when being analyzed in conjunction with analysis result in real time and the historical analysis result of storage, in combination with mass alarm event It is analyzed in library.
Based on above-mentioned computer server cluster log monitoring method, it is flat to establish computer server cluster log monitoring The system architecture of platform, the platform is as shown in Figure 1.
The computer server cluster log monitor supervision platform includes host equipment, memory devices and network communication apparatus, Host equipment includes monitoring system, message system, storage system, analysis system, display systems and warning system, memory devices Including file system, Database Systems, network communication apparatus includes modem, router and the network switch, and host is set It is standby to be designed using High Availabitity, MS master-slave server mode: primary server work is used, is in monitoring preparation situation from server;When When primary server delay machine, from all work of server adapter tube primary server, after primary server restores normal, by user's Service is switched on primary server by setting in a manner of automatic or manual to be run;
Monitoring system is running monitored computer server using real-time streaming data acquisition frame (as shown in Figure 2) The machine code instruction that mainboard generates in the process is acquired in real time and is sent to message system;
Message system carries out data classification and transfer operation to collected data, including filters in real time, processes;
Storage system utilizes HBase column high-performance data library technology, carries out to the data after message system processing quick Read/write, and store into file system;
Analysis system is calculated in real time the data after storage and trend prediction analysis, obtains processing result and by its point It is not sent to display systems and Database Systems, while will indicate that the result of trend prediction exception is sent to alarm in processing result System;
The processing result received is shown by display systems in the form of image and report;
Warning system is issued to operation maintenance personnel according to the result received and is alerted;
File system is HDFS distributed file system, for storing the data after message system is processed;
Database Systems are used to store the processing result of the analysis system received;
Network communication apparatus for host equipment and monitored computer server, host equipment and memory devices it Between communication.
In computer server cluster log monitoring method of the present invention and the system platform developed based on this method:
(1) streaming computing of big data technology is utilized, and combines the database technology of HBase high concurrent and fast reading and writing, Data real-time exhibition can not only be carried out, and the HDFS that can be write data into Hadoop is stored, for offline search, To make system that there is good real-time.
(2) due to using big data technology, using Hadoop ecosphere tool Flume, Kafka, Storm, HBase, HDFS is handled and is stored to daily record data, when back end utilization rate is excessively high, can by increase server node into Row horizontal extension is operated normally without influencing system, and compared on internet, some common monitoring tools (use single tradition Database such as MySQL, Oracle), the present invention is more easily extensible, and supports the access of more monitoring devices, and scalability is good.
(3) it is still either directed to extensive, imperial scale cluster for small-scale cluster, the present invention supports certainly Dynamicization deployment, the deployment time of each monitored node can control in 1 second, by the way that automatic discovery rule, system energy is arranged The node newly extended in automatic monitoring cluster, passes through all kinds of powerful monitoring templates, the CPU of each server of system energy automatic collection Data, internal storage data, network data, using data etc., automatically generated data curve graph and abnormal report, high degree of automation.
(4) monitoring server is designed using High Availabitity, uses MS master-slave server mode: main services, from server In monitoring preparation situation;When primary server delay machine, from all work of server adapter tube primary server, restore to primary server After normal, service is switched on primary server in a manner of automatic or manual by the setting of user and is run.
(5) communication between respectively servicing in system uses the form of Restful API, and at the same time, system can also pass through These API are with the platform outside system or using being communicated, and either addition server, newly-increased monitoring template, or reading are supervised Data are controlled, abnormality alarming is carried out, can be efficiently treated through by the opening API of the system, and are secondary development, access Convenient and fast solution route is provided to application scenarios such as other service platforms, so that system be made to extend with API abundant.
In conclusion the invention has the following advantages that
1. having good real-time;
2. having scalability;
3. high degree of automation;
4. High Availabitity;
5. API extension abundant.
It is understood that being merely to illustrate the present invention above with respect to specific descriptions of the invention and being not limited to this Technical solution described in inventive embodiments.Those skilled in the art should understand that still can be carried out to the present invention Modification or equivalent replacement, to reach identical technical effect;As long as meet use needs, all protection scope of the present invention it It is interior.

Claims (7)

1. a kind of computer server cluster log monitoring method, specific steps include:
A. each server magnanimity machine code instruction that mainboard generates in the process of running in computer server cluster is monitored, The machine code instruction data that mainboard generates are acquired in real time using the real-time streaming data acquisition frame in big data technology, institute Machine code instruction is stated to instruct, answer including at least memory instruction, cpu instruction, disk I/O instruction, network flow instruction, TCP connection number With process parameter instruction;Wherein:
By RHCS technology, the monitoring service of main monitoring node is disposed on 2 servers, monitoring service takes in computer automatically Collection agent is disposed on server in business device cluster, to guarantee monitoring service in High Availabitity state, main monitoring node can be automatic Scanning discovery simultaneously adds the server increased newly in cluster to watch-list, adds the monitored item of monitored server automatically;2 clothes Business device uses the working method of MS master-slave server: primary server work, is in monitoring preparation situation from server;When main service When device delay machine, from all work of server adapter tube primary server, after primary server restores normal, by the setting of user with Service is switched on primary server and runs by automatic or manual mode;
The flow data acquisition frame includes N number of Collection agent module, acquisition service module, data filtering module, dynamic property Balance module and distributed stream computing module, N number of Collection agent module operate on N number of monitored node, acquire in real time Magnanimity machine code instruction on monitored node, the acquisition service module operate on main monitoring node, and acquisition is respectively adopted in real time The machine code instruction of collection proxy module transmission simultaneously sends it to data filtering module, and the data filtering module operates in main prison It controls on node, receiving the machine code instruction that acquisition service module is sent simultaneously will carry out primary filtration, and by filtered machine code Instruction is sent to distributed stream computing module, and the dynamic property balance module is for dynamic equilibrium Collection agent module and acquisition Service between service module, acquisition service module and data filtering module, data filtering module and distributed stream computing module Performance;
B. data classification and transfer operation are carried out to collected machine code instruction, including filters, processes and stores in real time;Its In:
Filtering in real time, processing and the storage specifically includes: by the distribution based on Flume+Kafka+Storm frame The machine code semi-structured data that stream calculation service generates server carries out preliminary filtering, is preserved for the data of analysis, Again the data translation of the part at readable, regular structuring and semi-structured data, the data after translation pass through HBase high-performance columnar database is stored, and the real-time read/write of data is externally provided in conjunction with Phoenix, and HBase is utilized HDFS distributed file system does perdurable data storage, while providing the inquiry of static structure data using Hive, makes With SQL-like language, bottom is run on Hadoop by compiling indexing MapReduce program, when data memory node utilization rate When excessively high, horizontal extension is carried out to guarantee the normal operation of step B by increasing new memory node;
C. to step B, treated that data are analyzed and shown in real time, obtains analysis result in real time and simultaneously stores;
D. it combines analysis result in real time and the historical analysis result of storage to be analyzed, obtains the potential abnormal knot of computer server Fruit, and carry out early warning and notice.
2. computer server cluster log monitoring method according to claim 1, it is characterised in that: described in step A The acquisition project of Collection agent module can add and configure, the dependent thresholds for acquiring project are settable.
3. computer server cluster log monitoring method according to claim 1, it is characterised in that: step C is specifically wrapped It includes: real-time figure line being carried out by the Web based on the front end Nginx+PHP to the data flow that step B is obtained and report is shown, simultaneously The analysis of row data is flowed into data, marks and monitored computer is obtained according to data analysis result in the presence of abnormal data value The availability report of server cluster.
4. computer server cluster log monitoring method according to claim 3, it is characterised in that: utilize front end data Visualization Framework EChart shows figure line and report and figure line and report can customize as all kinds of chart collection and network topological diagram, The a set of API of PHP language extension is utilized simultaneously, for managing monitored server, reading collected monitoring log, self-developing Monitor situation and data visualization interface etc..
5. computer server cluster log monitoring method according to claim 1, it is characterised in that: step D is specifically wrapped Include: by analyzing real time data, the trend data and historical data of storage estimate the failure and exception that are likely to occur, The solution for matching and providing recommendation, during estimating, using deep learning frame Deeplearning4j to history number According to autonomous learning is carried out, improve the accuracy and timeliness of event anticipation, by open alarm API, access mail, short message or The alarm notification service of wechat.
6. computer server cluster log monitoring method according to claim 1, it is characterised in that: establish mass alarm Event base SDK is analyzed in combination with mass alarm event base in step D.
7. a kind of monitor supervision platform based on computer server cluster log monitoring method described in claim 1, feature exist In: including host equipment, memory devices and network communication apparatus, the host equipment includes monitoring system, message system, deposits Storage system, analysis system, display systems and warning system, the memory devices include file system, Database Systems, described Network communication apparatus includes modem, router and the network switch, and the host equipment is designed using High Availabitity, is used MS master-slave server mode: primary server work is in monitoring preparation situation from server;When primary server delay machine, from service All work of device adapter tube primary server, after primary server restores normal, by the setting of user in a manner of automatic or manual Service is switched on primary server and is run;
The monitoring system is using real-time streaming data acquisition frame to monitored computer server mainboard in the process of running The machine code instruction of generation is acquired in real time and is sent to message system;
The message system carries out data classification and transfer operation to collected data, including filters in real time, processes;
The storage system utilizes HBase column high-performance data library technology, carries out to the data after message system processing quick Read/write, and store into file system;
The analysis system is calculated in real time the data after storage and trend prediction analysis, obtains processing result and by its point It is not sent to display systems and Database Systems, while will indicate that the result of trend prediction exception is sent to alarm in processing result System;
The processing result received is shown by the display systems in the form of image and report;
The warning system is issued to operation maintenance personnel according to the result received and is alerted;
The file system is HDFS distributed file system, for storing the data after message system is processed;
The Database Systems are used to store the processing result of the analysis system received;
The network communication apparatus for host equipment and monitored computer server, host equipment and memory devices it Between communication.
CN201711353494.1A 2017-12-15 2017-12-15 Computer server cluster log monitoring method and monitor supervision platform Expired - Fee Related CN107943668B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711353494.1A CN107943668B (en) 2017-12-15 2017-12-15 Computer server cluster log monitoring method and monitor supervision platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711353494.1A CN107943668B (en) 2017-12-15 2017-12-15 Computer server cluster log monitoring method and monitor supervision platform

Publications (2)

Publication Number Publication Date
CN107943668A CN107943668A (en) 2018-04-20
CN107943668B true CN107943668B (en) 2019-02-26

Family

ID=61943544

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711353494.1A Expired - Fee Related CN107943668B (en) 2017-12-15 2017-12-15 Computer server cluster log monitoring method and monitor supervision platform

Country Status (1)

Country Link
CN (1) CN107943668B (en)

Families Citing this family (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108923952B (en) * 2018-05-31 2021-11-30 北京百度网讯科技有限公司 Fault diagnosis method, equipment and storage medium based on service monitoring index
CN108804679A (en) * 2018-06-12 2018-11-13 云南电网有限责任公司信息中心 A kind of operation system user's operation monitoring data method for visualizing
CN109144765B (en) * 2018-08-21 2024-02-02 平安科技(深圳)有限公司 Report generation method, report generation device, computer equipment and storage medium
CN109344180A (en) * 2018-08-21 2019-02-15 中国平安人寿保险股份有限公司 Method, apparatus, computer equipment and the storage medium that display data obtains
CN109034423B (en) * 2018-08-29 2023-04-18 郑州云海信息技术有限公司 Fault early warning judgment method, device, equipment and storage medium
CN109408320A (en) * 2018-09-03 2019-03-01 中国平安人寿保险股份有限公司 Abnormality eliminating method, device, computer equipment and storage medium are developed in front end
CN109359014A (en) * 2018-09-04 2019-02-19 武汉华信联创技术工程有限公司 A kind of computer operation condition monitoring method, system and storage medium
CN109117345A (en) * 2018-09-05 2019-01-01 深圳市木瓜移动科技有限公司 Log monitoring method and distributed data system
CN109189847A (en) * 2018-09-11 2019-01-11 国网山东省电力公司莱芜供电公司 A kind of distribution transforming decreasing loss detection prompt system and method
CN109522287B (en) * 2018-09-18 2023-08-18 平安科技(深圳)有限公司 Monitoring method, system, equipment and medium for distributed file storage cluster
CN110928740A (en) * 2018-09-20 2020-03-27 中国石油化工股份有限公司 Centralized visualization method and system for operation and maintenance data of cloud computing center
CN109542946A (en) * 2018-10-26 2019-03-29 贵州斯曼特信息技术开发有限责任公司 It is a kind of to calculate big data system and method in real time
CN109684161B (en) * 2018-11-02 2022-05-03 深圳壹账通智能科技有限公司 Data analysis method, data analysis device, server and storage medium
CN109408448A (en) * 2018-12-05 2019-03-01 江苏恒创软件有限公司 One kind can carry out centralized processing integration data to data and show platform
CN111382042A (en) * 2018-12-29 2020-07-07 上海北塔软件股份有限公司 Log management method based on big data stream type calculation
CN109739828B (en) * 2018-12-29 2021-06-29 咪咕文化科技有限公司 Data processing method and device and computer readable storage medium
CN109617750A (en) * 2019-01-31 2019-04-12 国网电子商务有限公司 A kind of service method for early warning and gateway
CN110059140A (en) * 2019-03-29 2019-07-26 国网福建省电力有限公司 A method of data storage is carried out based on Oracle data and Hbase data
CN110119421A (en) * 2019-04-03 2019-08-13 昆明理工大学 A kind of electric power stealing user identification method based on Spark flow sorter
CN110113386A (en) * 2019-04-16 2019-08-09 苏州浪潮智能科技有限公司 A kind of power of MDC data center and environmental monitoring system method of data synchronization
CN110287079A (en) * 2019-05-14 2019-09-27 中山大学 A kind of cluster Automatic monitoring systems and method
CN110287081A (en) * 2019-06-21 2019-09-27 腾讯科技(成都)有限公司 A kind of service monitoring system and method
CN110297745A (en) * 2019-07-04 2019-10-01 中山大学 A kind of Fault Locating Method and system storing monitoring system
CN110309030A (en) * 2019-07-05 2019-10-08 亿玛创新网络(天津)有限公司 Log analysis monitoring system and method based on ELK and Zabbix
CN110377488A (en) * 2019-07-15 2019-10-25 福建威盾科技集团有限公司 A kind of method and system for unifying O&M and dynamic expansion
CN112445674A (en) * 2019-08-30 2021-03-05 中国石油化工股份有限公司 Data processing method and storage medium of computer cluster
CN110659182A (en) * 2019-09-12 2020-01-07 无锡江南计算技术研究所 High-performance computer monitoring method and system
CN110765189A (en) * 2019-09-18 2020-02-07 苏宁云计算有限公司 Exception management method and system for Internet products
CN110677304A (en) * 2019-10-11 2020-01-10 广州趣丸网络科技有限公司 Distributed problem tracking system and equipment
CN110890988B (en) * 2019-12-02 2022-04-22 安徽三实信息技术服务有限公司 Server cluster operation monitoring system
CN111049898A (en) * 2019-12-10 2020-04-21 杭州东方通信软件技术有限公司 Method and system for realizing cross-domain architecture of computing cluster resources
CN110912786B (en) * 2019-12-27 2021-07-16 深圳市星砺达科技有限公司 Gateway pressure testing method and device, computer equipment and storage medium
CN111225045B (en) * 2019-12-31 2022-12-27 苏州浪潮智能科技有限公司 HIVE high-availability early warning method, device and computer readable storage medium
CN111339142A (en) * 2020-02-26 2020-06-26 广州信安数据有限公司 Data monitoring response method, computer readable storage medium and data driving platform
CN111324513B (en) * 2020-02-29 2022-12-27 苏州浪潮智能科技有限公司 Monitoring management method and system for artificial intelligence development platform
CN111600856B (en) * 2020-03-07 2023-03-31 浙江齐治科技股份有限公司 Safety system of operation and maintenance of data center
CN112115026B (en) * 2020-09-15 2022-09-16 招商局金融科技有限公司 Server cluster monitoring method and device, electronic equipment and readable storage medium
CN112437145A (en) * 2020-11-18 2021-03-02 北京浪潮数据技术有限公司 Server cluster management method and device and related components
CN114661538B (en) * 2020-12-23 2023-04-28 金篆信科有限责任公司 Distributed database monitoring method and device, electronic equipment and storage medium
CN112769622A (en) * 2021-01-18 2021-05-07 孙冬英 Cluster service fault early warning system based on RPC service monitoring
CN113282559A (en) * 2021-06-04 2021-08-20 青岛海尔科技有限公司 Computer log classification method and device, storage medium and electronic device
WO2023279815A1 (en) * 2021-07-08 2023-01-12 华为技术有限公司 Performance monitoring system and related method
CN113342621A (en) * 2021-07-14 2021-09-03 芯河半导体科技(无锡)有限公司 System for monitoring and testing idle time of machine host and giving alarm
CN113688005B (en) * 2021-08-09 2022-08-26 山东亚泽信息技术有限公司 Operation and maintenance monitoring method and system
CN113900898B (en) * 2021-10-19 2024-09-03 北京金山云网络技术有限公司 Data processing system, equipment and medium
CN114500232A (en) * 2022-01-24 2022-05-13 上海华力微电子有限公司 Factory network middleware monitoring system
CN114553732A (en) * 2022-03-08 2022-05-27 北京月新时代科技股份有限公司 Technology for automatically acquiring equipment performance based on equipment
CN114448831B (en) * 2022-03-18 2023-09-01 以萨技术股份有限公司 Method and system for monitoring state of servers to which clusters belong
CN115834696B (en) * 2022-10-20 2023-08-01 北京新数科技有限公司 Database performance monitoring platform data acquisition device
CN116112407A (en) * 2022-12-28 2023-05-12 上海学登信息科技有限公司 Network flow data acquisition system
CN116911807B (en) * 2023-09-13 2023-12-05 成都秦川物联网科技股份有限公司 Intelligent gas data center flow visual management method and Internet of things system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101707632A (en) * 2009-10-28 2010-05-12 浪潮电子信息产业股份有限公司 Method for dynamically monitoring performance of server cluster and alarming real-timely
CN102938710A (en) * 2012-11-14 2013-02-20 北京奇虎科技有限公司 Monitoring system and method for large-scale servers
CN104618343A (en) * 2015-01-06 2015-05-13 中国科学院信息工程研究所 Method and system for detecting website threat based on real-time log
CN105868075A (en) * 2016-03-31 2016-08-17 浪潮通信信息系统有限公司 System and method for monitoring and analyzing large amount of logs in real time

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014147699A1 (en) * 2013-03-18 2014-09-25 富士通株式会社 Management device, method, and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101707632A (en) * 2009-10-28 2010-05-12 浪潮电子信息产业股份有限公司 Method for dynamically monitoring performance of server cluster and alarming real-timely
CN102938710A (en) * 2012-11-14 2013-02-20 北京奇虎科技有限公司 Monitoring system and method for large-scale servers
CN104618343A (en) * 2015-01-06 2015-05-13 中国科学院信息工程研究所 Method and system for detecting website threat based on real-time log
CN105868075A (en) * 2016-03-31 2016-08-17 浪潮通信信息系统有限公司 System and method for monitoring and analyzing large amount of logs in real time

Also Published As

Publication number Publication date
CN107943668A (en) 2018-04-20

Similar Documents

Publication Publication Date Title
CN107943668B (en) Computer server cluster log monitoring method and monitor supervision platform
CN111984499B (en) Fault detection method and device for big data cluster
Muniswamaiah et al. Big data in cloud computing review and opportunities
CN108353090B (en) Method for improving processing of sensor stream data in a distributed network
US10129168B2 (en) Methods and systems providing a scalable process for anomaly identification and information technology infrastructure resource optimization
US8423638B2 (en) Performance monitoring of a computer resource
CN114500250B (en) System linkage comprehensive operation and maintenance system and method in cloud mode
CN104881352A (en) System resource monitoring device based on mobile terminal
CN111046022A (en) Database auditing method based on big data technology
JP6457777B2 (en) Automated generation and dynamic update of rules
CN105071954A (en) Resource pool fault diagnosis and positioning processing method based on probe technology
CN110598051A (en) Power industry monitoring system, method and device
Samak et al. Scalable analysis of network measurements with Hadoop and Pig
CN113835918A (en) Server fault analysis method and device
Solaimani et al. Online anomaly detection for multi‐source VMware using a distributed streaming framework
WO2023224764A1 (en) Multi-modality root cause localization for cloud computing systems
CN112306820A (en) Log operation and maintenance root cause analysis method and device, electronic equipment and storage medium
Shao et al. Griffon: Reasoning about job anomalies with unlabeled data in cloud-based platforms
CN108055152B (en) Communication network information system abnormity detection method based on distributed service log
CN112579552A (en) Log storage and calling method, device and system
CN114221997A (en) Interface monitoring system based on micro-service gateway
Shih et al. Implementation and visualization of a netflow log data lake system for cyberattack detection using distributed deep learning
Yamnual et al. Failure detection through monitoring of the scientific distributed system
KR101878291B1 (en) Big data management system and management method thereof
Kumar et al. A pragmatic approach to predict hardware failures in storage systems using MPP database and big data technologies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190226