CN107943668B - Computer server cluster log monitoring method and monitor supervision platform - Google Patents
Computer server cluster log monitoring method and monitor supervision platform Download PDFInfo
- Publication number
- CN107943668B CN107943668B CN201711353494.1A CN201711353494A CN107943668B CN 107943668 B CN107943668 B CN 107943668B CN 201711353494 A CN201711353494 A CN 201711353494A CN 107943668 B CN107943668 B CN 107943668B
- Authority
- CN
- China
- Prior art keywords
- data
- server
- real time
- module
- monitoring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 77
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000004458 analytical method Methods 0.000 claims abstract description 34
- 238000005516 engineering process Methods 0.000 claims abstract description 19
- 230000008569 process Effects 0.000 claims abstract description 17
- 238000012546 transfer Methods 0.000 claims abstract description 7
- 230000002159 abnormal effect Effects 0.000 claims abstract description 6
- 238000001914 filtration Methods 0.000 claims description 24
- 238000012545 processing Methods 0.000 claims description 18
- 238000004891 communication Methods 0.000 claims description 14
- 238000002360 preparation method Methods 0.000 claims description 8
- 238000013079 data visualisation Methods 0.000 claims description 6
- 238000012423 maintenance Methods 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims description 5
- 238000010586 diagram Methods 0.000 claims description 5
- 238000013519 translation Methods 0.000 claims description 5
- 238000013500 data storage Methods 0.000 claims description 4
- 238000000547 structure data Methods 0.000 claims description 4
- 241000233805 Phoenix Species 0.000 claims description 3
- 238000007405 data analysis Methods 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 claims description 3
- 238000012917 library technology Methods 0.000 claims description 3
- 230000003068 static effect Effects 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 claims 1
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 230000008901 benefit Effects 0.000 abstract description 4
- 238000007726 management method Methods 0.000 description 7
- 230000006872 improvement Effects 0.000 description 3
- 230000005856 abnormality Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012806 monitoring device Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- ZXQYGBMAQZUVMI-GCMPRSNUSA-N gamma-cyhalothrin Chemical compound CC1(C)[C@@H](\C=C(/Cl)C(F)(F)F)[C@H]1C(=O)O[C@H](C#N)C1=CC=CC(OC=2C=CC=CC=2)=C1 ZXQYGBMAQZUVMI-GCMPRSNUSA-N 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3452—Performance evaluation by statistical analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The present invention relates to field of computer technology, specially a kind of computer server cluster log monitoring method and monitor supervision platform, the specific steps of the monitoring method include: each server magnanimity machine code instruction that mainboard generates in the process of running in A. monitoring computer server cluster, are acquired in real time using the real-time streaming data acquisition frame in big data technology to the machine code instruction that mainboard generates;B. data classification and transfer operation are carried out to collected data, including filters, process and stores in real time;C. to step B, treated that data are analyzed and shown in real time, obtains analysis result in real time and simultaneously stores;D. it combines analysis result in real time and the historical analysis result of storage to be analyzed, obtains the potential abnormal results of computer server, and carry out early warning and notice.The present invention has many advantages, such as that real-time, expansible, automation, High Availabitity and API extension are abundant.
Description
Technical field
The present invention relates to field of computer technology, specially a kind of computer server cluster log monitoring method and monitoring
Platform.
Background technique
With the arrival of information age, the technologies such as big data, cloud computing and machine learning are increasingly becoming computer field
Research hotspot.The technologies such as big data, cloud computing, machine learning have the characteristics that one it is common: need to carry out complicated and huge
It calculates, therefore, often requires to use a large amount of computer server and form one or more clusters progress parallel computations, resultant force is complete
At one or more calculating tasks.Increasingly vigorous recently as demand, computer server cluster scale is also from original several
Ten are developing progressively several hundred, Ji Qiantai or even tens of thousands of.
With the large-scale of computer server cluster, how to manage clustering performance index (such as: the CPU of server,
Memory, the service condition of network, magnetic disc i/o read-write situation etc.) and how to find in time cluster there are the problem of and handle in time
It is increasingly becoming the hang-up that computer room operation maintenance personnel faces.
In the prior art, the main method that computer room operation maintenance personnel uses has:
(1) computer room regular visit, checks whether the indicator light of server has alarm;
(2) using some free server monitoring softwares on internet, assistance management is carried out.
However, as number of servers is more and more, thousands of servers are periodically patrolled only according to computer lab management personnel
Inspection, judges, checks problem by visually going, not only workload is too big, but also be easy to cause false retrieval, missing inspection.Even if utilizing interconnection
The monitoring software in online face assist management there is also very big safety problem, due to these monitoring softwares internal structure not
Know, use rashly, there are trojan horse attack or hacker attacks risks, and these monitoring softwares are usually applicable only to computer
The application of server cluster negligible amounts, when number of servers is tens, software runnability is good;But if clothes
Business device has reached several hundred, and thousands of even tens of thousands of, then software performance can be remarkably decreased, or even occur what software can not be supported
Situation.
It is received in view of the above-mentioned problems, studies in China personnel also develop some logs specifically for computer server cluster
Collection or monitoring method.For example, the application for a patent for invention that China Patent Publication No. is CN105095502A discloses a kind of collection
The log collecting method of group's storage system, technical solution are as follows: a kind of log collecting method of cluster storage system, this method institute
The module for including has log management module, log collection module, journaling agent module, and the log management module operates in collection
In group on transmission monitor node, it is responsible for management coordination log collection module and journaling agent module, the module is with Embedded side
Formula is integrated into the operation flow of cluster storage system, and the log collection module runs transmission monitor node in the cluster
On, it is responsible for the data that the multiple journaling agent module push of collection management come, and sort data under storage to the catalogue formulated, root
According to the size of the scale dynamic configuration log collection module of cluster, each section of the journaling agent module operation in the cluster
On point, it is responsible for the Log log of the cluster storage system on node where collecting, and log is pushed to log collection module,
Each described journaling agent module can monitor 1024 files, the transmission day in the journaling agent module Configuration file
Will security level attributes are E2E and SendOnly.Although this method can be realized the collection of log, but there are non real-time nature,
The disadvantages of non-distributed storage, early warning without exception, and can not by system platform direct visualizztion demonstrating computer cluster
The information of middle server is unfavorable for computer room operation maintenance personnel real time monitoring.China Patent Publication No. is that the invention of CN106326008 is special
Benefit application also discloses that a kind of monitoring method towards group system, technical solution mainly comprise the steps that step 1, adopt
The detailed attributes and groundwork state for collecting a basic point in group system, generate the report log of the basic working condition of each node;
Step 2, the groundwork state of each node according to obtained in step 1, judge whether there is node more than Node B threshold or because
Failure and stop working;There is threshold value of the groundwork state of several points more than default if it exists or in the shape that stops working
State, then the resource service condition of the entire group system of scan statistics, and judge whether the resource service condition of group system exceeds
System thresholds, while generating the resource service condition report log of entire group system;Step 3, if entire cluster in step 2
The resource situation of system is less than system thresholds, then the node being in idle condition in scanning search group system, enables in sky
The node of not busy state shunts operation of the groundwork state more than the node of Node B threshold;Step 4, if whole in step 2
The resource situation of a group system is more than system thresholds, then is determined by priority of the system to each operation, keep priority minimum
Task stop working and wait in line into queue.The patent be by every computer in control terminal node scan cluster,
The relevant information of computer in cluster is obtained, this processing mode cannot achieve real time monitoring especially when computer number in cluster
When measuring more, scanning spends the time longer, and the Internet resources in cluster can be consumed by network sweep, influence the Internet resources of cluster
Quality.It would therefore be highly desirable to develop real-time one kind, monitoring visualization, the computer server collection for not influencing cluster network resources quality
Group's log monitoring method and platform.
Summary of the invention
For the problems of the prior art, the present invention provide one kind can monitor in real time and monitored results visualization, no
Influence the computer server cluster log monitoring method and monitor supervision platform of cluster network resources quality.
To realize the above technical purpose, the technical scheme is that
A kind of computer server cluster log monitoring method, specific steps include:
A. each server magnanimity machine code that mainboard generates in the process of running in computer server cluster is monitored
Instruction adopts the machine code instruction data that mainboard generates using the real-time streaming data acquisition frame in big data technology in real time
Collection, the machine code instruction refer to including at least memory instruction, cpu instruction, disk I/O instruction, network flow instruction, TCP connection number
It enables, application process parameter instruction;
B. data classification and transfer operation are carried out to collected machine code instruction, including filters, processes and deposits in real time
Storage;
C. to step B, treated that data are analyzed and shown in real time, obtains analysis result in real time and simultaneously stores;
D. it combines analysis result in real time and the historical analysis result of storage to be analyzed, show that computer server is potential different
Often as a result, and carrying out early warning and notice.
As an improvement, by RHCS technology, the monitoring service of main monitoring node is disposed on 2 servers in step A,
Monitoring service disposes Collection agent on the server in computer server cluster automatically, to guarantee monitoring service in High Availabitity
State, main monitoring node energy automatically scanning finds and adds the server increased newly in cluster to watch-list, and automatic addition is supervised
Control the monitored item of server;2 servers use the working method of MS master-slave server: primary server work is in from server
Monitor preparation situation;When primary server delay machine, from all work of server adapter tube primary server, restore just to primary server
Service is switched on primary server in a manner of automatic or manual by the setting of user and is run by Chang Hou.
Preferably, the flow data acquisition frame includes N number of Collection agent module, acquisition service module, data filtering
Module, dynamic property balance module and distributed stream computing module, N number of Collection agent module operate in N number of monitored section
On point, the magnanimity machine code instruction on monitored node is acquired in real time, the acquisition service module operates on main monitoring node,
The machine code instruction that each Collection agent module is sent is acquired in real time and sends it to data filtering module, the data filtering mould
Block operates on main monitoring node, receives the machine code instruction that acquisition service module is sent and will carry out primary filtration, and incited somebody to action
Machine code instruction after filter is sent to distributed stream computing module, and the dynamic property balance module acquires generation for dynamic equilibrium
It manages module and acquisition service module, acquisition service module and data filtering module, data filtering module and distributed stream calculates mould
Service performance between block.
As an improvement, the acquisition project of Collection agent module described in step A can be added and be configured, the phase of project is acquired
It is settable to close threshold value.
Preferably, filtering in real time, processing described in step B and storage specifically include: by based on Flume+Kafka
The distributed stream of+Storm frame calculates the machine code semi-structured data that service generates server and carries out preliminary filtering, protects
Remain for the data of analysis, then the data translation of the part is turned at readable, regular structuring and semi-structured data
Data after translating are stored by HBase high-performance columnar database, externally provided in conjunction with Phoenix data it is real-time reading/
It writes, HBase does perdurable data storage using HDFS distributed file system, while providing static structure data using Hive
Inquiry, use SQL-like language, bottom by compiling indexing MapReduce program run on Hadoop, when data storage
When node utilization rate is excessively high, horizontal extension is carried out to guarantee the normal operation of step B by increasing new memory node.
Preferably, step C is specifically included: passing through the Web based on the front end Nginx+PHP to the step B data flow obtained
It carries out real-time figure line and report is shown, while the analysis of row data is flowed into data, mark the data value in the presence of exception, according to
Data analysis result obtains the availability report of monitored computer server cluster.
Preferably, showing figure line and report using front end data Visualization Framework EChart and figure line and report can be made by oneself
Justice be all kinds of chart collection and network topological diagram, while utilize a set of API of PHP language extension, for manage monitored server,
Read collected monitoring log, self-developing monitoring situation and data visualization interface etc..
Preferably, step D is specifically included: by analyzing real time data, the trend data and historical data of storage, to can
The failure and exception that can occur are estimated, and the solution of recommendation are matched and provide, during estimating, using deep learning
Frame Deeplearning4j carries out autonomous learning to historical data, the accuracy and timeliness of event anticipation is improved, by opening
The alarm API put accesses mail, the alarm notification service of short message or wechat.
As an improvement, establishing mass alarm event base SDK, divided in step D in combination with mass alarm event base
Analysis.
A kind of monitor supervision platform based on above-mentioned computer server cluster log monitoring method, including host equipment, storage
Device equipment and network communication apparatus, the host equipment include monitoring system, message system, storage system, analysis system, displaying
System and warning system, the memory devices include file system, Database Systems, and the network communication apparatus includes modulation
Demodulator, router and the network switch, the host equipment are designed using High Availabitity, use MS master-slave server mode: main clothes
Business device work is in monitoring preparation situation from server;When primary server delay machine, from all of server adapter tube primary server
Service is switched to primary server in a manner of automatic or manual by the setting of user after primary server restores normal by work
Upper operation;
The monitoring system using real-time streaming data acquisition frame to monitored computer server in the process of running
The machine code instruction that mainboard generates is acquired in real time and is sent to message system;
The message system carries out data classification and transfer operation to collected data, including filters in real time, processes;
The storage system utilizes HBase column high-performance data library technology, carries out to the data after message system processing
Fast read/write, and store into file system;
The analysis system is calculated in real time the data after storage and trend prediction analysis, and acquisition processing result simultaneously will
It is respectively sent to display systems and Database Systems, while will indicate that the result of trend prediction exception is sent in processing result
Warning system;
The processing result received is shown by the display systems in the form of image and report;
The warning system is issued to operation maintenance personnel according to the result received and is alerted;
The file system is HDFS distributed file system, for storing the data after message system is processed;
The Database Systems are used to store the processing result of the analysis system received;
The network communication apparatus is set for host equipment and monitored computer server, host equipment and memory
Communication between standby
From the above, it can be seen that the present invention has following advantages:
1. have good real-time, the present invention utilize big data technology streaming computing, and combine HBase high concurrent and
The database technology of fast reading and writing can not only carry out data real-time exhibition, and can write data into Hadoop
HDFS is stored, for offline search.
2. have scalability, due to the present invention be using big data technology, using Hadoop ecosphere tool Flume,
Kafka, Storm, HBase, HDFS are handled and are stored to daily record data, when back end utilization rate is excessively high, Ke Yitong
It crosses increase server node and carries out horizontal extension, operated normally without influencing system;Some common monitoring compared on internet
Tool (uses single traditional database such as MySQL, Oracle), and the present invention is more easily extensible, and supports more monitoring devices
Access.
3. high degree of automation is either still directed to extensive, imperial scale cluster for small-scale cluster, this
Automatically dispose is supported in invention, and the deployment time of each node can control in 1 second.By the way that automatic discovery rule is arranged, it is
System can monitor the node newly extended in cluster automatically, pass through all kinds of powerful monitoring templates, each server of system energy automatic collection
Cpu data, internal storage data, network data, using data etc., automatically generated data curve graph and abnormal report.
4. High Availabitity, monitoring server is designed using High Availabitity, uses MS master-slave server mode: primary server work, from
Server is in monitoring preparation situation;When primary server delay machine, from all work of server adapter tube primary server, to main clothes
It is engaged in after device recovery normally, service is switched on primary server in a manner of automatic or manual by the setting of user and is run.
5. API abundant extension, the communication in system respectively between service uses the form of Restful API, at the same time,
System also can either add server, increase monitoring mould newly by these API with the platform outside system or using being communicated
Plate, or monitoring data is read, abnormality alarming is carried out, can be efficiently treated through by the opening API of the system, and be
Secondary development is linked into the application scenarios such as other service platforms and provides convenient and fast solution route.
Detailed description of the invention
Fig. 1 is the system block diagram of monitor supervision platform of the present invention;
Fig. 2 is the system block diagram of flow data acquisition frame of the present invention.
Specific embodiment
In conjunction with Fig. 1 to Fig. 2, the specific embodiment that the present invention will be described in detail, but claim of the invention is not done
Any restriction.
A kind of computer server cluster log monitoring method, specific steps include:
A. each server magnanimity machine code that mainboard generates in the process of running in computer server cluster is monitored to refer to
It enables, the machine code instruction data that mainboard generates is adopted in real time using the real-time streaming data acquisition frame in big data technology
Collection, in which:
Machine code instruction includes at least memory instruction, cpu instruction, disk I/O instruction, network flow instruction, TCP connection number
Instruction, application process parameter instruction;
By RHCS technology, the monitoring service of main monitoring node is disposed on 2 servers, monitoring service is calculating automatically
Collection agent is disposed on server in machine server cluster, to guarantee monitoring service in High Availabitity state, main monitoring node energy
Automatically scanning finds and adds the server increased newly in cluster to watch-list, the automatic monitored item for adding monitored server, 2
Platform server is designed using High Availabitity (High Availability abbreviation HA), uses the working method of MS master-slave server: main
Server work is in monitoring preparation situation from server;When primary server delay machine, from the one of server adapter tube primary server
Cutting is made, and after primary server restores normal, service is switched to main service in a manner of automatic or manual by the setting of user
It is run on device;
As shown in Fig. 2, flow data acquisition frame includes N number of Collection agent module, acquisition service module, data filtering mould
Block, dynamic property balance module and distributed stream computing module, N number of Collection agent module operate on N number of monitored node (i.e.
On monitored server), the magnanimity machine code instruction on monitored node, the acquisition project of Collection agent module are acquired in real time
It can add and configure, the dependent thresholds for acquiring project are settable, and acquisition service module operates on main monitoring node, acquire in real time
The machine code instruction of each Collection agent module transmission simultaneously sends it to data filtering module, and data filtering module operates in main prison
It controls on node, receiving the machine code instruction that acquisition service module is sent simultaneously will carry out primary filtration, and by filtered machine code
Instruction is sent to distributed stream computing module, and dynamic property balance module is serviced for dynamic equilibrium Collection agent module and acquisition
Service performance between module, acquisition service module and data filtering module, data filtering module and distributed stream computing module,
Dynamic property balance module passes through between Flume technology and Kafka technology equilibrium data acquisition service and data filtering services
Performance, it is ensured that the two is able to maintain high-throughput ability.
B. data classification and transfer operation are carried out to collected machine code instruction, including filters, processes and deposits in real time
Storage calculates the machine code half hitch that service generates server by the distributed stream based on Flume+Kafka+Storm frame
Structure data carry out preliminary filtering, are preserved for the data of analysis, then the data translation of the part at readable, regular
Structuring and semi-structured data, the data after translation stored by HBase high-performance columnar database, in conjunction with
Phoenix externally provides the real-time read/write of data, and HBase does perdurable data storage using HDFS distributed file system, together
When using Hive (Tool for Data Warehouse based on Hadoop) provide static structure data inquiry, use SQL-like language,
Bottom is run on Hadoop by compiling indexing MapReduce program, when data memory node utilization rate is excessively high, passes through increasing
New memory node is added to carry out horizontal extension to guarantee the normal operation of step B.
C. to step B, treated that data are analyzed and shown in real time, obtains analysis result in real time and simultaneously stores, i.e., to step
The data flow that rapid B is obtained carries out real-time figure line by the Web based on the front end Nginx+PHP and report is shown, while to data
The analysis of row data is flowed into, marks and monitored computer server is obtained according to data analysis result in the presence of abnormal data value
The availability report of cluster, can use front end data Visualization Framework EChart displaying figure line and report and figure line and report can
It is customized for all kinds of chart collection and network topological diagram, while utilizing a set of API of PHP language extension, for managing monitored clothes
Business device reads collected monitoring log, self-developing monitoring situation and data visualization interface etc..
D. it combines analysis result in real time and the historical analysis result of storage to be analyzed, show that computer server is potential different
Often as a result, and carry out early warning and notice, that is, pass through analysis real time data, the trend data and historical data of storage, to may go out
Existing failure and exception is estimated, and the solution of recommendation is matched and provide, during estimating, using deep learning frame
Deeplearning4j carries out autonomous learning to historical data, the accuracy and timeliness of event anticipation is improved, by open
API is alerted, mail, the alarm notification service of short message or wechat are accessed.In order to improve early warning performance, mass alarm can also be established
Event base SDK, when being analyzed in conjunction with analysis result in real time and the historical analysis result of storage, in combination with mass alarm event
It is analyzed in library.
Based on above-mentioned computer server cluster log monitoring method, it is flat to establish computer server cluster log monitoring
The system architecture of platform, the platform is as shown in Figure 1.
The computer server cluster log monitor supervision platform includes host equipment, memory devices and network communication apparatus,
Host equipment includes monitoring system, message system, storage system, analysis system, display systems and warning system, memory devices
Including file system, Database Systems, network communication apparatus includes modem, router and the network switch, and host is set
It is standby to be designed using High Availabitity, MS master-slave server mode: primary server work is used, is in monitoring preparation situation from server;When
When primary server delay machine, from all work of server adapter tube primary server, after primary server restores normal, by user's
Service is switched on primary server by setting in a manner of automatic or manual to be run;
Monitoring system is running monitored computer server using real-time streaming data acquisition frame (as shown in Figure 2)
The machine code instruction that mainboard generates in the process is acquired in real time and is sent to message system;
Message system carries out data classification and transfer operation to collected data, including filters in real time, processes;
Storage system utilizes HBase column high-performance data library technology, carries out to the data after message system processing quick
Read/write, and store into file system;
Analysis system is calculated in real time the data after storage and trend prediction analysis, obtains processing result and by its point
It is not sent to display systems and Database Systems, while will indicate that the result of trend prediction exception is sent to alarm in processing result
System;
The processing result received is shown by display systems in the form of image and report;
Warning system is issued to operation maintenance personnel according to the result received and is alerted;
File system is HDFS distributed file system, for storing the data after message system is processed;
Database Systems are used to store the processing result of the analysis system received;
Network communication apparatus for host equipment and monitored computer server, host equipment and memory devices it
Between communication.
In computer server cluster log monitoring method of the present invention and the system platform developed based on this method:
(1) streaming computing of big data technology is utilized, and combines the database technology of HBase high concurrent and fast reading and writing,
Data real-time exhibition can not only be carried out, and the HDFS that can be write data into Hadoop is stored, for offline search,
To make system that there is good real-time.
(2) due to using big data technology, using Hadoop ecosphere tool Flume, Kafka, Storm, HBase,
HDFS is handled and is stored to daily record data, when back end utilization rate is excessively high, can by increase server node into
Row horizontal extension is operated normally without influencing system, and compared on internet, some common monitoring tools (use single tradition
Database such as MySQL, Oracle), the present invention is more easily extensible, and supports the access of more monitoring devices, and scalability is good.
(3) it is still either directed to extensive, imperial scale cluster for small-scale cluster, the present invention supports certainly
Dynamicization deployment, the deployment time of each monitored node can control in 1 second, by the way that automatic discovery rule, system energy is arranged
The node newly extended in automatic monitoring cluster, passes through all kinds of powerful monitoring templates, the CPU of each server of system energy automatic collection
Data, internal storage data, network data, using data etc., automatically generated data curve graph and abnormal report, high degree of automation.
(4) monitoring server is designed using High Availabitity, uses MS master-slave server mode: main services, from server
In monitoring preparation situation;When primary server delay machine, from all work of server adapter tube primary server, restore to primary server
After normal, service is switched on primary server in a manner of automatic or manual by the setting of user and is run.
(5) communication between respectively servicing in system uses the form of Restful API, and at the same time, system can also pass through
These API are with the platform outside system or using being communicated, and either addition server, newly-increased monitoring template, or reading are supervised
Data are controlled, abnormality alarming is carried out, can be efficiently treated through by the opening API of the system, and are secondary development, access
Convenient and fast solution route is provided to application scenarios such as other service platforms, so that system be made to extend with API abundant.
In conclusion the invention has the following advantages that
1. having good real-time;
2. having scalability;
3. high degree of automation;
4. High Availabitity;
5. API extension abundant.
It is understood that being merely to illustrate the present invention above with respect to specific descriptions of the invention and being not limited to this
Technical solution described in inventive embodiments.Those skilled in the art should understand that still can be carried out to the present invention
Modification or equivalent replacement, to reach identical technical effect;As long as meet use needs, all protection scope of the present invention it
It is interior.
Claims (7)
1. a kind of computer server cluster log monitoring method, specific steps include:
A. each server magnanimity machine code instruction that mainboard generates in the process of running in computer server cluster is monitored,
The machine code instruction data that mainboard generates are acquired in real time using the real-time streaming data acquisition frame in big data technology, institute
Machine code instruction is stated to instruct, answer including at least memory instruction, cpu instruction, disk I/O instruction, network flow instruction, TCP connection number
With process parameter instruction;Wherein:
By RHCS technology, the monitoring service of main monitoring node is disposed on 2 servers, monitoring service takes in computer automatically
Collection agent is disposed on server in business device cluster, to guarantee monitoring service in High Availabitity state, main monitoring node can be automatic
Scanning discovery simultaneously adds the server increased newly in cluster to watch-list, adds the monitored item of monitored server automatically;2 clothes
Business device uses the working method of MS master-slave server: primary server work, is in monitoring preparation situation from server;When main service
When device delay machine, from all work of server adapter tube primary server, after primary server restores normal, by the setting of user with
Service is switched on primary server and runs by automatic or manual mode;
The flow data acquisition frame includes N number of Collection agent module, acquisition service module, data filtering module, dynamic property
Balance module and distributed stream computing module, N number of Collection agent module operate on N number of monitored node, acquire in real time
Magnanimity machine code instruction on monitored node, the acquisition service module operate on main monitoring node, and acquisition is respectively adopted in real time
The machine code instruction of collection proxy module transmission simultaneously sends it to data filtering module, and the data filtering module operates in main prison
It controls on node, receiving the machine code instruction that acquisition service module is sent simultaneously will carry out primary filtration, and by filtered machine code
Instruction is sent to distributed stream computing module, and the dynamic property balance module is for dynamic equilibrium Collection agent module and acquisition
Service between service module, acquisition service module and data filtering module, data filtering module and distributed stream computing module
Performance;
B. data classification and transfer operation are carried out to collected machine code instruction, including filters, processes and stores in real time;Its
In:
Filtering in real time, processing and the storage specifically includes: by the distribution based on Flume+Kafka+Storm frame
The machine code semi-structured data that stream calculation service generates server carries out preliminary filtering, is preserved for the data of analysis,
Again the data translation of the part at readable, regular structuring and semi-structured data, the data after translation pass through
HBase high-performance columnar database is stored, and the real-time read/write of data is externally provided in conjunction with Phoenix, and HBase is utilized
HDFS distributed file system does perdurable data storage, while providing the inquiry of static structure data using Hive, makes
With SQL-like language, bottom is run on Hadoop by compiling indexing MapReduce program, when data memory node utilization rate
When excessively high, horizontal extension is carried out to guarantee the normal operation of step B by increasing new memory node;
C. to step B, treated that data are analyzed and shown in real time, obtains analysis result in real time and simultaneously stores;
D. it combines analysis result in real time and the historical analysis result of storage to be analyzed, obtains the potential abnormal knot of computer server
Fruit, and carry out early warning and notice.
2. computer server cluster log monitoring method according to claim 1, it is characterised in that: described in step A
The acquisition project of Collection agent module can add and configure, the dependent thresholds for acquiring project are settable.
3. computer server cluster log monitoring method according to claim 1, it is characterised in that: step C is specifically wrapped
It includes: real-time figure line being carried out by the Web based on the front end Nginx+PHP to the data flow that step B is obtained and report is shown, simultaneously
The analysis of row data is flowed into data, marks and monitored computer is obtained according to data analysis result in the presence of abnormal data value
The availability report of server cluster.
4. computer server cluster log monitoring method according to claim 3, it is characterised in that: utilize front end data
Visualization Framework EChart shows figure line and report and figure line and report can customize as all kinds of chart collection and network topological diagram,
The a set of API of PHP language extension is utilized simultaneously, for managing monitored server, reading collected monitoring log, self-developing
Monitor situation and data visualization interface etc..
5. computer server cluster log monitoring method according to claim 1, it is characterised in that: step D is specifically wrapped
Include: by analyzing real time data, the trend data and historical data of storage estimate the failure and exception that are likely to occur,
The solution for matching and providing recommendation, during estimating, using deep learning frame Deeplearning4j to history number
According to autonomous learning is carried out, improve the accuracy and timeliness of event anticipation, by open alarm API, access mail, short message or
The alarm notification service of wechat.
6. computer server cluster log monitoring method according to claim 1, it is characterised in that: establish mass alarm
Event base SDK is analyzed in combination with mass alarm event base in step D.
7. a kind of monitor supervision platform based on computer server cluster log monitoring method described in claim 1, feature exist
In: including host equipment, memory devices and network communication apparatus, the host equipment includes monitoring system, message system, deposits
Storage system, analysis system, display systems and warning system, the memory devices include file system, Database Systems, described
Network communication apparatus includes modem, router and the network switch, and the host equipment is designed using High Availabitity, is used
MS master-slave server mode: primary server work is in monitoring preparation situation from server;When primary server delay machine, from service
All work of device adapter tube primary server, after primary server restores normal, by the setting of user in a manner of automatic or manual
Service is switched on primary server and is run;
The monitoring system is using real-time streaming data acquisition frame to monitored computer server mainboard in the process of running
The machine code instruction of generation is acquired in real time and is sent to message system;
The message system carries out data classification and transfer operation to collected data, including filters in real time, processes;
The storage system utilizes HBase column high-performance data library technology, carries out to the data after message system processing quick
Read/write, and store into file system;
The analysis system is calculated in real time the data after storage and trend prediction analysis, obtains processing result and by its point
It is not sent to display systems and Database Systems, while will indicate that the result of trend prediction exception is sent to alarm in processing result
System;
The processing result received is shown by the display systems in the form of image and report;
The warning system is issued to operation maintenance personnel according to the result received and is alerted;
The file system is HDFS distributed file system, for storing the data after message system is processed;
The Database Systems are used to store the processing result of the analysis system received;
The network communication apparatus for host equipment and monitored computer server, host equipment and memory devices it
Between communication.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711353494.1A CN107943668B (en) | 2017-12-15 | 2017-12-15 | Computer server cluster log monitoring method and monitor supervision platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711353494.1A CN107943668B (en) | 2017-12-15 | 2017-12-15 | Computer server cluster log monitoring method and monitor supervision platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107943668A CN107943668A (en) | 2018-04-20 |
CN107943668B true CN107943668B (en) | 2019-02-26 |
Family
ID=61943544
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711353494.1A Expired - Fee Related CN107943668B (en) | 2017-12-15 | 2017-12-15 | Computer server cluster log monitoring method and monitor supervision platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107943668B (en) |
Families Citing this family (51)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108923952B (en) * | 2018-05-31 | 2021-11-30 | 北京百度网讯科技有限公司 | Fault diagnosis method, equipment and storage medium based on service monitoring index |
CN108804679A (en) * | 2018-06-12 | 2018-11-13 | 云南电网有限责任公司信息中心 | A kind of operation system user's operation monitoring data method for visualizing |
CN109144765B (en) * | 2018-08-21 | 2024-02-02 | 平安科技(深圳)有限公司 | Report generation method, report generation device, computer equipment and storage medium |
CN109344180A (en) * | 2018-08-21 | 2019-02-15 | 中国平安人寿保险股份有限公司 | Method, apparatus, computer equipment and the storage medium that display data obtains |
CN109034423B (en) * | 2018-08-29 | 2023-04-18 | 郑州云海信息技术有限公司 | Fault early warning judgment method, device, equipment and storage medium |
CN109408320A (en) * | 2018-09-03 | 2019-03-01 | 中国平安人寿保险股份有限公司 | Abnormality eliminating method, device, computer equipment and storage medium are developed in front end |
CN109359014A (en) * | 2018-09-04 | 2019-02-19 | 武汉华信联创技术工程有限公司 | A kind of computer operation condition monitoring method, system and storage medium |
CN109117345A (en) * | 2018-09-05 | 2019-01-01 | 深圳市木瓜移动科技有限公司 | Log monitoring method and distributed data system |
CN109189847A (en) * | 2018-09-11 | 2019-01-11 | 国网山东省电力公司莱芜供电公司 | A kind of distribution transforming decreasing loss detection prompt system and method |
CN109522287B (en) * | 2018-09-18 | 2023-08-18 | 平安科技(深圳)有限公司 | Monitoring method, system, equipment and medium for distributed file storage cluster |
CN110928740A (en) * | 2018-09-20 | 2020-03-27 | 中国石油化工股份有限公司 | Centralized visualization method and system for operation and maintenance data of cloud computing center |
CN109542946A (en) * | 2018-10-26 | 2019-03-29 | 贵州斯曼特信息技术开发有限责任公司 | It is a kind of to calculate big data system and method in real time |
CN109684161B (en) * | 2018-11-02 | 2022-05-03 | 深圳壹账通智能科技有限公司 | Data analysis method, data analysis device, server and storage medium |
CN109408448A (en) * | 2018-12-05 | 2019-03-01 | 江苏恒创软件有限公司 | One kind can carry out centralized processing integration data to data and show platform |
CN111382042A (en) * | 2018-12-29 | 2020-07-07 | 上海北塔软件股份有限公司 | Log management method based on big data stream type calculation |
CN109739828B (en) * | 2018-12-29 | 2021-06-29 | 咪咕文化科技有限公司 | Data processing method and device and computer readable storage medium |
CN109617750A (en) * | 2019-01-31 | 2019-04-12 | 国网电子商务有限公司 | A kind of service method for early warning and gateway |
CN110059140A (en) * | 2019-03-29 | 2019-07-26 | 国网福建省电力有限公司 | A method of data storage is carried out based on Oracle data and Hbase data |
CN110119421A (en) * | 2019-04-03 | 2019-08-13 | 昆明理工大学 | A kind of electric power stealing user identification method based on Spark flow sorter |
CN110113386A (en) * | 2019-04-16 | 2019-08-09 | 苏州浪潮智能科技有限公司 | A kind of power of MDC data center and environmental monitoring system method of data synchronization |
CN110287079A (en) * | 2019-05-14 | 2019-09-27 | 中山大学 | A kind of cluster Automatic monitoring systems and method |
CN110287081A (en) * | 2019-06-21 | 2019-09-27 | 腾讯科技(成都)有限公司 | A kind of service monitoring system and method |
CN110297745A (en) * | 2019-07-04 | 2019-10-01 | 中山大学 | A kind of Fault Locating Method and system storing monitoring system |
CN110309030A (en) * | 2019-07-05 | 2019-10-08 | 亿玛创新网络(天津)有限公司 | Log analysis monitoring system and method based on ELK and Zabbix |
CN110377488A (en) * | 2019-07-15 | 2019-10-25 | 福建威盾科技集团有限公司 | A kind of method and system for unifying O&M and dynamic expansion |
CN112445674A (en) * | 2019-08-30 | 2021-03-05 | 中国石油化工股份有限公司 | Data processing method and storage medium of computer cluster |
CN110659182A (en) * | 2019-09-12 | 2020-01-07 | 无锡江南计算技术研究所 | High-performance computer monitoring method and system |
CN110765189A (en) * | 2019-09-18 | 2020-02-07 | 苏宁云计算有限公司 | Exception management method and system for Internet products |
CN110677304A (en) * | 2019-10-11 | 2020-01-10 | 广州趣丸网络科技有限公司 | Distributed problem tracking system and equipment |
CN110890988B (en) * | 2019-12-02 | 2022-04-22 | 安徽三实信息技术服务有限公司 | Server cluster operation monitoring system |
CN111049898A (en) * | 2019-12-10 | 2020-04-21 | 杭州东方通信软件技术有限公司 | Method and system for realizing cross-domain architecture of computing cluster resources |
CN110912786B (en) * | 2019-12-27 | 2021-07-16 | 深圳市星砺达科技有限公司 | Gateway pressure testing method and device, computer equipment and storage medium |
CN111225045B (en) * | 2019-12-31 | 2022-12-27 | 苏州浪潮智能科技有限公司 | HIVE high-availability early warning method, device and computer readable storage medium |
CN111339142A (en) * | 2020-02-26 | 2020-06-26 | 广州信安数据有限公司 | Data monitoring response method, computer readable storage medium and data driving platform |
CN111324513B (en) * | 2020-02-29 | 2022-12-27 | 苏州浪潮智能科技有限公司 | Monitoring management method and system for artificial intelligence development platform |
CN111600856B (en) * | 2020-03-07 | 2023-03-31 | 浙江齐治科技股份有限公司 | Safety system of operation and maintenance of data center |
CN112115026B (en) * | 2020-09-15 | 2022-09-16 | 招商局金融科技有限公司 | Server cluster monitoring method and device, electronic equipment and readable storage medium |
CN112437145A (en) * | 2020-11-18 | 2021-03-02 | 北京浪潮数据技术有限公司 | Server cluster management method and device and related components |
CN114661538B (en) * | 2020-12-23 | 2023-04-28 | 金篆信科有限责任公司 | Distributed database monitoring method and device, electronic equipment and storage medium |
CN112769622A (en) * | 2021-01-18 | 2021-05-07 | 孙冬英 | Cluster service fault early warning system based on RPC service monitoring |
CN113282559A (en) * | 2021-06-04 | 2021-08-20 | 青岛海尔科技有限公司 | Computer log classification method and device, storage medium and electronic device |
WO2023279815A1 (en) * | 2021-07-08 | 2023-01-12 | 华为技术有限公司 | Performance monitoring system and related method |
CN113342621A (en) * | 2021-07-14 | 2021-09-03 | 芯河半导体科技(无锡)有限公司 | System for monitoring and testing idle time of machine host and giving alarm |
CN113688005B (en) * | 2021-08-09 | 2022-08-26 | 山东亚泽信息技术有限公司 | Operation and maintenance monitoring method and system |
CN113900898B (en) * | 2021-10-19 | 2024-09-03 | 北京金山云网络技术有限公司 | Data processing system, equipment and medium |
CN114500232A (en) * | 2022-01-24 | 2022-05-13 | 上海华力微电子有限公司 | Factory network middleware monitoring system |
CN114553732A (en) * | 2022-03-08 | 2022-05-27 | 北京月新时代科技股份有限公司 | Technology for automatically acquiring equipment performance based on equipment |
CN114448831B (en) * | 2022-03-18 | 2023-09-01 | 以萨技术股份有限公司 | Method and system for monitoring state of servers to which clusters belong |
CN115834696B (en) * | 2022-10-20 | 2023-08-01 | 北京新数科技有限公司 | Database performance monitoring platform data acquisition device |
CN116112407A (en) * | 2022-12-28 | 2023-05-12 | 上海学登信息科技有限公司 | Network flow data acquisition system |
CN116911807B (en) * | 2023-09-13 | 2023-12-05 | 成都秦川物联网科技股份有限公司 | Intelligent gas data center flow visual management method and Internet of things system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101707632A (en) * | 2009-10-28 | 2010-05-12 | 浪潮电子信息产业股份有限公司 | Method for dynamically monitoring performance of server cluster and alarming real-timely |
CN102938710A (en) * | 2012-11-14 | 2013-02-20 | 北京奇虎科技有限公司 | Monitoring system and method for large-scale servers |
CN104618343A (en) * | 2015-01-06 | 2015-05-13 | 中国科学院信息工程研究所 | Method and system for detecting website threat based on real-time log |
CN105868075A (en) * | 2016-03-31 | 2016-08-17 | 浪潮通信信息系统有限公司 | System and method for monitoring and analyzing large amount of logs in real time |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014147699A1 (en) * | 2013-03-18 | 2014-09-25 | 富士通株式会社 | Management device, method, and program |
-
2017
- 2017-12-15 CN CN201711353494.1A patent/CN107943668B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101707632A (en) * | 2009-10-28 | 2010-05-12 | 浪潮电子信息产业股份有限公司 | Method for dynamically monitoring performance of server cluster and alarming real-timely |
CN102938710A (en) * | 2012-11-14 | 2013-02-20 | 北京奇虎科技有限公司 | Monitoring system and method for large-scale servers |
CN104618343A (en) * | 2015-01-06 | 2015-05-13 | 中国科学院信息工程研究所 | Method and system for detecting website threat based on real-time log |
CN105868075A (en) * | 2016-03-31 | 2016-08-17 | 浪潮通信信息系统有限公司 | System and method for monitoring and analyzing large amount of logs in real time |
Also Published As
Publication number | Publication date |
---|---|
CN107943668A (en) | 2018-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107943668B (en) | Computer server cluster log monitoring method and monitor supervision platform | |
CN111984499B (en) | Fault detection method and device for big data cluster | |
Muniswamaiah et al. | Big data in cloud computing review and opportunities | |
CN108353090B (en) | Method for improving processing of sensor stream data in a distributed network | |
US10129168B2 (en) | Methods and systems providing a scalable process for anomaly identification and information technology infrastructure resource optimization | |
US8423638B2 (en) | Performance monitoring of a computer resource | |
CN114500250B (en) | System linkage comprehensive operation and maintenance system and method in cloud mode | |
CN104881352A (en) | System resource monitoring device based on mobile terminal | |
CN111046022A (en) | Database auditing method based on big data technology | |
JP6457777B2 (en) | Automated generation and dynamic update of rules | |
CN105071954A (en) | Resource pool fault diagnosis and positioning processing method based on probe technology | |
CN110598051A (en) | Power industry monitoring system, method and device | |
Samak et al. | Scalable analysis of network measurements with Hadoop and Pig | |
CN113835918A (en) | Server fault analysis method and device | |
Solaimani et al. | Online anomaly detection for multi‐source VMware using a distributed streaming framework | |
WO2023224764A1 (en) | Multi-modality root cause localization for cloud computing systems | |
CN112306820A (en) | Log operation and maintenance root cause analysis method and device, electronic equipment and storage medium | |
Shao et al. | Griffon: Reasoning about job anomalies with unlabeled data in cloud-based platforms | |
CN108055152B (en) | Communication network information system abnormity detection method based on distributed service log | |
CN112579552A (en) | Log storage and calling method, device and system | |
CN114221997A (en) | Interface monitoring system based on micro-service gateway | |
Shih et al. | Implementation and visualization of a netflow log data lake system for cyberattack detection using distributed deep learning | |
Yamnual et al. | Failure detection through monitoring of the scientific distributed system | |
KR101878291B1 (en) | Big data management system and management method thereof | |
Kumar et al. | A pragmatic approach to predict hardware failures in storage systems using MPP database and big data technologies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190226 |