CN109947615A - The monitoring method and device of distributed system - Google Patents

The monitoring method and device of distributed system Download PDF

Info

Publication number
CN109947615A
CN109947615A CN201910027660.1A CN201910027660A CN109947615A CN 109947615 A CN109947615 A CN 109947615A CN 201910027660 A CN201910027660 A CN 201910027660A CN 109947615 A CN109947615 A CN 109947615A
Authority
CN
China
Prior art keywords
high availabitity
index
achievement data
time cycle
multinomial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910027660.1A
Other languages
Chinese (zh)
Inventor
倪军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910027660.1A priority Critical patent/CN109947615A/en
Publication of CN109947615A publication Critical patent/CN109947615A/en
Pending legal-status Critical Current

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

This specification embodiment provides the monitoring method and device of a kind of distributed system, the distributed system is the cluster being made of multiple single machines, method includes: the multinomial High Availabitity achievement data for obtaining the distributed system first within the current preset time cycle, the multinomial High Availabitity achievement data includes, the multinomial single machine High Availabitity achievement data of each single machine, and the multinomial cluster High Availabitity achievement data of cluster, then the corresponding unusual fluctuation measure function of each High Availabitity index in the current preset time cycle is obtained, it is utilized respectively the corresponding unusual fluctuation measure function again, assess each High Availabitity achievement data in the current preset time cycle, obtain the result whether each High Availabitity achievement data needs early warning, so as to find High Availabitity problem and precise positioning problem in time, to promote quick emergency recovery.

Description

The monitoring method and device of distributed system
Technical field
This specification one or more embodiment be related to computer field more particularly to distributed system monitoring method and Device.
Background technique
In the process of running, high availability is that its very important robustness refers to system/software/website/platform etc. One of mark, continues externally to provide highly usable characteristic and state, is to measure a system quality very important point. And in real process, can because of various objective and subjective reasons (such as aacode defect, network failure, server hardware failure Deng), cause system (or software etc.) that various High Availabitities occur in terms of the problem of, such as central processing unit (central Processing unit, CPU) rise violently, frequent garbage reclamation (garbage collection, GC) the problems such as.This when is most It is desirable that can identify at the first time and refine navigate to be which aspect High Availabitity problem, to carry out answering for first time It is anxious to restore, to avoid the availability of system from being persistently damaged.
Distributed system: supporting the software systems of distributed treatment, is in the multiprocessor system interconnected by communication network The system of task is executed in structure.It includes distributed operating system, distributed program design language and its compiling (explanation) system System, distributed file system and distributed data base system etc..Distributed system is the cluster being made of multiple single machines.
There are two classes currently for the monitoring method of distributed system:
A kind of monitoring method is only to provide business unusual fluctuation monitoring, and this kind of monitoring solution can accomplish that unusual fluctuation occurs for business Come out by information early warning, but this kind of business monitoring be more from business change angle carry out unusual fluctuation perception, if it is because If the problem of in terms of for High Availabitity causes, question classification cannot be accurately navigated to, after bringing system personnel to receive early warning, The source for also needing to combine the further problem analysis of some other information increases the time-consuming of problem investigation and emergency.
Another kind of monitoring method is that the part High Availabitity index of system is detected, and what is substantially monitored is all cluster Average unusual fluctuation, and sometimes can't all single machines unusual fluctuation occurs together, and only wherein problem has occurred in individual single machines, this When cluster entirety average result from the point of view of, the magnitude of unusual fluctuation be it is minimum, early warning will not be triggered to, cause problem that can not be sent out It is existing.Also, because there are many number of single machine and link is more complicated, after generation problem, being difficult accurately to navigate to is which Unusual fluctuation has occurred in which High Availabitity index of a single machine, can not accomplish quickly to position.
Accordingly, it would be desirable to there is improved plan, High Availabitity problem and precise positioning problem can be found in time, to promote Quick emergency recovery.
Summary of the invention
This specification one or more embodiment describes the monitoring method and device of a kind of distributed system, can be timely It was found that High Availabitity problem and precise positioning problem, to promote quick emergency recovery.
In a first aspect, providing a kind of monitoring method of distributed system, the distributed system is by multiple single machine structures At cluster, method includes:
Obtain multinomial High Availabitity achievement data of the distributed system within the current preset time cycle, the multinomial height It can include the multinomial single machine High Availabitity achievement data of each single machine and the multinomial cluster High Availabitity index of cluster with achievement data Data;
Obtain the corresponding unusual fluctuation measure function of each High Availabitity index in the current preset time cycle;
It is utilized respectively the corresponding unusual fluctuation measure function, assesses each High Availabitity index in the current preset time cycle Data obtain the result whether each High Availabitity achievement data needs early warning.
In a kind of possible embodiment, the acquisition distributed system is more within the current preset time cycle Item High Availabitity achievement data, comprising:
Obtain log of the distributed system within the current preset time cycle;
The log for same source address is parsed according to preset model, obtains the multinomial list of each single machine Machine High Availabitity achievement data.
Further, wherein the multinomial High Availabitity for obtaining the distributed system within the current preset time cycle Achievement data, further includes:
Operation is carried out according to preset algorithm to the multinomial single machine High Availabitity achievement data of each single machine, determines the cluster Multinomial cluster High Availabitity achievement data.
Further, wherein the log includes runnability log and/or infrastructure service log.
Further, wherein the runnability log includes the service condition data of CPU, load condition data, memory At least one of in service condition data and GC several data;The High Availabitity index includes the service condition parameter of CPU, load At least one of in situation parameter, memory service condition parameter and GC parameter.
Further, wherein the infrastructure service log include calling interface method time-consuming, calling interface method result, At least one of in the interface method time-consuming of database manipulation and the interface method result of database manipulation;The High Availabitity index Including calling interface method time-consuming parameter, calling interface method result parameter, database manipulation interface method time-consuming parameter and At least one of in the interface method result parameter of database manipulation.
In a kind of possible embodiment, wherein the unusual fluctuation measure function determines in the following manner:
It can to the multinomial height obtained at least one preset period of time before the current preset time cycle It is for statistical analysis with achievement data difference, determine each corresponding index of High Availabitity index of the current preset time cycle Base line formula;
According to the corresponding index base line formula of each High Availabitity index of the current preset time cycle, work as described in determination The corresponding unusual fluctuation measure function of each High Availabitity index of preceding preset period of time.
Further, wherein in described at least one preset period of time to before the current preset time cycle The multinomial High Availabitity achievement data difference obtained is for statistical analysis, determines each height of the current preset time cycle The corresponding index base line formula of index can be used, comprising:
It is assumed that the interior multinomial height obtained of at least one preset period of time before the current preset time cycle Can be with achievement data according to normal distribution, each height of the current preset time cycle according to the determine the probability of numeric distribution can With the corresponding index base line formula of index.
Further, wherein described each corresponding index of High Availabitity index according to the current preset time cycle Base line formula determines the corresponding unusual fluctuation measure function of each High Availabitity index of the current preset time cycle, comprising:
According to the corresponding index base line formula of each High Availabitity index of the current preset time cycle and described work as The ring of each High Availabitity index of each High Availabitity index of preceding preset period of time and a upper preset period of time than ratio, And/or year-on-year ratio, determine the corresponding unusual fluctuation measure function of each High Availabitity index of the current preset time cycle.
In a kind of possible embodiment, wherein the method also includes:
The result of early warning whether is needed to distinguish the multinomial High Availabitity achievement data and each High Availabitity achievement data Information fusion is carried out according to cluster dimension and single machine dimension, is assembled into warning information message;
According to the corresponding single machine of the warning information message or cluster, the warning information message is sent with predetermined manner Give the single machine or the corresponding default terminal of cluster.
Further, wherein the predetermined manner includes one or more of mode:
Instant messaging (instant messaging, IM) notice, short message and phone.
Second aspect, provides a kind of monitoring device of distributed system, and the distributed system is by multiple single machine structures At cluster, device includes:
First acquisition unit refers to for obtaining multinomial High Availabitity of the distributed system within the current preset time cycle Mark data, the multinomial High Availabitity achievement data includes, the multinomial single machine High Availabitity achievement data of each single machine and cluster it is more Item cluster High Availabitity achievement data;
Second acquisition unit is measured for obtaining the corresponding unusual fluctuation of each High Availabitity index in the current preset time cycle Function;
Assessment unit, the corresponding unusual fluctuation measure function obtained for being utilized respectively the second acquisition unit, is commented Estimate each High Availabitity achievement data in the current preset time cycle, obtains whether each High Availabitity achievement data needs early warning As a result.
The third aspect provides a kind of computer readable storage medium, is stored thereon with computer program, when the calculating When machine program executes in a computer, enable computer execute first aspect method.
Fourth aspect provides a kind of calculating equipment, including memory and processor, and being stored in the memory can hold Line code, when the processor executes the executable code, the method for realizing first aspect.
The method and apparatus provided by this specification embodiment, the distributed system are the collection being made of multiple single machines Group, obtains multinomial High Availabitity achievement data of the distributed system within the current preset time cycle, the multinomial height first It can include the multinomial single machine High Availabitity achievement data of each single machine and the multinomial cluster High Availabitity index of cluster with achievement data Then data obtain the corresponding unusual fluctuation measure function of each High Availabitity index in the current preset time cycle, then are utilized respectively The corresponding unusual fluctuation measure function assesses each High Availabitity achievement data in the current preset time cycle, obtains each height The result of early warning whether can be needed with achievement data.Therefore the multinomial of cluster is not only obtained in this specification embodiment Cluster High Availabitity achievement data, and the multinomial single machine High Availabitity achievement data of each single machine is obtained, and measure according to unusual fluctuation Function judges whether each High Availabitity achievement data needs early warning (whether unusual fluctuation occurring), wherein different preset time weeks The High Availabitity index of phase or different item may correspond to different unusual fluctuation measure functions, be referred to by the High Availabitity to distributed system Target fining monitoring, so as to find High Availabitity problem and precise positioning problem in time, to promote quick emergency recovery.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, it can also be obtained according to these attached drawings others Attached drawing.
Fig. 1 is the implement scene schematic diagram of one embodiment that this specification discloses;
Fig. 2 shows the monitoring method flow charts according to the distributed system of one embodiment;
Fig. 3, which is shown, implements schematic diagram according to the monitoring method of the distributed system of one embodiment;
Fig. 4 shows the schematic block diagram of the monitoring device of the distributed system according to one embodiment.
Specific embodiment
With reference to the accompanying drawing, the scheme provided this specification is described.
Fig. 1 is the implement scene schematic diagram of one embodiment that this specification discloses.The implement scene is related to for distribution The monitoring of formula system, wherein the distributed system is the cluster being made of multiple single machines.Referring to Fig. 1, distributed system 11 is The cluster being made of single machine A, single machine B, single machine C and single machine D, user is by terminal 12 (for example, mobile phone, plate or PC Deng) to access or using the distributed system 11, it is to be understood that the number for the single machine for including in cluster can be according to reality Demand determines that the number of single machine as shown in the figure is by way of example only.
In a distributed system, what one group of independent computer (i.e. single machine) was presented to user is one unified whole Body just looks like as being a system.System possesses the physics and logical resource of many general, can dynamically distribute task, The physics and logical resource of dispersion realize information exchange by computer network.It is managed in a manner of global in system there are one The distributed operating system of computer resource.In general, for users, only one model of distributed system or pattern.It is grasping Make there is one layer of software middleware (middleware) to be responsible for realizing this model on system.
Distributed operating system is with global mode management system resource, it can be any dispatch network money of user Source, and scheduling process is " transparent ".When user submits an operation, distributed operating system can be as needed Most suitable processor (i.e. single machine) is selected in system, and the operation of user is submitted to the processing routine, is fulfiled assignment in processor Afterwards, result is transmitted to user.In this process, user is not aware that the presence of multiple processors, this system is just It seem that a processor is the same.
It should be noted that distributed system: supporting the software systems of distributed treatment in this specification embodiment, being The system of task is executed on the multiprocessor architecture interconnected by communication network.It not only includes distributed operating system, It further include distributed program design language and its compiling (explanation) system, distributed file system and distributed data base system Deng.
In this specification embodiment, monitoring for distributed system, primarily directed to the high availability of distributed system Monitoring, by the monitoring to cluster and single machine many indexes, to judge whether indices occur unusual fluctuation, that is, It says, judges whether indices need early warning, monitored by the fining of the High Availabitity index to distributed system, so as to Discovery High Availabitity problem and precise positioning problem in time, to promote quick emergency recovery.
Fig. 2 shows the monitoring method flow chart according to the distributed system of one embodiment, the distributed system is served as reasons The cluster that multiple single machines are constituted, such as distributed system 11 shown in Fig. 1.As shown in Fig. 2, distributed system in the embodiment Monitoring method the following steps are included: step 21, obtains multinomial height of the distributed system within the current preset time cycle Achievement data can be used, the multinomial High Availabitity achievement data includes the multinomial single machine High Availabitity achievement data of each single machine, Yi Jiji The multinomial cluster High Availabitity achievement data of group;Step 22, each the High Availabitity index obtained in the current preset time cycle is corresponding Unusual fluctuation measure function;Step 23, it is utilized respectively the corresponding unusual fluctuation measure function, is assessed in the current preset time cycle Each High Availabitity achievement data obtains the result whether each High Availabitity achievement data needs early warning.It is described below above each The specific executive mode of step.
First in step 21, multinomial High Availabitity index number of the distributed system within the current preset time cycle is obtained According to the multinomial High Availabitity achievement data includes the multinomial single machine High Availabitity achievement data of each single machine and the multi itemset of cluster Group's High Availabitity achievement data.Wherein, the current preset time cycle can be understood as the early warning period, and the above-mentioned early warning period can basis Demand setting, for example, being 5 minutes, half an hour or one day etc. by early warning cycle set.
In one example, log of the distributed system within the current preset time cycle is obtained;For it is same come The log of source address is parsed according to preset model, obtains the multinomial single machine High Availabitity achievement data of each single machine. It is understood that the log is the log of each single machine record in distributed system, the source address of log be can be identified for that The corresponding single machine of the log out, so as to which single machine High Availabitity achievement data is mapped with single machine, convenient for subsequent for every Whether a single machine analysis single machine High Availabitity achievement data is unusual fluctuation value, and whether needs early warning.
In one example, the multinomial single machine High Availabitity achievement data of each single machine is transported according to preset algorithm It calculates, determines the multinomial cluster High Availabitity achievement data of the cluster.Wherein, above-mentioned preset algorithm can be, but not limited to each single machine A single machine High Availabitity achievement data average, minimize or maximizing.Implement as shown in Table 1 for this specification A kind of mapping table of the cluster High Availabitity achievement data of the single machine High Availabitity achievement data and cluster for single machine that example provides.
Table one: the mapping table of single machine High Availabitity achievement data and cluster High Availabitity achievement data
It is being obtained for system is the cluster being made of single machine A, single machine B, single machine C and single machine D in a distributed manner referring to table one After taking the High Availabitity achievement data of a High Availabitity index of each single machine, referred to by this High Availabitity that maximizing obtains cluster Target High Availabitity achievement data.Wherein, in table one, by taking above-mentioned preset algorithm is maximizing as an example, other algorithms are also class Seemingly, this will not be repeated here.
In one example, the log includes runnability log and/or infrastructure service log.
Further, the runnability log includes that the service condition data, load condition data, memory of CPU use At least one of in situation data and GC several data;The High Availabitity index includes service condition parameter, the load condition of CPU At least one of in parameter, memory service condition parameter and GC parameter.
Further, the infrastructure service log includes calling interface method time-consuming, calling interface method result, database At least one of in the interface method time-consuming of operation and the interface method result of database manipulation;The High Availabitity index includes connecing Mouth method call time-consuming parameter, calling interface method result parameter, the interface method time-consuming parameter of database manipulation and database At least one of in the interface method result parameter of operation.
It should be noted that the above-mentioned early warning period is typically larger than the record period of log, for example, log records one per minute It is secondary, and the early warning period is 5 minutes, in this case, within an early warning period, one High Availabitity index of log recording Multiple High Availabitity achievement datas.For the ease of analysis, further processing can be made for this multiple High Availabitity achievement data, For example, maximizing, minimizes, or average, using the High Availabitity achievement data obtained after processing as the early warning The High Availabitity achievement data of this High Availabitity index in period, it is subsequent judge whether to occur unusual fluctuation value can be for after the processing An obtained High Availabitity achievement data is judged.A kind of log recording provided as shown in Table 2 for this specification embodiment High Availabitity achievement data and the early warning period High Availabitity achievement data to be assessed mapping table.
Table two: the mapping table of the High Availabitity achievement data of log and High Availabitity achievement data to be assessed
Referring to table two, by taking the cpu usage parameter value of the log recording in an early warning period in single machine A as an example, It is to be assessed by averaging to obtain after the multiple High Availabitity achievement datas for obtaining a High Availabitity index in the early warning period This High Availabitity index High Availabitity achievement data.Wherein, in table two, by taking above-mentioned preset algorithm is to average as an example, His algorithm be also it is similar, this will not be repeated here.
Then in step 22, letter is measured in the corresponding unusual fluctuation of each High Availabitity index obtained in the current preset time cycle Number.In this specification embodiment, the unusual fluctuation measure function can be obtained by the analysis to historical data, it can be using offline Mode carries out above-mentioned analysis, can also carry out above-mentioned analysis using online mode.
Wherein, off-line calculation: off-line calculation exactly all input datas known, input data before calculating starts will not produce Changing, and the calculating carried out under the premise of solving the problems, such as will obtain a result after one immediately.Belong to number in big data According to calculating section, it is corresponding with off-line calculation in the portion, be to calculate in real time.
It is understood that needing to determine that letter is measured in unusual fluctuation respectively for each High Availabitity index of each single machine or cluster Number.
In one example, the unusual fluctuation measure function determines in the following manner: to the current preset time cycle The multinomial High Availabitity achievement data difference obtained at least one preset period of time before is for statistical analysis, determines The corresponding index base line formula of each High Availabitity index of the current preset time cycle;According to week current preset time The corresponding index base line formula of each High Availabitity index of phase, determines each High Availabitity index of the current preset time cycle Corresponding unusual fluctuation measure function.
Specifically, it can be assumed that obtained at least one preset period of time before the current preset time cycle The multinomial High Availabitity achievement data is according to normal distribution, the current preset time cycle according to the determine the probability of numeric distribution The corresponding index base line formula of each High Availabitity index.For example, carrying out above-mentioned calculating using 3sigma algorithm.
Wherein, These parameters base line formula can be understood as interval threshold, and the interval threshold is for limiting High Availabitity index Normal value belonging to interval range.In this specification embodiment, can be by judging whether High Availabitity achievement data falls on Interval range is stated, to determine whether the High Availabitity achievement data is unusual fluctuation value, whether needs early warning accordingly.
In addition, can be combined with other conditions other than according to These parameters base line formula to determine High Availabitity index Whether data are unusual fluctuation value.In one example, corresponding according to each High Availabitity index of the current preset time cycle Each High Availabitity index of index base line formula and the current preset time cycle are every with a upper preset period of time The ring of item High Availabitity index determines each High Availabitity index of the current preset time cycle than ratio and/or year-on-year ratio Corresponding unusual fluctuation measure function.
Wherein, ring is the ratio of the High Availabitity achievement data of the same item High Availabitity index in two time adjacent segments than ratio Value, the above-mentioned period can be preset period of time, for example, each High Availabitity index of the current preset time cycle with it is upper The ratio of each High Availabitity index of one preset period of time.
Wherein, year-on-year ratio is the High Availabitity of the same item High Availabitity index in the identical subinterval in two time adjacent segments The ratio of achievement data, the above-mentioned period can day, above-mentioned subinterval can be preset period of time, for example, today is described Each of the preset period of time in the same period of each High Availabitity index and yesterday of current preset time cycle is high The ratio of index can be used.
In one example, year-on-year threshold value of the ring than the ring of ratio than threshold value and year-on-year ratio can be preset, is passed through High Availabitity achievement data compared with index base line formula, ring than ratio compared with ring is than threshold value, and ratio and same on year-on-year basis It is comprehensive to determine whether the High Availabitity achievement data is unusual fluctuation value than the comparison of threshold value, and whether need early warning.
Finally in step 23, it is utilized respectively the corresponding unusual fluctuation measure function, is assessed in the current preset time cycle Each High Availabitity achievement data obtains the result whether each High Availabitity achievement data needs early warning.It is understood that when height When can be judged as unusual fluctuation value with achievement data, then this High Availabitity achievement data needs early warning, does not otherwise need early warning.
In one example, after step 23, to the multinomial High Availabitity achievement data and each High Availabitity index Whether data need the result of early warning to carry out information fusion according to cluster dimension and single machine dimension respectively, are assembled into warning information report Text;According to the corresponding single machine of the warning information message or cluster, by the warning information message with predetermined manner be sent to The single machine or the corresponding default terminal of cluster.
Wherein, the predetermined manner includes one or more of mode:
Instant messaging (instant messaging, IM) notice, short message and phone.
It is understood that each single machine or cluster may be safeguarded or be managed by different personnel, this specification is implemented In example, the differentiated distribution of warning information message may be implemented, it is easier to be concerned about by user, to promote user experience.
Furthermore, it is possible to just assemble above-mentioned warning information message when there is the High Availabitity achievement data for needing early warning, go forward side by side The distribution of row message achievees the effect that failure is reminded;Alternatively, can also be in spite of in the presence of the High Availabitity index number for needing early warning According to above-mentioned warning information message being assembled, and carry out the distribution of message, to reach constantly monitoring of the user to indices.
The method provided by this specification embodiment, the distributed system are the cluster being made of multiple single machines, first Multinomial High Availabitity achievement data of the distributed system within the current preset time cycle is first obtained, the multinomial High Availabitity refers to Marking data includes the multinomial single machine High Availabitity achievement data of each single machine and the multinomial cluster High Availabitity achievement data of cluster, so The corresponding unusual fluctuation measure function of each High Availabitity index in the current preset time cycle is obtained afterwards, then is utilized respectively corresponding institute Unusual fluctuation measure function is stated, each High Availabitity achievement data in the current preset time cycle is assessed, obtains each High Availabitity index Whether data need the result of early warning.Therefore the multinomial cluster height that cluster is not only obtained in this specification embodiment can With achievement data, and the multinomial single machine High Availabitity achievement data of each single machine is obtained, and is judged according to unusual fluctuation measure function Whether each High Availabitity achievement data needs early warning (whether unusual fluctuation occurring), wherein different preset period of time or difference The High Availabitity index of item may correspond to different unusual fluctuation measure functions, pass through the fine of the High Availabitity index to distributed system Change monitoring, so as to find High Availabitity problem and precise positioning problem in time, to promote quick emergency recovery.
Fig. 3, which is shown, implements schematic diagram, the distribution according to the monitoring method of the distributed system of one embodiment System is the cluster being made of multiple single machines, such as distributed system 11 shown in Fig. 1.As shown in figure 3, dividing in the embodiment The monitoring method of cloth system is mainly realized by following module:
Module 31 is acquired for Real-time Metadata.
Wherein, above-mentioned metadata includes cluster High Availabitity metadata and single machine High Availabitity metadata.
On the one hand, the High Availabitity metadata based on Real-time Data Center collects integration, to obtain single machine High Availabitity metadata.
In this specification embodiment, a variety of Real-time Data Centers that can be provided using current industry class, by taking It is engaged in that client (client) is installed on device, then timing second grade is analyzed the log transmission of update to data from server flat Platform.By Real-time Data Center collect original log information, log information includes the runnability log of server, as CPU, LOAD, memory, gc situation etc. also include the infrastructure service log of system operation, such as the calling time-consuming and as a result, number of interface method According to the interface method time-consuming and result of library operation DAO.
It in one example, can be by carrying out default operation to single machine High Availabitity metadata, to obtain cluster High Availabitity Metadata.
On the other hand, it is obtained outside metadata except through collecting the mode of log, other existing monitoring can also be passed through Platform can reduce the new cost for obtaining data to obtain the part High Availabitity data in terms of cluster.
Module 32 is modeled for real time data.
In this specification embodiment, module 32 includes Data Analysis Platform, for the data flow obtained from module 31, is passed through One or more processing such as data filtering, data aggregate and data modeling, by treated, data pass to the progress of offline number storehouse Offline backup, or data pass to the Production database (database, DB) in module 34 by treated, so that module 34 Index unusual fluctuation detection is carried out to real time data.
Collecting the log information to come due to module 31 is typically all original log information, is needed to these this when Log carries out the modeling on basis, obtains corresponding data flow of the High Availabitity index based on time shaft.
For example, clean interface data may be such that
2018-03-30T14:55:15.538+0800:2.987:[CMS-concurrent-mark- start]
2018-03-30T14:55:15.541+e800:2.991:[CMS-concurrent-mark: 0.003/ 0.003secs] [Times:user=0.02sys=0.00, real=0.00secs]
2018-03-30T14:55:15.541+0800:2.991:[CMS-concurrent-precl ean-start]
2018-03-30T14:55:15.559+0800:3.009:[CMS-concurrent-precl ean:0.e18/ 0.018secs] [Times:user=0.11sys=0.00, real=e.e2secs]
2018-03-30T14:55:15.559+e8ee:3.ee9:[CMS-concurrent-abort able- preclean-start]
So, data modeling needs the source address based on log, is modeled in conjunction with this partial log and is believed as follows Breath:
Appname-1-1,2018-03-30 14:55, gc, 1
Wherein, appname-1-1 representative server name, 2018-03-30 14:55 represent the time, and gc represents High Availabitity and refers to It marks (being gc number here), 1 represents the numerical value of High Availabitity index, i.e. High Availabitity achievement data.
Then per minute, every server log is parsed by above-mentioned model respectively, it can when obtaining being based on Between axis High Availabitity achievement data stream.
As procedure described above, by all High Availabitity indexs of a system, acquisition of information and data are carried out in this way Complete fining High Availabitity achievement data can be obtained in modeling parsing.
Wherein, data filtering and data aggregate are technical term commonly used in the art, are no longer explained herein.
Module 33, for obtaining High Availabitity index baseline.
Because the data trend and magnitude of each index are all different, and the not homologous rays of same index, corresponding Data result and trend be also it is different, if unusual fluctuation threshold value of warning is arranged in the mode based on artificial experience, workload Can be very big, and as the variation of time can become not accurate, so using the historical data analysis based on algorithm here, come Obtain the unusual fluctuation measure function of subsequent time, come it is intelligentized obtain it is final whether should early warning result;It, will be real-time in realization All data that data obtain all are written to offline number storehouse, are analyzed by common unusual fluctuation detection algorithm off-line data And index base line formula is obtained, while the dynamic formula being written back to production DB in real time.Wherein, in selection unusual fluctuation measure function When, it is contemplated that the one of High Availabitity index is big, and feature is, the trend of data entirety is more stable, so can select warp here Allusion quotation unusual fluctuation detection algorithm 3sigma algorithm carries out baseline calculating to off-line data.
Module 34 can be described as High Availabitity unusual fluctuation detecting and alarm, for according to the real time data and Dynamic Baseline in production DB Index unusual fluctuation detection is carried out, and early warning is carried out by warning module.
High Availabitity achievement data based on time shaft after taking Real-time Data Center modeling and based on from online data After modeling obtained high-altitude index base line formula, recycle using all High Availabitity achievement datas of single system as high-altitude index The input of base line formula obtains final pre-warning mark position (being or non-), finally obtain all High Availabitity achievement datas and whether Need the result of early warning.These indexs are pressed cluster dimension respectively with engine and single machine dimension carries out every High Availabitity index result Information fusion is assembled into warning information message, is pushed to corresponding personnel in a manner of IM notice, short message, phone etc..
Information fusion advice method is similar as follows:
High Availabitity refines monitor supervision platform-aappname details
Cluster index:
(table)
Index | current unusual fluctuation value | stable reference value | whether unusual fluctuation
CPU | 81 | 23 | it is no
loadl...
FGC
Tair-xx- success rate
Tair-xx- is time-consuming
Interface-method-success rate
Interface-method-time-consuming
Dal- is time-consuming
Dao- data source-success rate
Dao- data source-time-consuming
Based on above-mentioned mechanism, it can accomplish that system manager when High Availabitity unusual fluctuation occurs for system, accurately receives The fining result warning information of cluster and each dimension of single machine.
The realization of this programme, by the advantage of real-time online data platform, got in real time system it is all it is high can With the source data of index, while after modeling to indices, all indication informations of cluster and single machine can be obtained;In addition, By the investment of unusual fluctuation detection algorithm, reduces threshold value of warning setup cost and promote precision;It, can finally by the polymerization of information Accomplish system personnel when High Availabitity unusual fluctuation occurs for system, accurately receives the fining polymerization of cluster and each dimension of single machine As a result, rather than scattered multiple individual event warning information, facilitate precise positioning, reduce early warning cost.
According to the embodiment of another aspect, a kind of monitoring device of distributed system is also provided, the distributed system is The cluster being made of multiple single machines.Fig. 4 shows the schematic block diagram of the monitoring device of the distributed system according to one embodiment. As shown in figure 4, the device 400 includes:
First acquisition unit 41, for obtaining multinomial High Availabitity of the distributed system within the current preset time cycle Achievement data, the multinomial High Availabitity achievement data include, the multinomial single machine High Availabitity achievement data of each single machine and cluster Multinomial cluster High Availabitity achievement data;
Second acquisition unit 42, for obtaining the corresponding unusual fluctuation weighing apparatus of each High Availabitity index in the current preset time cycle Flow function;
Assessment unit 43 measures letter for being utilized respectively the corresponding unusual fluctuation that the second acquisition unit 42 obtains Number assesses each High Availabitity achievement data in the current preset time cycle, obtains whether each High Availabitity achievement data needs The result of early warning.
Optionally, as one embodiment, the first acquisition unit 41 is specifically used for the acquisition distributed system and exists Log in the current preset time cycle;The log for same source address is parsed according to preset model, is obtained The multinomial single machine High Availabitity achievement data of each single machine.
Further, the first acquisition unit 41 is also used to the multinomial single machine High Availabitity index number to each single machine Operation is carried out according to according to preset algorithm, determines the multinomial cluster High Availabitity achievement data of the cluster.
Further, the log that the first acquisition unit 41 obtains includes runnability log and/or basis clothes Business log.
Further, the runnability log includes that the service condition data, load condition data, memory of CPU use At least one of in situation data and GC several data;The High Availabitity index includes service condition parameter, the load condition of CPU At least one of in parameter, memory service condition parameter and GC parameter.
Further, the infrastructure service log includes calling interface method time-consuming, calling interface method result, database At least one of in the interface method time-consuming of operation and the interface method result of database manipulation;The High Availabitity index includes connecing Mouth method call time-consuming parameter, calling interface method result parameter, the interface method time-consuming parameter of database manipulation and database At least one of in the interface method result parameter of operation.
Optionally, as one embodiment, described device further include:
Determination unit, the unusual fluctuation measure function obtained for determining the second acquisition unit 42 in the following manner:
It can to the multinomial height obtained at least one preset period of time before the current preset time cycle It is for statistical analysis with achievement data difference, determine each corresponding index of High Availabitity index of the current preset time cycle Base line formula;
According to the corresponding index base line formula of each High Availabitity index of the current preset time cycle, work as described in determination The corresponding unusual fluctuation measure function of each High Availabitity index of preceding preset period of time.
Further, the determination unit, specifically at least one before the hypothesis current preset time cycle The multinomial High Availabitity achievement data obtained in preset period of time is according to normal distribution, according to the determine the probability of numeric distribution The corresponding index base line formula of each High Availabitity index of the current preset time cycle.
Further, the determination unit, specifically for being referred to according to each High Availabitity of the current preset time cycle Mark corresponding index base line formula and each High Availabitity index of the current preset time cycle and a upper preset time The ring of each High Availabitity index in period determines each height of the current preset time cycle than ratio and/or year-on-year ratio The corresponding unusual fluctuation measure function of index can be used.
Optionally, as one embodiment, described device further include:
Prewarning unit, multinomial High Availabitity achievement data and the assessment for being obtained to the first acquisition unit 41 Each High Availabitity achievement data that unit 43 obtains whether need the result of early warning respectively according to cluster dimension and single machine dimension into Row information polymerization, is assembled into warning information message;According to the corresponding single machine of the warning information message or cluster, by the early warning Infomational message is sent to default terminal corresponding with the single machine or cluster with predetermined manner.
Further, the predetermined manner includes one or more of mode:
IM notice, short message and phone.
The device provided by this specification embodiment, the distributed system are the cluster being made of multiple single machines, first Multinomial High Availabitity achievement data of the distributed system within the current preset time cycle is first obtained by first acquisition unit 41, The multinomial High Availabitity achievement data includes the multinomial single machine High Availabitity achievement data of each single machine and the multinomial cluster of cluster Then it is corresponding to obtain each High Availabitity index in the current preset time cycle by second acquisition unit 42 for High Availabitity achievement data Unusual fluctuation measure function, then the corresponding unusual fluctuation measure function is utilized respectively by assessment unit 43, assesses the current preset time Each High Availabitity achievement data in period obtains the result whether each High Availabitity achievement data needs early warning.Therefore The multinomial cluster High Availabitity achievement data of cluster is not only obtained in this specification embodiment, but also obtains the multinomial of each single machine Single machine High Availabitity achievement data, and judge whether each High Availabitity achievement data needs early warning (i.e. according to unusual fluctuation measure function Whether unusual fluctuation is occurred), wherein the High Availabitity index of different preset period of time or different item may correspond to different unusual fluctuations Measure function is monitored by the fining of the High Availabitity index to distributed system, so as to find High Availabitity problem in time And precise positioning problem, to promote quick emergency recovery.
According to the embodiment of another aspect, a kind of computer readable storage medium is also provided, is stored thereon with computer journey Sequence enables computer execute and combines method described in Fig. 2 or Fig. 3 when the computer program executes in a computer.
According to the embodiment of another further aspect, a kind of calculating equipment, including memory and processor, the memory are also provided In be stored with executable code, when the processor executes the executable code, realize and combine side described in Fig. 2 or Fig. 3 Method.
Those skilled in the art are it will be appreciated that in said one or multiple examples, function described in the invention It can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by these functions Storage in computer-readable medium or as on computer-readable medium one or more instructions or code transmitted.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all any modification, equivalent substitution, improvement and etc. on the basis of technical solution of the present invention, done should all Including within protection scope of the present invention.

Claims (24)

1. a kind of monitoring method of distributed system, the distributed system is the cluster being made of multiple single machines, the method Include:
Obtain multinomial High Availabitity achievement data of the distributed system within the current preset time cycle, the multinomial High Availabitity Achievement data includes the multinomial single machine High Availabitity achievement data of each single machine and the multinomial cluster High Availabitity achievement data of cluster;
Obtain the corresponding unusual fluctuation measure function of each High Availabitity index in the current preset time cycle;
It is utilized respectively the corresponding unusual fluctuation measure function, assesses each High Availabitity index number in the current preset time cycle According to obtaining the result whether each High Availabitity achievement data needs early warning.
2. the method for claim 1, wherein described obtain the distributed system within the current preset time cycle Multinomial High Availabitity achievement data, comprising:
Obtain log of the distributed system within the current preset time cycle;
The log for same source address is parsed according to preset model, and the multinomial single machine for obtaining each single machine is high Achievement data can be used.
3. method according to claim 2, wherein described to obtain the distributed system within the current preset time cycle Multinomial High Availabitity achievement data, further includes:
Operation is carried out according to preset algorithm to the multinomial single machine High Availabitity achievement data of each single machine, determines the more of the cluster Item cluster High Availabitity achievement data.
4. method according to claim 2, wherein the log includes runnability log and/or infrastructure service log.
5. method as claimed in claim 4, wherein the runnability log includes the service condition of central processor CPU At least one of in data, load condition data, memory service condition data and garbage reclamation GC times several data;The High Availabitity Index includes service condition parameter, load condition parameter, memory service condition parameter and the garbage reclamation GC of central processor CPU At least one of in parameter.
6. method as claimed in claim 4, wherein the infrastructure service log includes calling interface method time-consuming, interface side At least one of in the interface method result of method call result, the interface method time-consuming of database manipulation and database manipulation;Institute State the interface side that High Availabitity index includes calling interface method time-consuming parameter, calling interface method result parameter, database manipulation At least one of in the interface method result parameter of method time-consuming parameter and database manipulation.
7. the method for claim 1, wherein the unusual fluctuation measure function determines in the following manner:
The multinomial High Availabitity obtained at least one preset period of time before the current preset time cycle is referred to It is for statistical analysis to mark data difference, determines the corresponding index baseline of each High Availabitity index of the current preset time cycle Formula;
According to the corresponding index base line formula of each High Availabitity index of the current preset time cycle, determine described current pre- If the corresponding unusual fluctuation measure function of each High Availabitity index of time cycle.
8. the method for claim 7, wherein described default at least one before the current preset time cycle The multinomial High Availabitity achievement data difference obtained in time cycle is for statistical analysis, determines week current preset time The corresponding index base line formula of each High Availabitity index of phase, comprising:
It is assumed that the interior multinomial High Availabitity obtained of at least one preset period of time before the current preset time cycle Achievement data refers to according to normal distribution, each High Availabitity of the current preset time cycle according to the determine the probability of numeric distribution Mark corresponding index base line formula.
9. the method for claim 7, wherein described each High Availabitity index according to the current preset time cycle Corresponding index base line formula determines that letter is measured in the corresponding unusual fluctuation of each High Availabitity index of the current preset time cycle Number, comprising:
According to the corresponding index base line formula of each High Availabitity index of the current preset time cycle and described current pre- If the ring of each High Availabitity index of each High Availabitity index of time cycle and a upper preset period of time than ratio, and/ Or year-on-year ratio, determine the corresponding unusual fluctuation measure function of each High Availabitity index of the current preset time cycle.
10. method as claimed in any one of claims 1-9 wherein, wherein the method also includes:
To the multinomial High Availabitity achievement data and each High Availabitity achievement data whether need the result of early warning respectively according to Cluster dimension and single machine dimension carry out information fusion, are assembled into warning information message;
According to the corresponding single machine of the warning information message or cluster, by the warning information message with predetermined manner be sent to The single machine or the corresponding default terminal of cluster.
11. method as claimed in claim 10, wherein the predetermined manner includes one or more of mode:
Instant messaging IM notice, short message and phone.
12. a kind of monitoring device of distributed system, the distributed system is the cluster being made of multiple single machines, described device Include:
First acquisition unit, for obtaining multinomial High Availabitity index number of the distributed system within the current preset time cycle According to the multinomial High Availabitity achievement data includes the multinomial single machine High Availabitity achievement data of each single machine and the multi itemset of cluster Group's High Availabitity achievement data;
Second acquisition unit measures letter for obtaining the corresponding unusual fluctuation of each High Availabitity index in the current preset time cycle Number;
Assessment unit, the corresponding unusual fluctuation measure function obtained for being utilized respectively the second acquisition unit, assessment are worked as Each High Availabitity achievement data in preceding preset period of time obtains the knot whether each High Availabitity achievement data needs early warning Fruit.
13. device as claimed in claim 12, wherein the first acquisition unit is specifically used for obtaining the distributed system Log of the system within the current preset time cycle;The log for same source address is parsed according to preset model, Obtain the multinomial single machine High Availabitity achievement data of each single machine.
14. device as claimed in claim 13, wherein the first acquisition unit is also used to the multinomial of each single machine Single machine High Availabitity achievement data carries out operation according to preset algorithm, determines the multinomial cluster High Availabitity achievement data of the cluster.
15. device as claimed in claim 13, wherein the log that the first acquisition unit obtains includes runnability Log and/or infrastructure service log.
16. device as claimed in claim 15, wherein the runnability log includes the use feelings of central processor CPU At least one of in condition data, load condition data, memory service condition data and garbage reclamation GC times several data;The height can It include service condition parameter, load condition parameter, memory service condition parameter and the garbage reclamation of central processor CPU with index At least one of in GC parameter.
17. device as claimed in claim 15, wherein the infrastructure service log includes calling interface method time-consuming, interface At least one of in the interface method result of method call result, the interface method time-consuming of database manipulation and database manipulation; The High Availabitity index include calling interface method time-consuming parameter, calling interface method result parameter, database manipulation interface At least one of in the interface method result parameter of method time-consuming parameter and database manipulation.
18. device as claimed in claim 12, wherein described device further include:
Determination unit, the unusual fluctuation measure function obtained for determining the second acquisition unit in the following manner:
The multinomial High Availabitity obtained at least one preset period of time before the current preset time cycle is referred to It is for statistical analysis to mark data difference, determines the corresponding index baseline of each High Availabitity index of the current preset time cycle Formula;
According to the corresponding index base line formula of each High Availabitity index of the current preset time cycle, determine described current pre- If the corresponding unusual fluctuation measure function of each High Availabitity index of time cycle.
19. device as claimed in claim 18, wherein the determination unit is specifically used for assuming the current preset time The multinomial High Availabitity achievement data obtained at least one preset period of time before period according to normal distribution, according to The corresponding index base line formula of each High Availabitity index of current preset time cycle described in the determine the probability of numeric distribution.
20. device as claimed in claim 18, wherein the determination unit was specifically used for according to the current preset time The corresponding index base line formula of each High Availabitity index and each High Availabitity of the current preset time cycle in period refer to The ring of each High Availabitity index of mark and a upper preset period of time determines described current pre- than ratio and/or year-on-year ratio If the corresponding unusual fluctuation measure function of each High Availabitity index of time cycle.
21. the device as described in any one of claim 12 to 20, wherein described device further include:
Prewarning unit, multinomial High Availabitity achievement data and the assessment unit for obtaining to the first acquisition unit obtain To each High Availabitity achievement data whether the result of early warning is needed to carry out information according to cluster dimension and single machine dimension respectively poly- It closes, is assembled into warning information message;According to the corresponding single machine of the warning information message or cluster, by the warning information message Default terminal corresponding with the single machine or cluster is sent to predetermined manner.
22. device as claimed in claim 21, wherein the predetermined manner includes one or more of mode:
Instant messaging IM notice, short message and phone.
23. a kind of computer readable storage medium, is stored thereon with computer program, when the computer program in a computer When execution, computer perform claim is enabled to require the method for any one of 1-11.
24. a kind of calculating equipment, including memory and processor, executable code, the processing are stored in the memory When device executes the executable code, the method for any one of claim 1-11 is realized.
CN201910027660.1A 2019-01-11 2019-01-11 The monitoring method and device of distributed system Pending CN109947615A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910027660.1A CN109947615A (en) 2019-01-11 2019-01-11 The monitoring method and device of distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910027660.1A CN109947615A (en) 2019-01-11 2019-01-11 The monitoring method and device of distributed system

Publications (1)

Publication Number Publication Date
CN109947615A true CN109947615A (en) 2019-06-28

Family

ID=67007291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910027660.1A Pending CN109947615A (en) 2019-01-11 2019-01-11 The monitoring method and device of distributed system

Country Status (1)

Country Link
CN (1) CN109947615A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113419925A (en) * 2021-08-25 2021-09-21 天津南大通用数据技术股份有限公司 Monitoring method and system for monitoring and alarming multiple distributed MPP clusters
WO2021184554A1 (en) * 2020-03-18 2021-09-23 平安科技(深圳)有限公司 Database exception monitoring method and device, computer device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140032379A1 (en) * 2012-07-27 2014-01-30 Wolfgang Schuetz On-shelf availability system and method
CN105072201A (en) * 2015-08-28 2015-11-18 北京奇艺世纪科技有限公司 Distributed storage system and storage quality control method and device thereof
CN108365985A (en) * 2018-02-07 2018-08-03 深圳壹账通智能科技有限公司 A kind of cluster management method, device, terminal device and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140032379A1 (en) * 2012-07-27 2014-01-30 Wolfgang Schuetz On-shelf availability system and method
CN105072201A (en) * 2015-08-28 2015-11-18 北京奇艺世纪科技有限公司 Distributed storage system and storage quality control method and device thereof
CN108365985A (en) * 2018-02-07 2018-08-03 深圳壹账通智能科技有限公司 A kind of cluster management method, device, terminal device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021184554A1 (en) * 2020-03-18 2021-09-23 平安科技(深圳)有限公司 Database exception monitoring method and device, computer device, and storage medium
CN113419925A (en) * 2021-08-25 2021-09-21 天津南大通用数据技术股份有限公司 Monitoring method and system for monitoring and alarming multiple distributed MPP clusters

Similar Documents

Publication Publication Date Title
CN105868373B (en) Method and device for processing key data of power business information system
CN105095747B (en) A kind of Java application health degree appraisal procedure and system
CN105830037A (en) Process for displaying test coverage data during code reviews
CN108197261A (en) A kind of wisdom traffic operating system
CN110287081A (en) A kind of service monitoring system and method
CN101632093A (en) Be used to use statistical analysis to come the system and method for management of performance fault
CN111865407B (en) Intelligent early warning method, device, equipment and storage medium for optical channel performance degradation
CN106940677A (en) One kind application daily record data alarm method and device
CN110162445A (en) The host health assessment method and device of Intrusion Detection based on host log and performance indicator
CN113051147A (en) Database cluster monitoring method, device, system and equipment
CN107330080A (en) A kind of data processing method, device and apply its computer equipment
CN111552607A (en) Health evaluation method, device and equipment of application program and storage medium
CN110457371A (en) Data managing method, device, storage medium and system
CN111367747B (en) Index abnormal detection early warning device based on time annotation
CN112633542A (en) System performance index prediction method, device, server and storage medium
US11243951B2 (en) Systems and methods for automated analysis, screening, and reporting of group performance
CN114021971A (en) Comprehensive evaluation system, method and storage medium for expressway operation and maintenance management
CN109947615A (en) The monitoring method and device of distributed system
EP2866174A1 (en) Automated generation and dynamic update of rules
CN109829615B (en) Target task multistage monitoring device and method based on proprietary cloud
CN106951360B (en) Data statistical integrity calculation method and system
CN110182871A (en) A kind of method for treating water and terminal based on full-automatic medicine system
CN113379230A (en) Inspection regulation and control system and method based on big data
CN110517731A (en) Genetic test quality monitoring data processing method and system
CN113656452B (en) Method and device for detecting call chain index abnormality, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201014

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201014

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20190628

RJ01 Rejection of invention patent application after publication