CN108984320A

CN108984320A - A kind of anti-fissure method and device of message queue cluster

Info

Publication number: CN108984320A
Application number: CN201810682010.6A
Authority: CN
Inventors: 苏志远
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2018-06-27
Filing date: 2018-06-27
Publication date: 2018-12-11

Abstract

The embodiment of the present application discloses a kind of anti-fissure method and device of message queue cluster, the method includes the status informations according to node when message queue exception, building alerts model, includes N number of element for characterizing the status information of the node in the alarm model；Time series analysis carried out to N number of element at the node current time respectively, obtains the corresponding element prediction value of N number of element described in subsequent time；If the corresponding element prediction value of the first element is greater than or equal to the corresponding alarm threshold of first element, alerted, first element is the either element in N number of element.In the embodiment of the present application, to modeling analysis the reason of causing message queue fissure, obtain the threshold value for causing each element of message queue, and using these threshold values as alarm conditions, in addition each alarm element is predicted, and alarm is provided according to predicted value, it effectively prevent the appearance of message queue fissure.

Description

A kind of anti-fissure method and device of message queue cluster

Technical field

This application involves field of computer technology, more particularly to a kind of anti-fissure method and device of message queue cluster.

Background technique

OpenStack is a series of combination of Open-Source Tools (or open source projects), mainly carrys out structure using pond virtual resource Build and manage private clound and public cloud.It can provide to store including calculating, network, storage, authentication, mirror image and object and take Core cloud computing service including business, user can come together to create cloud framework that is unique, can disposing by being bundled.

HA (High Available, high availability cluster) is the effective solution for guaranteeing business continuance.Usual feelings Under condition, the High Availabitity between message queue cluster is carried out by mirror queue (mirror image pattern), the number of mirror image pattern queue According on all mirror images portion to all nodes.Any one node failure in this way, will not influence the use of entire cluster.In reality On now, there is a set of election algorithm inside mirror queue, a master (host node) and several slaver can be selected (from node).Master and slaver checks whether that connection disconnects by constantly sending heartbeat each other.In HA system, when When contacting " heartbeat " disconnection of 2 nodes, it was integral originally, the HA system of coordination, just division becomes 2 independences Individual.Due to mutually losing connection, which is all thought as that other side is out of order, and the HA software on two nodes is striven It robs " shared resource ", striven " application service ", will lead to that shared resource is carved up, 2 sides " service " all cannot get up in this way；Or 2 Side " service " is all got up, but reads while write " shared storage ", leads to corrupted data, the online log of common such as database poll Error.

For the fissure countermeasure between message queue cluster, following scheme is generallyd use in the prior art: first, it adds superfluous Remaining heartbeat, for example, reducing " fissure " occurrence probability to the greatest extent using diplocardia wire jumper；Second, arbitration mechanism is set, such as sets It sets with reference to IP (such as gateway IP), when heartbeat is fully disconnected, the respective ping of 2 nodes once refers to IP, and general rule does not show Breakpoint just goes out in local terminal, and not only " heartbeat ", the local terminal network link of also simultaneous externally " service " are disconnected, even if starting (or continuation) Application service is not also used, that just actively abandons competing, allow can the logical one end with reference to IP ping gone to service.More insurance one A bit, the obstructed side with reference to IP of ping self is restarted, and those of is possible to also take up shared resource with thorough release.

But in above scheme, fissure can not be avoided by addition redundancy heartbeat or setting arbitration mechanism completely, only It is to reduce odds.Therefore, a kind of more preferably anti-fissure scheme of message queue cluster urgently occurs.

Summary of the invention

A kind of anti-fissure method and device of message queue cluster is provided in the embodiment of the present application, in favor of solving existing skill Message queue cluster cannot avoid the problem that fissure completely in art.

In a first aspect, the embodiment of the present application provides a kind of method of anti-fissure of message queue cluster, comprising:

According to the status information of node when message queue exception, building alerts model, includes being used in the alarm model N number of element of the status information of the node is characterized, wherein one alarm threshold of each element arrangements, N >=1；

Time series analysis is carried out to N number of element at the node current time respectively, obtains N described in subsequent time The corresponding element prediction value of a element；

If the corresponding element prediction value of the first element is greater than or equal to the corresponding alarm threshold of first element, carry out Alarm, first element are the either element in N number of element.

Optionally, the method also includes:

If the alarm number of the node is more than preset alarm frequency threshold value, the message team on the node is closed Column.

Optionally, the alarm model is a trinary data group, and the element of the trinary data group is cpu busy percentage, interior Deposit utilization rate and network fluctuation coefficient.

Optionally, the alarm threshold of the cpu busy percentage is cpu busy percentage historical data average value；The memory utilizes The alarm threshold of rate is memory usage historical data average value；The alarm threshold of the network fluctuation coefficient is network fluctuation system Number historical data average value.

Optionally, N number of element to the node current time carries out time series analysis respectively, under acquisition The corresponding element prediction value of N number of element described in one moment, comprising:

Based on Arima algorithm, time series analysis is carried out to N number of element at the node current time respectively, is obtained The corresponding element prediction value of N number of element described in subsequent time.

Second aspect, the embodiment of the present application provide a kind of device of anti-fissure of message queue cluster, comprising:

Module is constructed, for the status information according to node when message queue exception, building alarm model, the alarm mould It include N number of element for characterizing the status information of the node in type, wherein one alarm threshold of each element arrangements, N >=1；

Prediction module carries out time series analysis for N number of element to the node current time respectively, obtains The corresponding element prediction value of N number of element described in subsequent time；

Alarm module, if being greater than or equal to the corresponding announcement of first element for the corresponding element prediction value of the first element Alert threshold value, then alerted, and first element is the either element in N number of element.

Optionally, described device further include:

Closedown module closes the section if the alarm number for the node is more than preset alarm frequency threshold value Message queue on point.

Optionally, the prediction module is specifically used for being based on Arima algorithm, to the N at the node current time A element carries out time series analysis respectively, obtains the corresponding element prediction value of N number of element described in subsequent time.

In the embodiment of the present application, it to modeling analysis the reason of causing message queue fissure, obtains and causes message queue The threshold value of each element, and using these threshold values as alarm conditions, in addition each alarm element is predicted, and according to prediction Value provides alarm, operation maintenance personnel can be noted that before message queue is abnormal, and after alarm generates, automatic to kill Process effectively prevent the appearance of message queue fissure.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, for those of ordinary skill in the art Speech, without creative efforts, is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of anti-fissure method flow schematic diagram of message queue cluster provided by the embodiments of the present application；

Fig. 2 is a kind of apparatus structure schematic diagram of the anti-fissure of message queue cluster provided by the embodiments of the present application.

Specific embodiment

In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common The application protection all should belong in technical staff's every other embodiment obtained without making creative work Range.

First below to the invention relates to concept be illustrated.

The cloud computing of OpenStack: one open source manages platform project, is combined by several main components and completes tool Body running.Project objective is to provide that implementation is simple, can extend, enrich on a large scale, the cloud computing platform that standard is unified. OpenStack services the solution of (IaaS) by the service of the various complementations facility that provides the foundation, and each service provides API is to be integrated.

Hypervisor:Hypervisor is a kind of middleware software operated between physical server and operating system Layer, allows multiple operating systems and a set of underlying physical hardware of Application share, therefore also can be regarded as in virtual environment " member " operating system, it can coordinate to access all physical equipments and virtual machine on server, also be virtual machine monitor (Virtual Machine Monitor).Hypervisor is the core of all virtualization technologies.Multiplexing is supported to make non-interruptedly The ability of load migration is the basic function of Hypervisor.When server is started and carried out Hypervisor, it can be to every One virtual machine distributes suitable memory, CPU, network and disk, and loads the client operating system of all virtual machines.

Message queue (RabbitMQ): the framework of Openstack, which determines, to be needed to realize not using message queue mechanism With the communication of intermodule, by information authentication, message conversion, message routing framework mode, bring benefit is exactly that can be mould Utmostly decoupled between block, client do not need concern server-side position and whether there is, only need to by message queue into The transmission of row information.RabbitMQ is suitble to be deployed in the flexible scale system environments of a topology, is effectively ensured Disparate modules, different node, between different processes message communicating timeliness, can effectively support OpenStack cloud platform system Scale deployment, resilient expansion, the demand of flexible architecture and information security.

Message queue network partition (fissure): in High Availabitity (HA) system, when " heartbeat " of 2 nodes of connection disconnects When, it was integral originally, the HA system of coordination, just division becomes 2 independent individuals.Due to mutually losing connection, All it is thought as that other side is out of order.HA software on two nodes fights for " shared resource ", has striven and " answered as " split " With service ", it will lead to that shared resource is carved up, 2 sides " service " all cannot get up in this way；Or 2 side " service " all get up, but " shared storage " is read while write, corrupted data is caused, the common online log such as database poll malfunctions.

The HA design of OpenStack cluster is divided into two aspects: stateless and stateful.Wherein, statelessly refer to the clothes The data for needing persistence will not be locally stored in the example run of being engaged in, and multiple examples are for the knot of the same request response Fruit is completely the same.This kind of service is after the creation of Netease's honeycomb cloud platform, by the load balancing inside k8s, when access should After the request of service reaches service one end, load balancing can find an example at random come the response for completing the request and (be at present Poll).The example of this kind of service may stop because of some reasons or re-create (when such as dilatation), and at this moment, these stop All information (in addition to log and monitoring data) in example only will all lose (restarting container i.e. can lose).So if you Container instance in need to retain important information, and it is desirable that can back up in order to can restore later at any time, then It is recommended that you create stateful service.Stateful service (Stateful Service): refer to that the example of the service can be by one Divided data is backed up at any time, and when creating a new stateful service, can by these data of Backup and Restore, with Achieve the purpose that data persistence.Stateful service can only have an example, therefore not support " automation services capacity regulating ".One As for, database service or need can in the application program of local file system storage configuration file or other permanent datas Stateful service is used with creation.To create stateful service, it is necessary to meet several premises: service mirror image to be created (image) storage volume (Volume) must be defined in Dockerfile, because only that the data in catalogue where storage volume When can be backed up creation service, it is necessary to the disk space size of storage volume distribution is assigned to, if while creation service Which need to restore data in a backup before, then also to indicate the storage volume with Backup and Restore.

In general stateless resume module be it is fairly simple, basic ideas are to run multiple nodes or clothes parallel Be engaged in module and to they carry out load balancing.Typical example is the Web server cluster an of website, is often added using front end The LoadBanlace server of LVS or Nginx etc solves the problems, such as HA, wherein the High Availabitity of LVS and Nginx is mainly benefit It is solved with Keepalived, Heartbeat etc. based on route redundancy protocol VRRP or heartbeat arbitration mechanism.And for stateful Module, realize HA there are mainly two types of mode, one is multinode be based on distributed consensus agreement (such as Paxos, Raft agreement etc.) the identical state of maintenance, typically representing has Zookeeper, Rabbitmq；One is based on master slave mode Either synchronously or asynchronously duplication is to safeguard identical state, such as Mysql, Redis.The former is more complex for both modes, in some fields Performance can be very low under scape, and the latter is insufficient in terms of data consistency and retractility.

But fissure can not be avoided by addition redundancy heartbeat or setting arbitration mechanism completely, it is only reduction Probability, and can only prevent heartbeat caused by network fluctuation disconnect phenomenon.For example, two heartbeats preferably at most avoid network wave It is dynamic, but it is seemingly-dead unrelated with network caused by memory, cpu.When a message queue node is caused due to memory, cpu Response is very slow, and when causing Message Queuing Services seemingly-dead, diplocardia wire jumper can not then be solved.

Based on this, the embodiment of the present application provides a kind of anti-fissure method of message queue cluster, and Fig. 1 is the embodiment of the present application The anti-fissure method flow schematic diagram of a kind of message queue cluster provided, as shown in Figure 1, it is mainly included the following steps that.

Step S101: it according to the status information of node when message queue exception, constructs and alerts model, in the alarm model N number of element including the status information for characterizing the node, wherein one alarm threshold of each element arrangements, N >=1.

In the embodiment of the present application, the status information may include cpu busy percentage, memory usage, network state etc., Specifically, the reason of message queue cluster exception is analyzed, enumerates all possible factors for causing message queue exception, example Such as, the cpu busy percentage of node, memory usage, network state etc..Correspondingly, the alarm model can be a triple S ={ C, M, N }, wherein C is cpu busy percentage, and M is memory usage, and N is network fluctuation coefficient.

In addition, it is also necessary to which alarm threshold, which is arranged, for each element specifically can calculate separately ternary in the past period The average value of each element in group, S (avg)={ C (avg), M (avg), N (avg) }.

Step S102: time series analysis is carried out to N number of element at the node current time respectively, is obtained next The corresponding element prediction value of N number of element described in moment；

In the embodiment of the present application, time series analysis can be carried out to triple interior element based on Arima algorithm respectively, Obtain the predicted value of each element in subsequent time triple.Specifically, to cpu busy percentage, memory usage, network fluctuation system Number carries out time series analysis, obtains cpu busy percentage predicted value, memory usage predicted value and network fluctuation coefficient prediction respectively Value.

Step S103: if the corresponding element prediction value of the first element is greater than or equal to the corresponding alarm threshold of first element Value, then alerted, and first element is the either element in N number of element.

In the embodiment of the present application, which can be cpu busy percentage predicted value, memory usage predicted value or net Network coefficient of variation predicted value.For example, then being carried out when cpu busy percentage predicted value is greater than or equal to the alarm threshold for its setting Alarm.

Certainly, those skilled in the art can be set according to actual needs warning strategies, for example, can be set to when all When element is more than alarm threshold, just alerted, it should all be within the protection scope of the application.

In addition, the generation of fissure phenomenon in order to prevent, also sets up alarm frequency threshold value, when a certain in the embodiment of the present application When the alarm number of node is more than that the alarm vehicle of setting damages threshold value, no matter which kind of state message queue cluster is in, then preferential to close The message queue on the node is closed, that is, kills the message queue on the node, prevents the appearance of fissure.

Corresponding with above method embodiment, present invention also provides a kind of device of anti-fissure of message queue cluster, Fig. 2 For a kind of apparatus structure schematic diagram of the anti-fissure of message queue cluster provided by the embodiments of the present application, as shown in Fig. 2, it is mainly wrapped It includes with lower module.

Module 201 is constructed, for the status information according to node when message queue exception, building alarm model, the announcement It include N number of element for characterizing the status information of the node in alert model, wherein one alarm threshold of each element arrangements, N ≥1；

Prediction module 202 carries out time series analysis for N number of element to the node current time respectively, Obtain the corresponding element prediction value of N number of element described in subsequent time；

Alarm module 203, if corresponding more than or equal to first element for the corresponding element prediction value of the first element Alarm threshold, then alerted, first element be N number of element in either element.

In an alternative embodiment, described device further includes closedown module 204, if the alarm number for the node More than preset alarm frequency threshold value, then the message queue on the node is closed.

In an alternative embodiment, the alarm model is a trinary data group, and the element of the trinary data group is CPU utilization rate, memory usage and network fluctuation coefficient.

In an alternative embodiment, the alarm threshold of the cpu busy percentage is cpu busy percentage historical data average value； The alarm threshold of the memory usage is memory usage historical data average value；The alarm threshold of the network fluctuation coefficient For network fluctuation coefficient historical data average value.

In an alternative embodiment, the prediction module is specifically used for being based on Arima algorithm, current to the node N number of element at moment carries out time series analysis respectively, obtains the corresponding element prediction of N number of element described in subsequent time Value.

In the specific implementation, the application also provides a kind of computer storage medium, wherein the computer storage medium can store There is program, which may include step some or all of in each embodiment provided by the present application when executing.The storage is situated between Matter can be magnetic disk, CD, read-only memory (English: read-only memory, abbreviation: ROM) or random storage memory Body (English: random access memory, referred to as: RAM) etc..

It is required that those skilled in the art can be understood that the technology in the embodiment of the present application can add by software The mode of general hardware platform realize.Based on this understanding, the technical solution in the embodiment of the present application substantially or Say that the part that contributes to existing technology can be embodied in the form of software products, which can deposit Storage is in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that computer equipment (can be with It is personal computer, server or the network equipment etc.) execute certain part institutes of each embodiment of the application or embodiment The method stated.

Same and similar part may refer to each other between each embodiment in this specification.Implement especially for terminal For example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring in embodiment of the method Explanation.

Above-described the application embodiment does not constitute the restriction to the application protection scope.

Claims

1. a kind of method of the anti-fissure of message queue cluster characterized by comprising

According to the status information of node when message queue exception, it includes for characterizing in the alarm model that building, which alerts model, N number of element of the status information of the node, wherein one alarm threshold of each element arrangements, N >=1；

Time series analysis is carried out to N number of element at the node current time respectively, obtains N number of member described in subsequent time The corresponding element prediction value of element；

If the corresponding element prediction value of the first element is greater than or equal to the corresponding alarm threshold of first element, accused Alert, first element is the either element in N number of element.

2. the method for the anti-fissure of message queue cluster according to claim 1, which is characterized in that further include:

If the alarm number of the node is more than preset alarm frequency threshold value, the message queue on the node is closed.

3. the method for the anti-fissure of message queue cluster according to claim 1 or 2, which is characterized in that the alarm model For a trinary data group, the element of the trinary data group is cpu busy percentage, memory usage and network fluctuation coefficient.

4. the method for the anti-fissure of message queue cluster according to claim 3, which is characterized in that the cpu busy percentage Alarm threshold is cpu busy percentage historical data average value；The alarm threshold of the memory usage is memory usage history number According to average value；The alarm threshold of the network fluctuation coefficient is network fluctuation coefficient historical data average value.

5. the method for the anti-fissure of message queue cluster according to claim 1, which is characterized in that described to work as to the node N number of element at preceding moment carries out time series analysis respectively, obtains the corresponding element prediction of N number of element described in subsequent time Value, comprising:

Based on Arima algorithm, time series analysis is carried out respectively to N number of element at the node current time, under acquisition The corresponding element prediction value of N number of element described in one moment.

6. a kind of device of the anti-fissure of message queue cluster characterized by comprising

Module is constructed, for the status information according to node when message queue exception, constructs and alerts model, in the alarm model N number of element including the status information for characterizing the node, wherein one alarm threshold of each element arrangements, N >=1；

Prediction module carries out time series analysis for N number of element to the node current time respectively, obtains next The corresponding element prediction value of N number of element described in moment；

Alarm module, if being greater than or equal to the corresponding alarm threshold of first element for the corresponding element prediction value of the first element Value, then alerted, and first element is the either element in N number of element.

7. the device of the anti-fissure of message queue cluster according to claim 6, which is characterized in that further include:

Closedown module is closed on the node if the alarm number for the node is more than preset alarm frequency threshold value Message queue.

8. the device of the anti-fissure of message queue cluster according to claim 6 or 7, which is characterized in that the alarm model For a trinary data group, the element of the trinary data group is cpu busy percentage, memory usage and network fluctuation coefficient.

9. the device of the anti-fissure of message queue cluster according to claim 8, which is characterized in that the cpu busy percentage Alarm threshold is cpu busy percentage historical data average value；The alarm threshold of the memory usage is memory usage history number According to average value；The alarm threshold of the network fluctuation coefficient is network fluctuation coefficient historical data average value.

10. the device of the anti-fissure of message queue cluster according to claim 6, which is characterized in that the prediction module, tool Body is used to be based on Arima algorithm, carries out time series analysis respectively to N number of element at the node current time, obtains The corresponding element prediction value of N number of element described in subsequent time.