CN103995901B

CN103995901B - A kind of method for determining back end failure

Info

Publication number: CN103995901B
Application number: CN201410254980.8A
Authority: CN
Inventors: 赵晓平; 唐超; 马丽伟; 秦波; 王�锋
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2014-06-10
Filing date: 2014-06-10
Publication date: 2018-01-12
Anticipated expiration: 2034-06-10
Also published as: CN103995901A

Abstract

The invention discloses a kind of method for determining back end failure, for distributed data base, this method includes：In all application nodes for accessing the distributed data base, when any one application node does not connect some back end in the distributed data base, the broadcast for not connecting the back end is sent to other application node；After other application node receives the broadcast, connection request is sent to the back end, to determine whether that the back end can be connected；When the application node quantity that can not connect the back end reaches set threshold value, determine that the back end fails.In the method for the present invention, the characteristics of belonging to different IP using each application node, determine whether the back end fails, can avoid passing through same IP to back end send synchronization request when because network fluctuation influences to caused by the single IP, and then can more accurately judge the failure cause of back end.

Description

A kind of method for determining back end failure

Technical field

The present invention relates to distributed data base field, more particularly to a kind of method for determining back end failure.

Background technology

With the continuous development of network technology, the requirement more and more higher of the storage to data and access, thus, distributed number Arisen at the historic moment according to storehouse.The high scalability and high availability of distributed data base are that many websites for needing non-stop run solve Problem.

Distributed data base, it is made up of the subdata base being distributed on multiple computer nodes, is distributed in each calculating Each subdata base on machine node is referred to as back end, and each back end is logically related, and status is equality. In order to ensure the normal operation of whole distributed data base, it is necessary to the running status of each back end is understood immediately, to determine Whether service can be normally provided, that is, determine whether back end is effective.And the reason such as network fluctuation, hardware fault, it may all lead The failure of back end is caused, for example, network fluctuation can cause the temporary failure of back end, and hardware fault then can be counted then According to node permanent failure.Therefore a kind of effective means are needed to determine whether current data node fails.

Cassandra is a set of distributed NoSQL Database Systems of increasing income.Due to the good scalabilities of Cassandra, Adopted by numerous well-known websites, become a kind of popular distributed structured data storage scheme.In Cassandra In, the method for predicate node failure is to use the detection (Accrual Failure Detection) based on Suspected Degree.This method Basic thought be under distributed environment, the value that the Suspected Degree that fails is represented by one judges whether back end fails. This method is in regular hour window, constantly synchronization request is sent to back end, if back end fails to respond to together Walk message once, then the value of the failure Suspected Degree of the back end just adds 1, when the value of failure Suspected Degree reaches some setting After threshold value, the permanent failure of the back end is determined that.

Due to the method using the above-mentioned detection based on Suspected Degree, synchronous ask is sent to back end by same IP Ask, it is impossible to avoid well because of the influence of network fluctuation synchronization request to transmitted by, because network fluctuation can within a period of time The loss of synchronization request data and/or back end to the response data of synchronization request can be produced, and then may cause sending In a period of time of synchronization request, the value of back end failure Suspected Degree dramatically increases, even more so that back end failure is doubtful Degree reaches set threshold value and is judged as permanent failure, but actually after this period, back end still can In the not genuine permanent failure of upstate.Therefore, the method for the existing above-mentioned detection based on Suspected Degree was using There may be the erroneous judgement of back end failure in journey.

The content of the invention

In view of this, the present invention provides a kind of method for determining back end failure, accurately to judge that back end is The temporary failure caused by network, or permanent failure caused by hardware reason.

What the technical scheme of the application was realized in：

A kind of method for determining back end failure, for distributed data base, this method includes：

In all application nodes for accessing the distributed data base, when any one application node do not connect it is described During some back end in distributed data base, the broadcast for not connecting the back end is sent to other application node；

After other application node receives the broadcast, connection request is sent to the back end, to determine whether to connect Connect the back end；

When the application node quantity that can not connect the back end reaches set threshold value, determine that the back end loses Effect.

Further, in all application nodes for accessing the distributed data base, any one application node work is selected For arbitration node, the quantity of the application node of the back end can not be connected with statistics.

Further：

A decision content is set in the arbitration node, and the decision content is initialized as 0；

After the other application node sends connection request to the back end, whether the data section will can be connected The information of point is sent to the arbitration node；

The arbitration node receives the information that whether can connect the back end that all application nodes are sent, and described Arbitration node often receives the message that can not connect the back end that an application node is sent, and just does the decision content once Add 1 operation；

When the arbitration node received that all application nodes send after whether can connecting the information of the back end：

If the decision content reaches set threshold value, it is determined that the back end fails；

If the decision content is not up to set threshold value, it is determined that the back end is effective.

Further, the threshold value is the half for all application node quantity for accessing the distributed data base.

Further, after determining back end failure, methods described also includes：

The back end is deleted from the distributed data base；

Enable the backup node of the back end.

Further, after determining that the back end is effective, methods described also includes：

The decision content is reverted into initial value 0；

The application node timing for not connecting the back end sends connection request to the back end, to wait the data Node recovers connection.

Further, when any one application node does not connect some back end in the distributed data base, Mask connection of the application node to the back end.

Further, each application node belongs to different IP.

From such scheme as can be seen that the present invention fixed number is according in the method for node failure really, when a certain application node connects After not connecing some back end, connection request is sent to determine whether to connect to the back end by multiple application nodes The back end is connect, and then determines whether the back end fails, because each application node belongs to different IP, and then can be kept away Exempt from the prior art by same IP to back end send synchronization request when because network fluctuation is to caused by the single IP Influence.The present invention more accurately judges that back end is the temporary failure caused by network than prior art, or hardware Permanent failure caused by reason.

Brief description of the drawings

Fig. 1 is method flow diagram of the fixed number really of the invention according to node failure；

Fig. 2 is flow chart of the embodiment of the present invention.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, develop simultaneously embodiment referring to the drawings, The present invention is described in further detail.

Really fixed number is used for distributed data base to the present invention according to the method for node failure, as shown in figure 1, this method includes：

Wherein, the quantity for the application node that statistics can not connect the back end is carried out in an arbitration node.Arbitration The selection of node is：In all application nodes for accessing the distributed data base, the application node arbitrarily selected is made For arbitration node.

The arbitration node statistics can not connect the back end and carry out by the following method：

Unlike the prior art, method of the invention is when a certain application node does not connect some back end Afterwards, connection request is sent to the back end to determine whether to connect the back end by multiple application nodes, and then Determine whether the back end fails, each application node belongs to different IP, and then can avoid in the prior art by same Individual IP is influenceed due to network fluctuation when sending synchronization request to back end to caused by the single IP, and then than prior art more Add and accurately judge that back end is the temporary failure caused by network, or permanent failure caused by hardware reason.

In the above method of the present invention, after it is determined that the back end fails, in addition to：

The back end is deleted from the distributed data base；

Enable the backup node of the back end.

And then realize and fail data node is replaced.

After it is determined that the back end is effective, method of the invention also includes：

The decision content is reverted into initial value 0；

When real network is applied, the substantial amounts of the application node of distributed data base are accessed, each application node IP address is different, and has substantial amounts of back end in distributed data base.Below in conjunction with a specific embodiment, to this The method of invention illustrates.In the embodiment, it is assumed that the application node for accessing distributed data base shares N number of, N>1, distribution There is M back end (M in formula database>1), wherein there are application node i (1≤i≤N) connections in N number of application node Back end j in not upper distributed data base (back end j is any one in M back end).As shown in Fig. 2 The embodiment comprises the following steps：

Step 1, an application node is arbitrarily selected as arbitration node from N number of application node, and in arbitration node A decision content is set, and decision content is initialized as " 0 ", sets a threshold value, and sets a threshold to N/2, afterwards into step 2。

Step 2, when application node i does not connect the back end j in distributed data base, to other application node send out Go out not connecting back end j broadcast, afterwards into step 3.

Any one application node in all application nodes does not connect some back end in distributed data base When, it can also further comprise, mask connection of the application node to the back end.Such as in this step 2, work as application node When i does not connect back end j, application node i masks it and arrives back end j connection, and then can avoid application node i mono- Straight hair plays the connection to back end j but does not connect the network resource overhead caused by back end j.

After step 3, other application node receive the broadcast for not connecting back end j, sending connection to back end j please Ask, afterwards into step 4.

Whether step 4, other application node will can connect back end j information and be sent to the arbitration node, Enter step 5 afterwards.

Step 5, arbitration node receive the information that whether can connect back end j that all application nodes are sent, and secondary Cut out node and often receive the message that can not connect back end j that 1 application node is sent, just decision content is carried out plus 1 operates, it Enter step 6 afterwards.

Step 6, arbitration node judge whether cumulative decision content reaches the threshold value N/2 of setting：If cumulative decision content reaches To the threshold value N/2 of setting, it is determined that back end j fails, afterwards into step 7；If cumulative decision content is not up to set Threshold value N/2, it is determined that the back end is effective, afterwards into step 9.

Step 7, back end j deleted from the distributed data base, afterwards into step 8.

Step 8, the backup node j ' for enabling back end j, with alternate data node j.

The decision content is reverted to initial value 0 by step 9, arbitration node, and notifies that application node i back end j is effective, Enter step 10 afterwards；

After step 10, application node i receive the effective message of back end j that arbitration node is sent, regularly to data Node j sends connection request, to wait back end j to recover connection.

Method using fixed number really of the invention according to node failure, when a certain application node does not connect some back end Afterwards, connection request is sent to the back end to determine whether to connect the back end by multiple application nodes, and then Determine whether the back end fails, because each application node belongs to different IP, and then can avoid passing through in the prior art Same IP is influenceed due to network fluctuation when sending synchronization request to back end to caused by the single IP.The present invention is than existing Technology more accurately judges that back end is the temporary failure caused by network, or is forever lost caused by hardware reason Effect.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God any modification, equivalent substitution and improvements done etc., should be included within the scope of protection of the invention with principle.

Claims

1. a kind of method for determining back end failure, for distributed data base, this method includes：

In all application nodes for accessing the distributed data base, when any one application node does not connect the distribution During some back end in formula database, the broadcast for not connecting the back end is sent to other application node；

After other application node receives the broadcast, connection request is sent to the back end, to determine whether to connect The back end；

When the application node quantity that can not connect the back end reaches set threshold value, determine that the back end fails.

2. the method according to claim 1 for determining back end failure, it is characterised in that：

In all application nodes for accessing the distributed data base, any one application node is selected as arbitration node, The quantity of the application node of the back end can not be connected with statistics.

3. the method according to claim 2 for determining back end failure, it is characterised in that：

After the other application node sends connection request to the back end, whether the back end will can be connected Information is sent to the arbitration node；

The arbitration node receives the information that whether can connect the back end that all application nodes are sent, and the arbitration Node often receives the message that can not connect the back end that an application node is sent, and just does the decision content and once adds 1 Operation；

4. the method according to claim 1 for determining back end failure, it is characterised in that：The threshold value is described in access The half of all application node quantity of distributed data base.

5. the method according to claim 1 for determining back end failure, it is characterised in that determine that the back end fails Afterwards, methods described also includes：

The back end is deleted from the distributed data base；

Enable the backup node of the back end.

6. the method according to claim 3 for determining back end failure, it is characterised in that determine that the back end is effective Afterwards, methods described also includes：

The decision content is reverted into initial value 0；

The application node timing for not connecting the back end sends connection request to the back end, to wait the back end Recover connection.

7. the method according to claim 1 for determining back end failure, it is characterised in that when any one application node When not connecting some back end in the distributed data base, the application node is masked to the company of the back end Connect.

8. the method according to claim 1 for determining back end failure, it is characterised in that each application node belongs to Different IP.