CN103995901B - A kind of method for determining back end failure - Google Patents

A kind of method for determining back end failure Download PDF

Info

Publication number
CN103995901B
CN103995901B CN201410254980.8A CN201410254980A CN103995901B CN 103995901 B CN103995901 B CN 103995901B CN 201410254980 A CN201410254980 A CN 201410254980A CN 103995901 B CN103995901 B CN 103995901B
Authority
CN
China
Prior art keywords
back end
node
application node
application
failure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410254980.8A
Other languages
Chinese (zh)
Other versions
CN103995901A (en
Inventor
赵晓平
唐超
马丽伟
秦波
王�锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201410254980.8A priority Critical patent/CN103995901B/en
Publication of CN103995901A publication Critical patent/CN103995901A/en
Application granted granted Critical
Publication of CN103995901B publication Critical patent/CN103995901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses a kind of method for determining back end failure, for distributed data base, this method includes:In all application nodes for accessing the distributed data base, when any one application node does not connect some back end in the distributed data base, the broadcast for not connecting the back end is sent to other application node;After other application node receives the broadcast, connection request is sent to the back end, to determine whether that the back end can be connected;When the application node quantity that can not connect the back end reaches set threshold value, determine that the back end fails.In the method for the present invention, the characteristics of belonging to different IP using each application node, determine whether the back end fails, can avoid passing through same IP to back end send synchronization request when because network fluctuation influences to caused by the single IP, and then can more accurately judge the failure cause of back end.

Description

A kind of method for determining back end failure
Technical field
The present invention relates to distributed data base field, more particularly to a kind of method for determining back end failure.
Background technology
With the continuous development of network technology, the requirement more and more higher of the storage to data and access, thus, distributed number Arisen at the historic moment according to storehouse.The high scalability and high availability of distributed data base are that many websites for needing non-stop run solve Problem.
Distributed data base, it is made up of the subdata base being distributed on multiple computer nodes, is distributed in each calculating Each subdata base on machine node is referred to as back end, and each back end is logically related, and status is equality. In order to ensure the normal operation of whole distributed data base, it is necessary to the running status of each back end is understood immediately, to determine Whether service can be normally provided, that is, determine whether back end is effective.And the reason such as network fluctuation, hardware fault, it may all lead The failure of back end is caused, for example, network fluctuation can cause the temporary failure of back end, and hardware fault then can be counted then According to node permanent failure.Therefore a kind of effective means are needed to determine whether current data node fails.
Cassandra is a set of distributed NoSQL Database Systems of increasing income.Due to the good scalabilities of Cassandra, Adopted by numerous well-known websites, become a kind of popular distributed structured data storage scheme.In Cassandra In, the method for predicate node failure is to use the detection (Accrual Failure Detection) based on Suspected Degree.This method Basic thought be under distributed environment, the value that the Suspected Degree that fails is represented by one judges whether back end fails. This method is in regular hour window, constantly synchronization request is sent to back end, if back end fails to respond to together Walk message once, then the value of the failure Suspected Degree of the back end just adds 1, when the value of failure Suspected Degree reaches some setting After threshold value, the permanent failure of the back end is determined that.
Due to the method using the above-mentioned detection based on Suspected Degree, synchronous ask is sent to back end by same IP Ask, it is impossible to avoid well because of the influence of network fluctuation synchronization request to transmitted by, because network fluctuation can within a period of time The loss of synchronization request data and/or back end to the response data of synchronization request can be produced, and then may cause sending In a period of time of synchronization request, the value of back end failure Suspected Degree dramatically increases, even more so that back end failure is doubtful Degree reaches set threshold value and is judged as permanent failure, but actually after this period, back end still can In the not genuine permanent failure of upstate.Therefore, the method for the existing above-mentioned detection based on Suspected Degree was using There may be the erroneous judgement of back end failure in journey.
The content of the invention
In view of this, the present invention provides a kind of method for determining back end failure, accurately to judge that back end is The temporary failure caused by network, or permanent failure caused by hardware reason.
What the technical scheme of the application was realized in:
A kind of method for determining back end failure, for distributed data base, this method includes:
In all application nodes for accessing the distributed data base, when any one application node do not connect it is described During some back end in distributed data base, the broadcast for not connecting the back end is sent to other application node;
After other application node receives the broadcast, connection request is sent to the back end, to determine whether to connect Connect the back end;
When the application node quantity that can not connect the back end reaches set threshold value, determine that the back end loses Effect.
Further, in all application nodes for accessing the distributed data base, any one application node work is selected For arbitration node, the quantity of the application node of the back end can not be connected with statistics.
Further:
A decision content is set in the arbitration node, and the decision content is initialized as 0;
After the other application node sends connection request to the back end, whether the data section will can be connected The information of point is sent to the arbitration node;
The arbitration node receives the information that whether can connect the back end that all application nodes are sent, and described Arbitration node often receives the message that can not connect the back end that an application node is sent, and just does the decision content once Add 1 operation;
When the arbitration node received that all application nodes send after whether can connecting the information of the back end:
If the decision content reaches set threshold value, it is determined that the back end fails;
If the decision content is not up to set threshold value, it is determined that the back end is effective.
Further, the threshold value is the half for all application node quantity for accessing the distributed data base.
Further, after determining back end failure, methods described also includes:
The back end is deleted from the distributed data base;
Enable the backup node of the back end.
Further, after determining that the back end is effective, methods described also includes:
The decision content is reverted into initial value 0;
The application node timing for not connecting the back end sends connection request to the back end, to wait the data Node recovers connection.
Further, when any one application node does not connect some back end in the distributed data base, Mask connection of the application node to the back end.
Further, each application node belongs to different IP.
From such scheme as can be seen that the present invention fixed number is according in the method for node failure really, when a certain application node connects After not connecing some back end, connection request is sent to determine whether to connect to the back end by multiple application nodes The back end is connect, and then determines whether the back end fails, because each application node belongs to different IP, and then can be kept away Exempt from the prior art by same IP to back end send synchronization request when because network fluctuation is to caused by the single IP Influence.The present invention more accurately judges that back end is the temporary failure caused by network than prior art, or hardware Permanent failure caused by reason.
Brief description of the drawings
Fig. 1 is method flow diagram of the fixed number really of the invention according to node failure;
Fig. 2 is flow chart of the embodiment of the present invention.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, develop simultaneously embodiment referring to the drawings, The present invention is described in further detail.
Really fixed number is used for distributed data base to the present invention according to the method for node failure, as shown in figure 1, this method includes:
In all application nodes for accessing the distributed data base, when any one application node do not connect it is described During some back end in distributed data base, the broadcast for not connecting the back end is sent to other application node;
After other application node receives the broadcast, connection request is sent to the back end, to determine whether to connect Connect the back end;
When the application node quantity that can not connect the back end reaches set threshold value, determine that the back end loses Effect.
Wherein, the quantity for the application node that statistics can not connect the back end is carried out in an arbitration node.Arbitration The selection of node is:In all application nodes for accessing the distributed data base, the application node arbitrarily selected is made For arbitration node.
The arbitration node statistics can not connect the back end and carry out by the following method:
A decision content is set in the arbitration node, and the decision content is initialized as 0;
After the other application node sends connection request to the back end, whether the data section will can be connected The information of point is sent to the arbitration node;
The arbitration node receives the information that whether can connect the back end that all application nodes are sent, and described Arbitration node often receives the message that can not connect the back end that an application node is sent, and just does the decision content once Add 1 operation;
When the arbitration node received that all application nodes send after whether can connecting the information of the back end:
If the decision content reaches set threshold value, it is determined that the back end fails;
If the decision content is not up to set threshold value, it is determined that the back end is effective.
Unlike the prior art, method of the invention is when a certain application node does not connect some back end Afterwards, connection request is sent to the back end to determine whether to connect the back end by multiple application nodes, and then Determine whether the back end fails, each application node belongs to different IP, and then can avoid in the prior art by same Individual IP is influenceed due to network fluctuation when sending synchronization request to back end to caused by the single IP, and then than prior art more Add and accurately judge that back end is the temporary failure caused by network, or permanent failure caused by hardware reason.
In the above method of the present invention, after it is determined that the back end fails, in addition to:
The back end is deleted from the distributed data base;
Enable the backup node of the back end.
And then realize and fail data node is replaced.
After it is determined that the back end is effective, method of the invention also includes:
The decision content is reverted into initial value 0;
The application node timing for not connecting the back end sends connection request to the back end, to wait the data Node recovers connection.
When real network is applied, the substantial amounts of the application node of distributed data base are accessed, each application node IP address is different, and has substantial amounts of back end in distributed data base.Below in conjunction with a specific embodiment, to this The method of invention illustrates.In the embodiment, it is assumed that the application node for accessing distributed data base shares N number of, N>1, distribution There is M back end (M in formula database>1), wherein there are application node i (1≤i≤N) connections in N number of application node Back end j in not upper distributed data base (back end j is any one in M back end).As shown in Fig. 2 The embodiment comprises the following steps:
Step 1, an application node is arbitrarily selected as arbitration node from N number of application node, and in arbitration node A decision content is set, and decision content is initialized as " 0 ", sets a threshold value, and sets a threshold to N/2, afterwards into step 2。
Step 2, when application node i does not connect the back end j in distributed data base, to other application node send out Go out not connecting back end j broadcast, afterwards into step 3.
Any one application node in all application nodes does not connect some back end in distributed data base When, it can also further comprise, mask connection of the application node to the back end.Such as in this step 2, work as application node When i does not connect back end j, application node i masks it and arrives back end j connection, and then can avoid application node i mono- Straight hair plays the connection to back end j but does not connect the network resource overhead caused by back end j.
After step 3, other application node receive the broadcast for not connecting back end j, sending connection to back end j please Ask, afterwards into step 4.
Whether step 4, other application node will can connect back end j information and be sent to the arbitration node, Enter step 5 afterwards.
Step 5, arbitration node receive the information that whether can connect back end j that all application nodes are sent, and secondary Cut out node and often receive the message that can not connect back end j that 1 application node is sent, just decision content is carried out plus 1 operates, it Enter step 6 afterwards.
Step 6, arbitration node judge whether cumulative decision content reaches the threshold value N/2 of setting:If cumulative decision content reaches To the threshold value N/2 of setting, it is determined that back end j fails, afterwards into step 7;If cumulative decision content is not up to set Threshold value N/2, it is determined that the back end is effective, afterwards into step 9.
Step 7, back end j deleted from the distributed data base, afterwards into step 8.
Step 8, the backup node j ' for enabling back end j, with alternate data node j.
The decision content is reverted to initial value 0 by step 9, arbitration node, and notifies that application node i back end j is effective, Enter step 10 afterwards;
After step 10, application node i receive the effective message of back end j that arbitration node is sent, regularly to data Node j sends connection request, to wait back end j to recover connection.
Method using fixed number really of the invention according to node failure, when a certain application node does not connect some back end Afterwards, connection request is sent to the back end to determine whether to connect the back end by multiple application nodes, and then Determine whether the back end fails, because each application node belongs to different IP, and then can avoid passing through in the prior art Same IP is influenceed due to network fluctuation when sending synchronization request to back end to caused by the single IP.The present invention is than existing Technology more accurately judges that back end is the temporary failure caused by network, or is forever lost caused by hardware reason Effect.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God any modification, equivalent substitution and improvements done etc., should be included within the scope of protection of the invention with principle.

Claims (8)

1. a kind of method for determining back end failure, for distributed data base, this method includes:
In all application nodes for accessing the distributed data base, when any one application node does not connect the distribution During some back end in formula database, the broadcast for not connecting the back end is sent to other application node;
After other application node receives the broadcast, connection request is sent to the back end, to determine whether to connect The back end;
When the application node quantity that can not connect the back end reaches set threshold value, determine that the back end fails.
2. the method according to claim 1 for determining back end failure, it is characterised in that:
In all application nodes for accessing the distributed data base, any one application node is selected as arbitration node, The quantity of the application node of the back end can not be connected with statistics.
3. the method according to claim 2 for determining back end failure, it is characterised in that:
A decision content is set in the arbitration node, and the decision content is initialized as 0;
After the other application node sends connection request to the back end, whether the back end will can be connected Information is sent to the arbitration node;
The arbitration node receives the information that whether can connect the back end that all application nodes are sent, and the arbitration Node often receives the message that can not connect the back end that an application node is sent, and just does the decision content and once adds 1 Operation;
When the arbitration node received that all application nodes send after whether can connecting the information of the back end:
If the decision content reaches set threshold value, it is determined that the back end fails;
If the decision content is not up to set threshold value, it is determined that the back end is effective.
4. the method according to claim 1 for determining back end failure, it is characterised in that:The threshold value is described in access The half of all application node quantity of distributed data base.
5. the method according to claim 1 for determining back end failure, it is characterised in that determine that the back end fails Afterwards, methods described also includes:
The back end is deleted from the distributed data base;
Enable the backup node of the back end.
6. the method according to claim 3 for determining back end failure, it is characterised in that determine that the back end is effective Afterwards, methods described also includes:
The decision content is reverted into initial value 0;
The application node timing for not connecting the back end sends connection request to the back end, to wait the back end Recover connection.
7. the method according to claim 1 for determining back end failure, it is characterised in that when any one application node When not connecting some back end in the distributed data base, the application node is masked to the company of the back end Connect.
8. the method according to claim 1 for determining back end failure, it is characterised in that each application node belongs to Different IP.
CN201410254980.8A 2014-06-10 2014-06-10 A kind of method for determining back end failure Active CN103995901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410254980.8A CN103995901B (en) 2014-06-10 2014-06-10 A kind of method for determining back end failure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410254980.8A CN103995901B (en) 2014-06-10 2014-06-10 A kind of method for determining back end failure

Publications (2)

Publication Number Publication Date
CN103995901A CN103995901A (en) 2014-08-20
CN103995901B true CN103995901B (en) 2018-01-12

Family

ID=51310066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410254980.8A Active CN103995901B (en) 2014-06-10 2014-06-10 A kind of method for determining back end failure

Country Status (1)

Country Link
CN (1) CN103995901B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105306545B (en) * 2015-09-28 2018-09-07 浪潮(北京)电子信息产业有限公司 A kind of method and system of the external service node Takeover of cluster
CN105975212A (en) * 2016-04-29 2016-09-28 深圳市永兴元科技有限公司 Failure detection processing method and device for distributed data system
CN108616566B (en) * 2018-03-14 2021-02-23 华为技术有限公司 Main selection method of raft distributed system, related equipment and system
CN112860799A (en) * 2021-02-22 2021-05-28 浪潮云信息技术股份公司 Management method for data synchronization of distributed database
CN113783735A (en) * 2021-09-24 2021-12-10 小红书科技有限公司 Method, device, equipment and medium for identifying fault node in Redis cluster

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120101987A1 (en) * 2010-10-25 2012-04-26 Paul Allen Bottorff Distributed database synchronization
US10103949B2 (en) * 2012-03-15 2018-10-16 Microsoft Technology Licensing, Llc Count tracking in distributed environments
US9239749B2 (en) * 2012-05-04 2016-01-19 Paraccel Llc Network fault detection and reconfiguration
CN102882792B (en) * 2012-06-20 2015-05-13 杜小勇 Method for simplifying internet propagation path diagram

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof

Also Published As

Publication number Publication date
CN103995901A (en) 2014-08-20

Similar Documents

Publication Publication Date Title
CN103995901B (en) A kind of method for determining back end failure
CN101772918B (en) Operation, administration and maintenance (OAM) for chains of services
CN103338243B (en) The data cached update method and system of Web node
CN107769943B (en) Method and equipment for switching main and standby clusters
US20200099604A1 (en) Method and device for fingerprint based status detection in a distributed processing system
CN110149220A (en) A kind of method and device managing data transmission channel
CN106294357A (en) Data processing method and stream calculation system
CN104579853A (en) Method for network testing of server cluster system
WO2014166265A1 (en) Method, terminal, cache server and system for updating webpage data
KR20190020105A (en) Method and device for distributing streaming data
CN106959820B (en) Data extraction method and system
US20140310372A1 (en) Method, terminal, cache server and system for updating webpage data
CN104023082A (en) Method for achieving cluster load balance
CN106411629A (en) Method used for monitoring state of CDN node and equipment thereof
CN104935481A (en) Data recovery method based on redundancy mechanism in distributed storage
US20170351560A1 (en) Software failure impact and selection system
CN105208058A (en) Information exchange system based on web session sharing
CN104065508A (en) Application service health examination method, device and system
CN109739527A (en) A kind of method, apparatus, server and the storage medium of the publication of client gray scale
CN111181800A (en) Test data processing method and device, electronic equipment and storage medium
WO2017012460A1 (en) Method and apparatus for detecting failure of random memory, and processor
CN104038366B (en) Clustered node abatement detecting method and system
US11341842B2 (en) Metering data management system and computer readable recording medium
CN101505241B (en) Method and apparatus for generating test instances
CN111565133A (en) Private line switching method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant