CN107547252A - A kind of network failure processing method and device - Google Patents

A kind of network failure processing method and device Download PDF

Info

Publication number
CN107547252A
CN107547252A CN201710515775.6A CN201710515775A CN107547252A CN 107547252 A CN107547252 A CN 107547252A CN 201710515775 A CN201710515775 A CN 201710515775A CN 107547252 A CN107547252 A CN 107547252A
Authority
CN
China
Prior art keywords
osd
router
network
information
place
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710515775.6A
Other languages
Chinese (zh)
Other versions
CN107547252B (en
Inventor
马春燕
陈杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Technologies Co Ltd
Original Assignee
New H3C Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Technologies Co Ltd filed Critical New H3C Technologies Co Ltd
Priority to CN201710515775.6A priority Critical patent/CN107547252B/en
Publication of CN107547252A publication Critical patent/CN107547252A/en
Application granted granted Critical
Publication of CN107547252B publication Critical patent/CN107547252B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Computer And Data Communications (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a kind of network failure processing method and device.Distributed memory system includes the service network being deployed separately and storage net, and clustered control node M ON and multiple object storage device OSD are connected to service network and communicated, and multiple OSD are also connected to storage net and communicated, and this method is applied to OSD, including:The OSD and the storage net connecting link between the OSD of OSD pairings are detected, obtains the link-state information of the OSD sides;If detecting there is abnormal, active termination OSD finger daemon in the Link State of the OSD sides, prevent the OSD from sending the information with the OSD failures of OSD pairings to MON by service network.The network failure processing method of the application, apply in the OSD of storage system, when the Link State for detecting itself side has abnormal, the finger daemon of active termination itself, prevent to report other OSD failures by mistake, solve the problems, such as that OSD shakes in storage net.

Description

A kind of network failure processing method and device
Technical field
The present invention relates to communication technical field, more particularly to a kind of network failure processing method and device.
Background technology
Ceph is that one kind is increased income distributed memory system, it has also become one of most general storage system instantly, and at present Popularity highest is increased income one of stored items.Ceph has the characteristics that high-performance, highly reliable and Highly Scalable, including object storage Equipment (Object Storage Device, OSD) and cluster monitoring node (Monitor, MON).OSD is used to provide storage money Source, (up states) can provide storage when state is normal, and (down states) normally can not be read and be write when state is abnormal, OSD possesses a finger daemon (OSD deamon) of oneself, for be responsible for complete OSD all logic functions, including with MON Communicated with other OSD to safeguard renewal system mode etc..MON is used to receive the state report that OSD is reported, renewal and diffusion OSD status informations (OSDMap).To safeguard the global state of whole Ceph clusters.
However, Ceph cluster applications, when production environment, network connection is to influence one of very important factor of its work. The network structure suggested in Ceph clusters is, by service network (Public network) and storage net (Cluster Network) it is deployed separately.State of the service network mainly between carrying user real data, OSD and MON, MON and MON, OSD Information and heartbeat communication, storage net are mainly used in heartbeat communication and cluster internal data between OSD, such as recovery, copy, scouring Deng.In robustness and reliability testing, network flash or other network problems can trigger the OSD concussions that storage is netted (Flapping) phenomenon, some or all OSD is shown as and are set to up or down states repeatedly, cause service disconnection.
The content of the invention
This application provides a kind of network failure processing method and device, with solve in Ceph due to network flash or its Storage net OSD concussion problems caused by his network problem.
One side according to the application, there is provided a kind of network failure processing method, in distributed memory system, bag The service network being deployed separately and storage net are included, clustered control node M ON and multiple object storage device OSD are connected to the service network Being communicated, the multiple OSD is also connected to the storage net and communicated, and this method is applied to the OSD, including:
The OSD and the storage net connecting link between the OSD of OSD pairings are detected, obtains the Link State of the OSD sides Information;
If detecting there is abnormal, active termination OSD finger daemon in the Link State of the OSD sides, to prevent to be somebody's turn to do OSD sends the information with the OSD failures of OSD pairings by service network to MON.
According to further aspect of the application, there is provided a kind of dealing with network breakdown device, in distributed memory system, Including the service network being deployed separately and storage net, clustered control node M ON and multiple object storage device OSD are connected to the business Net is communicated, and the multiple OSD is also connected to the storage net and communicated, and the device is applied to the OSD, including:
Link detecting unit, for OSD where detecting the device and the storage net between the OSD of place OSD pairings Connecting link, obtain the link-state information of place OSD sides;
Processing unit, detect that the Link State of place OSD sides has exception for working as, then the active termination place OSD finger daemon, with prevent place OSD by service network to MON send with place OSD match OSD failures letter Breath.
The beneficial effect of the application is:The network failure processing method of the application, apply in the OSD of Ceph storage nets, By detecting the OSD and the storage net connecting link between the OSD of OSD pairings, the Link State letter of the OSD sides is obtained Breath, when the Link State for detecting the OSD sides has abnormal, then active termination OSD finger daemon, so as to send out again Raw state is recovered, and to prevent the OSD failures of wrong report and OSD pairings, solves the problems, such as that OSD shakes in storage net.
Brief description of the drawings
Fig. 1 is the Ceph network architectures and communication scheme;
Fig. 2 communication process schematic diagrames between peer OSD;
Fig. 3 is the schematic flow sheet of the network failure processing method of the application one embodiment;
The network path schematic diagram that Fig. 4 is connected between peer OSD by multistage route implementing;
Fig. 5 is the structural representation of the dealing with network breakdown device of the application one embodiment;
Fig. 6 is the structural representation of the dealing with network breakdown device of the application another embodiment;
Fig. 7 is the structural representation of the dealing with network breakdown device of the application another embodiment;
Fig. 8 is a kind of structural representation of OSD hardware of the application one embodiment.
Embodiment
Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the application.On the contrary, they be only with it is such as appended The example of the consistent apparatus and method of some aspects be described in detail in claims, the application.
It is only merely for the purpose of description specific embodiment in term used in this application, and is not intended to be limiting the application. " one kind " of singulative used in the application and appended claims, " described " and "the" are also intended to including majority Form, unless context clearly shows that other implications.It is also understood that term "and/or" used herein refers to and wrapped Containing the associated list items purpose of one or more, any or all may be combined.
It will be appreciated that though various information, but this may be described using term first, second, third, etc. in the application A little information should not necessarily be limited by these terms.These terms are only used for same type of information being distinguished from each other out.For example, do not departing from In the case of the application scope, the first information can also be referred to as the second information, and similarly, the second information can also be referred to as One information.Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determining ".
In order that those skilled in the art more fully understand the technical scheme of the application, first, the application application back of the body is introduced Scape Ceph increases income the structure and operation principle of distributed memory system.As shown in figure 1, the service network of Ceph clusters is (i.e. in figure Front) it is deployed separately with storage net (i.e. Back in figure).Service network mainly carries user's real data, OSD and MON, MON Map information and heartbeat between MON, OSD communicate, and storage network is mainly used in heartbeat communication and cluster internal between OSD Data, such as recovery, copy, scouring.
OSD status checkout flows are described as follows:
1st, OSD actively reports the state of oneself to MON
For example, acquiescence 900s is reported once, therefore, after OSD breaks down, their own is waited to report MON time mistake It is long.If also, shorten and called time on this, the load of MON processing OSD state reportings can be linearly increasing with cluster scale increase, And generally we only need to be concerned about the OSD to go wrong, the most of the time, this reported not how many practical function, because This reduces the effect of this time and bad.
2nd, OSD detects it and matches the OSD states of (peer)
Heartbeat can be established between OSD, for example, carrying same placement group (Placement Group, write a Chinese character in simplified form PG) OSD Between establish peer relations, or establish peer relations with the OSD before and after self ID.The state to be communicated according to heartbeat, reports it Peer OSD state.
Specifically, the inspection policies between peer OSD are as follows:
Each OSD opens a thread, and (0.5~5.9s) sends heartbeat to each peer OSD and disappeared at regular intervals Breath, if cluster configuration public_network and cluster_network, each heartbeat message can be simultaneously at two Link is sent.
As shown in Fig. 2 OSD.a is to its peer OSD --- OSD.b sends MOSDPing::PING heartbeat message (carries OSD.a transmission timestamp and osdmap version numbers), OSD.b can reply MOSDPing to OSD.a under normal circumstances::PING_ (timestamp that Reply messages carry remains what OSD.a was sended over to REPLY messages, and can also take OSD.b oneself Osdmap version), the Reply messages that OSD.a receives OSD.b will record OSD.b heartbeat message.
If (acquiescence 20s, can match somebody with somebody) OSD.a can not receive OSD.b REPLY messages in preset time, will be by OSD.b A failure_queue queue is added to, and reports MON.One OSD is reported 3 times by its peer OSD, then MON renewals Osdmap, the state of the OSD is put into down.
Afterwards, the OSD self-tests for being set to down are attempted to bind network interface card again, and state puts up if binding success, if unsuccessful Down and out are then further set to, shows that the OSD thoroughly breaks down, no longer carries any PG.
However, being had a problem that in above-mentioned inspection policies, Ceph can detect the machine network interface card hardware fault, but nothing Method detects cluster networking link abnormal conditions, for example netting twine is pulled out, or network flash etc..With Fig. 1 interior joint B network failures Exemplified by, because the PING_REPLY messages that can not receive peer OSD can be reported mutually between OSD.a and OSD.b, OSD.b and OSD.c Down, for example when MON continuously receives OSD.b and reports down by its peer OSD three times, update OSDmap, OSD.b is put into down. OSD.b is set to start self-test after down and attempts to bind network interface card again, if net card failure, then Bind Failed, and OSD.b shape State will not change;But if it is that netting twine is pulled out, or network flash, or opposite end (connection interchanger one end) network interface failure, netting twine Do not connect, now OSD.b binds network interface card success, OSD.b states up again again.Similarly, OSD.a and OSD.c also can be by OSD.b reports down, then binds network interface card, state up again.Therefore, cause multiple OSD in cluster up/down repeatedly, cause Long-time OSD shakes, and OSD concussions will necessarily cause upper-layer service to be interrupted.
Based on this, the technical concept of the application is:Stored for prior art Ceph in network, due to network flash or OSD caused by other network problems shakes problem, increases link detection mechanism in the OSD of Ceph storage nets, realizes to OSD certainly The detection of body side Link State, link-state information is obtained, when the Link State for detecting OSD itself sides has exception When, then the finger daemon of active termination itself.By the increased link detection mechanism in OSD, find out and be truly present failure One side OSD, and its finger daemon is terminated, so as to which the OSD generating state will not recover again, to prevent wrong report, it matches OSD failures, Solve the problems, such as that OSD shakes in storage net.
The implementation process of the dealing with network breakdown scheme of the application is specifically described with reference to embodiments.
Fig. 3 shows the schematic flow sheet of the network failure processing method of the application one embodiment.In distributed storage In system, including the service network being deployed separately and storage net, clustered control node M ON and multiple object storage device OSD are connected to The service network is communicated, and the multiple OSD is also connected to the storage net and communicated, and this method is applied to the OSD, bag Include following steps:
Step S110, the OSD and the storage net connecting link between the OSD of OSD pairings are detected, obtains the OSD sides Link-state information.
Step S120, if detecting there is abnormal, active termination OSD finger daemon in the Link State of the OSD sides, With prevent the OSD by service network to MON send with the OSD pairing OSD failures information.
Pass through the state self-test to the OSD, after detecting that the OSD Link States of itself have exception, active termination The finger daemon of the OSD, the OSD is prevented to carry out state recovery, so as to avoid the peer OSD Network Abnormals for reporting the OSD by mistake, The problem of OSD concussions occur is avoided, ensures that upper-layer service normally issues and data effective mobility.
In application scheme, link detecting includes the detection to network interface and the detection to router on network path.
For there was only level-1 router between better simply networking, such as each OSD, it is all connected on same router, Then judge that the detection to Link State can be achieved in network interface state.Now, the OSD is detected and between the OSD of OSD pairings Net connecting link is stored, obtains the link-state information of the OSD sides, including:
The network interface of the OSD is detected, obtains the network interface status information of the OSD.
Accordingly, if the Link State for detecting the OSD sides exist it is abnormal, the active termination OSD guard into Journey, to prevent the OSD by service network to MON transmissions and the information of the OSD failures of OSD pairings, including:
By the network interface state self-test to the OSD, if detecting the network interface abnormal state of the OSD, the active termination OSD Finger daemon, with prevent the OSD by service network to MON send with the OSD pairing OSD failures information.
Specifically, a timing network testing mechanism can be increased on OSD, every 6 seconds OSD are super in detection heartbeat communication When before, first detecting network interface, whether normal (for example, trawl performance failure, netting twine is not plugged or damaged, and exchanges generator terminal and is pulled out Deng).When the OSD detects itself network interface exception, i.e. direct connected link abnormal state, oneself actively exits process, so that will not Report Peer OSD heartbeat communication abnormalities by mistake, therefore can solve the problems, such as OSD Flapping.
In addition, isolated fault is come with this, it is necessary to using multistage route come networking for large-scale cluster.Therefore, OSD Between heartbeat be no longer single router connection, multiple routers may be crossed over.As shown in figure 4, OSD1 is through router A It is connected to center router C, OSD2 and is connected to center router C through router B, be i.e. has three on network path between OSD1 and OSD2 Level router, and router C is common connection OSD1 and OSD2 center router.When router A breaks down, OSD1 without Method receives OSD2 heartbeat response, and the backward Mon of time-out reports the message that OSD2 is Down;After Mon receives message, issue OSDMap, OSD2 is set to Down;Now OSD2 has found that the network interface of oneself is normal, reports MON, and state is updated to up, and to OSD1 sends heartbeat message, and yet with router A failures, the heartbeat that OSD2 can not receive OSD1 responds, OSD2 meetings after time-out Report the message that OSD1 is Down, Mon OSD1 can be set into Down after receiving message to Mon, and so on, produce Flapping Problem.
The key of problem is the side for finding out real failure, if it is understood that the series of router, heartbeat message quilt The OSD that real failure is just can determine that on which router is blocked in, solves the problems, such as wrong report.Based on this, in some of the application In embodiment, this method further comprises:
The storage cluster network topological information on the OSD;In the cluster network topology information, include cluster network Router series, and the positional information of center router;The OSD obtains the OSD according to the cluster network topology information The center router position being connected to jointly with the OSD matched with the OSD, so as to confirm which level router is located at the OSD sides.
Described detection OSD and the storage net connecting link between the OSD of OSD pairings, obtain the link of the OSD sides Status information, further comprise:
After the OSD and the OSD heartbeat communication abnormalities matched with the OSD, the OSD sends IP messages and received at different levels step by step The message that router returns, if failing to receive the message of certain level-1 router return, judge that the router breaks down.
Accordingly, if the Link State for detecting the OSD sides exist it is abnormal, the active termination OSD guard into Journey, to prevent the OSD from, to MON transmissions and the information of the OSD failures of OSD pairings, further comprising by service network:
The OSD is according to the cluster network topology information, if detecting the router to break down in the OSD and Center Road By between device, then OSD active terminations OSD finger daemon, to prevent the OSD from being sent and the OSD to MON by service network The information of the OSD failures of pairing.By newly-increased cluster network topology information, the OSD really to break down is detected, so as to give Failure OSD kicks out of from cluster work, and it is Down to avoid the OSD from reporting other OSD by mistake, solves the problems, such as OSD Flapping.
The cluster network topology message, is created in networking.Also, closed in the router level of cluster network After system changes, as there is the OSD newly increased in cluster, or there is OSD to be replaced, then cluster network topology information can occur Renewal, based on such a situation, methods described further comprises:
Cluster network topology information after the renewal sent with the OSD of OSD pairings is received by service network and stored, with And send the cluster network topology information after renewal to the OSD matched with the OSD;And/or the MON is received by service network Cluster network topology information after the renewal of transmission simultaneously stores.
Preferably, in some embodiments of the present application, can using ICMP ICMP agreements come by Level finds the router on network path.Specifically, the IP messages are in accordance with ICMP ICMP agreements IP messages, time-to-live (Time To Live, abbreviation TTL) field of the IP messages, describe it in transmit process, are being lost The limiting value for the number of devices that can be undergone before abandoning.
Then the message described above for sending IP messages step by step and receiving each level router return includes:
The time-to-live TTL initial value of the IP messages is set to 1, is transmitted, often receives level-1 router return Message after, the TTL numerical value of the IP messages is added 1, and send again.
By taking OSD1 in Fig. 4 as an example, OSD1 send out a TTL initial value be 1 IP messages (in fact, send out every time for 3 The message of individual 40 byte, including source address, the time tag that destination address and message are sent) to destination OSD2.When on path First router A when receiving this message, TTL is subtracted 1 by it, and now, TTL is changed into 0, so router A can be by this message Lose, and send back to an ICMP time exceeded message (include the source address of transmitting IP packet, all the elements of IP messages and The IP address of router), after OSD1 receives this message, just know that this router A is present on this path, then, then The IP messages that another TTL is 2 are sent out, find the 2nd center router C.Tracking is route with this, every time by the IP messages of submitting TTL adds 1, to find another router.
Assuming that router A failures, when OSD1 first IP message (TTL 1) of submitting is overtime, according to cluster network topology Information, it is known that the route of time-out is router A, in OSD1 itself one end, therefore, OSD1, which cancels, reports OSD2 as Down's Message, oneself exits finger daemon.Meanwhile OSD2 sends out the IP messages that TTL is 1 to OSD1, router B returns to message, OSD2 The IP message that TTL is 2 is sent out again, and router C is successfully returned, and OSD2 continues to send out the message that TTL is 3, due to router A events Barrier, IP message overtime returns, according to cluster network topology information, OSD2 knows centered on center router router C and without reason Barrier, it can thus be appreciated that OSD1 ends failure, and the message that OSD1 is Down is reported to Mon, and after Mon receives message, OSDMap is issued, will OSD1 is set to Down.Therefore, OSD1 can actively exit finger daemon, avoid wrong report when the link of oneself breaks down Peer OSD are Down, so as to solve the problems, such as OSD Flapping.
Corresponding to the above method, disclosed herein as well is a kind of dealing with network breakdown device, in distributed memory system, Including the service network being deployed separately and storage net, clustered control node M ON and multiple object storage device OSD are connected to the business Net is communicated, and the multiple OSD is also connected to the storage net and communicated, and the device is applied to the OSD, with reference to the institute of figure 5 Show, functionally divide, present networks fault treating apparatus 200 includes:
Link detecting unit 210, for OSD where detecting the device and the storage between the OSD of place OSD pairings Net connecting link, obtain the link-state information of place OSD sides.
Processing unit 220, detect that the Link State of place OSD sides has exception for working as, then the active termination institute In OSD finger daemon, to prevent place OSD from being sent and place the OSD OSD failures matched to MON by service network Information.
Further, with reference to shown in figure 6, in another embodiment of the application, the link detecting unit 210 wraps Include:
Network interface detection unit 211, whether the network interface for detecting place OSD is normal, obtains place OSD network interface shape State information.
The processing unit 220, specifically in OSD network interfaces abnormal state where detecting this, the active termination institute In OSD finger daemon, to prevent place OSD from being sent and place the OSD OSD failures matched to MON by service network Information.
With further reference to shown in Fig. 7, in another embodiment of the application, the device also includes:
Memory cell 230, for storage cluster network topological information;In the cluster network topology information, include collection The positional information of the router sum of series center router of group network.The device obtains according to the cluster network topology information OSD where the device and the center router position being connected to jointly with the place OSD OSD matched, so as to which which grade road confirmed Place OSD sides are located at by device.
The link detecting unit 210 further comprises:
Router detection unit 212, the OSD heartbeat communication abnormalities matched for OSD where this and with OSD where this Afterwards, IP messages are sent step by step and connect the message that receipts routers at different levels return, if failing to receive disappearing for certain level-1 router return Breath, then judge that the router breaks down.
The processing unit 220, it is further used for, according to the cluster network topology information, judging the route to break down Device position;When detecting the router to break down where this between OSD and center router, the active termination place OSD finger daemon, with prevent place OSD by service network to MON send with place OSD match OSD failures letter Breath.
Referring again to shown in Fig. 7, in some embodiments of the present application, the device further comprises:
Updating block 240, for after the router hierarchical relationship of cluster network changes, by service network receive with Cluster network topology information after the renewal that the OSD of place OSD pairings is sent, and be sent to the memory cell 230 and store, And send the cluster network topology information after renewal to the OSD matched with place OSD;And/or institute is received by service network The cluster network topology information after the renewal of MON transmissions is stated, and is sent to the memory cell 230 and stores.
Specifically, the IP messages that the router detection unit 212 is sent are in accordance with ICMP ICMP The IP messages of agreement.The time-to-live TTL initial value of the IP messages is set to 1 by the router detection unit 212, is sent out Send, after the message for often receiving level-1 router return, the TTL numerical value of the IP messages is added 1, and send again.When failing After the message for receiving the return of certain level-1 router, then judge that the router breaks down.
The dealing with network breakdown device that the application provides can be realized by software, can also pass through hardware or software and hardware With reference to mode realize.Exemplified by implemented in software, can by processor 810 by nonvolatile memory 850 with network failure Machine-executable instruction corresponding to processing unit 200 reads in internal memory 840 and run.For hardware view, as shown in figure 8, For a kind of hardware structure diagram of the application device, except the processor 810 shown in Fig. 8, internal bus 820, network interface 830, Outside internal memory 840 and nonvolatile memory 850, according to the actual functional capability of the OSD, other hardware can also be included, to this Repeat no more.
In various embodiments, the nonvolatile memory 850 can be:Memory driver (such as hard drive Device), solid state hard disc, any kind of storage dish (such as CD, DVD), either similar storage medium or their group Close.The internal memory 840 can be:RAM (Radom Access Memory, random access memory), volatile memory, it is non-easily The property lost memory, flash memory.
Further, nonvolatile memory 850 and internal memory 840 are used as machinable medium, can store thereon by Manage machine-executable instruction corresponding to the dealing with network breakdown device 200 that device 810 performs.
For device embodiment, because it corresponds essentially to embodiment of the method, so related part is real referring to method Apply the part explanation of example.Device embodiment described above is only schematical, wherein described be used as separating component The unit of explanation can be or may not be physically separate, can be as the part that unit is shown or can also It is not physical location, you can with positioned at a place, or can also be distributed on multiple NEs.Can be according to reality Need to select some or all of module therein to realize the purpose of this embodiment scheme.Those of ordinary skill in the art are not In the case of paying creative work, you can to understand and implement.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation In any this actual relation or order.Term " comprising ", "comprising" or its any other variant are intended to non-row His property includes, so that process, method, article or equipment including a series of elements not only include those key elements, and And also include the other element being not expressly set out, or also include for this process, method, article or equipment institute inherently Key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including institute State in process, method, article or the equipment of key element and other identical element also be present.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention It is interior.

Claims (10)

  1. A kind of 1. network failure processing method, it is characterised in that in distributed memory system, including the service network being deployed separately With storage net, clustered control node M ON and multiple object storage device OSD are connected to the service network and communicated, the multiple OSD is also connected to the storage net and communicated, and this method is applied to the OSD, including:
    The OSD and the storage net connecting link between the OSD of OSD pairings are detected, obtains the Link State letter of the OSD sides Breath;
    If detecting there is abnormal, active termination OSD finger daemon, to prevent the OSD from leading in the Link State of the OSD sides Cross the information for the OSD failures that service network is sent to MON and the OSD is matched.
  2. 2. according to the method for claim 1, it is characterised in that detect the OSD and depositing between the OSD of OSD pairings Net connecting link is stored up, obtains the link-state information of the OSD sides, including:
    The network interface of the OSD is detected, obtains the network interface status information of the OSD;
    Accordingly, if the Link State for detecting the OSD sides has abnormal, active termination OSD finger daemon, with The information for the OSD failures that the OSD is sent by service network to MON and the OSD is matched is prevented, including:
    If detecting the network interface abnormal state of the OSD, active termination OSD finger daemon, to prevent the OSD from passing through business Net sends the information with the OSD failures of OSD pairings to MON.
  3. 3. method according to claim 1 or 2, it is characterised in that this method further comprises:
    Storage cluster network topological information;In the cluster network topology information, include the router series of cluster network, with And the positional information of center router;The OSD obtains the OSD and matched with the OSD according to the cluster network topology information The center router positions that are connected to jointly of OSD, so as to confirm which level router is located at the OSD sides;
    Described detection OSD and the storage net connecting link between the OSD of OSD pairings, obtain the Link State of the OSD sides Information, further comprise:
    After the OSD and the OSD heartbeat communication abnormalities matched with the OSD, IP messages are sent step by step and receive each level router and are returned The message returned, if failing to receive the message of certain level-1 router return, judge that the router breaks down;
    Accordingly, if detecting there is abnormal, active termination OSD finger daemon in the Link State of the OSD sides, to prevent The OSD is sent by service network to MON and the information of the OSD failures of OSD pairings, further comprises:
    According to the cluster network topology information, if detecting the router to break down between the OSD and center router, Then active termination OSD finger daemon, to prevent the OSD from sending the OSD failures matched with the OSD to MON by service network Information.
  4. 4. according to the method for claim 3, it is characterised in that methods described further comprises:In the route of cluster network After device hierarchical relationship changes,
    Cluster network topology information after the renewal sent with the OSD of OSD pairings is received by service network and stored, Yi Jixiang The cluster network topology information after renewal is sent with the OSD of OSD pairings;
    And/or the cluster network topology information after the renewal of the MON transmissions is received by service network and is stored.
  5. 5. according to the method for claim 3, it is characterised in that the IP messages are in accordance with ICMP The IP messages of ICMP agreements;The message for sending IP messages step by step and receiving each level router return includes:
    The time-to-live TTL initial value of the IP messages is set to 1, is transmitted, often receives disappearing for level-1 router return After breath, the TTL numerical value of the IP messages is added 1, and send again.
  6. A kind of 6. dealing with network breakdown device, it is characterised in that in distributed memory system, including the service network being deployed separately With storage net, clustered control node M ON and multiple object storage device OSD are connected to the service network and communicated, the multiple OSD is also connected to the storage net and communicated, and the device is applied to the OSD, including:
    Link detecting unit, it is connected for OSD where detecting the device and the storage net between the OSD of place OSD pairings Link, obtain the link-state information of place OSD sides;
    Processing unit, for existing extremely when the Link State that detect place OSD sides, then active termination place OSD Finger daemon, with prevent place OSD by service network to MON send with place OSD match OSD failures information.
  7. 7. device according to claim 6, it is characterised in that the link detecting unit includes:
    Network interface detection unit, whether the network interface for detecting place OSD is normal, obtains place OSD network interface status information;
    The processing unit, specifically in OSD network interfaces abnormal state where detecting this, active termination place OSD's Finger daemon, with prevent place OSD by service network to MON send with place OSD match OSD failures information.
  8. 8. the device according to claim 6 or 7, it is characterised in that the device further comprises:
    Memory cell, for storage cluster network topological information;In the cluster network topology information, include cluster network The positional information of router sum of series center router;The device obtains the device institute according to the cluster network topology information The center router position that the OSD that OSD where OSD and with this is matched is connected to jointly, so as to confirm which level router is located at Place OSD sides;
    The link detecting unit further comprises:
    Router detection unit, for OSD where this and with after the OSD heartbeat communication abnormalities of OSD pairings where this, sending out step by step Send IP messages and receive the message that receipts routers at different levels return, if failing to receive the message of certain level-1 router return, sentence The disconnected router breaks down;
    The processing unit, it is further used for according to the cluster network topology information, when detecting the router that breaks down When where this between OSD and center router, active termination place OSD finger daemon, to prevent place OSD from passing through Service network sends the information with the OSD failures of place OSD pairings to MON.
  9. 9. device according to claim 8, it is characterised in that the device further comprises:
    Updating block, for after the router hierarchical relationship of cluster network changes,
    Cluster network topology information after the renewal sent with the OSD of place OSD pairings is received by service network, and is sent to The memory cell storage, and send the cluster network topology information after renewal to the OSD matched with place OSD;
    And/or the cluster network topology information after the renewal of the MON transmissions is received by service network, and it is sent to described deposit Storage unit stores.
  10. 10. device according to claim 8, it is characterised in that the IP messages that the router detection unit is sent are to abide by Keep the IP messages of ICMP ICMP agreements;Specifically, the router detection unit is by the IP messages Time-to-live TTL initial value is set to 1, is transmitted, after the message for often receiving level-1 router return, by the IP messages TTL numerical value adds 1, and sends again.
CN201710515775.6A 2017-06-29 2017-06-29 Network fault processing method and device Active CN107547252B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710515775.6A CN107547252B (en) 2017-06-29 2017-06-29 Network fault processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710515775.6A CN107547252B (en) 2017-06-29 2017-06-29 Network fault processing method and device

Publications (2)

Publication Number Publication Date
CN107547252A true CN107547252A (en) 2018-01-05
CN107547252B CN107547252B (en) 2020-12-04

Family

ID=60970312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710515775.6A Active CN107547252B (en) 2017-06-29 2017-06-29 Network fault processing method and device

Country Status (1)

Country Link
CN (1) CN107547252B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108519927A (en) * 2018-04-12 2018-09-11 郑州云海信息技术有限公司 A kind of OSD Fault Locating Methods and system based on ICFS systems
CN108769098A (en) * 2018-04-03 2018-11-06 郑州云海信息技术有限公司 A kind of method, apparatus and system for establishing distributed memory system network connection
CN108924195A (en) * 2018-06-20 2018-11-30 郑州云海信息技术有限公司 A kind of unidirectional heartbeat mechanism implementation method, device, equipment and system
CN108958970A (en) * 2018-05-29 2018-12-07 新华三技术有限公司 A kind of data reconstruction method, server and computer-readable medium
CN109101357A (en) * 2018-07-20 2018-12-28 广东浪潮大数据研究有限公司 A kind of detection method and device of OSD failure
CN109597689A (en) * 2018-12-10 2019-04-09 浪潮(北京)电子信息产业有限公司 A kind of distributed file system Memory Optimize Method, device, equipment and medium
CN111142801A (en) * 2019-12-26 2020-05-12 星辰天合(北京)数据科技有限公司 Distributed storage system network sub-health detection method and device
CN111385296A (en) * 2020-03-04 2020-07-07 深信服科技股份有限公司 Business process restarting method, device, storage medium and system
CN111510338A (en) * 2020-03-09 2020-08-07 苏州浪潮智能科技有限公司 Distributed block storage network sub-health test method, device and storage medium
CN111614477A (en) * 2019-02-22 2020-09-01 华为技术有限公司 Method and device for positioning network fault
CN111756571A (en) * 2020-05-28 2020-10-09 苏州浪潮智能科技有限公司 Cluster node fault processing method, device, equipment and readable medium
CN111817926A (en) * 2020-09-11 2020-10-23 中国人民解放军国防科技大学 Method for realizing reachability monitoring based on net-ping under RubyGems
CN111913667A (en) * 2020-08-06 2020-11-10 平安科技(深圳)有限公司 OSD blocking detection method, system, terminal and storage medium based on Ceph
CN112000500A (en) * 2020-07-29 2020-11-27 新华三大数据技术有限公司 Communication fault determining method, processing method and storage device
CN112306815A (en) * 2020-11-16 2021-02-02 新华三大数据技术有限公司 Method, device, equipment and medium for monitoring IO (input/output) information between OSD (on Screen display) side master and slave in Ceph
CN112596935A (en) * 2020-11-16 2021-04-02 新华三大数据技术有限公司 OSD (on-screen display) fault processing method and device
CN113472553A (en) * 2020-03-30 2021-10-01 中国移动通信集团浙江有限公司 Fault injection system and method
CN113542001A (en) * 2021-05-26 2021-10-22 新华三大数据技术有限公司 OSD (on-screen display) fault heartbeat detection method, device, equipment and storage medium
CN114095341A (en) * 2021-11-19 2022-02-25 深信服科技股份有限公司 Network recovery method and device, computer equipment and storage medium
WO2024113832A1 (en) * 2022-11-29 2024-06-06 华为技术有限公司 Node abnormality event processing method, network interface card, and storage cluster

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7362702B2 (en) * 2001-10-18 2008-04-22 Qlogic, Corporation Router with routing processors and methods for virtualization
CN101252603A (en) * 2008-04-11 2008-08-27 清华大学 Cluster distributed type lock management method based on storage area network SAN
US20090198793A1 (en) * 2008-01-31 2009-08-06 Thanabalan Thavittupitchai Paul Systems and methods for dynamically reporting a boot process in content/service receivers
CN102402395A (en) * 2010-09-16 2012-04-04 上海中标软件有限公司 Quorum disk-based non-interrupted operation method for high availability system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7362702B2 (en) * 2001-10-18 2008-04-22 Qlogic, Corporation Router with routing processors and methods for virtualization
US20090198793A1 (en) * 2008-01-31 2009-08-06 Thanabalan Thavittupitchai Paul Systems and methods for dynamically reporting a boot process in content/service receivers
CN101252603A (en) * 2008-04-11 2008-08-27 清华大学 Cluster distributed type lock management method based on storage area network SAN
CN102402395A (en) * 2010-09-16 2012-04-04 上海中标软件有限公司 Quorum disk-based non-interrupted operation method for high availability system

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108769098B (en) * 2018-04-03 2021-04-13 郑州云海信息技术有限公司 Method, device and system for establishing network connection of distributed storage system
CN108769098A (en) * 2018-04-03 2018-11-06 郑州云海信息技术有限公司 A kind of method, apparatus and system for establishing distributed memory system network connection
CN108519927A (en) * 2018-04-12 2018-09-11 郑州云海信息技术有限公司 A kind of OSD Fault Locating Methods and system based on ICFS systems
CN108958970A (en) * 2018-05-29 2018-12-07 新华三技术有限公司 A kind of data reconstruction method, server and computer-readable medium
CN108958970B (en) * 2018-05-29 2021-05-07 新华三技术有限公司 Data recovery method, server and computer readable medium
CN108924195A (en) * 2018-06-20 2018-11-30 郑州云海信息技术有限公司 A kind of unidirectional heartbeat mechanism implementation method, device, equipment and system
CN109101357A (en) * 2018-07-20 2018-12-28 广东浪潮大数据研究有限公司 A kind of detection method and device of OSD failure
CN109597689B (en) * 2018-12-10 2022-06-10 浪潮(北京)电子信息产业有限公司 Distributed file system memory optimization method, device, equipment and medium
CN109597689A (en) * 2018-12-10 2019-04-09 浪潮(北京)电子信息产业有限公司 A kind of distributed file system Memory Optimize Method, device, equipment and medium
US11876700B2 (en) 2019-02-22 2024-01-16 Huawei Technologies Co., Ltd. Network fault locating method and apparatus
CN111614477A (en) * 2019-02-22 2020-09-01 华为技术有限公司 Method and device for positioning network fault
CN111142801B (en) * 2019-12-26 2021-05-04 星辰天合(北京)数据科技有限公司 Distributed storage system network sub-health detection method and device
CN111142801A (en) * 2019-12-26 2020-05-12 星辰天合(北京)数据科技有限公司 Distributed storage system network sub-health detection method and device
CN111385296B (en) * 2020-03-04 2022-06-21 深信服科技股份有限公司 Business process restarting method, device, storage medium and system
CN111385296A (en) * 2020-03-04 2020-07-07 深信服科技股份有限公司 Business process restarting method, device, storage medium and system
CN111510338A (en) * 2020-03-09 2020-08-07 苏州浪潮智能科技有限公司 Distributed block storage network sub-health test method, device and storage medium
CN111510338B (en) * 2020-03-09 2022-04-26 苏州浪潮智能科技有限公司 Distributed block storage network sub-health test method, device and storage medium
CN113472553A (en) * 2020-03-30 2021-10-01 中国移动通信集团浙江有限公司 Fault injection system and method
CN111756571A (en) * 2020-05-28 2020-10-09 苏州浪潮智能科技有限公司 Cluster node fault processing method, device, equipment and readable medium
US11750437B2 (en) 2020-05-28 2023-09-05 Inspur Suzhou Intelligent Technology Co., Ltd. Cluster node fault processing method and apparatus, and device and readable medium
CN111756571B (en) * 2020-05-28 2022-02-18 苏州浪潮智能科技有限公司 Cluster node fault processing method, device, equipment and readable medium
CN112000500A (en) * 2020-07-29 2020-11-27 新华三大数据技术有限公司 Communication fault determining method, processing method and storage device
CN111913667A (en) * 2020-08-06 2020-11-10 平安科技(深圳)有限公司 OSD blocking detection method, system, terminal and storage medium based on Ceph
CN111913667B (en) * 2020-08-06 2023-03-14 平安科技(深圳)有限公司 OSD blocking detection method, system, terminal and storage medium based on Ceph
CN111817926A (en) * 2020-09-11 2020-10-23 中国人民解放军国防科技大学 Method for realizing reachability monitoring based on net-ping under RubyGems
CN112596935B (en) * 2020-11-16 2022-08-30 新华三大数据技术有限公司 OSD (on-screen display) fault processing method and device
CN112306815B (en) * 2020-11-16 2023-07-25 新华三大数据技术有限公司 Method, device, equipment and medium for monitoring IO information between OSD side and master slave in Ceph
CN112596935A (en) * 2020-11-16 2021-04-02 新华三大数据技术有限公司 OSD (on-screen display) fault processing method and device
CN112306815A (en) * 2020-11-16 2021-02-02 新华三大数据技术有限公司 Method, device, equipment and medium for monitoring IO (input/output) information between OSD (on Screen display) side master and slave in Ceph
CN113542001A (en) * 2021-05-26 2021-10-22 新华三大数据技术有限公司 OSD (on-screen display) fault heartbeat detection method, device, equipment and storage medium
CN114095341A (en) * 2021-11-19 2022-02-25 深信服科技股份有限公司 Network recovery method and device, computer equipment and storage medium
WO2024113832A1 (en) * 2022-11-29 2024-06-06 华为技术有限公司 Node abnormality event processing method, network interface card, and storage cluster

Also Published As

Publication number Publication date
CN107547252B (en) 2020-12-04

Similar Documents

Publication Publication Date Title
CN107547252A (en) A kind of network failure processing method and device
US10200279B1 (en) Tracer of traffic trajectories in data center networks
US7761743B2 (en) Fault tolerant routing in a non-hot-standby configuration of a network routing system
CN101707537B (en) Positioning method of failed link and alarm root cause analyzing method, equipment and system
CN1826776B (en) Method and apparatus for processing duplicate packets
US5949759A (en) Fault correlation system and method in packet switching networks
US10868709B2 (en) Determining the health of other nodes in a same cluster based on physical link information
US7330889B2 (en) Network interaction analysis arrangement
CN101313280A (en) Pool-based network diagnostic systems and methods
JP2001249856A (en) Method for processing error in storage area network(san) and data processing system
CN103795570B (en) The unicast message restoration methods and device of the stacked switchboard system of ring topology
CN106959820A (en) A kind of data extraction method and system
CN110784373A (en) Virtual network convergence method and device
CN106330531A (en) Node fault recording and processing method and device
US8559317B2 (en) Alarm threshold for BGP flapping detection
US10277484B2 (en) Self organizing network event reporting
Dozier et al. Vulnerability analysis of AIS-based intrusion detection systems via genetic and particle swarm red teams
CN112769653B (en) Network detection and switching method, system and medium based on network port binding
Borokhovich et al. The show must go on: Fundamental data plane connectivity services for dependable SDNs
CN109257268A (en) A kind of network attack test system and method across vlan
CN111130813B (en) Information processing method based on network and electronic equipment
CN113132140B (en) Network fault detection method, device, equipment and storage medium
CN101102231A (en) An automatic discovery method and device of PPP link routing device
Cisco Cisco WAN Switching Software Release Notes, Release 8.5.05
Cisco 9.1.10 Software Release Notes Cisco StrataView Plus for AIX

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant