CN107547252A - A kind of network failure processing method and device - Google Patents
A kind of network failure processing method and device Download PDFInfo
- Publication number
- CN107547252A CN107547252A CN201710515775.6A CN201710515775A CN107547252A CN 107547252 A CN107547252 A CN 107547252A CN 201710515775 A CN201710515775 A CN 201710515775A CN 107547252 A CN107547252 A CN 107547252A
- Authority
- CN
- China
- Prior art keywords
- osd
- router
- network
- information
- place
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Computer And Data Communications (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a kind of network failure processing method and device.Distributed memory system includes the service network being deployed separately and storage net, and clustered control node M ON and multiple object storage device OSD are connected to service network and communicated, and multiple OSD are also connected to storage net and communicated, and this method is applied to OSD, including:The OSD and the storage net connecting link between the OSD of OSD pairings are detected, obtains the link-state information of the OSD sides;If detecting there is abnormal, active termination OSD finger daemon in the Link State of the OSD sides, prevent the OSD from sending the information with the OSD failures of OSD pairings to MON by service network.The network failure processing method of the application, apply in the OSD of storage system, when the Link State for detecting itself side has abnormal, the finger daemon of active termination itself, prevent to report other OSD failures by mistake, solve the problems, such as that OSD shakes in storage net.
Description
Technical field
The present invention relates to communication technical field, more particularly to a kind of network failure processing method and device.
Background technology
Ceph is that one kind is increased income distributed memory system, it has also become one of most general storage system instantly, and at present
Popularity highest is increased income one of stored items.Ceph has the characteristics that high-performance, highly reliable and Highly Scalable, including object storage
Equipment (Object Storage Device, OSD) and cluster monitoring node (Monitor, MON).OSD is used to provide storage money
Source, (up states) can provide storage when state is normal, and (down states) normally can not be read and be write when state is abnormal,
OSD possesses a finger daemon (OSD deamon) of oneself, for be responsible for complete OSD all logic functions, including with MON
Communicated with other OSD to safeguard renewal system mode etc..MON is used to receive the state report that OSD is reported, renewal and diffusion
OSD status informations (OSDMap).To safeguard the global state of whole Ceph clusters.
However, Ceph cluster applications, when production environment, network connection is to influence one of very important factor of its work.
The network structure suggested in Ceph clusters is, by service network (Public network) and storage net (Cluster
Network) it is deployed separately.State of the service network mainly between carrying user real data, OSD and MON, MON and MON, OSD
Information and heartbeat communication, storage net are mainly used in heartbeat communication and cluster internal data between OSD, such as recovery, copy, scouring
Deng.In robustness and reliability testing, network flash or other network problems can trigger the OSD concussions that storage is netted
(Flapping) phenomenon, some or all OSD is shown as and are set to up or down states repeatedly, cause service disconnection.
The content of the invention
This application provides a kind of network failure processing method and device, with solve in Ceph due to network flash or its
Storage net OSD concussion problems caused by his network problem.
One side according to the application, there is provided a kind of network failure processing method, in distributed memory system, bag
The service network being deployed separately and storage net are included, clustered control node M ON and multiple object storage device OSD are connected to the service network
Being communicated, the multiple OSD is also connected to the storage net and communicated, and this method is applied to the OSD, including:
The OSD and the storage net connecting link between the OSD of OSD pairings are detected, obtains the Link State of the OSD sides
Information;
If detecting there is abnormal, active termination OSD finger daemon in the Link State of the OSD sides, to prevent to be somebody's turn to do
OSD sends the information with the OSD failures of OSD pairings by service network to MON.
According to further aspect of the application, there is provided a kind of dealing with network breakdown device, in distributed memory system,
Including the service network being deployed separately and storage net, clustered control node M ON and multiple object storage device OSD are connected to the business
Net is communicated, and the multiple OSD is also connected to the storage net and communicated, and the device is applied to the OSD, including:
Link detecting unit, for OSD where detecting the device and the storage net between the OSD of place OSD pairings
Connecting link, obtain the link-state information of place OSD sides;
Processing unit, detect that the Link State of place OSD sides has exception for working as, then the active termination place
OSD finger daemon, with prevent place OSD by service network to MON send with place OSD match OSD failures letter
Breath.
The beneficial effect of the application is:The network failure processing method of the application, apply in the OSD of Ceph storage nets,
By detecting the OSD and the storage net connecting link between the OSD of OSD pairings, the Link State letter of the OSD sides is obtained
Breath, when the Link State for detecting the OSD sides has abnormal, then active termination OSD finger daemon, so as to send out again
Raw state is recovered, and to prevent the OSD failures of wrong report and OSD pairings, solves the problems, such as that OSD shakes in storage net.
Brief description of the drawings
Fig. 1 is the Ceph network architectures and communication scheme;
Fig. 2 communication process schematic diagrames between peer OSD;
Fig. 3 is the schematic flow sheet of the network failure processing method of the application one embodiment;
The network path schematic diagram that Fig. 4 is connected between peer OSD by multistage route implementing;
Fig. 5 is the structural representation of the dealing with network breakdown device of the application one embodiment;
Fig. 6 is the structural representation of the dealing with network breakdown device of the application another embodiment;
Fig. 7 is the structural representation of the dealing with network breakdown device of the application another embodiment;
Fig. 8 is a kind of structural representation of OSD hardware of the application one embodiment.
Embodiment
Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to
During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistent with the application.On the contrary, they be only with it is such as appended
The example of the consistent apparatus and method of some aspects be described in detail in claims, the application.
It is only merely for the purpose of description specific embodiment in term used in this application, and is not intended to be limiting the application.
" one kind " of singulative used in the application and appended claims, " described " and "the" are also intended to including majority
Form, unless context clearly shows that other implications.It is also understood that term "and/or" used herein refers to and wrapped
Containing the associated list items purpose of one or more, any or all may be combined.
It will be appreciated that though various information, but this may be described using term first, second, third, etc. in the application
A little information should not necessarily be limited by these terms.These terms are only used for same type of information being distinguished from each other out.For example, do not departing from
In the case of the application scope, the first information can also be referred to as the second information, and similarly, the second information can also be referred to as
One information.Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ...
When " or " in response to determining ".
In order that those skilled in the art more fully understand the technical scheme of the application, first, the application application back of the body is introduced
Scape Ceph increases income the structure and operation principle of distributed memory system.As shown in figure 1, the service network of Ceph clusters is (i.e. in figure
Front) it is deployed separately with storage net (i.e. Back in figure).Service network mainly carries user's real data, OSD and MON, MON
Map information and heartbeat between MON, OSD communicate, and storage network is mainly used in heartbeat communication and cluster internal between OSD
Data, such as recovery, copy, scouring.
OSD status checkout flows are described as follows:
1st, OSD actively reports the state of oneself to MON
For example, acquiescence 900s is reported once, therefore, after OSD breaks down, their own is waited to report MON time mistake
It is long.If also, shorten and called time on this, the load of MON processing OSD state reportings can be linearly increasing with cluster scale increase,
And generally we only need to be concerned about the OSD to go wrong, the most of the time, this reported not how many practical function, because
This reduces the effect of this time and bad.
2nd, OSD detects it and matches the OSD states of (peer)
Heartbeat can be established between OSD, for example, carrying same placement group (Placement Group, write a Chinese character in simplified form PG) OSD
Between establish peer relations, or establish peer relations with the OSD before and after self ID.The state to be communicated according to heartbeat, reports it
Peer OSD state.
Specifically, the inspection policies between peer OSD are as follows:
Each OSD opens a thread, and (0.5~5.9s) sends heartbeat to each peer OSD and disappeared at regular intervals
Breath, if cluster configuration public_network and cluster_network, each heartbeat message can be simultaneously at two
Link is sent.
As shown in Fig. 2 OSD.a is to its peer OSD --- OSD.b sends MOSDPing::PING heartbeat message (carries
OSD.a transmission timestamp and osdmap version numbers), OSD.b can reply MOSDPing to OSD.a under normal circumstances::PING_
(timestamp that Reply messages carry remains what OSD.a was sended over to REPLY messages, and can also take OSD.b oneself
Osdmap version), the Reply messages that OSD.a receives OSD.b will record OSD.b heartbeat message.
If (acquiescence 20s, can match somebody with somebody) OSD.a can not receive OSD.b REPLY messages in preset time, will be by OSD.b
A failure_queue queue is added to, and reports MON.One OSD is reported 3 times by its peer OSD, then MON renewals
Osdmap, the state of the OSD is put into down.
Afterwards, the OSD self-tests for being set to down are attempted to bind network interface card again, and state puts up if binding success, if unsuccessful
Down and out are then further set to, shows that the OSD thoroughly breaks down, no longer carries any PG.
However, being had a problem that in above-mentioned inspection policies, Ceph can detect the machine network interface card hardware fault, but nothing
Method detects cluster networking link abnormal conditions, for example netting twine is pulled out, or network flash etc..With Fig. 1 interior joint B network failures
Exemplified by, because the PING_REPLY messages that can not receive peer OSD can be reported mutually between OSD.a and OSD.b, OSD.b and OSD.c
Down, for example when MON continuously receives OSD.b and reports down by its peer OSD three times, update OSDmap, OSD.b is put into down.
OSD.b is set to start self-test after down and attempts to bind network interface card again, if net card failure, then Bind Failed, and OSD.b shape
State will not change;But if it is that netting twine is pulled out, or network flash, or opposite end (connection interchanger one end) network interface failure, netting twine
Do not connect, now OSD.b binds network interface card success, OSD.b states up again again.Similarly, OSD.a and OSD.c also can be by
OSD.b reports down, then binds network interface card, state up again.Therefore, cause multiple OSD in cluster up/down repeatedly, cause
Long-time OSD shakes, and OSD concussions will necessarily cause upper-layer service to be interrupted.
Based on this, the technical concept of the application is:Stored for prior art Ceph in network, due to network flash or
OSD caused by other network problems shakes problem, increases link detection mechanism in the OSD of Ceph storage nets, realizes to OSD certainly
The detection of body side Link State, link-state information is obtained, when the Link State for detecting OSD itself sides has exception
When, then the finger daemon of active termination itself.By the increased link detection mechanism in OSD, find out and be truly present failure
One side OSD, and its finger daemon is terminated, so as to which the OSD generating state will not recover again, to prevent wrong report, it matches OSD failures,
Solve the problems, such as that OSD shakes in storage net.
The implementation process of the dealing with network breakdown scheme of the application is specifically described with reference to embodiments.
Fig. 3 shows the schematic flow sheet of the network failure processing method of the application one embodiment.In distributed storage
In system, including the service network being deployed separately and storage net, clustered control node M ON and multiple object storage device OSD are connected to
The service network is communicated, and the multiple OSD is also connected to the storage net and communicated, and this method is applied to the OSD, bag
Include following steps:
Step S110, the OSD and the storage net connecting link between the OSD of OSD pairings are detected, obtains the OSD sides
Link-state information.
Step S120, if detecting there is abnormal, active termination OSD finger daemon in the Link State of the OSD sides,
With prevent the OSD by service network to MON send with the OSD pairing OSD failures information.
Pass through the state self-test to the OSD, after detecting that the OSD Link States of itself have exception, active termination
The finger daemon of the OSD, the OSD is prevented to carry out state recovery, so as to avoid the peer OSD Network Abnormals for reporting the OSD by mistake,
The problem of OSD concussions occur is avoided, ensures that upper-layer service normally issues and data effective mobility.
In application scheme, link detecting includes the detection to network interface and the detection to router on network path.
For there was only level-1 router between better simply networking, such as each OSD, it is all connected on same router,
Then judge that the detection to Link State can be achieved in network interface state.Now, the OSD is detected and between the OSD of OSD pairings
Net connecting link is stored, obtains the link-state information of the OSD sides, including:
The network interface of the OSD is detected, obtains the network interface status information of the OSD.
Accordingly, if the Link State for detecting the OSD sides exist it is abnormal, the active termination OSD guard into
Journey, to prevent the OSD by service network to MON transmissions and the information of the OSD failures of OSD pairings, including:
By the network interface state self-test to the OSD, if detecting the network interface abnormal state of the OSD, the active termination OSD
Finger daemon, with prevent the OSD by service network to MON send with the OSD pairing OSD failures information.
Specifically, a timing network testing mechanism can be increased on OSD, every 6 seconds OSD are super in detection heartbeat communication
When before, first detecting network interface, whether normal (for example, trawl performance failure, netting twine is not plugged or damaged, and exchanges generator terminal and is pulled out
Deng).When the OSD detects itself network interface exception, i.e. direct connected link abnormal state, oneself actively exits process, so that will not
Report Peer OSD heartbeat communication abnormalities by mistake, therefore can solve the problems, such as OSD Flapping.
In addition, isolated fault is come with this, it is necessary to using multistage route come networking for large-scale cluster.Therefore, OSD
Between heartbeat be no longer single router connection, multiple routers may be crossed over.As shown in figure 4, OSD1 is through router A
It is connected to center router C, OSD2 and is connected to center router C through router B, be i.e. has three on network path between OSD1 and OSD2
Level router, and router C is common connection OSD1 and OSD2 center router.When router A breaks down, OSD1 without
Method receives OSD2 heartbeat response, and the backward Mon of time-out reports the message that OSD2 is Down;After Mon receives message, issue
OSDMap, OSD2 is set to Down;Now OSD2 has found that the network interface of oneself is normal, reports MON, and state is updated to up, and to
OSD1 sends heartbeat message, and yet with router A failures, the heartbeat that OSD2 can not receive OSD1 responds, OSD2 meetings after time-out
Report the message that OSD1 is Down, Mon OSD1 can be set into Down after receiving message to Mon, and so on, produce Flapping
Problem.
The key of problem is the side for finding out real failure, if it is understood that the series of router, heartbeat message quilt
The OSD that real failure is just can determine that on which router is blocked in, solves the problems, such as wrong report.Based on this, in some of the application
In embodiment, this method further comprises:
The storage cluster network topological information on the OSD;In the cluster network topology information, include cluster network
Router series, and the positional information of center router;The OSD obtains the OSD according to the cluster network topology information
The center router position being connected to jointly with the OSD matched with the OSD, so as to confirm which level router is located at the OSD sides.
Described detection OSD and the storage net connecting link between the OSD of OSD pairings, obtain the link of the OSD sides
Status information, further comprise:
After the OSD and the OSD heartbeat communication abnormalities matched with the OSD, the OSD sends IP messages and received at different levels step by step
The message that router returns, if failing to receive the message of certain level-1 router return, judge that the router breaks down.
Accordingly, if the Link State for detecting the OSD sides exist it is abnormal, the active termination OSD guard into
Journey, to prevent the OSD from, to MON transmissions and the information of the OSD failures of OSD pairings, further comprising by service network:
The OSD is according to the cluster network topology information, if detecting the router to break down in the OSD and Center Road
By between device, then OSD active terminations OSD finger daemon, to prevent the OSD from being sent and the OSD to MON by service network
The information of the OSD failures of pairing.By newly-increased cluster network topology information, the OSD really to break down is detected, so as to give
Failure OSD kicks out of from cluster work, and it is Down to avoid the OSD from reporting other OSD by mistake, solves the problems, such as OSD Flapping.
The cluster network topology message, is created in networking.Also, closed in the router level of cluster network
After system changes, as there is the OSD newly increased in cluster, or there is OSD to be replaced, then cluster network topology information can occur
Renewal, based on such a situation, methods described further comprises:
Cluster network topology information after the renewal sent with the OSD of OSD pairings is received by service network and stored, with
And send the cluster network topology information after renewal to the OSD matched with the OSD;And/or the MON is received by service network
Cluster network topology information after the renewal of transmission simultaneously stores.
Preferably, in some embodiments of the present application, can using ICMP ICMP agreements come by
Level finds the router on network path.Specifically, the IP messages are in accordance with ICMP ICMP agreements
IP messages, time-to-live (Time To Live, abbreviation TTL) field of the IP messages, describe it in transmit process, are being lost
The limiting value for the number of devices that can be undergone before abandoning.
Then the message described above for sending IP messages step by step and receiving each level router return includes:
The time-to-live TTL initial value of the IP messages is set to 1, is transmitted, often receives level-1 router return
Message after, the TTL numerical value of the IP messages is added 1, and send again.
By taking OSD1 in Fig. 4 as an example, OSD1 send out a TTL initial value be 1 IP messages (in fact, send out every time for 3
The message of individual 40 byte, including source address, the time tag that destination address and message are sent) to destination OSD2.When on path
First router A when receiving this message, TTL is subtracted 1 by it, and now, TTL is changed into 0, so router A can be by this message
Lose, and send back to an ICMP time exceeded message (include the source address of transmitting IP packet, all the elements of IP messages and
The IP address of router), after OSD1 receives this message, just know that this router A is present on this path, then, then
The IP messages that another TTL is 2 are sent out, find the 2nd center router C.Tracking is route with this, every time by the IP messages of submitting
TTL adds 1, to find another router.
Assuming that router A failures, when OSD1 first IP message (TTL 1) of submitting is overtime, according to cluster network topology
Information, it is known that the route of time-out is router A, in OSD1 itself one end, therefore, OSD1, which cancels, reports OSD2 as Down's
Message, oneself exits finger daemon.Meanwhile OSD2 sends out the IP messages that TTL is 1 to OSD1, router B returns to message, OSD2
The IP message that TTL is 2 is sent out again, and router C is successfully returned, and OSD2 continues to send out the message that TTL is 3, due to router A events
Barrier, IP message overtime returns, according to cluster network topology information, OSD2 knows centered on center router router C and without reason
Barrier, it can thus be appreciated that OSD1 ends failure, and the message that OSD1 is Down is reported to Mon, and after Mon receives message, OSDMap is issued, will
OSD1 is set to Down.Therefore, OSD1 can actively exit finger daemon, avoid wrong report when the link of oneself breaks down
Peer OSD are Down, so as to solve the problems, such as OSD Flapping.
Corresponding to the above method, disclosed herein as well is a kind of dealing with network breakdown device, in distributed memory system,
Including the service network being deployed separately and storage net, clustered control node M ON and multiple object storage device OSD are connected to the business
Net is communicated, and the multiple OSD is also connected to the storage net and communicated, and the device is applied to the OSD, with reference to the institute of figure 5
Show, functionally divide, present networks fault treating apparatus 200 includes:
Link detecting unit 210, for OSD where detecting the device and the storage between the OSD of place OSD pairings
Net connecting link, obtain the link-state information of place OSD sides.
Processing unit 220, detect that the Link State of place OSD sides has exception for working as, then the active termination institute
In OSD finger daemon, to prevent place OSD from being sent and place the OSD OSD failures matched to MON by service network
Information.
Further, with reference to shown in figure 6, in another embodiment of the application, the link detecting unit 210 wraps
Include:
Network interface detection unit 211, whether the network interface for detecting place OSD is normal, obtains place OSD network interface shape
State information.
The processing unit 220, specifically in OSD network interfaces abnormal state where detecting this, the active termination institute
In OSD finger daemon, to prevent place OSD from being sent and place the OSD OSD failures matched to MON by service network
Information.
With further reference to shown in Fig. 7, in another embodiment of the application, the device also includes:
Memory cell 230, for storage cluster network topological information;In the cluster network topology information, include collection
The positional information of the router sum of series center router of group network.The device obtains according to the cluster network topology information
OSD where the device and the center router position being connected to jointly with the place OSD OSD matched, so as to which which grade road confirmed
Place OSD sides are located at by device.
The link detecting unit 210 further comprises:
Router detection unit 212, the OSD heartbeat communication abnormalities matched for OSD where this and with OSD where this
Afterwards, IP messages are sent step by step and connect the message that receipts routers at different levels return, if failing to receive disappearing for certain level-1 router return
Breath, then judge that the router breaks down.
The processing unit 220, it is further used for, according to the cluster network topology information, judging the route to break down
Device position;When detecting the router to break down where this between OSD and center router, the active termination place
OSD finger daemon, with prevent place OSD by service network to MON send with place OSD match OSD failures letter
Breath.
Referring again to shown in Fig. 7, in some embodiments of the present application, the device further comprises:
Updating block 240, for after the router hierarchical relationship of cluster network changes, by service network receive with
Cluster network topology information after the renewal that the OSD of place OSD pairings is sent, and be sent to the memory cell 230 and store,
And send the cluster network topology information after renewal to the OSD matched with place OSD;And/or institute is received by service network
The cluster network topology information after the renewal of MON transmissions is stated, and is sent to the memory cell 230 and stores.
Specifically, the IP messages that the router detection unit 212 is sent are in accordance with ICMP ICMP
The IP messages of agreement.The time-to-live TTL initial value of the IP messages is set to 1 by the router detection unit 212, is sent out
Send, after the message for often receiving level-1 router return, the TTL numerical value of the IP messages is added 1, and send again.When failing
After the message for receiving the return of certain level-1 router, then judge that the router breaks down.
The dealing with network breakdown device that the application provides can be realized by software, can also pass through hardware or software and hardware
With reference to mode realize.Exemplified by implemented in software, can by processor 810 by nonvolatile memory 850 with network failure
Machine-executable instruction corresponding to processing unit 200 reads in internal memory 840 and run.For hardware view, as shown in figure 8,
For a kind of hardware structure diagram of the application device, except the processor 810 shown in Fig. 8, internal bus 820, network interface 830,
Outside internal memory 840 and nonvolatile memory 850, according to the actual functional capability of the OSD, other hardware can also be included, to this
Repeat no more.
In various embodiments, the nonvolatile memory 850 can be:Memory driver (such as hard drive
Device), solid state hard disc, any kind of storage dish (such as CD, DVD), either similar storage medium or their group
Close.The internal memory 840 can be:RAM (Radom Access Memory, random access memory), volatile memory, it is non-easily
The property lost memory, flash memory.
Further, nonvolatile memory 850 and internal memory 840 are used as machinable medium, can store thereon by
Manage machine-executable instruction corresponding to the dealing with network breakdown device 200 that device 810 performs.
For device embodiment, because it corresponds essentially to embodiment of the method, so related part is real referring to method
Apply the part explanation of example.Device embodiment described above is only schematical, wherein described be used as separating component
The unit of explanation can be or may not be physically separate, can be as the part that unit is shown or can also
It is not physical location, you can with positioned at a place, or can also be distributed on multiple NEs.Can be according to reality
Need to select some or all of module therein to realize the purpose of this embodiment scheme.Those of ordinary skill in the art are not
In the case of paying creative work, you can to understand and implement.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality
Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation
In any this actual relation or order.Term " comprising ", "comprising" or its any other variant are intended to non-row
His property includes, so that process, method, article or equipment including a series of elements not only include those key elements, and
And also include the other element being not expressly set out, or also include for this process, method, article or equipment institute inherently
Key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including institute
State in process, method, article or the equipment of key element and other identical element also be present.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all
Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention
It is interior.
Claims (10)
- A kind of 1. network failure processing method, it is characterised in that in distributed memory system, including the service network being deployed separately With storage net, clustered control node M ON and multiple object storage device OSD are connected to the service network and communicated, the multiple OSD is also connected to the storage net and communicated, and this method is applied to the OSD, including:The OSD and the storage net connecting link between the OSD of OSD pairings are detected, obtains the Link State letter of the OSD sides Breath;If detecting there is abnormal, active termination OSD finger daemon, to prevent the OSD from leading in the Link State of the OSD sides Cross the information for the OSD failures that service network is sent to MON and the OSD is matched.
- 2. according to the method for claim 1, it is characterised in that detect the OSD and depositing between the OSD of OSD pairings Net connecting link is stored up, obtains the link-state information of the OSD sides, including:The network interface of the OSD is detected, obtains the network interface status information of the OSD;Accordingly, if the Link State for detecting the OSD sides has abnormal, active termination OSD finger daemon, with The information for the OSD failures that the OSD is sent by service network to MON and the OSD is matched is prevented, including:If detecting the network interface abnormal state of the OSD, active termination OSD finger daemon, to prevent the OSD from passing through business Net sends the information with the OSD failures of OSD pairings to MON.
- 3. method according to claim 1 or 2, it is characterised in that this method further comprises:Storage cluster network topological information;In the cluster network topology information, include the router series of cluster network, with And the positional information of center router;The OSD obtains the OSD and matched with the OSD according to the cluster network topology information The center router positions that are connected to jointly of OSD, so as to confirm which level router is located at the OSD sides;Described detection OSD and the storage net connecting link between the OSD of OSD pairings, obtain the Link State of the OSD sides Information, further comprise:After the OSD and the OSD heartbeat communication abnormalities matched with the OSD, IP messages are sent step by step and receive each level router and are returned The message returned, if failing to receive the message of certain level-1 router return, judge that the router breaks down;Accordingly, if detecting there is abnormal, active termination OSD finger daemon in the Link State of the OSD sides, to prevent The OSD is sent by service network to MON and the information of the OSD failures of OSD pairings, further comprises:According to the cluster network topology information, if detecting the router to break down between the OSD and center router, Then active termination OSD finger daemon, to prevent the OSD from sending the OSD failures matched with the OSD to MON by service network Information.
- 4. according to the method for claim 3, it is characterised in that methods described further comprises:In the route of cluster network After device hierarchical relationship changes,Cluster network topology information after the renewal sent with the OSD of OSD pairings is received by service network and stored, Yi Jixiang The cluster network topology information after renewal is sent with the OSD of OSD pairings;And/or the cluster network topology information after the renewal of the MON transmissions is received by service network and is stored.
- 5. according to the method for claim 3, it is characterised in that the IP messages are in accordance with ICMP The IP messages of ICMP agreements;The message for sending IP messages step by step and receiving each level router return includes:The time-to-live TTL initial value of the IP messages is set to 1, is transmitted, often receives disappearing for level-1 router return After breath, the TTL numerical value of the IP messages is added 1, and send again.
- A kind of 6. dealing with network breakdown device, it is characterised in that in distributed memory system, including the service network being deployed separately With storage net, clustered control node M ON and multiple object storage device OSD are connected to the service network and communicated, the multiple OSD is also connected to the storage net and communicated, and the device is applied to the OSD, including:Link detecting unit, it is connected for OSD where detecting the device and the storage net between the OSD of place OSD pairings Link, obtain the link-state information of place OSD sides;Processing unit, for existing extremely when the Link State that detect place OSD sides, then active termination place OSD Finger daemon, with prevent place OSD by service network to MON send with place OSD match OSD failures information.
- 7. device according to claim 6, it is characterised in that the link detecting unit includes:Network interface detection unit, whether the network interface for detecting place OSD is normal, obtains place OSD network interface status information;The processing unit, specifically in OSD network interfaces abnormal state where detecting this, active termination place OSD's Finger daemon, with prevent place OSD by service network to MON send with place OSD match OSD failures information.
- 8. the device according to claim 6 or 7, it is characterised in that the device further comprises:Memory cell, for storage cluster network topological information;In the cluster network topology information, include cluster network The positional information of router sum of series center router;The device obtains the device institute according to the cluster network topology information The center router position that the OSD that OSD where OSD and with this is matched is connected to jointly, so as to confirm which level router is located at Place OSD sides;The link detecting unit further comprises:Router detection unit, for OSD where this and with after the OSD heartbeat communication abnormalities of OSD pairings where this, sending out step by step Send IP messages and receive the message that receipts routers at different levels return, if failing to receive the message of certain level-1 router return, sentence The disconnected router breaks down;The processing unit, it is further used for according to the cluster network topology information, when detecting the router that breaks down When where this between OSD and center router, active termination place OSD finger daemon, to prevent place OSD from passing through Service network sends the information with the OSD failures of place OSD pairings to MON.
- 9. device according to claim 8, it is characterised in that the device further comprises:Updating block, for after the router hierarchical relationship of cluster network changes,Cluster network topology information after the renewal sent with the OSD of place OSD pairings is received by service network, and is sent to The memory cell storage, and send the cluster network topology information after renewal to the OSD matched with place OSD;And/or the cluster network topology information after the renewal of the MON transmissions is received by service network, and it is sent to described deposit Storage unit stores.
- 10. device according to claim 8, it is characterised in that the IP messages that the router detection unit is sent are to abide by Keep the IP messages of ICMP ICMP agreements;Specifically, the router detection unit is by the IP messages Time-to-live TTL initial value is set to 1, is transmitted, after the message for often receiving level-1 router return, by the IP messages TTL numerical value adds 1, and sends again.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710515775.6A CN107547252B (en) | 2017-06-29 | 2017-06-29 | Network fault processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710515775.6A CN107547252B (en) | 2017-06-29 | 2017-06-29 | Network fault processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107547252A true CN107547252A (en) | 2018-01-05 |
CN107547252B CN107547252B (en) | 2020-12-04 |
Family
ID=60970312
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710515775.6A Active CN107547252B (en) | 2017-06-29 | 2017-06-29 | Network fault processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107547252B (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108519927A (en) * | 2018-04-12 | 2018-09-11 | 郑州云海信息技术有限公司 | A kind of OSD Fault Locating Methods and system based on ICFS systems |
CN108769098A (en) * | 2018-04-03 | 2018-11-06 | 郑州云海信息技术有限公司 | A kind of method, apparatus and system for establishing distributed memory system network connection |
CN108924195A (en) * | 2018-06-20 | 2018-11-30 | 郑州云海信息技术有限公司 | A kind of unidirectional heartbeat mechanism implementation method, device, equipment and system |
CN108958970A (en) * | 2018-05-29 | 2018-12-07 | 新华三技术有限公司 | A kind of data reconstruction method, server and computer-readable medium |
CN109101357A (en) * | 2018-07-20 | 2018-12-28 | 广东浪潮大数据研究有限公司 | A kind of detection method and device of OSD failure |
CN109597689A (en) * | 2018-12-10 | 2019-04-09 | 浪潮(北京)电子信息产业有限公司 | A kind of distributed file system Memory Optimize Method, device, equipment and medium |
CN111142801A (en) * | 2019-12-26 | 2020-05-12 | 星辰天合(北京)数据科技有限公司 | Distributed storage system network sub-health detection method and device |
CN111385296A (en) * | 2020-03-04 | 2020-07-07 | 深信服科技股份有限公司 | Business process restarting method, device, storage medium and system |
CN111510338A (en) * | 2020-03-09 | 2020-08-07 | 苏州浪潮智能科技有限公司 | Distributed block storage network sub-health test method, device and storage medium |
CN111614477A (en) * | 2019-02-22 | 2020-09-01 | 华为技术有限公司 | Method and device for positioning network fault |
CN111756571A (en) * | 2020-05-28 | 2020-10-09 | 苏州浪潮智能科技有限公司 | Cluster node fault processing method, device, equipment and readable medium |
CN111817926A (en) * | 2020-09-11 | 2020-10-23 | 中国人民解放军国防科技大学 | Method for realizing reachability monitoring based on net-ping under RubyGems |
CN111913667A (en) * | 2020-08-06 | 2020-11-10 | 平安科技(深圳)有限公司 | OSD blocking detection method, system, terminal and storage medium based on Ceph |
CN112000500A (en) * | 2020-07-29 | 2020-11-27 | 新华三大数据技术有限公司 | Communication fault determining method, processing method and storage device |
CN112306815A (en) * | 2020-11-16 | 2021-02-02 | 新华三大数据技术有限公司 | Method, device, equipment and medium for monitoring IO (input/output) information between OSD (on Screen display) side master and slave in Ceph |
CN112596935A (en) * | 2020-11-16 | 2021-04-02 | 新华三大数据技术有限公司 | OSD (on-screen display) fault processing method and device |
CN113472553A (en) * | 2020-03-30 | 2021-10-01 | 中国移动通信集团浙江有限公司 | Fault injection system and method |
CN113542001A (en) * | 2021-05-26 | 2021-10-22 | 新华三大数据技术有限公司 | OSD (on-screen display) fault heartbeat detection method, device, equipment and storage medium |
CN114095341A (en) * | 2021-11-19 | 2022-02-25 | 深信服科技股份有限公司 | Network recovery method and device, computer equipment and storage medium |
WO2024113832A1 (en) * | 2022-11-29 | 2024-06-06 | 华为技术有限公司 | Node abnormality event processing method, network interface card, and storage cluster |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7362702B2 (en) * | 2001-10-18 | 2008-04-22 | Qlogic, Corporation | Router with routing processors and methods for virtualization |
CN101252603A (en) * | 2008-04-11 | 2008-08-27 | 清华大学 | Cluster distributed type lock management method based on storage area network SAN |
US20090198793A1 (en) * | 2008-01-31 | 2009-08-06 | Thanabalan Thavittupitchai Paul | Systems and methods for dynamically reporting a boot process in content/service receivers |
CN102402395A (en) * | 2010-09-16 | 2012-04-04 | 上海中标软件有限公司 | Quorum disk-based non-interrupted operation method for high availability system |
-
2017
- 2017-06-29 CN CN201710515775.6A patent/CN107547252B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7362702B2 (en) * | 2001-10-18 | 2008-04-22 | Qlogic, Corporation | Router with routing processors and methods for virtualization |
US20090198793A1 (en) * | 2008-01-31 | 2009-08-06 | Thanabalan Thavittupitchai Paul | Systems and methods for dynamically reporting a boot process in content/service receivers |
CN101252603A (en) * | 2008-04-11 | 2008-08-27 | 清华大学 | Cluster distributed type lock management method based on storage area network SAN |
CN102402395A (en) * | 2010-09-16 | 2012-04-04 | 上海中标软件有限公司 | Quorum disk-based non-interrupted operation method for high availability system |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108769098B (en) * | 2018-04-03 | 2021-04-13 | 郑州云海信息技术有限公司 | Method, device and system for establishing network connection of distributed storage system |
CN108769098A (en) * | 2018-04-03 | 2018-11-06 | 郑州云海信息技术有限公司 | A kind of method, apparatus and system for establishing distributed memory system network connection |
CN108519927A (en) * | 2018-04-12 | 2018-09-11 | 郑州云海信息技术有限公司 | A kind of OSD Fault Locating Methods and system based on ICFS systems |
CN108958970A (en) * | 2018-05-29 | 2018-12-07 | 新华三技术有限公司 | A kind of data reconstruction method, server and computer-readable medium |
CN108958970B (en) * | 2018-05-29 | 2021-05-07 | 新华三技术有限公司 | Data recovery method, server and computer readable medium |
CN108924195A (en) * | 2018-06-20 | 2018-11-30 | 郑州云海信息技术有限公司 | A kind of unidirectional heartbeat mechanism implementation method, device, equipment and system |
CN109101357A (en) * | 2018-07-20 | 2018-12-28 | 广东浪潮大数据研究有限公司 | A kind of detection method and device of OSD failure |
CN109597689B (en) * | 2018-12-10 | 2022-06-10 | 浪潮(北京)电子信息产业有限公司 | Distributed file system memory optimization method, device, equipment and medium |
CN109597689A (en) * | 2018-12-10 | 2019-04-09 | 浪潮(北京)电子信息产业有限公司 | A kind of distributed file system Memory Optimize Method, device, equipment and medium |
US11876700B2 (en) | 2019-02-22 | 2024-01-16 | Huawei Technologies Co., Ltd. | Network fault locating method and apparatus |
CN111614477A (en) * | 2019-02-22 | 2020-09-01 | 华为技术有限公司 | Method and device for positioning network fault |
CN111142801B (en) * | 2019-12-26 | 2021-05-04 | 星辰天合(北京)数据科技有限公司 | Distributed storage system network sub-health detection method and device |
CN111142801A (en) * | 2019-12-26 | 2020-05-12 | 星辰天合(北京)数据科技有限公司 | Distributed storage system network sub-health detection method and device |
CN111385296B (en) * | 2020-03-04 | 2022-06-21 | 深信服科技股份有限公司 | Business process restarting method, device, storage medium and system |
CN111385296A (en) * | 2020-03-04 | 2020-07-07 | 深信服科技股份有限公司 | Business process restarting method, device, storage medium and system |
CN111510338A (en) * | 2020-03-09 | 2020-08-07 | 苏州浪潮智能科技有限公司 | Distributed block storage network sub-health test method, device and storage medium |
CN111510338B (en) * | 2020-03-09 | 2022-04-26 | 苏州浪潮智能科技有限公司 | Distributed block storage network sub-health test method, device and storage medium |
CN113472553A (en) * | 2020-03-30 | 2021-10-01 | 中国移动通信集团浙江有限公司 | Fault injection system and method |
CN111756571A (en) * | 2020-05-28 | 2020-10-09 | 苏州浪潮智能科技有限公司 | Cluster node fault processing method, device, equipment and readable medium |
US11750437B2 (en) | 2020-05-28 | 2023-09-05 | Inspur Suzhou Intelligent Technology Co., Ltd. | Cluster node fault processing method and apparatus, and device and readable medium |
CN111756571B (en) * | 2020-05-28 | 2022-02-18 | 苏州浪潮智能科技有限公司 | Cluster node fault processing method, device, equipment and readable medium |
CN112000500A (en) * | 2020-07-29 | 2020-11-27 | 新华三大数据技术有限公司 | Communication fault determining method, processing method and storage device |
CN111913667A (en) * | 2020-08-06 | 2020-11-10 | 平安科技(深圳)有限公司 | OSD blocking detection method, system, terminal and storage medium based on Ceph |
CN111913667B (en) * | 2020-08-06 | 2023-03-14 | 平安科技(深圳)有限公司 | OSD blocking detection method, system, terminal and storage medium based on Ceph |
CN111817926A (en) * | 2020-09-11 | 2020-10-23 | 中国人民解放军国防科技大学 | Method for realizing reachability monitoring based on net-ping under RubyGems |
CN112596935B (en) * | 2020-11-16 | 2022-08-30 | 新华三大数据技术有限公司 | OSD (on-screen display) fault processing method and device |
CN112306815B (en) * | 2020-11-16 | 2023-07-25 | 新华三大数据技术有限公司 | Method, device, equipment and medium for monitoring IO information between OSD side and master slave in Ceph |
CN112596935A (en) * | 2020-11-16 | 2021-04-02 | 新华三大数据技术有限公司 | OSD (on-screen display) fault processing method and device |
CN112306815A (en) * | 2020-11-16 | 2021-02-02 | 新华三大数据技术有限公司 | Method, device, equipment and medium for monitoring IO (input/output) information between OSD (on Screen display) side master and slave in Ceph |
CN113542001A (en) * | 2021-05-26 | 2021-10-22 | 新华三大数据技术有限公司 | OSD (on-screen display) fault heartbeat detection method, device, equipment and storage medium |
CN114095341A (en) * | 2021-11-19 | 2022-02-25 | 深信服科技股份有限公司 | Network recovery method and device, computer equipment and storage medium |
WO2024113832A1 (en) * | 2022-11-29 | 2024-06-06 | 华为技术有限公司 | Node abnormality event processing method, network interface card, and storage cluster |
Also Published As
Publication number | Publication date |
---|---|
CN107547252B (en) | 2020-12-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107547252A (en) | A kind of network failure processing method and device | |
US10200279B1 (en) | Tracer of traffic trajectories in data center networks | |
US7761743B2 (en) | Fault tolerant routing in a non-hot-standby configuration of a network routing system | |
CN101707537B (en) | Positioning method of failed link and alarm root cause analyzing method, equipment and system | |
CN1826776B (en) | Method and apparatus for processing duplicate packets | |
US5949759A (en) | Fault correlation system and method in packet switching networks | |
US10868709B2 (en) | Determining the health of other nodes in a same cluster based on physical link information | |
US7330889B2 (en) | Network interaction analysis arrangement | |
CN101313280A (en) | Pool-based network diagnostic systems and methods | |
JP2001249856A (en) | Method for processing error in storage area network(san) and data processing system | |
CN103795570B (en) | The unicast message restoration methods and device of the stacked switchboard system of ring topology | |
CN106959820A (en) | A kind of data extraction method and system | |
CN110784373A (en) | Virtual network convergence method and device | |
CN106330531A (en) | Node fault recording and processing method and device | |
US8559317B2 (en) | Alarm threshold for BGP flapping detection | |
US10277484B2 (en) | Self organizing network event reporting | |
Dozier et al. | Vulnerability analysis of AIS-based intrusion detection systems via genetic and particle swarm red teams | |
CN112769653B (en) | Network detection and switching method, system and medium based on network port binding | |
Borokhovich et al. | The show must go on: Fundamental data plane connectivity services for dependable SDNs | |
CN109257268A (en) | A kind of network attack test system and method across vlan | |
CN111130813B (en) | Information processing method based on network and electronic equipment | |
CN113132140B (en) | Network fault detection method, device, equipment and storage medium | |
CN101102231A (en) | An automatic discovery method and device of PPP link routing device | |
Cisco | Cisco WAN Switching Software Release Notes, Release 8.5.05 | |
Cisco | 9.1.10 Software Release Notes Cisco StrataView Plus for AIX |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |