CN111666231B - Method for maintaining memory sharing in clustered system - Google Patents

Method for maintaining memory sharing in clustered system Download PDF

Info

Publication number
CN111666231B
CN111666231B CN201910164001.2A CN201910164001A CN111666231B CN 111666231 B CN111666231 B CN 111666231B CN 201910164001 A CN201910164001 A CN 201910164001A CN 111666231 B CN111666231 B CN 111666231B
Authority
CN
China
Prior art keywords
computer node
memory
transparent bridge
driver
computer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910164001.2A
Other languages
Chinese (zh)
Other versions
CN111666231A (en
Inventor
泰清秀
林宏达
许瀞文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mitac Computer Shunde Ltd
Mitac Computing Technology Corp
Original Assignee
Mitac Computer Shunde Ltd
Mitac Computing Technology Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitac Computer Shunde Ltd, Mitac Computing Technology Corp filed Critical Mitac Computer Shunde Ltd
Priority to CN201910164001.2A priority Critical patent/CN111666231B/en
Publication of CN111666231A publication Critical patent/CN111666231A/en
Application granted granted Critical
Publication of CN111666231B publication Critical patent/CN111666231B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • G06F13/1652Handling requests for interconnection or transfer for access to memory bus based on arbitration in a multiprocessor architecture
    • G06F13/1663Access to shared memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4027Coupling between buses using bus bridges
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4282Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus

Abstract

The invention provides a method for maintaining memory sharing in a cluster system, which is implemented by a cluster system, the cluster system comprises a computer node and another computer node which is connected with the computer node through a non-transparent bridge to share memory, the method comprises the following steps: (A) The computer node verifies whether the connection state of the non-transparent bridge between the computer node and the other computer node is in a disconnection state; (B) When the computer node is in the disconnection state, the computer node transmits a reset request for reinitializing the memory to a driving program; (C) The computer node responds to the reset request through the executed driving program to execute a reset program so as to generate an initialization result; and (D) when the computer node receives the initialization result from the other computer node, the computer node executes a sharing configuration program.

Description

Method for maintaining memory sharing in clustered system
[ technical field ] A method for producing a semiconductor device
The invention relates to a method for maintaining memory sharing in a clustered system, in particular to a method for maintaining memory sharing in a clustered system, which automatically detects whether peer computer nodes in the clustered system reset and automatically restores the connection state between the two computer nodes to maintain memory sharing.
[ background of the invention ]
Clustered systems (clustered systems) are a collection of more than two computer nodes that share storage devices and are connected together to work together with a high degree of compactness to perform computing tasks. In a clustered system, a Non-Transparent Bridge (NTB) technique can be used between two computer nodes to implement memory sharing between the computer nodes, so as to achieve the purpose of fast and mass communication between the computer nodes. In a clustered system, NTB is implemented by relying on firmware running in Peripheral Component Interconnect Express (PCIe) switches (switches) of the computer nodes, and cores running in the operating systems of the computer nodes and drivers associated with the PCIe switches to cooperate, and all of which accomplish memory sharing among the computer nodes during system initialization.
When one of the computer nodes is reset (i.e., one of the computer nodes is abnormal and replaced with a new computer node, or rebooted due to the abnormality, or one of the computer nodes is hot-plugged and reinstalled into the cluster system), the related processing behavior is not specified in the prior art for this exception condition, so the settings established by the computer nodes with respect to the shared memory will be lost due to the reset of one of the computer nodes, which results in the communication failure between the computer nodes (i.e., the connection status between the computer nodes is in the disconnected status), and the subsequent computer nodes cannot actively repair the connection status therebetween.
[ summary of the invention ]
The present invention provides a method for maintaining memory sharing in a clustered system, which automatically detects whether a peer computer node in the clustered system is reset and automatically restores the connection status between the two computer nodes to maintain memory sharing.
To solve the above technical problem, a method for maintaining memory sharing in a clustered system is implemented by a clustered system, the clustered system including a computer node and another computer node connected to the computer node via a non-transparent bridge to share memory, each computer node including a cpu, a memory, an application executed by the cpu and running in a user space, a driver executed by the cpu and running in a core space, and a pci switch electrically connected to the cpu and executing a firmware, the method comprising:
(A) Verifying whether the connection state of the non-transparent bridge between the computer node and the other computer node is in a disconnection state or not through the executed application program by the computer node;
(B) When the computer node verifies that the connection state of the non-transparent bridge between the computer node and the other computer node is in a disconnection state, the computer node transmits a reset request for instructing the driver to reinitialize the memory to the driver through the executed application program;
(C) Executing a reset program associated with reinitializing the memory in response to the reset request by the computer node via the executed driver to generate an initialization result associated with the memory; and
(D) When the computer node receives the initialization result from the other computer node and related to its own memory, the computer node executes a sharing configuration program related to sharing the memories through the executed driver.
Compared with the prior art, the method for maintaining the memory sharing in the cluster system has the advantages that the computer node actively verifies whether the connection state of the non-transparent bridge between the computer node and the other computer node is in the disconnection state through the executed application program, when the computer node verifies that the connection state between the computer node and the other computer node is in the disconnection state, the driver is automatically requested to execute the reset program related to reinitializing the memory, and the sharing setting program is executed after the initialization result from the other computer node is received, so that the aim of automatically repairing the connection state between the two computer nodes to maintain the memory sharing is fulfilled.
[ description of the drawings ]
FIG. 1 is a block diagram illustrating a clustered system implementing a first embodiment of a method for maintaining memory sharing in a clustered system in accordance with the present invention.
FIG. 2 is a flowchart illustrating a memory sharing operation procedure of the first embodiment of the method for maintaining memory sharing in a clustered system according to the present invention.
FIG. 3 is a flowchart illustrating a memory sharing maintaining procedure according to a first embodiment of the method for maintaining memory sharing in a clustered system.
FIG. 4 is a flowchart illustrating how a computer node verifies whether the connection status of a non-transparent bridge between the computer node and another computer node is disconnected.
FIG. 5 is a flowchart illustrating how the computer node detects whether the connection status of the non-transparent bridge to the other computer node is disconnected.
FIG. 6 is a flowchart illustrating how the computer node detects whether the connection status of the non-transparent bridge to the other computer node is disconnected.
FIG. 7 is a block diagram illustrating a process for executing an application, a driver, and a firmware in each computer node.
[ detailed description ] embodiments
Referring to fig. 1, a first embodiment of a method for maintaining memory sharing in a clustered system according to the present invention is implemented by a clustered system. The clustered system includes a computer node 1 and another computer node 1 connected to the computer node 1 via a non-transparent bridge to share a memory 12, each computer node 1 includes a central processing unit 11, a memory 12 electrically connected to the central processing unit 11, an application executed by the central processing unit 11 and running in a user space of an operating system, a driver executed by the central processing unit 11 and running in a kernel space of the operating system, and a peripheral component interconnect switch 13 electrically connected to the central processing unit 11 and executing a firmware. Each pci switch 13 includes a processor 131 for executing the firmware, a register 132 electrically connected to the processor 131, and a serializer/deserializer 133 (SerDes) electrically connected to the processor 131 and the register 132. The serializer/deserializer 133 of the PCI switch 13 of the computer node 1 communicates with the serializer/deserializer 133 of the PCI switch 13 of the other computer node 1 via a channel (Lane). It is noted that, for said computer node 1, said another computer node 1 is a peer computer node 1 of said computer node 1, and for said another computer node 1, said computer node 1 is a peer computer node 1 of said another computer node 1. In the present embodiment, each memory 12 is, for example, a main memory. Referring to FIG. 7, FIG. 7 illustrates hardware in each computer node for executing the application, the driver, and the firmware.
The following describes the operation details of each component in the clustered system with a first embodiment of the method for maintaining memory sharing in the clustered system according to the present invention. The present embodiment sequentially includes a memory sharing operation procedure and a memory sharing maintenance procedure.
It is worth mentioning that, at the initial stage of the cluster system, the computer nodes 1 can complete the sharing of the memory 12 between them through the non-transparent bridge in a conventional manner. However, when one of the computer nodes 1 in the clustered system is reset, the communication between the computer nodes 1 is disabled, but the technical feature of the present invention is how to maintain the memory 12 sharing among the computer nodes 1 in the clustered system, and for the initial stage of the cluster system, the steps of configuring the memory 12 sharing among the computer nodes 1 are not described in detail herein. After the clustered system completes the process of sharing the memory 12 among the computer nodes 1 for the first time, the clustered system is configured to perform the memory sharing operation process.
Referring to fig. 1 and 2, the clustered system implements a memory sharing operation procedure of the method for maintaining memory sharing in the clustered system according to the present invention. The memory sharing operation illustrates the steps that are initially performed by the computer nodes 1 in the clustered system in a state where they have established mutual memory sharing 12 via non-transparent bridging.
In step 21, when the connection status of the non-transparent bridge between the computer node 1 and the other computer node 1 is not in the off-line status, the other computer node 1 periodically changes a data to be updated stored in the memory 12 thereof at an update interval. In this embodiment, the data to be updated is, for example, a predetermined number, and each change of the data to be updated by the other computer node 1 can be implemented by incrementing the predetermined number one at a time. In practice, the update step of the data to be updated may be programmed in the application or driver of the other computer node 1, so that the other computer node 1 periodically updates the data to be updated through the executed application or driver.
Referring to fig. 1 and 3, the clustered system implements a memory sharing maintaining procedure of the method for maintaining memory sharing in a clustered system according to the present invention, and includes the following steps.
In step 30, the computer node 1 verifies via the executed application whether the connection status of the non-transparent bridge between the computer node 1 and the other computer node is disconnected. When the computer node 1 verifies that the connection state of the non-transparent bridge between the computer node 1 and the other computer node 1 is not in the disconnection state, the process proceeds to step 31; when the computer node 1 verifies that the connection status of the non-transparent bridge between the computer node 1 and the other computer node 1 is in the disconnection status, the process proceeds to step 32.
Referring to fig. 1 and 4, it is worth particularly describing that step 30 includes the following sub-steps.
In sub-step 301, the computer node 1 detects whether the connection status of the non-transparent bridge between the computer node 1 and the other computer node 1 is in the disconnection status through the executed application program. When the computer node 1 detects that the connection status of the non-transparent bridge between the computer node 1 and the other computer node 1 is in the disconnection status, the process proceeds to step 302; when the computer node 1 detects that the connection status of the non-transparent bridge between the computer node 1 and the other computer node 1 is not in the disconnection status, the process proceeds to the sub-step 305. It should be noted that, in an embodiment provided in this embodiment, the computer node 1 determines, by the executed application program, whether the data to be updated currently read from the memory 12 of the other computer node 1 is different from the data to be updated previously read from the memory 12 of the other computer node 1, so as to detect whether the connection state of the non-transparent bridge between the computer node 1 and the other computer node is in the disconnection state. If the data to be updated read from the memory 12 of the other computer node 1 at present is different from the data to be updated read from the memory 12 of the other computer node 1 at the previous time, the computer node 1 detects that the connection state of the non-transparent bridge between the computer node 1 and the other computer node 1 is not in the disconnection state; if the data to be updated currently read from the memory 12 of the other computer node 1 is the same as the data to be updated previously read from the memory 12 of the other computer node 1, the computer node 1 detects that the connection state of the non-transparent bridge between the computer node 1 and the other computer node 1 is in the disconnection state.
However, referring to fig. 1 and fig. 5, in another embodiment provided in this embodiment, the sub-step 301 further includes the following sub-steps.
In sub-step 501, the computer node 1 transmits a query request relating to the connection status of the non-transparent bridge between the computer node 1 and the other computer node 1 to the driver via the executed application program, so that the driver returns a connection data detected by the driver indicating the connection status of the non-transparent bridge between the computer node 1 and the other computer node to the application program in response to the query request. Wherein, the computer node 1 detects the connection data according to the data to be updated currently read from the memory 12 of the other computer node 1 and the data to be updated last read from the memory 12 of the other computer node 1 by the executed driving program. If the current data to be updated read from the memory 12 of the other computer node 1 is different from the previous data to be updated read from the memory 12 of the other computer node 1, the computer node 1 detects that the connection state of the non-transparent bridge between the computer node 1 and the other computer node 1 is not in the disconnection state, and then generates the connection data indicating that the connection state between the computer node 1 and the other computer node 1 is not in the disconnection state; if the current data to be updated read from the memory 12 of the other computer node 1 is the same as the previous data to be updated read from the memory 12 of the other computer node 1, the computer node 1 detects that the connection state of the non-transparent bridge between the computer node 1 and the other computer node 1 is in the disconnection state, and then generates the connection data indicating that the connection state between the computer node 1 and the other computer node 1 is in the disconnection state.
In this embodiment, the computer node 1 detects the connection data periodically according to the data to be updated currently read from the memory 12 of the other computer node 1 and the data to be updated last read from the memory 12 of the other computer node 1 at the update interval time by the executed driver, so that the computer node 1 can update the detected connection data in real time to accurately verify whether the connection state of the non-transparent bridge between the computer node 1 and the other computer node 1 is in the disconnection state; however, the computer node 1 may also detect the connection data periodically according to the data to be updated currently read from the memory 12 of the other computer node 1 and the data to be updated last time read from the memory 12 of the other computer node 1 by the executed driver at a detection interval time greater than or equal to the update interval time (i.e., once the computer node 1 receives the query request from the application program by the executed driver, the computer node 1 detects the connection data according to the data to be updated currently read from the memory 12 of the other computer node 1 and the data to be updated last time read from the memory 12 of the other computer node 1).
In sub-step 502, the computer node 1 determines whether the connection status of the non-transparent bridge between the computer node 1 and the other computer node 1 is in the disconnection status according to the connection data after receiving the connection data from the driver through the executed application program.
Alternatively, in another embodiment provided in this embodiment, the computer node 1 detects the connection data according to the data to be updated currently read from the memory 12 of the other computer node 1 and the data to be updated previously read from the memory 12 of the other computer node 1 periodically at the update interval by the executed driver, and the computer node 1 transmits a disconnection notification indicating that the non-transparent bridge between the computer node 1 and the other computer node 1 is disconnected to the application program after detecting the connection data indicating that the connection between the computer node 1 and the other computer node 1 is disconnected by the executed driver. The computer node 1 determines whether the disconnection notification is received from the driver and detected by the driver by the executed application program, so as to detect whether the connection state of the non-transparent bridge between the computer node 1 and the other computer node is in a disconnection state.
In sub-step 302, the computer node 1 increments a count value by the executed application program, and determines whether the count value is greater than or equal to a predetermined value. When the computer node 1 determines that the count value is smaller than the preset value, the process proceeds to step 303; when the computer node 1 determines that the count value is greater than or equal to the predetermined value, the process proceeds to step 304. In this embodiment, the default value is 3, however, in other embodiments of the present invention, the default value may be 2 or a value greater than 3, and not limited thereto, by determining whether the count value is greater than or equal to the default value, it may be determined whether the connection state between the computer node 1 and the another computer node 1 is in the disconnection state for multiple times, and further it is determined that the connection state between the computer node 1 and the another computer node 1 is indeed in the disconnection state, but not the temporary disconnection caused by hardware instability, and the temporary disconnection caused by hardware instability may be quickly recovered, so that when the computer node 1 determines again that the connection state between the computer node 1 and the another computer node 1 is in the disconnection state after the time interval, it may be determined that the connection state between the computer node 1 and the another computer node 1 is recovered to the connection state, and thus it may be possible to avoid unnecessary detection of the computer node 1 being excessively sensitive, and determining that the temporary disconnection caused by hardware instability is actually determined as the disconnection, and the memory sharing program of the computer node 1 is set.
In step 303, the computer node 1 counts the detection interval by the executed application program, and then returns to step 301.
In step 304, the computer node 1 resets the counter value through the executed application program and verifies that the connection status of the non-transparent bridge between the computer node 1 and the other computer node is disconnected, and then proceeds to step 32.
In sub-step 305, the computer node 1 resets the count value via the executed application, and then proceeds to step 31.
In step 31, the computer node 1 counts a predetermined time and then returns to step 30, wherein the predetermined time is greater than or equal to the refresh interval time.
In step 32, the computer node 1 sends a reset request to the driver via the executed application program instructing the driver to reinitialize the memory 12.
In step 33, the computer node 1 executes a reset procedure associated with reinitializing the memory 12 in response to the reset request via the executed driver to generate and transmit an initialization result associated with the memory 12 to the other computer node 1.
In step 34, when the computer node 1 receives the initialization result from the other computer node 1 and associated with its own memory 12, the computer node 1 executes a sharing configuration procedure associated with sharing the memories 12 via the executed driver. It should be noted that the reset procedure and the sharing configuration procedure executed by the computer node 1 through the executed driver are the same as those executed in the memory 12 sharing configuration procedure performed in the initial stage of the cluster system.
It should be noted that, after the other computer node 1 is reset (i.e. the other computer node 1 is replaced with a new computer node, or the other computer node 1 is hot-plugged and then installed back to the clustered system), the other computer node 1 will re-execute the reset procedure to generate another initialization result related to the memory 12 of the other computer node 1 and transmit the result to the computer node 1. When the other computer node 1 receives the initialization result from the computer node 1 and associated with its own memory 12, the other computer node 1 also executes a sharing configuration procedure associated with sharing the memories 12 through the executed driver, thereby completing the sharing of the memories 12 between the two computer nodes 1. When the memory is shared, the computer nodes all need to perform the reset procedure and the sharing setting procedure to successfully complete the sharing of the memory 12 between the two computer nodes 1, and if only one computer node performs the reset procedure, the memory sharing effect cannot be achieved.
The second embodiment of the method for maintaining memory sharing in a clustered system is also implemented by the clustered system and comprises the following steps. The second embodiment is substantially the same as the first embodiment, and the same thing is not to be repeated, wherein the differences are: the second embodiment does not perform step 21, and the sub-step 301 includes a sub-step 601 and a sub-step 602.
Referring to fig. 1 and 6, a step 301 of a method for maintaining memory sharing in a clustered system according to a second embodiment of the present invention includes the following steps.
In sub-step 601, the computer node 1 transmits a query request related to the connection status of the non-transparent bridge between the computer node 1 and the other computer node 1 to the firmware through the driver via the executed application program, so that the firmware responds to the query request to transmit a connection data detected by the firmware and indicating the connection status of the non-transparent bridge between the computer node 1 and the application program back through the driver. In the embodiment, the PCI switch 13 of the computer node 1 detects the connection data according to a status data, which is temporarily stored in the register 132 of the PCI switch 13 and related to the connection status of the non-transparent bridge between the PCI switch 13 and the other computer node 1, by the executed firmware, the PCI switch 13 directly reads the status data of the register 132 and transmits the status data as the connection data back to the application program through the driver. It should be noted that the register 132 of the PCI switch 13 is physically connected to the serializer/deserializer 133, and when the communication between the computer nodes 1 is disconnected, the state data of the register 132 will reflect that the connection between the computer nodes 1 is disconnected; when the communication between the computer nodes 1 is connected, the state data of the register 132 also reflects that the connection state between the computer nodes 1 is not disconnected, wherein the state data stored in the register 132 is maintained by the processor 131 electrically connected to the register 132.
In sub-step 602, the computer node 1 determines whether the connection status of the non-transparent bridge between the computer node 1 and the other computer node 1 is in the disconnection status according to the connection data after receiving the connection data from the firmware through the executed application program.
Alternatively, in another embodiment provided in this embodiment, the PCI switch 13 of the computer node 1 periodically reads the status data at the detection interval or the update interval by the executed firmware, and the PCI switch 13 of the computer node 1 transmits a disconnection notification indicating that the non-transparent bridge between the computer node 1 and the other computer node 1 is disconnected to the application program through the driver program after the status data indicating that the connection status between the computer node 1 and the other computer node 1 is disconnected is read by the executed firmware. The computer node 1 determines whether the disconnection notification indicating that the non-transparent bridge between the computer node 1 and the firmware is disconnected is received through the driver by the executed application program, so as to detect whether the connection state of the non-transparent bridge between the computer node 1 and the firmware is in a disconnection state.
In summary, the present invention provides a method for maintaining memory sharing in a cluster system, in which a computer node 1 actively verifies whether a connection state of a non-transparent bridge between the computer node 1 and another computer node 1 is in a disconnected state through an executed application program, and when the computer node 1 verifies that the connection state of the non-transparent bridge between the computer node 1 and the another computer node 1 is in the disconnected state, the driver is automatically requested to execute the reset program related to reinitializing the memory 12, and the sharing configuration program is executed after receiving an initialization result from the another computer node 1, so that even if the another computer node 1 is reset (i.e. the another computer node 1 is abnormal and replaced with a new computer node, or is rebooted due to the abnormality, or is re-installed back to the cluster system after the another computer node 1 is hot plugged), the computer node 1 can automatically determine the connection state between the another computer node 1 and reconnect, thereby achieving the purpose of automatically repairing the connection state between the two computer nodes 1 to maintain the memory 12 sharing. In addition, when the computer node 1 verifies the connection state between itself and the other computer node 1, the connection state can be detected directly and periodically by the executed application program according to the currently read data to be updated and the data to be updated read last time, or the connection state can be determined by periodically inquiring the connection data detected by the driver program through the executed application program, even if the disconnection notification is actively transmitted to the application program through the driver program, so that the application program can determine the connection state by determining whether the disconnection notification is received, or even if the connection state is determined by periodically inquiring the connection data detected by the firmware through the driver program through the executed application program, so that the connection state can be determined by determining whether the disconnection notification is received by the application program through the driver program, and even if the firmware actively transmitting the disconnection notification to the application program through the driver program, so that the object of the present invention can be achieved.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present invention, and shall cover the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (9)

1. A method for maintaining memory sharing in a clustered system, implemented by a clustered system, the clustered system including a computer node and another computer node connected to the computer node via a non-transparent bridge for sharing memory, each computer node including a cpu, a memory, an application executed by the cpu and running in a user space, a driver executed by the cpu and running in a core space, and a peripheral component interconnect switch electrically connected to the cpu and executing a firmware, the method comprising the steps of:
(A) Verifying whether the connection state of the non-transparent bridge between the computer node and the other computer node is in a disconnection state or not through the executed application program by the computer node;
(B) When the computer node verifies that the connection state of the non-transparent bridge between the computer node and the other computer node is in a disconnection state, the computer node transmits a reset request for instructing the driver program to reinitialize the memory of the computer node to the driver program of the computer node through the executed application program;
(C) Executing a reset program related to reinitializing the memory by the computer node in response to the reset request through the executed driver of the computer node to generate an initialization result related to the memory; and
(D) When the computer node receives the initialization result from the other computer node and related to its own memory, the computer node executes a sharing configuration program related to sharing the memories through the executed driver of the computer node.
2. The method as claimed in claim 1, further comprising a step (E) after the step (A), when the computer node verifies that the connection status of the non-transparent bridge between the computer node and the other computer node is not in the disconnection status, the computer node counts a predetermined time and returns to the step (A).
3. The method of claim 1, wherein step (A) comprises the following sub-steps:
(a-1) detecting, by the computer node via the executed application program, whether a connection state of the non-transparent bridge between the computer node and the other computer node is in a disconnection state;
(A-2) when the computer node detects that the connection state of the non-transparent bridge between the computer node and the other computer node is in a disconnection state, the computer node adds a count value by an executed application program, and judges whether the count value is greater than or equal to a preset value;
(A-3) when the computer node determines that the count value is less than the predetermined value, the computer node repeats step (A-1) after counting a detection interval time by the executed application program;
(A-4) when the computer node determines that the count value is greater than or equal to the predetermined value, the computer node resets the count value through the executed application program and verifies that the connection state of the non-transparent bridge between the computer node and the other computer node is in a disconnection state; and
(A-5) when the computer node detects that the connection state of the non-transparent bridge between the computer node and the other computer node is not in the disconnection state, the computer node resets the count value through the executed application program, and repeats the step (A-1) after counting for a preset time.
4. The method as claimed in claim 3, further comprising a step of:
(F) When the connection state of the non-transparent bridge between the computer node and the other computer node is not in the disconnection state, the other computer node periodically changes data to be updated stored in a memory thereof at an update interval time which is less than or equal to the detection interval time;
in step (a-1), the computer node determines whether the data to be updated currently read from the memory of the other computer node is different from the data to be updated previously read from the memory of the other computer node by the executed application program, so as to detect whether the connection state of the non-transparent bridge between the computer node and the other computer node is in a disconnection state.
5. The method of claim 3, wherein the step (A-1) comprises the following sub-steps:
(a-1-1) transmitting, by the computer node, an inquiry request regarding a connection status of the non-transparent bridge with the another computer node to the driver of the computer node via the executed application of the computer node, so that the driver of the computer node transmits, in response to the inquiry request, connection data detected by the driver of the computer node, which indicates the connection status of the non-transparent bridge with the another computer node, back to the application of the computer node; and
(A-1-2) determining, by the computer node, whether the connection status of the non-transparent bridge between the computer node and the other computer node is in the disconnection status according to the connection data after receiving the connection data from the driver of the computer node through the executed application program.
6. The method of claim 5, further comprising the step of:
(F) When the connection state of the non-transparent bridge between the computer node and the other computer node is not in the disconnection state, the other computer node periodically changes data to be updated stored in a memory thereof at an update interval time which is less than or equal to the detection interval time;
in the step (A-1-1), the computer node detects the connection data according to the data to be updated currently read from the memory of the other computer node and the data to be updated last read from the memory of the other computer node by the executed driver.
7. The method as claimed in claim 3, wherein in step (A-1), the computer node determines whether the driver from itself is received and a disconnection notification indicating that the non-transparent bridge is disconnected from the other computer node is detected by the driver, so as to detect whether the connection status of the non-transparent bridge with the other computer node is disconnected.
8. The method of claim 3, wherein the step (A-1) comprises the following sub-steps:
(a-1-1) transmitting, by the computer node via the executed application program via its driver, a query request to the firmware regarding a connection status of the non-transparent bridge with the other computer node, so that the firmware, in response to the query request, transmits connection data detected by the firmware, indicating the connection status of the non-transparent bridge with the other computer node, back to the application program via its driver; and
(A-1-2) determining, by the computer node, whether a connection status of the non-transparent bridge with the other computer node is in a disconnection status according to the connection data after receiving the connection data from the firmware through the executed application program.
9. The method as claimed in claim 3, wherein in step (A-1), the computer node determines whether a disconnection notification indicating that the non-transparent bridge is disconnected from the other computer node is received through the driver of the computer node via the executed application program, so as to detect whether the connection status of the non-transparent bridge with the other computer node is disconnected.
CN201910164001.2A 2019-03-05 2019-03-05 Method for maintaining memory sharing in clustered system Active CN111666231B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910164001.2A CN111666231B (en) 2019-03-05 2019-03-05 Method for maintaining memory sharing in clustered system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910164001.2A CN111666231B (en) 2019-03-05 2019-03-05 Method for maintaining memory sharing in clustered system

Publications (2)

Publication Number Publication Date
CN111666231A CN111666231A (en) 2020-09-15
CN111666231B true CN111666231B (en) 2023-02-10

Family

ID=72381252

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910164001.2A Active CN111666231B (en) 2019-03-05 2019-03-05 Method for maintaining memory sharing in clustered system

Country Status (1)

Country Link
CN (1) CN111666231B (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3052857B2 (en) * 1996-10-31 2000-06-19 日本電気株式会社 Inter-cluster shared memory access method
TW201015336A (en) * 2008-10-03 2010-04-16 Accusys Technology Ltd Shared-storage bus switch
US8589613B2 (en) * 2010-06-02 2013-11-19 Intel Corporation Method and system to improve the operations of an integrated non-transparent bridge device
JP5833756B2 (en) * 2012-03-23 2015-12-16 株式会社日立製作所 Dual shared memory access method and storage device using dual shared memory access method
TW201423422A (en) * 2012-12-07 2014-06-16 Hon Hai Prec Ind Co Ltd System and method for sharing device having PCIe interface
US20150261709A1 (en) * 2014-03-14 2015-09-17 Emilio Billi Peripheral component interconnect express (pcie) distributed non- transparent bridging designed for scalability,networking and io sharing enabling the creation of complex architectures.
US9419918B2 (en) * 2014-11-07 2016-08-16 Futurewei Technologies, Inc. Non-transparent bridge method and apparatus for configuring high-dimensional PCI-express networks
JP6455302B2 (en) * 2015-04-30 2019-01-23 富士通株式会社 Bus communication system

Also Published As

Publication number Publication date
CN111666231A (en) 2020-09-15

Similar Documents

Publication Publication Date Title
US10560315B2 (en) Method and device for processing failure in at least one distributed cluster, and system
US6983397B2 (en) Method, system, and program for error handling in a dual adaptor system where one adaptor is a master
US11392417B2 (en) Ultraconverged systems having multiple availability zones
US9430266B2 (en) Activating a subphysical driver on failure of hypervisor for operating an I/O device shared by hypervisor and guest OS and virtual computer system
US9983812B1 (en) Automated node failure detection in an active/hot-standby storage cluster
US7721155B2 (en) I2C failure detection, correction, and masking
JP2009540436A (en) SAS expander to isolate obstacles
CN114185603B (en) Control method of intelligent accelerator card, server and intelligent accelerator card
CN111666231B (en) Method for maintaining memory sharing in clustered system
TWI704460B (en) A method of maintaining memory sharing in clustered system
TWI547873B (en) Control module of server node and firmware updating method for the control module
US10049058B2 (en) Method for resolving a cable mismatch in a target device
US11349705B2 (en) Control system and control method
KR102438148B1 (en) Abnormality detection apparatus, system and method for detecting abnormality of embedded computing module
JP2021120827A5 (en)
US20180121087A1 (en) Register-based communications interface
KR20090092707A (en) Information processing system and control method and control program of the same
JP3266841B2 (en) Communication control device
JP5672225B2 (en) HARDWARE MANAGEMENT DEVICE, INFORMATION PROCESSING DEVICE, HARDWARE MANAGEMENT METHOD, AND COMPUTER PROGRAM
JP3903688B2 (en) Bank switching system
CN116165977A (en) Method, device, system, medium and program product for controlling start of electronic system
CN116010131A (en) Managing applications in clusters
CN116820837A (en) Exception handling method and device for system component
CN116232862A (en) Intelligent network card switching method, controller and system
KR20140015030A (en) Firmware operating apparatus and method therefor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant