WO2020088351A1 - Method for sending device information, computer device and distributed computer device system - Google Patents

Method for sending device information, computer device and distributed computer device system Download PDF

Info

Publication number
WO2020088351A1
WO2020088351A1 PCT/CN2019/113147 CN2019113147W WO2020088351A1 WO 2020088351 A1 WO2020088351 A1 WO 2020088351A1 CN 2019113147 W CN2019113147 W CN 2019113147W WO 2020088351 A1 WO2020088351 A1 WO 2020088351A1
Authority
WO
WIPO (PCT)
Prior art keywords
computer device
reset
information
node
power
Prior art date
Application number
PCT/CN2019/113147
Other languages
French (fr)
Chinese (zh)
Inventor
岑月宁
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020088351A1 publication Critical patent/WO2020088351A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications

Definitions

  • a system composed of a group of computers using distributed computing is called a distributed computing system.
  • the distributed computing system divides the project data that needs a large amount of calculation into small pieces, which are calculated by multiple computing nodes, such as a server with a computing function, and then the results of the calculation are unified and merged to obtain the data conclusion.
  • the message sending unit is configured to generate a reset notification message including the reset information of the computer device based on the reset information of the computer device, and send the reset notification message to other users in the distributed system Computer equipment.
  • the message sending unit may send the notification message through a private network or a public network.
  • the operating system in the computer device includes a notification chain about reset; before the computer device is reset, the processor obtains from the notification chain through a preset function Reset information of the computer device.
  • the message sending unit is a baseboard management controller BMC, and the BMC further includes a notification module;
  • the method further includes:
  • FIG. 1 is a schematic structural diagram of a distributed system provided by an embodiment of this application.
  • a notification message is sent to the notification chain, and the notification chain may be a notification chain about resetting.
  • the reset detection module 1011 registered on the notification chain can obtain the information that the node 100 is about to be reset by calling a callback function.
  • the reset detection module 1011 may transmit the reset information of the node 100 to the microcode module 1021 in the interface unit 102 through a communication channel between the control unit 101 and the interface unit 102, for example, a PCIE 3.0 communication channel.
  • the notification message sent by the reset detection module 1011 to the microcode module 1021 includes the identifier of the node 100 and the information that the node 100 is about to be reset. After the reset detection module 1011 transmits the reset information of the node 100 to the microcode module 1021 in the interface unit 102, it can notify the reset module of the operating system of the node 100 to start the reset.
  • microcode module 1021 may send a reset notification message to node 200 and node 300 through the public network.
  • the reset notification message may be a directional message
  • the reset notification message sent by the microcode module 1021 may carry the IP address of the node 100, the IP addresses of the node 200 and the node 300, and so on. It can be understood that the transmission of the notification message through the private network is more efficient and real-time than the transmission of the notification message through the public network.
  • the power module 103 can trigger the pin transition of the interface unit 102 to transmit the information that the node 100 is powered off to the interface unit 102.
  • the power module 103 generates the PS_OK signal by triggering the transition of the pin, and triggers the transition of the pin of the interface unit 102 by the PS_OK signal.
  • the power module 103 may also trigger the pins defined by the microcode module 1021 to implement the transmission of power-down information.
  • the interface unit 102 generates a reset notification message according to the reset information of the node 100 acquired from the BMC 104, and sends it to the node 200 and the node 300.
  • the manner in which the interface unit 102 sends the reset information of the node 100 to the node 200 and the node 300 is similar to the manner in which the interface unit 102 in FIG. 2 and FIG. 3 is sent, and will not be described repeatedly.
  • the reset detection module 1011 can be registered on the notification chain provided by the Linux operating system. When a piece of software running in the node 100 is to be reset, a notification will be sent to the notification chain.
  • the reset detection module 1011 registered on the notification chain knows that the node 100 is about to be reset by calling a callback function, and through the communication channel between the control unit and the BMC, for example, it can be a PCIE 3.0 communication channel to transmit the reset information of the node 100 to the BMC 104 ⁇ Notice module 1041.
  • the notification module 1041 when the notification module 1041 obtains the node 100 power-down information, the node 100 is about to power down. If the notification module 1041 cannot quickly generate a power-down notification message, it may fail to send the power-down notification message due to the node 100 being powered down. In order to increase the speed of the notification module 1041 to send a power-down notification message when the node 100 is powered off, the information required by the power-down notification message may be configured in the notification module 1041 when the node 100 is initialized after power-on. In this way, the notification module 1041 can quickly generate and send a power-down notification message according to the configured information when acquiring the power-down information of the node 100.
  • the power-off method can improve the efficiency of computer equipment reset or power-off information acquisition in a distributed system, and can avoid misjudgment.
  • Step S100 The processor in the computer device obtains the reset information of the computer device before resetting the computer device, and transmits the reset information of the computer device to the message sending unit in the computer device;
  • Step S300 The message sending unit generates a reset notification message containing the reset information of the computer device according to the reset information of the computer device and sends the reset notification message to the distribution where the computer device is Computer equipment in a distributed system.
  • the above method may be implemented by a computer device in a distributed system.
  • a computer device in a distributed system.
  • reference may be made to the implementation manner of the node 100 in FIG. 2 to FIG. 8 described above, and details are not described herein again.
  • the preset function is a callback function, and the callback function is registered on the notification chain;
  • the message sending unit is a BMC in the computer device.
  • the above method can obtain the power-down information of the device and generate a power-down notification message containing the power-down information according to the obtained power-down information of the device, and send it to other devices in the distributed system.
  • the device power-off information notifies other devices in the distributed system. Compared with the prior art method of detecting whether other devices are powered off by heartbeat, not only can the efficiency of power-off information transmission be improved, but also the occurrence of misjudgment can be avoided.

Abstract

Provided in the present application are a computer device, a distributed computer device system and a method for sending device information so as to solve the problem of real-time performance being poor when using heartbeat to detect failures in distributed nodes. The method provided in the present application comprises: acquiring information for resetting a computer device; according to the acquired information for resetting the computer device, generating a reset notification message which contains the resetting information; and sending the reset notification message to other devices in a distributed system so as to rapidly notify the other devices in the distributed system of the resetting information of the present device. Compared with the prior art, a manner of using heartbeat to detect whether other devices have been reset may improve efficiency in the transmission of resetting information, and may also prevent the occurrence of misjudgment.

Description

设备信息发送的方法、计算机设备和分布式计算机设备系统Device information sending method, computer device and distributed computer device system
本申请要求于2018年12月29日提交中国国家知识产权局、申请号为201811632716.8、发明名称为“设备信息发送的方法、计算机设备和分布式计算机设备系统”的中国专利申请的优先权,该专利申请要求于2018年11月01日提交中国专利局、申请号为201811294576.8、发明名称为“设备信息发送的方法、计算机设备和分布式计算机设备系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application submitted to the State Intellectual Property Office of China on December 29, 2018 with the application number 201811632716.8 and the invention titled "Method for Sending Equipment Information, Computer Equipment and Distributed Computer Equipment System". The patent application requires the priority of the Chinese patent application submitted to the China Patent Office on November 01, 2018, with the application number 201811294576.8 and the invention titled "Method of sending device information, computer equipment and distributed computer equipment system", all of its content Incorporated by reference in this application.
技术领域Technical field
本申请涉及信息技术领域,特别涉及设备信息发送的方法、计算机设备和分布式计算机设备系统。This application relates to the field of information technology, in particular to a method for sending device information, a computer device, and a distributed computer device system.
背景技术Background technique
分布式系统通常包括分布式计算系统和分布式存储系统。Distributed systems usually include distributed computing systems and distributed storage systems.
采用分布式计算的一组计算机组成的系统,叫做分布式计算系统。分布式计算系统将需要进行大量计算的项目数据分割成小块,由多台计算节点,例如具有计算功能的服务器,分别计算,再上传运算结果后统一合并得出数据结论。A system composed of a group of computers using distributed computing is called a distributed computing system. The distributed computing system divides the project data that needs a large amount of calculation into small pieces, which are calculated by multiple computing nodes, such as a server with a computing function, and then the results of the calculation are unified and merged to obtain the data conclusion.
分布式存储系统,是将数据分散存储在多台独立的设备上,采用可扩展的系统结构,利用多台存储节点,例如存储服务器,分担存储负荷,利用位置服务器定位存储信息。分布式存储系统不仅能够提高系统的可靠性、可用性和存取效率,还易于扩展。A distributed storage system is to store data on multiple independent devices in a distributed manner, adopt a scalable system structure, use multiple storage nodes, such as a storage server, to share storage load, and use a location server to locate and store information. Distributed storage system can not only improve the reliability, availability and access efficiency of the system, but also be easy to expand.
分布式计算系统中的计算几点和分布式存储系统中的存储节点,统称为分布式节点。The computing points in a distributed computing system and the storage nodes in a distributed storage system are collectively called distributed nodes.
目前业界针对分布式节点复位故障或掉电故障,只能通过节点间心跳来检测。通过心跳检测分布式节点故障,存在误判以及检测实时性差等问题,无法满足高端场景下(银行等)业务倒换的需要。At present, the industry can only detect the reset failure or power failure of distributed nodes through the heartbeat between nodes. The detection of distributed node failures through heartbeat has the problems of misjudgment and poor real-time detection, which cannot meet the needs of business switching in high-end scenarios (banks, etc.).
发明内容Summary of the invention
本申请实施例提供一种计算机设备、分布式计算机设备系统和设备信息发送的方法,以解决心跳检测分布式节点故障存在的实时性差的问题。Embodiments of the present application provide a computer device, a distributed computer device system, and a method for sending device information, to solve the problem of poor real-time performance of heartbeat detection of distributed node faults.
第一方面,本申请实施例提供了一种计算机设备,所述计算机设备是分布式系统中的计算机设备,In a first aspect, an embodiment of the present application provides a computer device, the computer device is a computer device in a distributed system,
所述计算机设备包括处理器,其特征在于,所述计算机设备还包括报文发送单元,所述报文发送单元与所述处理器通过总线连接;The computer device includes a processor, wherein the computer device further includes a message sending unit, and the message sending unit and the processor are connected by a bus;
所述处理器,用于在所述计算机设备复位时,获取所述计算机设备复位的信息,并将所述计算机设备复位的信息传输给所述报文发送单元;The processor is configured to acquire the reset information of the computer device when the computer device is reset, and transmit the reset information of the computer device to the message sending unit;
所述报文发送单元,用于根据所述计算机设备复位的信息生成包含所述计算机设备 复位的信息的复位通知报文,并将所述复位通知报文发送给所述分布式系统中的其它计算机设备。The message sending unit is configured to generate a reset notification message including the reset information of the computer device based on the reset information of the computer device, and send the reset notification message to other users in the distributed system Computer equipment.
上述计算机设备通过获取本设备复位的信息,并能够根据获取的本设备复位的信息,生成包含复位的信息的复位通知报文,发送给分布式系统中的其它设备,能够快速地将本设备复位的信息通知分布式系统中的其它设备。相对于现有技术通过心跳探测其它设备是否复位的方式,能够提升复位的信息传递的效率。进一步的,由于不需要设置预设的阈值进行心跳检测,避免了因设置阈值不当带来的误判的发生。The above computer device can obtain the reset information of the device and generate a reset notification message containing the reset information according to the obtained reset information of the device, and send it to other devices in the distributed system to quickly reset the device Informs other devices in the distributed system. Compared with the prior art method of detecting whether other devices are reset by heartbeat, the efficiency of resetting information transmission can be improved. Further, since there is no need to set a preset threshold for heartbeat detection, the occurrence of misjudgment caused by improper threshold setting is avoided.
可选的,上述计算机设备可以是计算服务器设备或存储服务器设备。Optionally, the computer device may be a computing server device or a storage server device.
可选的,所述计算机设备还可以包括主存储器或辅助存储器等。Optionally, the computer device may further include a main memory or an auxiliary memory.
可选的,所述计算机设备与所述分布式系统中的其它设备可以通过千兆以太网(Gigabit Ethernet,GE)、IB(InfiniBand)等网络进行通信。Optionally, the computer device and other devices in the distributed system may communicate through networks such as Gigabit Ethernet (GE), IB (InfiniBand), and so on.
可选的,所述报文发送单元可以通过私有网络或公有网络发送所述通知报文。Optionally, the message sending unit may send the notification message through a private network or a public network.
可选的,所述报文发送单元可以通过定向报文的方式,将所述通知报文发送给所述分布式系统中的其它计算机设备。所述报文发送单元也可以通过发送广播报文的方式,发送所述通知报文。Optionally, the message sending unit may send the notification message to other computer devices in the distributed system by means of directed messages. The message sending unit may also send the notification message by sending a broadcast message.
可选的,所述处理器可以是中央处理器(central processing unit,CPU),所述CPU可以是X86CPU或高级精简指令集计算机器(advanced reduced instruction set computing machines,ARM)等。Optionally, the processor may be a central processing unit (CPU), and the CPU may be an X86 CPU or an advanced reduced instruction set computer (advanced reduced instruction set computing) (ARM).
可选的,所述报文发送单元可以是快速外围部件互连(Peripheral Component Interconnect Express,PCIe)智能网卡。所述CPU与所述报文发送单元之间可以通过PCIe总线连接。Optionally, the message sending unit may be a PCIe (Peripheral Component Interconnect Express) intelligent network card. The CPU and the message sending unit may be connected through a PCIe bus.
在第一方面的一种可能的实现方式中,所述计算机设备中的操作系统包括关于复位的通知链;在所述计算机设备复位前,所述处理器通过预设函数从所述通知链获取所述计算机设备复位的信息。In a possible implementation manner of the first aspect, the operating system in the computer device includes a notification chain about reset; before the computer device is reset, the processor obtains from the notification chain through a preset function Reset information of the computer device.
在第一方面的一种可能的实现方式中,In a possible implementation manner of the first aspect,
所述计算机设备中的操作系统包括复位检测模块,The operating system in the computer device includes a reset detection module,
所述复位检测模块,注册于所述计算机设备的操作系统中的复位通知链上,并通过回调函数获取所述复位通知链上关于所述计算机设备复位的信息;The reset detection module is registered on a reset notification chain in the operating system of the computer device, and obtains information about the reset of the computer device on the reset notification chain through a callback function;
所述处理器通过所述复位检测模块获取所述计算机设备复位的信息。The processor obtains reset information of the computer device through the reset detection module.
在第一方面的一种可能的实现方式中,In a possible implementation manner of the first aspect,
所述报文发送单元包括微码模块;The message sending unit includes a microcode module;
所述微码模块用于生成所述复位通知报文,并将所述复位通知报文送给所述分布式系统中的其它计算机设备。The microcode module is used to generate the reset notification message and send the reset notification message to other computer devices in the distributed system.
可选的,上述微码模块可以通过固件(firmware,FW)实现。Optionally, the above microcode module may be implemented by firmware (firmware, FW).
可选的,所述复位检测模块是所述计算机设备运行的操作系统中的模块。Optionally, the reset detection module is a module in an operating system run by the computer device.
可选的,所述复位检测模块通过与微码模块之间的私有协议发送所述计算机设备复位的信息。Optionally, the reset detection module sends the reset information of the computer device through a private protocol with the microcode module.
在第一方面的一种可能的实现方式中,所述计算机设备还包括基板管理控制器(baseboard management controller,BMC);所述处理器通过所述BMC将所述计算机 设备复位的信息传输给所述报文发送单元。In a possible implementation manner of the first aspect, the computer device further includes a baseboard management controller (BMC); the processor transmits the reset information of the computer device to the computer through the BMC Describe the message sending unit.
在第一方面的一种可能的实现方式中,所述计算机设备还包括电源模块,所述电源模块通过通用输入输出(general-purpose input/output,GPIO)引脚与所述报文发送单元连接;In a possible implementation manner of the first aspect, the computer device further includes a power supply module, and the power supply module is connected to the message sending unit through a general-purpose input / output (GPIO) pin ;
所述电源模块,用于在所述计算机设备掉电时,通过触发引脚的跳变将所述计算机设备掉电的信息传递给所述报文发送单元;The power supply module is used to transfer the power-off information of the computer device to the message sending unit through the transition of the trigger pin when the computer device is powered off;
所述报文发送单元,还用于根据所述计算机设备掉电的信息,生成包含所述计算机设备掉电信息的掉电通知报文,并将所述掉电通知报文发送给所述分布式系统中的所述其它计算机设备。上述计算机设备能够根据获取的本设备掉电的信息,生成包含掉电信息的掉电通知报文,发送给分布式系统中的其它设备,能够快速地将本设备掉电的信息通知分布式系统中的其它设备。相对于现有技术通过心跳探测其它设备是否掉电的方式,能够提升掉电信息传递的效率。进一步的,由于不需要设置预设的阈值进行心跳检测,避免了因设置阈值不当带来的误判的发生。The message sending unit is further configured to generate a power-down notification message containing the computer device's power-down information based on the computer device's power-down information, and send the power-down notification message to the distribution Other computer equipment in a distributed system. The above-mentioned computer device can generate a power-down notification message containing power-down information based on the obtained power-down information of the device, and send it to other devices in the distributed system, which can quickly notify the distributed system of the power-down information of the device In other devices. Compared with the prior art method of detecting whether other devices are powered off by heartbeat, the efficiency of power-off information transmission can be improved. Further, since there is no need to set a preset threshold for heartbeat detection, the occurrence of misjudgment caused by improper threshold setting is avoided.
可选的,所述电源模块可以通过掉电指示信号所述GPIO引脚的跳变。Optionally, the power supply module may change the GPIO pin through a power-down instruction signal.
可选的,所述掉电指示信号可以是PS_OK信号,或者其它用于指示市电掉电的信号。Optionally, the power-down indication signal may be a PS_OK signal, or other signals used to indicate the mains power-down.
在第一方面的一种可能的实现方式中,In a possible implementation manner of the first aspect,
所述报文发送单元是基板管理控制器BMC,所述BMC还包括通知模块;The message sending unit is a baseboard management controller BMC, and the BMC further includes a notification module;
所述通知模块根据所述BMC获取到的所述计算机设备复位的信息生成所述复位通知报文,并将所述复位通知报文发送给所述分布式系统中的所述其它计算机设备。The notification module generates the reset notification message according to the reset information of the computer device acquired by the BMC, and sends the reset notification message to the other computer devices in the distributed system.
可选的,所述BMC中的通知模块,将所述复位通知报文发送给所述分布式系统的所述其它计算机设备中的BMC。Optionally, the notification module in the BMC sends the reset notification message to the BMC in the other computer equipment of the distributed system.
可选的,所述BMC中的通知模块,可以通过带外系统,将所述复位通知报文发送给所述其它计算机设备中的BMC。Optionally, the notification module in the BMC may send the reset notification message to the BMC in the other computer device through an out-of-band system.
在第一方面的一种可能的实现方式中,In a possible implementation manner of the first aspect,
所述计算机设备还包括电源模块;The computer equipment also includes a power supply module;
所述电源模块,还用于在所述计算机设备掉电时通过触发引脚的跳变,向所述BMC传递所述计算机设备掉电的信息;The power supply module is also used to transmit the power-off information of the computer device to the BMC by triggering the transition of the pin when the computer device is power-off;
所述通知模块,还用于根据获取到的所述计算机设备掉电的信息生成包含所述计算机设备掉电信息的掉电通知报文,并将所述掉电通知报文发送给所述分布式系统中的所述其它计算机设备。The notification module is further configured to generate a power-down notification message including the computer device's power-down information according to the acquired power-down information of the computer device, and send the power-down notification message to the distribution Other computer equipment in a distributed system.
第二方面,本申请实施例提供了一种分布式计算机设备系统,包括至少两个第一方面中的计算机设备。In a second aspect, an embodiment of the present application provides a distributed computer device system, including at least two computer devices in the first aspect.
可选的,所述分布式设备系统可以是分布式计算系统、分布式存储系统或分布式混合系统等。其中,分布式混合系统包括计算设备和存储设备的系统。Optionally, the distributed device system may be a distributed computing system, a distributed storage system, or a distributed hybrid system. Among them, the distributed hybrid system includes a system of computing devices and storage devices.
第三方面,本申请实施例提供了一种设备信息发送的方法,所述方法包括:In a third aspect, an embodiment of the present application provides a method for sending device information. The method includes:
计算机设备中的处理器在所述计算机设备复位前获取所述计算机设备复位的信息,并将所述计算机设备复位的信息传输给所述计算机设备中的报文发送单元;The processor in the computer device acquires the reset information of the computer device before resetting the computer device, and transmits the reset information of the computer device to the message sending unit in the computer device;
所述报文发送单元接收所述计算机设备复位的信息;The message sending unit receives the reset information of the computer device;
所述报文发送单元根据所述计算机设备复位的信息,生成包含所述计算机设备复位 的信息的复位通知报文,并将所述复位通知报文发送给所述计算机设备所在的分布式系统中的其它计算机设备。The message sending unit generates a reset notification message containing the reset information of the computer device based on the reset information of the computer device, and sends the reset notification message to the distributed system where the computer device is located Other computer equipment.
上述方法通过获取本设备复位的信息,并根据获取的本设备复位的信息,生成包含复位的信息的复位通知报文,发送给分布式系统中的其它设备,能够快速地将本设备复位的信息通知分布式系统中的其它设备。相对于现有技术通过心跳探测其它设备是否复位或掉电的方式,提升了复位的信息传递的效率。进一步的,由于不需要设置预设的阈值进行心跳检测,避免了因设置阈值不当带来的误判的发生。The above method obtains the reset information of the device and generates a reset notification message containing the reset information according to the reset information of the device, and sends it to other devices in the distributed system to quickly reset the device. Notify other devices in the distributed system. Compared with the prior art method of detecting whether other devices are reset or powered off by heartbeat, the efficiency of resetting information transmission is improved. Further, since there is no need to set a preset threshold for heartbeat detection, the occurrence of misjudgment caused by improper threshold setting is avoided.
可选的,Optional,
在第三方面的一种可能的实现方式中,所述处理器通过预设函数和所述计算机设备的操作系统中关于复位的通知链,获取所述计算机设备复位的信息。In a possible implementation manner of the third aspect, the processor obtains reset information of the computer device through a preset function and a notification chain about reset in the operating system of the computer device.
在第三方面的一种可能的实现方式中,所述方法还包括:In a possible implementation manner of the third aspect, the method further includes:
所述预设函数为回调函数,所述回调函数注册于所述通知链上;The preset function is a callback function, and the callback function is registered on the notification chain;
所述处理器通过预设函数和所述计算机设备的操作系统中关于复位的通知链,获取所述计算机设备复位的信息包括:The processor acquiring the reset information of the computer device through a preset function and a notification chain about reset in the operating system of the computer device includes:
所述处理器通过所述回调函数从所述通知链获取所述计算机设备复位的信息。The processor obtains the reset information of the computer device from the notification chain through the callback function.
在第三方面的一种可能的实现方式中,所述处理器通过所述计算机设备中的BMC将所述计算机设备复位的信息传输给所述报文发送单元。In a possible implementation manner of the third aspect, the processor transmits the reset information of the computer device to the message sending unit through the BMC in the computer device.
在第三方面的一种可能的实现方式中,所述报文发送单元是所述计算机设备中的BMC。In a possible implementation manner of the third aspect, the message sending unit is a BMC in the computer device.
在第三方面的一种可能的实现方式中,当所述计算机设备掉电时,所述报文发送单元通过引脚的跳变获取所述计算机设备掉电的信息,生成包含所述计算机设备掉电信息的掉电通知报文,并将所述掉电通知报文发送给所述分布式系统中的所述其它计算机设备。In a possible implementation manner of the third aspect, when the computer device is powered off, the message sending unit acquires the information about the power off of the computer device through a pin transition, and generates and includes the computer device A power failure notification message of power failure information, and sending the power failure notification message to the other computer equipment in the distributed system.
上述方法根据获取的本设备掉电的信息,生成包含掉电信息的掉电通知报文,发送给分布式系统中的其它设备,能够快速地将本设备掉电的信息通知分布式系统中的其它设备。相对于现有技术通过心跳探测其它设备是否掉电的方式,能够提升掉电信息传递的效率。进一步的,由于不需要设置预设的阈值进行心跳检测,避免了因设置阈值不当带来的误判的发生。The above method generates a power-down notification message containing power-down information based on the obtained power-down information of the device, and sends it to other devices in the distributed system, which can quickly notify the device of the power-down information in the distributed system. Other equipment. Compared with the prior art method of detecting whether other devices are powered off by heartbeat, the efficiency of power-off information transmission can be improved. Further, since there is no need to set a preset threshold for heartbeat detection, the occurrence of misjudgment caused by improper threshold setting is avoided.
第四方面,本申请实施例提供了一种计算机程序产品,所述计算机程序产品包括在计算机可读存储介质中存储的计算机程序,并且所述计算程序通过控制器进行加载来实现上述第三方面或第三方面的任意可能的实现方式的方法。According to a fourth aspect, an embodiment of the present application provides a computer program product, the computer program product includes a computer program stored in a computer-readable storage medium, and the calculation program is loaded by a controller to implement the third aspect Or any possible implementation of the third aspect.
第五方面,本申请实施例提供了一种非易失性计算机可读存储介质,用于存储计算机程序,所述计算机程序通过处理器进行加载来执行上述第三方面或第三方面的任意可能的实现方式的方法的指令。According to a fifth aspect, an embodiment of the present application provides a non-volatile computer-readable storage medium for storing a computer program that is loaded by a processor to perform the third aspect or any possibility of the third aspect Instructions for the method of implementation.
第六方面,本申请提实施例供了一种芯片,所述芯片包括可编程逻辑电路和/或程序指令,当所述芯片运行时用于实现上述第三方面或第三方面的任意可能的实现方式的方法。According to a sixth aspect, the embodiments of the present application provide a chip including programmable logic circuits and / or program instructions, which are used to implement the third aspect or any possible aspect of the third aspect when the chip is running The method of implementation.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the technical solutions in the embodiments of the present invention, the drawings required in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, without paying any creative labor, other drawings can also be obtained based on these drawings.
图1为本申请实施例提供的一种分布式系统结构示意图;1 is a schematic structural diagram of a distributed system provided by an embodiment of this application;
图2为本申请实施例提供的一种分布式系统实现方式结构示意图;2 is a schematic structural diagram of an implementation manner of a distributed system provided by an embodiment of the present application;
图3为图2所示分布式系统的具体实现方式结构示意图;3 is a schematic structural diagram of a specific implementation manner of the distributed system shown in FIG. 2;
图4为图2所示分布式系统的另一种具体实现方式结构示意图;4 is a schematic structural diagram of another specific implementation manner of the distributed system shown in FIG. 2;
图5为本申请实施例提供的另一种分布式系统实现方式结构示意图;5 is a schematic structural diagram of another implementation manner of a distributed system provided by an embodiment of the present application;
图6为本申请实施例提供的另一种分布式系统实现方式结构示意图;6 is a schematic structural diagram of another implementation manner of a distributed system provided by an embodiment of the present application;
图7为图6所示分布式系统的具体实现方式结构示意图;7 is a schematic structural diagram of a specific implementation manner of the distributed system shown in FIG. 6;
图8为图6所示分布式系统的另一种具体实现方式结构示意图;8 is a schematic structural diagram of another specific implementation manner of the distributed system shown in FIG. 6;
图9A为本申请实施例提供的一种计算机设备900的结构示意图;9A is a schematic structural diagram of a computer device 900 according to an embodiment of the present application;
图9B为本申请实施例提供的计算机设备900的另一种实现方式的结构示意图;9B is a schematic structural diagram of another implementation manner of a computer device 900 provided by an embodiment of this application;
图10为本申请实施例提供的计算机设备900的另一种实现方式的结构示意图;10 is a schematic structural diagram of another implementation manner of a computer device 900 provided by an embodiment of this application;
图11为本申请实施例提供的计算机设备900的另一种实现方式的结构示意图;11 is a schematic structural diagram of another implementation manner of a computer device 900 provided by an embodiment of this application;
图12为本申请实施例提供的一种设备信息发送的方法的流程示意图。FIG. 12 is a schematic flowchart of a method for sending device information according to an embodiment of the present application.
具体实施方式detailed description
下面结合附图,对本发明的实施例进行描述。The following describes the embodiments of the present invention with reference to the drawings.
图1为一种分布式系统的一种结构示意图,该分布式系统包括节点1、节点2和节点3。该分布式系统可以是分布式计算系统,也可以分布式存储系统。相应的,节点1、节点2和节点3可以是计算节点,也可以是存储节点。本申请实施例中,节点可以是计算机设备,例如可以是计算服务器或存储服务器等;节点也可以是其它具有电子信息处理能力的设备,例如具有信息通信能力的设备等。FIG. 1 is a schematic structural diagram of a distributed system including node 1, node 2 and node 3. The distributed system may be a distributed computing system or a distributed storage system. Correspondingly, node 1, node 2 and node 3 may be computing nodes or storage nodes. In the embodiment of the present application, the node may be a computer device, for example, a computing server or a storage server; the node may also be another device with electronic information processing capabilities, such as a device with information communication capabilities.
可选的,本申请实施例中的分布式系统,也可以为集群系统。该集群系统中包括两个以上的节点,每个节点可以运行集群管理软件,用于对集群中节点的管理。集群管理软件是对分布式系统中的节点进行管理的软件,例如用于对分布式系统中各节点的状态上报和故障节点的隔离等。集群管理软件可以收集分布式系统中所有节点的心跳检测结果,综合判断某个节点是否故障或异常,是否需要进行业务倒换等。Optionally, the distributed system in the embodiment of the present application may also be a cluster system. The cluster system includes more than two nodes, and each node can run cluster management software to manage the nodes in the cluster. The cluster management software is software that manages the nodes in the distributed system, for example, it is used to report the status of each node in the distributed system and isolate the faulty node. The cluster management software can collect the heartbeat detection results of all nodes in the distributed system, comprehensively determine whether a node is faulty or abnormal, and whether service switching is required.
在分布式系统运行过程中,通常通过心跳检测的方式,确定某一个节点是否异常。例如,每个节点都会向其它节点发送探测包(例如ping包),通过ping包响应的时延确定某个节点是否异常。以图1所示的系统为例,节点2向节点1发送ping包。发送的包的大小可以是1024个字节,也可以是64个字节。节点2检测节点1响应所述ping包的时延。如果时延超过预设阈值,例如2秒,再判断预设周期内超过所述预设阈值的包的数量,如果超过预设阈值的包的数量超过预设的门限值,例如5个,则判定节点1异常。同样的,节点3也通过向节点1发送ping包的方式,判断节点1是否异常。如果节点3也判定节点1异常,则确定节点1异常,完成对节点1的心跳检测。During the operation of a distributed system, it is usually determined by heartbeat detection whether a certain node is abnormal. For example, each node sends probe packets (such as ping packets) to other nodes, and determines whether a node is abnormal by the delay of the ping packet response. Taking the system shown in FIG. 1 as an example, node 2 sends a ping packet to node 1. The size of the packet sent can be 1024 bytes or 64 bytes. Node 2 detects the delay of node 1 responding to the ping packet. If the delay exceeds a preset threshold, for example, 2 seconds, then determine the number of packets exceeding the preset threshold in a preset period, and if the number of packets exceeding the preset threshold exceeds a preset threshold, for example 5, It is determined that node 1 is abnormal. Similarly, node 3 also determines whether node 1 is abnormal by sending a ping packet to node 1. If node 3 also determines that node 1 is abnormal, it is determined that node 1 is abnormal, and the heartbeat detection of node 1 is completed.
图1中,如果节点1复位,则节点2和节点1间的心跳检测会出现异常,节点3和节点1间的心跳检测也会出现异常。这样,该节点2和节点3中的集群管理软件可以根据心跳检测的结果判定节点1故障,并进行业务倒换等管理操作。上述通过心跳检测的方式,需要定时发送探测包来检测,且需要较长的周期来判断是否存在异常。通常情况下,需要的时间达到5.5秒或以上(业务切换端到端是6至8秒)。这样的检测实时性差,无法满足高端场景下(银行等)业务切换的需要。In Figure 1, if node 1 is reset, the heartbeat detection between node 2 and node 1 will be abnormal, and the heartbeat detection between node 3 and node 1 will also be abnormal. In this way, the cluster management software in the nodes 2 and 3 can determine the failure of the node 1 according to the result of the heartbeat detection, and perform management operations such as service switching. The above-mentioned method of heartbeat detection needs to send probe packets regularly for detection, and it takes a long period to judge whether there is an abnormality. Usually, the time required is 5.5 seconds or more (end-to-end service switching is 6 to 8 seconds). Such detection has poor real-time performance and cannot meet the needs of business switching in high-end scenarios (banks, etc.).
本申请实施例提供一种计算机设备、分布式计算机设备系统和设备信息发送的方法,以解决通过心跳检测分布式节点故障的方式实时性差的问题。Embodiments of the present application provide a computer device, a distributed computer device system, and a method for sending device information, to solve the problem of poor real-time performance in the manner of detecting a distributed node failure through heartbeat.
图2为本申请实施例提供的一种分布式系统实现方式结构示意图。如图2所示,该分布式系统包括节点100、节点200和节点300。其中,节点100包括控制单元101,接口单元102和电源模块103;节点200包括控制单元201,接口单元202和电源模块203;节点300包括控制单元301,接口单元302和电源模块303。可以理解,图2只是为方便描述本申请的技术方案,对节点的数量以及节点包含的组件进行的说明,在具体实现时,还可以包括更多的节点,或节点也可以包括其他的组件,例如节点还可以包括主存储器(例如随机存取存储器(RAM)等)、辅助存储器(例如硬盘等),不再一一列举。FIG. 2 is a schematic structural diagram of an implementation manner of a distributed system provided by an embodiment of the present application. As shown in FIG. 2, the distributed system includes node 100, node 200, and node 300. The node 100 includes a control unit 101, an interface unit 102, and a power module 103; the node 200 includes a control unit 201, an interface unit 202, and a power module 203; the node 300 includes a control unit 301, an interface unit 302, and a power module 303. It can be understood that FIG. 2 is only for the convenience of describing the technical solution of the present application, and illustrates the number of nodes and the components included in the node. In specific implementation, more nodes may be included, or the node may also include other components. For example, the node may also include a main memory (such as a random access memory (RAM), etc.) and an auxiliary memory (such as a hard disk, etc.), which are not listed one by one.
图2所示的节点之间,可以通过GE或IB等网络进行通信。本申请实施例不限定具体的网络协议或网络形态。节点之间可以通过接口单元通信,例如节点100与节点200之间通过接口单元102与接口单元202通信。The nodes shown in Figure 2 can communicate through networks such as GE or IB. The embodiments of the present application do not limit specific network protocols or network forms. Nodes can communicate through the interface unit, for example, between the node 100 and the node 200 through the interface unit 102 and the interface unit 202.
以节点100为例,控制单元101可以是处理器,例如可以是CPU,包括但不限于X86CPU或ARM等;控制单元101也可以是异构处理器等。接口单元102可以是PCIe智能网卡或其它网卡设备。电源模块103是为节点100供电的设备,可以是为节点100供电的电源。例如电源模块103可以为将市电输入的220V电压转化为节点100中其他组件能够使用的12V电压的硬件模块。控制单元101与接口单元102可以通过总线(例如PCIe总线)连接,电源模块103可以与接口单元102的GPIO引脚相连接。当节点100复位时,例如当节点100准备复位时,控制单元101可以向接口单元102发送指示,以将节点100复位的信息传递给接口单元102。控制单元101向接口单元102发送指示后,节点100即可启动复位。Taking the node 100 as an example, the control unit 101 may be a processor, for example, a CPU, including but not limited to X86 CPU or ARM, etc .; the control unit 101 may also be a heterogeneous processor, etc. The interface unit 102 may be a PCIe intelligent network card or other network card device. The power supply module 103 is a device that supplies power to the node 100, and may be a power supply that supplies power to the node 100. For example, the power module 103 may be a hardware module that converts the 220V voltage input from the mains into a 12V voltage that can be used by other components in the node 100. The control unit 101 and the interface unit 102 may be connected by a bus (for example, a PCIe bus), and the power module 103 may be connected to the GPIO pins of the interface unit 102. When the node 100 is reset, for example, when the node 100 is ready to be reset, the control unit 101 may send an instruction to the interface unit 102 to transfer the reset information of the node 100 to the interface unit 102. After the control unit 101 sends an instruction to the interface unit 102, the node 100 can initiate a reset.
可以理解,控制单元101向接口单元102发送的指示,可以是通过命令、消息、报文或硬件信号等形式发送的指示。本申请实施例不限定控制单元101向接口单元102发送指示的具体形式,只要是控制单元101能够将节点100复位的信息传递给接口单元102的方式,都在本申请实施例覆盖的范围内。It can be understood that the instruction sent by the control unit 101 to the interface unit 102 may be an instruction sent in the form of commands, messages, messages, or hardware signals. The embodiment of the present application does not limit the specific form in which the control unit 101 sends an instruction to the interface unit 102, as long as the control unit 101 can transfer the reset information of the node 100 to the interface unit 102, they are all within the scope covered by the embodiment of the present application.
可选的,本申请实施例中的节点100复位可以是节点100重新启动;相应的,本申请实施例中节点100复位的信息,可以是节点100重新启动的信息。Optionally, resetting the node 100 in the embodiment of the present application may be restarting of the node 100; correspondingly, the information of the node 100 resetting in the embodiment of the present application may be information of restarting the node 100.
接口单元102根据接收到的节点100复位的通知,通过网络将节点100复位的信息发送给节点200和节点300。例如,接口单元102可以通过网络,发送包括节点100的标识以及节点100复位的信息的复位通知报文。节点200和节点300获取节点100发送的复位通知报文后,可以通过集群管理软件进行相应的管理操作。以节点200为例,当节点200接收到节点100发送的复位通知报文后,节点200中的管理模块(例如集群管理软件)启动相应的处理,包括但不限于启动对节点100隔离的流程,以避免继续访问 节点100带来的故障。The interface unit 102 sends the reset information of the node 100 to the node 200 and the node 300 through the network according to the received notification of the reset of the node 100. For example, the interface unit 102 may send a reset notification message including the identification of the node 100 and the information about the reset of the node 100 through the network. After the node 200 and the node 300 obtain the reset notification message sent by the node 100, they can perform corresponding management operations through the cluster management software. Taking node 200 as an example, after node 200 receives the reset notification message sent by node 100, the management module (such as cluster management software) in node 200 starts the corresponding processing, including but not limited to starting the process of isolating node 100, In order to avoid the failure caused by continuing to access the node 100.
通过上述方法,当节点100复位时,节点100中负责发送复位通知报文的接口单元102获取节点100复位的信息,并将节点100复位的信息发送给分布式系统中的其它节点,能够使节点200和节点300迅速获知节点100复位的信息,相对于通过心跳检测并通知的方式,提高了其它节点获取节点100复位的信息的效率,实时性高。Through the above method, when the node 100 is reset, the interface unit 102 in the node 100 responsible for sending the reset notification message acquires the reset information of the node 100, and sends the reset information of the node 100 to other nodes in the distributed system, enabling the node The node 200 and the node 300 quickly learn the reset information of the node 100. Compared with the way of heartbeat detection and notification, the efficiency of other nodes to obtain the reset information of the node 100 is improved, and the real-time performance is high.
并且,由于本申请中是节点100中的接口单元102直接获取复位的信息,能够避免心跳检测方式通过预设的阈值和门限值进行判断时,因相关的阈值或门限值设置不合理所造成的误判,进而能够避免节点被错误地隔离。In addition, since the interface unit 102 in the node 100 in this application directly obtains the reset information, it can avoid that the heartbeat detection method is judged by the preset threshold and threshold due to the unreasonable setting of the relevant threshold or threshold The resulting misjudgment can prevent nodes from being erroneously isolated.
当节点100掉电时,节点100中的电源模块103通过引脚的跳变触发掉电指示信号,所述掉电指示信号可以触发接口单元102的引脚的跳变,例如从高电平跳变到低电平,或从低电平跳变到高电平。接口单元102可以根据引脚的跳变,获知节点100掉电的信息。When the node 100 is powered off, the power supply module 103 in the node 100 triggers a power-down instruction signal through the transition of the pin, and the power-down instruction signal can trigger the transition of the pin of the interface unit 102, for example, from a high level Change to low level, or jump from low to high level. The interface unit 102 can learn about the power down of the node 100 according to the transition of the pin.
接口单元102可以根据节点100掉电的信息生成包含节点100掉电的掉电通知报文,并将所述掉电通知报文发送给分布式系统中的其它节点,如节点200和节点300。The interface unit 102 may generate a power-down notification message including the power-down of the node 100 according to the power-down information of the node 100, and send the power-down notification message to other nodes in the distributed system, such as the node 200 and the node 300.
由于电源模块103是将市电的电压转化为节点100中其他组件能够使用的电压,当市电掉电,电源模块103在感知到掉电时就会触发引脚的跳变。而市电掉电后电源模块103还会继续将市电掉电前接收到的市电转化为节点100能够使用的电压,在这期间,通过引脚的跳变能够迅速地将节点100掉电的信息传递给接口单元102,接口单元102生成节点100掉电的掉电通知报文,并将所述掉电通知报文发送给节点200和节点300。Since the power module 103 converts the voltage of the commercial power to a voltage that can be used by other components in the node 100, when the commercial power is powered off, the power module 103 triggers the jump of the pin when it senses the power is lost. After the mains power is lost, the power module 103 will continue to convert the mains power received before the mains power is turned into a voltage that can be used by the node 100. During this period, the node 100 can be quickly powered down through the jump of the pin The information is transmitted to the interface unit 102, and the interface unit 102 generates a power-down notification message for the node 100 to power off, and sends the power-down notification message to the node 200 and the node 300.
这样,当节点100掉电时,节点100能够通过电源模块通知接口单元102,并由接口单元102将节点100掉电的信息发送给分布式系统中的其它节点,能够使节点200和节点300快速获知节点100掉电的信息。本申请的掉电通知方式相对于通过心跳检测并通知的方式,提高了通知的效率,实时性高;并且,能够避免心跳检测方式通过预设的阈值和门限值进行判断时,因相关的阈值或门限值设置不合理所造成的误判,进而能够避免节点被错误地隔离。In this way, when the node 100 is powered off, the node 100 can notify the interface unit 102 through the power supply module, and the interface unit 102 can send the information of the node 100 to the other nodes in the distributed system, so that the node 200 and the node 300 can be quickly The information that the node 100 is powered off is learned. Compared with the way of heartbeat detection and notification, the power-down notification method of the present application improves the efficiency of notification and has high real-time performance; and, it can avoid that the heartbeat detection method is judged by the preset threshold and threshold due to related The misjudgment caused by the unreasonable setting of threshold or threshold value can prevent nodes from being erroneously isolated.
下面通过具体的例子对图2所示分布式系统传输节点复位或掉电信息的方式进行详细说明。在具体实现时,图2所示的分布式系统的每个节点中的控制单元还可以包括复位检测模块,接口单元还可以包括微码模块。其中,所述微码模块可以是FW。如图3所示,控制单元101包括复位检测模块1011,接口单元102包括微码模块1021;控制单元201包括复位检测模块2011,接口单元202包括微码模块2021;控制单元301包括复位检测模块3011,接口单元302包括微码模块3021。The method of resetting or powering off the transmission node of the distributed system shown in FIG. 2 is described in detail below through a specific example. In specific implementation, the control unit in each node of the distributed system shown in FIG. 2 may further include a reset detection module, and the interface unit may also include a microcode module. Wherein, the microcode module may be FW. As shown in FIG. 3, the control unit 101 includes a reset detection module 1011, the interface unit 102 includes a microcode module 1021; the control unit 201 includes a reset detection module 2011, and the interface unit 202 includes a microcode module 2021; the control unit 301 includes a reset detection module 3011 The interface unit 302 includes a microcode module 3021.
可以理解,上述控制单元包括复位检测模块,具体是控制单元运行的操作系统中包括复位检测模块。本申请实施例和附图为简洁之便,将控制单元运行的操作系统中包括复位检测模块,描述为控制单元包括复位检测模块;将控制单元通过执行复位检测模块对应的代码所实现的功能,描述为复位检测模块所实现的功能。It can be understood that the above control unit includes a reset detection module, and specifically, the operating system running by the control unit includes a reset detection module. The embodiments of the present application and the drawings are concise and convenient. The operating system running by the control unit includes a reset detection module, and the control unit is described as including a reset detection module; the function implemented by the control unit by executing the code corresponding to the reset detection module, Describe the functions implemented by the reset detection module.
以节点100为例,复位检测模块1011用于检测节点100是否复位,在节点100复位前,向接口单元102发送通知消息。复位检测模块1012可以通过向接口单元102中的微码模块1021发送通知消息,以将节点100复位的信息传递给接口单元102。微码模块1021接收到复位检测模块1012发送的通知消息后,通过网络将节点100复位的信息 发送给节点200和节点300。Taking the node 100 as an example, the reset detection module 1011 is used to detect whether the node 100 is reset, and before the node 100 is reset, send a notification message to the interface unit 102. The reset detection module 1012 may send a notification message to the microcode module 1021 in the interface unit 102 to transfer the reset information of the node 100 to the interface unit 102. After receiving the notification message sent by the reset detection module 1012, the microcode module 1021 sends the reset information of the node 100 to the node 200 and the node 300 through the network.
下面以节点100上运行Linux操作系统以及节点100中的软件需要复位为例,对复位检测模块1011获取复位的信息以及通知微码模块1021的方式进行说明。Taking the Linux operating system running on the node 100 and the software in the node 100 needing reset as an example, the manner in which the reset detection module 1011 obtains reset information and notifies the microcode module 1021 will be described.
复位检测模块1011可以注册到linux操作系统提供的通知链上。所述通知链,是Linux操作系统提供的一种通知机制。在Linux操作系统中,包含多个内核子系统。大多数内核子系统都是相互独立的,不同子系统之间可以通过通知链获取其它子系统中的事件。通知链只能够在内核的子系统之间使用,而不能够在内核与用户空间之间进行事件的通知。通知链是一个函数链表,链表上的每一个链表节点都注册了一个函数。当某一事件发生时,链表上所有链表节点对应的函数就会被执行。因此,对于一个通知链来说,会有通知方和接收方。接收方可以在通知链上注册一个函数,在发生某个事件时这些函数会被执行。接收方可以定义事件发生时相应的处理函数,即回调函数。所述回调函数需要提前注册到通知链中,当事件发生时,通知方发出通知后,接收方可以通过回调函数获取相应的事件。The reset detection module 1011 can be registered on the notification chain provided by the Linux operating system. The notification chain is a notification mechanism provided by the Linux operating system. In the Linux operating system, it contains multiple kernel subsystems. Most kernel subsystems are independent of each other, and events in other subsystems can be obtained through notification chains between different subsystems. The notification chain can only be used between the subsystems of the kernel, not the notification of events between the kernel and user space. The notification chain is a function linked list, and each linked list node on the linked list has a function registered. When an event occurs, the functions corresponding to all linked list nodes on the linked list will be executed. Therefore, for a notification chain, there will be a notifier and a receiver. The receiver can register a function on the notification chain, and these functions will be executed when an event occurs. The receiver can define the corresponding processing function when the event occurs, namely the callback function. The callback function needs to be registered in the notification chain in advance. When an event occurs, after the notification party sends out a notification, the receiver can obtain the corresponding event through the callback function.
当节点100中运行的某一个软件复位前,会发送通知消息到通知链上,所述通知链可以是关于复位的通知链。注册在通知链上的复位检测模块1011,可以通过调用回调函数获取节点100即将复位的信息。复位检测模块1011可以通过控制单元101与接口单元102之间的通信通道,例如可以是PCIE 3.0通信通道,将节点100复位的信息传输给接口单元102中的微码模块1021。可选的,复位检测模块1011向微码模块1021发送的通知消息中包括节点100的标识和节点100即将复位的信息。复位检测模块1011将节点100复位的信息传递给接口单元102中的微码模块1021后,就可以通知节点100的操作系统的复位模块启动复位。Before a piece of software running in the node 100 is reset, a notification message is sent to the notification chain, and the notification chain may be a notification chain about resetting. The reset detection module 1011 registered on the notification chain can obtain the information that the node 100 is about to be reset by calling a callback function. The reset detection module 1011 may transmit the reset information of the node 100 to the microcode module 1021 in the interface unit 102 through a communication channel between the control unit 101 and the interface unit 102, for example, a PCIE 3.0 communication channel. Optionally, the notification message sent by the reset detection module 1011 to the microcode module 1021 includes the identifier of the node 100 and the information that the node 100 is about to be reset. After the reset detection module 1011 transmits the reset information of the node 100 to the microcode module 1021 in the interface unit 102, it can notify the reset module of the operating system of the node 100 to start the reset.
可选的,复位检测模块1011可以根据与微码模块1021之间的私有协议发送节点100复位的通知消息。例如,复位检测模块1011通过与微码模块1021之间预先定义的私有接口命令,携带节点100即将复位的信息和节点100的标识,并将该私有接口命令发送给微码模块1021。微码模块1021通过私有协议接收到复位检测模块1011发送的消息后,根据节点100的标识和节点100即将复位的信息生成复位通知报文,通过网络向节点200和节点300发送生成的复位通知报文。Optionally, the reset detection module 1011 may send a notification message that the node 100 resets according to a private protocol with the microcode module 1021. For example, the reset detection module 1011 carries the pre-defined private interface command with the microcode module 1021, carries the information that the node 100 is about to be reset and the identification of the node 100, and sends the private interface command to the microcode module 1021. After receiving the message sent by the reset detection module 1011 through the private protocol, the microcode module 1021 generates a reset notification message according to the identifier of the node 100 and the information that the node 100 is about to reset, and sends the generated reset notification message to the node 200 and the node 300 through the network Text.
微码模块1021通过网络向节点200和节点300发送生成的复位通知报文可以有多种实现方式,既可以是发送定向的复位通知报文,也可以是广播的复位通知报文;既可以是通过私有网络发送的复位通知报文,也可以是通过公网发送的复位通知报文,本申请不限定具体的实现方式。例如,当节点100、节点200和节点300组成的分布式系统有自己的私有网络时,微码模块1021可以通过私有网络发送复位通知报文,该复位通知报文可以是定向报文或广播报文。如果节点100、节点200和节点300组成的分布式系统没有自己的私有网络,微码模块1021可以通过公网向节点200和节点300发送复位通知报文,该复位通知报文可以是定向报文,在这种情况下,微码模块1021发送的复位通知报文可以携带节点100的IP地址,以及节点200和节点300的IP地址等。可以理解,通过私有网络传输通知报文,相对于通过公网传输通知报文,传输效率和实时性更高。The microcode module 1021 can send the generated reset notification message to the node 200 and the node 300 through the network in various ways, either by sending a targeted reset notification message or by broadcasting a reset notification message; it can be either The reset notification message sent through the private network may also be a reset notification message sent through the public network. This application does not limit specific implementation methods. For example, when the distributed system composed of the node 100, the node 200, and the node 300 has its own private network, the microcode module 1021 may send a reset notification message through the private network. The reset notification message may be a directional message or a broadcast message. Text. If the distributed system composed of node 100, node 200, and node 300 does not have its own private network, microcode module 1021 may send a reset notification message to node 200 and node 300 through the public network. The reset notification message may be a directional message In this case, the reset notification message sent by the microcode module 1021 may carry the IP address of the node 100, the IP addresses of the node 200 and the node 300, and so on. It can be understood that the transmission of the notification message through the private network is more efficient and real-time than the transmission of the notification message through the public network.
其它节点(节点200和节点300)通过监听报文的方式或直接接收的方式,接收节 点100发送的复位通知报文,并根据接收到的复位通知报文获取节点100即将复位的信息,并通过集群管理软件进行业务切换或隔离等管理操作。Other nodes (node 200 and node 300) receive the reset notification message sent by node 100 by listening to the message or directly receiving it, and obtain the information that node 100 is about to be reset according to the received reset notification message, and pass The cluster management software performs management operations such as service switching or isolation.
在本申请实施例的另外一个实现方式中,节点100也可能因电源故障而出现异常。在这种情况下,也需要快速地将节点100电源异常的信息通知到节点200和节点300。具体地,如图3所示,当节点100中的电源出现异常时,电源模块103触发掉电指示信号,以通知接口单元102。电源模块103触发的方式,可以通过一个引脚(例如pin脚)指示掉电,以产生掉电指示信号。在具体实现时,所述掉电指示信号可以是PS_OK信号,或者其它用于指示市电掉电的信号。In another implementation manner of the embodiment of the present application, the node 100 may also be abnormal due to a power failure. In this case, it is also necessary to quickly notify the node 200 and the node 300 of the information about the abnormal power supply of the node 100. Specifically, as shown in FIG. 3, when the power supply in the node 100 is abnormal, the power supply module 103 triggers a power-down instruction signal to notify the interface unit 102. In the way that the power module 103 is triggered, a pin (for example, a pin) can be used to indicate power-down to generate a power-down indication signal. In a specific implementation, the power-down indication signal may be a PS_OK signal, or other signals used to indicate mains power-down.
电源模块103可以通过触发接口单元102的引脚跳变,以将节点100掉电的信息传递给接口单元102。例如电源模块103通过触发引脚的跳变,产生将PS_OK信号,并通过PS_OK信号触发接口单元102引脚的跳变。例如,触发接口单元102的引脚从高电平跳变到低电平时。可选的,电源模块103也可以触发微码模块1021定义的引脚,以实现掉电信息的传递。The power module 103 can trigger the pin transition of the interface unit 102 to transmit the information that the node 100 is powered off to the interface unit 102. For example, the power module 103 generates the PS_OK signal by triggering the transition of the pin, and triggers the transition of the pin of the interface unit 102 by the PS_OK signal. For example, when the pin of the trigger interface unit 102 transitions from a high level to a low level. Optionally, the power module 103 may also trigger the pins defined by the microcode module 1021 to implement the transmission of power-down information.
接口单元102中的微码模块1021可以检测该引脚的信号跳变,当检测到该引脚发生了跳变,则微码模块1021获取节点100掉电的信息。例如,当该引脚从高电平跳变到低电平时,微码单元1021获取到节点100掉电的信息。The microcode module 1021 in the interface unit 102 can detect the signal transition of the pin. When it is detected that the pin has transitioned, the microcode module 1021 obtains the information that the node 100 is powered off. For example, when the pin transitions from a high level to a low level, the microcode unit 1021 obtains information that the node 100 is powered down.
需要说明的是,上述PS_OK信号触发引脚的跳变,要早于电源模块103掉电,至少提前200微秒,具体可以通过电源模块103的电容储能即可实现,不再详述。It should be noted that the above-mentioned PS_OK signal triggers the transition of the pin earlier than the power module 103 is powered off, at least 200 microseconds in advance, which can be specifically achieved by the capacitor energy storage of the power module 103, which will not be described in detail.
接口单元102中的微码模块1021在确认节点100掉电后,根据节点100的标识、掉电状态和掉电时间等信息生成掉电通知报文,通过网络向节点200和节点300发送生成的掉电通知报文。After confirming that the node 100 is powered off, the microcode module 1021 in the interface unit 102 generates a power-down notification message according to the node 100's identification, power-off status, and power-off time, and sends the generated message to the node 200 and the node 300 through the network Power-down notification message.
具体实现的,当微码模块1021获取到节点100掉电信息时,节点100即将掉电。如果微码模块1021不能快速地生成掉电通知报文,将可能因节点100掉电导致发送掉电通知报文的失败。为提升微码模块1021在节点100掉电时发送掉电通知报文的速度,可以在节点100上电初始化时就将掉电通知报文需要的信息配置在微码模块1021中。这样,微码模块1021在获取到节点100掉电的信息时,能够根据已经配置的信息,快速地生成掉电通知报文并发送。例如,在节点100上电初始化时,将节点100的标识配置在微码模块1021中。可选的,当需要定向发送掉电通知报文时,也可以将需要发送到的节点的IP地址信息配置在微码模块1021中。这样微码模块1021在获取到节点100掉电的信息时,根据预先配置的信息,加上时间戳能够快速地生成节点100掉电的掉电通知报文并发送。Specifically, when the microcode module 1021 obtains the node 100 power-down information, the node 100 is about to power down. If the microcode module 1021 cannot quickly generate a power-down notification message, it may fail to send the power-down notification message due to the node 100 being powered down. To improve the speed at which the microcode module 1021 sends a power-down notification message when the node 100 is powered down, the information required by the power-down notification message can be configured in the microcode module 1021 when the node 100 is initialized at power-up. In this way, when the microcode module 1021 obtains the information about the power down of the node 100, it can quickly generate and send a power down notification message according to the configured information. For example, when the node 100 is powered on and initialized, the identifier of the node 100 is configured in the microcode module 1021. Optionally, when a power-down notification message needs to be sent directionally, the IP address information of the node to which it needs to be sent may also be configured in the microcode module 1021. In this way, when the microcode module 1021 obtains the information of the node 100 power down, according to the pre-configured information and a time stamp, the power down notification message of the node 100 power down can be quickly generated and sent.
其它节点(节点200和节点300)通过监听的方式接收通知报文或直接接收节点100发送的掉电通知报文,并根据接收到的掉电通知报文获取节点100掉电的信息,通过集群管理软件进行业务切换或隔离等管理操作。Other nodes (node 200 and node 300) receive the notification message by listening or directly receive the power-down notification message sent by node 100, and obtain the power-down information of node 100 according to the received power-down notification message, through the cluster The management software performs management operations such as business switching or isolation.
可以理解,本申请实施例中的复位通知报文或掉电通知报文可以是普通的报文,只是携带了必要的信息,例如复位的信息或掉电的信息。当然,本申请实施例中的复位通知报文或掉电通知报文,与普通的心跳报文不同。普通的心跳报文没有携带复位或掉电等必要信息,只是用于通信超时和畅通性检测。It can be understood that the reset notification message or the power-down notification message in the embodiment of the present application may be an ordinary message, but only carries necessary information, such as reset information or power-down information. Of course, the reset notification message or the power-down notification message in the embodiment of the present application is different from the ordinary heartbeat message. Ordinary heartbeat messages do not carry necessary information such as reset or power failure, but are used for communication timeout and smoothness detection.
上述是以节点100出现复位或掉电情况下,快速通知其它节点的实现方式。对于分 布式系统中的其它节点,例如节点200和节点300,出现复位或掉电时,其实现方式与节点100的实现方式类似,不再赘述。The above is an implementation manner of quickly notifying other nodes when the node 100 is reset or powered off. For other nodes in the distributed system, such as node 200 and node 300, when a reset or power failure occurs, the implementation is similar to the implementation of node 100, and will not be described in detail.
在本申请实施例提供的分布式系统的另一种实现方式中,分布式系统中的节点还包括报文监听模块。报文监听模块用于监听其它节点发送的复位或掉电信息。如图4所示,节点100的控制单元101中还包括报文监听模块1012,节点200的控制单元201中还包括报文监听模块2012,节点100的控制单元301中还包括报文监听模块3012。以节点100为例,节点100中的报文监听模块1012,用于监听节点200或节点300发送的复位或掉电相关的通知报文,使得节点100可以通过集群管理软件进行业务切换或隔离等管理操作。In another implementation manner of the distributed system provided by the embodiments of the present application, the nodes in the distributed system further include a message monitoring module. The message monitoring module is used to monitor reset or power-off information sent by other nodes. As shown in FIG. 4, the control unit 101 of the node 100 further includes a message monitoring module 1012, the control unit 201 of the node 200 further includes a message monitoring module 2012, and the control unit 301 of the node 100 further includes a message monitoring module 3012 . Taking node 100 as an example, the message monitoring module 1012 in the node 100 is used to monitor the reset or power-down notification messages sent by the node 200 or the node 300, so that the node 100 can perform service switching or isolation through the cluster management software. Management operations.
可以理解,上述控制单元包括报文监听模块,具体是控制单元运行的操作系统中包括报文监听模块。本申请实施例和附图为简洁之便,将控制单元运行的操作系统中包括报文监听模块,描述为控制单元包括报文监听模块;将控制单元通过执行报文监听模块对应的代码所实现的功能,描述为报文监听模块所实现的功能。It can be understood that the above-mentioned control unit includes a message monitoring module, specifically the operating system running by the control unit includes a message monitoring module. The embodiments and drawings in this application are concise and convenient. The operating system running by the control unit includes a message monitoring module. The control unit is described as including a message monitoring module; the control unit is implemented by executing the code corresponding to the message monitoring module. The function is described as the function realized by the message monitoring module.
图5为本申请实施例提供的另一种分布式系统实现方式结构示意图。如图5所示,图5与图2的不同在于,每个节点中包括BMC。例如,节点100还包括BMC104,节点200还包括BMC204,节点300还包括BMC304。FIG. 5 is a schematic structural diagram of another implementation manner of a distributed system provided by an embodiment of the present application. As shown in FIG. 5, the difference between FIG. 5 and FIG. 2 is that each node includes a BMC. For example, node 100 further includes BMC 104, node 200 further includes BMC 204, and node 300 further includes BMC 304.
以节点100为例,BMC104分别与控制单元101、接口单元102和电源模块103连接。当节点100复位时,BMC104从控制单元101获取节点100复位的信息,通过接口单元102发送通知消息,以将节点100复位的信息传递给接口单元102。接口单元102获取节点100复位的消息后,节点100即可启动复位。其中,控制单元101获取节点100复位的信息的方式,与上述图2和图3中控制单元101获取节点100复位的信息的方式相同,不再赘述。Taking node 100 as an example, BMC 104 is connected to control unit 101, interface unit 102, and power module 103, respectively. When the node 100 is reset, the BMC 104 obtains the reset information of the node 100 from the control unit 101, and sends a notification message through the interface unit 102 to pass the reset information of the node 100 to the interface unit 102. After the interface unit 102 obtains the message that the node 100 is reset, the node 100 can start the reset. The manner in which the control unit 101 acquires the reset information of the node 100 is the same as the manner in which the control unit 101 acquires the reset information of the node 100 in FIGS. 2 and 3 above, and details are not described herein again.
接口单元102根据从BMC104获取到的节点100复位的信息,生成复位通知报文,发送给节点200和节点300。接口单元102将节点100复位的信息发送给节点200和节点300的方式,与上述图2和图3中接口单元102的发送方式类似,不再赘述。The interface unit 102 generates a reset notification message according to the reset information of the node 100 acquired from the BMC 104, and sends it to the node 200 and the node 300. The manner in which the interface unit 102 sends the reset information of the node 100 to the node 200 and the node 300 is similar to the manner in which the interface unit 102 in FIG. 2 and FIG. 3 is sent, and will not be described repeatedly.
当节点100中的电源模块103出现异常时,电源模块103触发掉电指示信号,以通知BMC104。电源模块103触发的方式,可以通过一个引脚指示掉电,以产生掉电指示信号。在具体实现时,所述掉电指示信号可以是PS_OK信号,或者其它用于指示市电掉电的信号。电源模块103可以通过所述掉电指示信号触发BMC104的引脚跳变,以将节点100掉电的信息传递给BMC104。BMC104根据获取到的节点100掉电的信息,传递给接口单元102,例如可以向接口单元102发送包含节点100掉电信息的消息。接口单元102根据节点100掉电的信息生成掉电通知报文并发送。接口单元102向节点200和节点300发送掉电通知报文以通知节点100掉电的方式,与图2或图3中接口单元102通过掉电通知报文通知节点200和节点300的方式相同,不再赘述。When the power module 103 in the node 100 is abnormal, the power module 103 triggers a power-down instruction signal to notify the BMC 104. In the way that the power module 103 triggers, a pin can be used to indicate power down to generate a power down indication signal. In a specific implementation, the power-down indication signal may be a PS_OK signal, or other signals used to indicate mains power-down. The power supply module 103 may trigger the pin transition of the BMC 104 through the power-down instruction signal, so as to transmit the information of the node 100 power-down to the BMC 104. The BMC 104 transfers to the interface unit 102 the acquired information about the power down of the node 100, for example, a message containing the power down information of the node 100 may be sent to the interface unit 102. The interface unit 102 generates a power-down notification message according to the power-down information of the node 100 and sends it. The interface unit 102 sends a power-down notification message to the node 200 and the node 300 to notify the node 100 of the power-down, in the same way as the interface unit 102 in FIG. 2 or FIG. 3 notifies the node 200 and the node 300 through the power-down notification message. No longer.
上述BMC104将节点100复位的信息通过接口单元102通知给节点200和节点300,相对于控制单元101直接通过接口单元102将节点100复位的信息通知给节点200和节点300会稍慢一些,或BMC104将节点100掉电的信息通过接口单元102通知给节点200和节点300相对于电源模块103直接通知接口单元102会稍慢一些,但仍然能够快速地将节点100复位或掉电的信息通知给节点200和节点300,相对于通过心跳检测并通知 的方式,能够提高通知的效率和速度。并且,由于不需要设置预设的阈值进行心跳检测,避免了因设置阈值不当带来的误判的发生。The above BMC 104 notifies the node 200 and the node 300 of the reset information of the node 100 through the interface unit 102, which is slightly slower than the control unit 101 notifies the node 200 and the node 300 of the reset information of the node 100 directly through the interface unit 102, or Notifying the node 200 and node 300 of the power down of the node 100 through the interface unit 102 is slightly slower than directly informing the interface unit 102 of the power module 103, but it can still quickly notify the node of the reset or power down of the node 100 200 and node 300 can improve the efficiency and speed of notification compared to the method of detecting and notifying by heartbeat. Moreover, since there is no need to set a preset threshold for heartbeat detection, the occurrence of misjudgment caused by improper threshold setting is avoided.
参考图6,图6为本申请实施例提供的另一种分布式系统实现方式的结构示意图。其中,每个节点包括控制单元、电源模块和BMC,每个节点通过BMC向其它节点发送复位或掉电的通知报文。如图6所示,节点100包括控制单元101、电源模块103和BMC104,节点200包括控制单元201、电源模块203和BMC204,节点300包括控制单元301、电源模块303和BMC304。Referring to FIG. 6, FIG. 6 is a schematic structural diagram of another implementation manner of a distributed system provided by an embodiment of the present application. Among them, each node includes a control unit, a power supply module and a BMC, and each node sends a reset or power-down notification message to other nodes through the BMC. As shown in FIG. 6, the node 100 includes a control unit 101, a power module 103, and a BMC 104, the node 200 includes a control unit 201, a power module 203, and a BMC 204, and the node 300 includes a control unit 301, a power module 303, and a BMC 304.
以节点100为例,控制单元101的实现方式与图2或图3中的控制单元的实现方式类似,不再赘述。BMC104从控制单元101获取节点100复位的信息,并根据节点100复位的信息生成复位通知报文发送给节点200和节点300。其中,BMC104从控制单元101获取节点100复位的信息的具体实现方式,与图2或图3中接口单元102从控制单元获取节点100复位的信息的方式类似。同样的,BMC104从电源模块103获取节点100掉电信息的方式,与图2或图3中接口单元102获取节点100掉电信息的方式类似。BMC104获取到节点100复位或掉电的信息后,通知节点200和节点300的方式,可以通过发送定向报文或广播的通知报文的方式,发送给节点200和节点300。例如节点100中的BMC104将包含节点100复位的信息的复位通知报文发送给节点200中的BMC204和节点300中的BMC304。Taking the node 100 as an example, the implementation of the control unit 101 is similar to the implementation of the control unit in FIG. 2 or FIG. 3 and will not be described in detail. The BMC 104 acquires the reset information of the node 100 from the control unit 101, and generates a reset notification message according to the reset information of the node 100 and sends it to the node 200 and the node 300. The specific implementation manner of the BMC 104 acquiring the reset information of the node 100 from the control unit 101 is similar to the manner in which the interface unit 102 in FIG. 2 or FIG. 3 acquires the reset information of the node 100 from the control unit. Similarly, the manner in which the BMC 104 obtains the power-down information of the node 100 from the power module 103 is similar to the manner in which the interface unit 102 in FIG. 2 or FIG. 3 obtains the power-down information of the node 100. After the BMC 104 obtains the information about the reset or power down of the node 100, the method of notifying the node 200 and the node 300 may be sent to the node 200 and the node 300 by sending a directed message or a broadcast notification message. For example, the BMC 104 in the node 100 sends a reset notification message containing information about the reset of the node 100 to the BMC 204 in the node 200 and the BMC 304 in the node 300.
在具体实现时,图6所示的分布式系统的每个节点中的控制单元还包括复位检测模块,BMC还包括通知模块。如图7所示,控制单元101包括复位检测模块1011,BMC104包括通知模块1041;控制单元201包括复位检测模块2011,BMC204包括通知模块2041;控制单元301包括复位检测模块3011,BMC304包括通知模块3041。图7所示的控制单元中的复位检测模块与图3所示的控制单元中复位检测模块类似,可以是控制单元运行的操作系统中的复位检测模块,不再赘述。In specific implementation, the control unit in each node of the distributed system shown in FIG. 6 further includes a reset detection module, and the BMC further includes a notification module. As shown in FIG. 7, the control unit 101 includes a reset detection module 1011, the BMC 104 includes a notification module 1041; the control unit 201 includes a reset detection module 2011, and the BMC 204 includes a notification module 2041; the control unit 301 includes a reset detection module 3011, and the BMC 304 includes a notification module 3041 . The reset detection module in the control unit shown in FIG. 7 is similar to the reset detection module in the control unit shown in FIG. 3, and may be a reset detection module in the operating system that the control unit runs, and will not be described in detail.
下面以节点100上运行Linux操作系统以及节点100中的软件需要复位为例,对复位检测模块1011获取复位的信息以及向通知模块1041传递复位的信息的方式进行说明。复位检测模块1011可以注册到linux操作系统提供的通知链上。当节点100中运行的某一个软件要复位时,会发送通知到通知链上。注册在通知链上的复位检测模块1011,通过调用回调函数获知节点100即将复位,通过控制单元与BMC之间的通信通道,例如可以是PCIE 3.0通信通道,以将节点100复位的信息传输给BMC104中的通知模块1041。Taking the Linux operating system running on the node 100 and the software in the node 100 requiring reset as an example, the manner in which the reset detection module 1011 acquires reset information and transmits the reset information to the notification module 1041 will be described. The reset detection module 1011 can be registered on the notification chain provided by the Linux operating system. When a piece of software running in the node 100 is to be reset, a notification will be sent to the notification chain. The reset detection module 1011 registered on the notification chain knows that the node 100 is about to be reset by calling a callback function, and through the communication channel between the control unit and the BMC, for example, it can be a PCIE 3.0 communication channel to transmit the reset information of the node 100 to the BMC 104的 Notice module 1041.
可选的,复位检测模块1011可以根据与通知模块1041之间私有协议发送节点100复位的通知消息。例如,复位检测模块1011通过与通知模块1041之间预先定义的私有接口命令,携带节点100复位的信息和节点100的标识,并将该私有接口命令发送给通知模块1041。Optionally, the reset detection module 1011 may send a notification message of the node 100 reset according to a private agreement with the notification module 1041. For example, the reset detection module 1011 carries the reset information of the node 100 and the identification of the node 100 through a pre-defined private interface command with the notification module 1041, and sends the private interface command to the notification module 1041.
通知模块1041接收到复位检测模块1011发送的消息后,根据节点100的标识和节点100复位的信息生成复位通知报文,以定向或广播的方式通过网络向节点200和节点300发送生成的复位通知报文。After receiving the message sent by the reset detection module 1011, the notification module 1041 generates a reset notification message according to the identifier of the node 100 and the reset information of the node 100, and sends the generated reset notification to the node 200 and the node 300 through the network in a directed or broadcast manner Message.
通知模块1041通过网络向节点200和节点300发送生成的通知报文可以有多种实现方式,本申请不限定具体的实现方式。例如,当节点100、节点200和节点300组成 的分布式系统有自己的私有网络时,通知模块1041可以通过私有网络发送复位通知报文。如果节点100、节点200和节点300组成的分布式系统没有自己的私有网络,通知模块1041可以通过公网向节点200和节点300发送复位通知报文,在这种情况下,通知模块1041发送的复位通知报文可以携带节点100的IP地址,以及节点200和节点300的IP地址。可以理解,通过私有网络传输复位通知报文,相对于通过公网传输复位通知报文,传输效率和实时性更高。The notification module 1041 may send the generated notification message to the node 200 and the node 300 through the network in multiple implementation manners, and the specific implementation manner is not limited in this application. For example, when the distributed system composed of the node 100, the node 200, and the node 300 has its own private network, the notification module 1041 may send a reset notification message through the private network. If the distributed system composed of node 100, node 200, and node 300 does not have its own private network, the notification module 1041 may send a reset notification message to the node 200 and the node 300 via the public network. The reset notification message may carry the IP address of the node 100, and the IP addresses of the node 200 and the node 300. It can be understood that the transmission of the reset notification message through the private network is more efficient and real-time than the transmission of the reset notification message through the public network.
其它节点(节点200和节点300)通过监听的方式接收复位通知报文或直接接收复位通知报文。具体的,可以是节点200中的BMC204接收节点100通过BMC104发送的复位通知报文,或者节点300中的BMC304接收节点100通过BMC104发送的复位通知报文。Other nodes (node 200 and node 300) receive the reset notification message by listening or directly receive the reset notification message. Specifically, it may be that the BMC 204 in the node 200 receives the reset notification message sent by the node 100 through the BMC 104, or the BMC 304 in the node 300 receives the reset notification message sent by the node 100 through the BMC 104.
可选的,节点200的BMC204中可以包括接收模块(图中未示出),用于接收BMC104中的通知模块104发送的复位通知报文。可选的,BMC204中的接收模块与通知模块2041可以是同一个模块。节点300中的BMC304的实现方式与节点200中的BMC204的实现方式类似,不再赘述。Optionally, the BMC 204 of the node 200 may include a receiving module (not shown in the figure) for receiving a reset notification message sent by the notification module 104 in the BMC 104. Optionally, the receiving module and the notification module 2041 in the BMC 204 may be the same module. The implementation manner of the BMC 304 in the node 300 is similar to the implementation manner of the BMC 204 in the node 200, and will not be described in detail.
其它节点(节点200和节点300)获取节点100复位的信息后,可以通过集群管理软件进行业务切换或隔离等管理操作。After other nodes (node 200 and node 300) obtain the reset information of node 100, they can perform management operations such as service switching or isolation through cluster management software.
当节点100中的电源异常或掉电时,也可以通过BMC104发送掉电的掉电通知报文。电源模块103可以触发掉电指示信号,以将节点100掉电的信息传递给BMC104。例如,电源模块103可以通过一个引脚(pin脚)指示掉电,以产生掉电指示信号。在具体实现时,所述掉电指示信号可以是PS_OK信号,或者其它用于指示市电掉电的信号。通过所述掉电指示信号,触发BMC104的引脚产生跳变,BMC104根据引脚的跳变获取节点100掉电的信息。可选的,也可以通过触发通知模块1041定义的引脚,以实现节点100掉电的信息的传递。When the power supply in the node 100 is abnormal or is powered off, the BMC 104 may also send a power-off notification message for power-off. The power module 103 may trigger a power-down instruction signal to pass the information about the power-down of the node 100 to the BMC 104. For example, the power module 103 may indicate power down through a pin (pin pin) to generate a power down indication signal. In a specific implementation, the power-down indication signal may be a PS_OK signal, or other signals used to indicate mains power-down. Through the power-down instruction signal, the pins of the BMC 104 are triggered to generate transitions, and the BMC 104 obtains information about the power down of the node 100 according to the transitions of the pins. Optionally, the pin defined by the notification module 1041 may also be triggered to realize the transmission of the information that the node 100 is powered off.
需要说明的是,上述PS_OK信号触发引脚的跳变,要早于电源模块103掉电,至少提前200微秒,具体可以通过电源模块103的电容储能即可实现,不再详述。It should be noted that the above-mentioned PS_OK signal triggers the transition of the pin earlier than the power module 103 is powered off, at least 200 microseconds in advance, which can be specifically achieved by the capacitor energy storage of the power module 103, which will not be described in detail.
具体实现的,当通知模块1041获取到节点100掉电信息时,节点100即将掉电。如果通知模块1041不能快速地生成掉电通知报文,将可能因节点100掉电导致发送掉电通知报文失败。为提升通知模块1041在节点100掉电时发送掉电通知报文的速度,可以在节点100上电初始化时就将掉电通知报文需要的信息配置在通知模块1041中。这样,通知模块1041在获取到节点100掉电的信息时,能够根据已经配置的信息,快速地生成掉电通知报文并发送。例如,在节点100上电初始化时,将节点100的标识配置在通知模块1041中。当需要定向发送掉电通知报文时,也可以将需要发送到的节点的IP地址信息配置在通知模块1041中。这样通知模块1041在获取到节点100掉电的信息时,根据预先配置的信息,加上时间戳能够快速地生成节点100掉电的掉电通知报文并发送。Specifically, when the notification module 1041 obtains the node 100 power-down information, the node 100 is about to power down. If the notification module 1041 cannot quickly generate a power-down notification message, it may fail to send the power-down notification message due to the node 100 being powered down. In order to increase the speed of the notification module 1041 to send a power-down notification message when the node 100 is powered off, the information required by the power-down notification message may be configured in the notification module 1041 when the node 100 is initialized after power-on. In this way, the notification module 1041 can quickly generate and send a power-down notification message according to the configured information when acquiring the power-down information of the node 100. For example, when the node 100 is powered on and initialized, the identifier of the node 100 is configured in the notification module 1041. When the power-down notification message needs to be sent in a targeted manner, the IP address information of the node to which it needs to be sent may also be configured in the notification module 1041. In this way, when the notification module 1041 obtains the information of the node 100 power down, according to the pre-configured information and a time stamp, the power down notification message of the node 100 power down can be quickly generated and sent.
其它节点(节点200和节点300)通过监听掉电通知报文的方式或直接接收掉电通知报文,并根据接收到的掉电通知报文获取节点100掉电的信息,通过集群管理软件进行业务切换或隔离等管理操作。Other nodes (node 200 and node 300) receive the power-down notification message by listening to the power-down notification message or directly receive the power-down information of the node 100 according to the received power-down notification message, and perform it through the cluster management software Management operations such as business switching or isolation.
上述是以节点100出现复位或掉电情况下,快速通知其它节点的实现方式。对于分布式系统中的其它节点,例如节点200和节点300,出现复位或掉电时,其实现方式与 节点100的实现方式类似,不再赘述。The above is an implementation manner of quickly notifying other nodes when the node 100 is reset or powered off. For other nodes in the distributed system, such as node 200 and node 300, when reset or power failure occurs, the implementation manner is similar to that of node 100, and will not be described in detail.
在本申请实施例提供的分布式系统的另一种实现方式中,分布式系统中的节点还包括报文监听模块。报文监听模块用于监听其它节点发送的复位或掉电信息。如图8所示,节点100的控制单元101中还包括报文监听模块1012,节点200的控制单元201中还包括报文监听模块2012,节点100的控制单元301中还包括报文监听模块3012。以节点100为例,节点100中的报文监听模块1012,用于监听节点200或节点300发送的复位或掉电的通知报文,并通过集群管理软件进行业务切换或隔离等管理操作。图8中报文监听模块与图4中的报文监听模块的实现方式类似,不再赘述。In another implementation manner of the distributed system provided by the embodiments of the present application, the nodes in the distributed system further include a message monitoring module. The message monitoring module is used to monitor reset or power-off information sent by other nodes. As shown in FIG. 8, the control unit 101 of the node 100 further includes a message monitoring module 1012, the control unit 201 of the node 200 further includes a message monitoring module 2012, and the control unit 301 of the node 100 further includes a message monitoring module 3012 . Taking the node 100 as an example, the message monitoring module 1012 in the node 100 is used to monitor the reset or power-down notification message sent by the node 200 or the node 300, and perform management operations such as service switching or isolation through cluster management software. The implementation of the message monitoring module in FIG. 8 is similar to that of the message monitoring module in FIG. 4 and will not be repeated here.
上述实施例是在节点100复位时,通过控制单元101和接口单元102将节点100复位的信息发送给分布式系统中的其它节点,或通过BMC将节点100复位的信息发送给分布式系统中的其它节点。在具体实现时,也可以由其它软件或硬件来实现节点复位的信息的发送,例如,可以由节点100的操作系统直接将节点100复位的信息通过一定的接口发送给分布式系统中的其它节点,也可以是其它芯片或逻辑单元将节点100复位的信息发送给分布式系统中的其它节点,还可以是接口单元102中的微码模块1021从其它芯片或逻辑单元获取节点100复位的信息,并将节点100复位的信息发送给分布式系统中的其它节点。只要是节点100在复位时,主动将自身复位的信息发送给分布式系统中的其它节点,相对于通过心跳能够提高通知的效率和实时性的方式,都在本申请实施例覆盖的范围内。In the above embodiment, when the node 100 is reset, the reset information of the node 100 is sent to other nodes in the distributed system through the control unit 101 and the interface unit 102, or the reset information of the node 100 is sent to the distributed system through the BMC. Other nodes. In specific implementation, other software or hardware may also be used to send the reset information of the node. For example, the operating system of the node 100 may directly send the reset information of the node 100 to other nodes in the distributed system through a certain interface. , It may be that other chips or logic units send the reset information of the node 100 to other nodes in the distributed system, or that the microcode module 1021 in the interface unit 102 obtains the reset information of the node 100 from other chips or logic units. And send the reset information of node 100 to other nodes in the distributed system. As long as the node 100 resets itself and actively sends the reset information to other nodes in the distributed system, it is within the scope of the embodiments of the present application relative to the manner in which the efficiency and real-time notification can be improved through heartbeat.
同样的,当节点100掉电时,通过电源模块103和接口单元102将节点100掉电的信息发送给分布式系统中的其它节点,或通过BMC将节点100掉电的信息发送给分布式系统中的其它节点。在具体实现时,也可以由节点100中的其它硬件或软件来实现节点复位的信息的发送,例如,可以由节点100的其它芯片或逻辑单元将节点100掉电的信息发送给分布式系统中的其它节点。只要是节点100在掉电时,主动将自身掉电的信息发送给分布式系统中的其它节点,相对于通过心跳检测能够提高通知的效率和实时性的方式,都在本申请实施例覆盖的范围内。Similarly, when the node 100 is powered down, the power module 103 and the interface unit 102 send the information about the node 100's power down to other nodes in the distributed system, or the BMC sends the information about the node 100's power down to the distributed system In other nodes. In specific implementation, other hardware or software in the node 100 may also be used to send the reset information of the node. For example, other chips or logic units of the node 100 may send the information of the node 100 to the distributed system. Other nodes. As long as the node 100 actively sends its own power-down information to other nodes in the distributed system when it is powered off, it is covered by the embodiments of this application relative to the way that the efficiency and real-time of notification can be improved by heartbeat detection. Within range.
图9A为本申请实施例提供的一种计算机设备900的结构示意图。如图9A所示,计算机设备900包括处理器901和报文发送单元902。计算机设备900是分布式系统中的计算机设备,该分布式系统可以包括2个以上的计算机设备。9A is a schematic structural diagram of a computer device 900 according to an embodiment of the present application. As shown in FIG. 9A, the computer device 900 includes a processor 901 and a message sending unit 902. The computer device 900 is a computer device in a distributed system, and the distributed system may include more than two computer devices.
所述处理器901,用于在所述计算机设备900复位时,获取所述计算机设备900复位的信息,并将所述计算机设备900复位的信息传输给所述报文发送单元902;The processor 901 is configured to acquire the reset information of the computer device 900 when the computer device 900 is reset, and transmit the reset information of the computer device 900 to the message sending unit 902;
所述报文发送单元902,用于根据所述计算机设备900复位的信息生成包含所述计算机设备900复位的信息的复位通知报文,并将所述复位通知报文发送给所述分布式系统中的其它计算机设备。The message sending unit 902 is configured to generate a reset notification message including information reset by the computer device 900 according to the information reset by the computer device 900, and send the reset notification message to the distributed system Other computer equipment.
上述计算机设备900通过报文发送单元902获取本设备复位的信息,并能够根据获取的本设备复位的信息,生成包含复位的信息的复位通知报文,发送给分布式系统中的其它设备,能够快速地将本设备复位的信息通知分布式系统中的其它设备。相对于现有技术通过心跳探测其它设备是否复位的方式,不仅能提升复位的信息传递的效率,也能够避免误判的发生。The above-mentioned computer device 900 obtains the reset information of the device through the message sending unit 902, and can generate a reset notification message containing the reset information according to the obtained reset information of the device, and send it to other devices in the distributed system. Quickly notify other devices in the distributed system of the reset information of this device. Compared with the prior art method of detecting whether other devices are reset by heartbeat, not only can the efficiency of resetting information transmission be improved, but also the occurrence of misjudgment can be avoided.
具体的,上述计算机设备900的具体实现方式,可以参照上述图2至图4中节点100 的实现方式来实现。例如报文发送单元902与图2至图4中接口单元102的实现方式类似,处理器901与图2至图4中控制单元101的实现方式类似,不再赘述。Specifically, the specific implementation manner of the computer device 900 may be implemented with reference to the implementation manner of the node 100 in FIGS. 2 to 4 described above. For example, the message sending unit 902 is similar to the implementation of the interface unit 102 in FIGS. 2 to 4, and the processor 901 is similar to the implementation of the control unit 101 in FIGS. 2 to 4 and will not be described in detail.
可选的,如图9B所示,所述计算机设备900还包括电源模块903。所述电源模块903可以通过通用输入输出GPIO引脚与所述报文发送单元902连接;Optionally, as shown in FIG. 9B, the computer device 900 further includes a power module 903. The power supply module 903 may be connected to the message sending unit 902 through a general input and output GPIO pin;
所述电源模块903,用于在所述计算机设备900掉电时,通过触发引脚的跳变将所述计算机设备900掉电的信息传递给所述报文发送单元902;The power supply module 903 is used to transfer the power-off information of the computer device 900 to the message sending unit 902 through the transition of the trigger pin when the computer device 900 is powered off;
所述报文发送单元902,还用于根据所述计算机设备900掉电的信息,生成包含所述计算机设备900掉电信息的掉电通知报文,并将所述掉电通知报文发送给所述分布式系统中的所述其它计算机设备。The message sending unit 902 is further configured to generate a power-down notification message containing the power-down information of the computer device 900 according to the power-down information of the computer device 900, and send the power-down notification message to The other computer equipment in the distributed system.
上述电源模块903与图2至图4中电源模块103的实现方式类似,不再赘述。The above power supply module 903 is similar to the implementation manner of the power supply module 103 in FIGS. 2 to 4 and will not be described in detail.
上述计算机设备900通过报文发送单元902获取本设备掉电的信息,并能够根据获取的本设备掉电的信息,生成包含掉电信息的掉电通知报文,发送给分布式系统中的其它设备,能够快速地将本设备掉电的信息通知分布式系统中的其它设备。相对于现有技术通过心跳探测其它设备是否掉电的方式,不仅能提升掉电信息传递的效率,也能够避免误判的发生。The above-mentioned computer device 900 obtains the power-down information of the device through the message sending unit 902, and can generate a power-down notification message containing the power-down information according to the obtained power-down information of the device and send it to other users in the distributed system The device can quickly notify other devices in the distributed system of the power down of the device. Compared with the prior art method of detecting whether other devices are powered off by heartbeat, not only can the efficiency of power-off information transmission be improved, but also the occurrence of misjudgment can be avoided.
图10为本申请实施例提供的计算机设备900的另一种实现方式的结构示意图。如图10所示,计算机设备900还包括BMC904。所述BMC904用于从所述计算机设备900中的操作系统获取所述计算机设备900复位的信息,并将所述计算机设备900复位的信息发送给所述报文发送单元902;或所述BMC904从所述电源模块获取所述计算机设备900掉电的信息,并将所述计算机设备900掉电的信息发送给所述报文发送单元902。10 is a schematic structural diagram of another implementation manner of a computer device 900 provided by an embodiment of this application. As shown in FIG. 10, the computer device 900 further includes a BMC 904. The BMC904 is used to obtain the reset information of the computer device 900 from the operating system in the computer device 900, and send the reset information of the computer device 900 to the message sending unit 902; or the BMC904 The power supply module obtains information on power-off of the computer device 900, and sends the information on power-off of the computer device 900 to the message sending unit 902.
通过BMC904获取计算机设备900复位或掉电的信息,并通过报文发送单元902将计算机设备900复位或掉电的信息发送给分布式系统中的其它计算机设备,相对于通过心跳获取其它节点复位或掉电的方式,能够提高分布式系统中计算机设备复位或掉电信息获取的效率,并能够避免误判的发生。Obtain the reset or power-off information of the computer device 900 through the BMC 904, and send the reset or power-off information of the computer device 900 to other computer devices in the distributed system through the message sending unit 902. The power-off method can improve the efficiency of computer equipment reset or power-off information acquisition in a distributed system, and can avoid misjudgment.
具体的,图10中BMC904的实现方式,可以参考上述图5中BMC104的实现方式来实现,不再赘述。Specifically, the implementation manner of the BMC904 in FIG. 10 can be implemented by referring to the implementation manner of the BMC104 in FIG. 5 described above, and details are not described herein again.
在本申请实施例的另一种实现方式中,计算机设备900中的报文发送单元902是由BMC904实现的。如图11所示,计算机设备900包括中央处理器901、BMC904和电源模块903,BMC904包括报文发送单元902。具体地,图11所示的计算机设备900的实现方式,可以参考上述图7或图8中节点100的实现方式来实现。例如,BMC904的实现方式,可以参考上述图7或图8中BMC104的实现方式来实现,BMC904中的报文发送单元902可以参考上述图7或图8中通知模块1041的实现方式来实现,不再赘述。In another implementation manner of the embodiment of the present application, the message sending unit 902 in the computer device 900 is implemented by the BMC 904. As shown in FIG. 11, the computer device 900 includes a central processor 901, a BMC 904 and a power module 903, and the BMC 904 includes a message sending unit 902. Specifically, the implementation of the computer device 900 shown in FIG. 11 can be implemented with reference to the implementation of the node 100 in FIG. 7 or FIG. 8 described above. For example, the implementation of the BMC904 can be implemented by referring to the implementation of the BMC104 in FIG. 7 or FIG. 8, and the message sending unit 902 in the BMC904 can be implemented by referring to the implementation of the notification module 1041 in FIG. 7 or 8. Repeat again.
图12为本申请实施例提供的一种设备信息发送的方法的流程示意图。如图12所示,所述方法包括:FIG. 12 is a schematic flowchart of a method for sending device information according to an embodiment of the present application. As shown in FIG. 12, the method includes:
步骤S100:计算机设备中的处理器在所述计算机设备复位前获取所述计算机设备复位的信息,并将所述计算机设备复位的信息传输给所述计算机设备中的报文发送单元;Step S100: The processor in the computer device obtains the reset information of the computer device before resetting the computer device, and transmits the reset information of the computer device to the message sending unit in the computer device;
步骤S200:所述报文发送单元接收所述计算机设备复位的信息;Step S200: the message sending unit receives the reset information of the computer device;
步骤S300:所述报文发送单元根据所述计算机设备复位的信息,生成包含所述计算机设备复位的信息的复位通知报文,并将所述复位通知报文发送给所述计算机设备所在 的分布式系统中的其它计算机设备。Step S300: The message sending unit generates a reset notification message containing the reset information of the computer device according to the reset information of the computer device and sends the reset notification message to the distribution where the computer device is Computer equipment in a distributed system.
上述方法通过获取计算机设备复位的信息,并根据获取的计算机设备复位的信息,生成包含复位的信息的复位通知报文,发送给分布式系统中的其它设备,能够快速地将本设备复位的信息通知分布式系统中的其它设备。相对于现有技术通过心跳探测其它设备是否复位的方式,不仅能提升复位的信息传递的效率,也能够避免误判的发生。The above method obtains the reset information of the computer device, and generates a reset notification message containing the reset information according to the obtained reset information of the computer device, and sends it to other devices in the distributed system, which can quickly reset the device. Notify other devices in the distributed system. Compared with the prior art method of detecting whether other devices are reset by heartbeat, not only can the efficiency of resetting information transmission be improved, but also the occurrence of misjudgment can be avoided.
上述方法可以是由分布式系统中的计算机设备来实现。具体实现时,可以参考上述图2至图8中节点100的实现方式来实现,不再赘述。The above method may be implemented by a computer device in a distributed system. For specific implementation, reference may be made to the implementation manner of the node 100 in FIG. 2 to FIG. 8 described above, and details are not described herein again.
可选的,上述方法中,所述方法还包括:所述处理器通过预设函数和所述计算机设备的操作系统中关于复位的通知链,获取所述计算机设备复位的信息。Optionally, in the above method, the method further includes: the processor acquiring reset information of the computer device through a preset function and a notification chain about reset in the operating system of the computer device.
可选的,所述预设函数为回调函数,所述回调函数注册于所述通知链上;Optionally, the preset function is a callback function, and the callback function is registered on the notification chain;
所述处理器通过预设函数和所述计算机设备的操作系统中关于复位的通知链,获取所述计算机设备复位的信息包括:The processor acquiring the reset information of the computer device through a preset function and a notification chain about reset in the operating system of the computer device includes:
所述处理器通过所述回调函数从所述通知链获取所述计算机设备复位的信息。The processor obtains the reset information of the computer device from the notification chain through the callback function.
可选的,所述处理器通过所述计算机设备中的BMC将所述计算机设备复位的信息传输给所述报文发送单元。Optionally, the processor transmits the reset information of the computer device to the message sending unit through the BMC in the computer device.
可选的,所述报文发送单元是所述计算机设备中的BMC。Optionally, the message sending unit is a BMC in the computer device.
可选的,所述方法还包括:Optionally, the method further includes:
当所述计算机设备掉电时,所述报文发送单元通过引脚的跳变获取所述计算机设备掉电的信息,生成包含所述计算机设备掉电信息的掉电通知报文,并将所述掉电通知报文发送给所述分布式系统中的所述其它计算机设备。When the computer device is powered off, the message sending unit acquires the information of the computer device's power off through the transition of the pin, generates a power-down notification message containing the computer device's power-off information, and sends The power-down notification message is sent to the other computer equipment in the distributed system.
上述方法通过获取本设备掉电的信息,并能够根据获取的本设备掉电的信息,生成包含掉电信息的掉电通知报文,发送给分布式系统中的其它设备,能够快速地将本设备掉电的信息通知分布式系统中的其它设备。相对于现有技术通过心跳探测其它设备是否掉电的方式,不仅能提升掉电信息传递的效率,也能够避免误判的发生。The above method can obtain the power-down information of the device and generate a power-down notification message containing the power-down information according to the obtained power-down information of the device, and send it to other devices in the distributed system. The device power-off information notifies other devices in the distributed system. Compared with the prior art method of detecting whether other devices are powered off by heartbeat, not only can the efficiency of power-off information transmission be improved, but also the occurrence of misjudgment can be avoided.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元、模块及步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those of ordinary skill in the art may realize that the units, modules, and steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the hardware and software In the above description, the composition and steps of each example have been generally described in terms of function. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the present invention.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、设备和方法,可以通过其它的方式实现。例如,以上所描述的节点100仅仅是示意性的;例如,上述单元或模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的连接或直接连接或通信连接可以是通过一些接口、装置或单元的间的连接或通信连接,也可以是电的,机械的或其它的形式连接。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the node 100 described above is only schematic; for example, the division of the above units or modules is only a division of logical functions, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual connection or direct connection or communication connection may be a connection or communication connection through some interfaces, devices or units, and may also be an electrical, mechanical or other form connection.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本发明实施例方 案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present invention.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The above integrated unit may be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention essentially or part of the contribution to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium In it, several instructions are included to enable a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the scope of protection of the present invention is not limited to this. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed by the present invention. Modifications or replacements, these modifications or replacements should be covered by the protection scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (15)

  1. 一种计算机设备,所述计算机设备是分布式系统中的计算机设备,所述计算机设备包括处理器,其特征在于,所述计算机设备还包括报文发送单元,所述报文发送单元与所述处理器通过总线连接;A computer device, the computer device is a computer device in a distributed system, the computer device includes a processor, characterized in that the computer device further includes a message sending unit, the message sending unit and the The processor is connected through the bus;
    所述处理器,用于在所述计算机设备复位时,获取所述计算机设备复位的信息,并将所述计算机设备复位的信息传输给所述报文发送单元;The processor is configured to acquire the reset information of the computer device when the computer device is reset, and transmit the reset information of the computer device to the message sending unit;
    所述报文发送单元,用于根据所述计算机设备复位的信息生成包含所述计算机设备复位的信息的复位通知报文,并将所述复位通知报文发送给所述分布式系统中的其它计算机设备。The message sending unit is configured to generate a reset notification message including the reset information of the computer device based on the reset information of the computer device, and send the reset notification message to other users in the distributed system Computer equipment.
  2. 根据权利要求1所述的计算机设备,其特征在于:The computer device according to claim 1, characterized in that:
    所述计算机设备中的操作系统包括关于复位的通知链;The operating system in the computer device includes a notification chain about resetting;
    在所述计算机设备复位前,所述处理器通过预设函数从所述通知链获取所述计算机设备复位的信息。Before the computer device is reset, the processor obtains the reset information of the computer device from the notification chain through a preset function.
  3. 根据权利要求1所述的计算机设备,其特征在于:所述计算机设备中的操作系统包括复位检测模块,The computer device according to claim 1, wherein the operating system in the computer device includes a reset detection module,
    所述复位检测模块,注册于所述计算机设备的操作系统中的复位通知链上,并通过回调函数获取所述复位通知链上关于所述计算机设备复位的信息;The reset detection module is registered on a reset notification chain in the operating system of the computer device, and obtains information about the reset of the computer device on the reset notification chain through a callback function;
    所述处理器通过所述复位检测模块获取所述计算机设备复位的信息。The processor obtains reset information of the computer device through the reset detection module.
  4. 根据权利要求1-3任意一项所述的计算机设备,其特征在于:The computer device according to any one of claims 1-3, characterized in that:
    所述报文发送单元包括微码模块;The message sending unit includes a microcode module;
    所述微码模块用于生成所述复位通知报文,并将所述复位通知报文送给所述分布式系统中的其它计算机设备。The microcode module is used to generate the reset notification message and send the reset notification message to other computer devices in the distributed system.
  5. 根据权利要求1-4任意一项所述的计算机设备,其特征在于,所述计算机设备还包括基板管理控制器BMC;The computer device according to any one of claims 1-4, wherein the computer device further comprises a baseboard management controller BMC;
    所述处理器通过所述BMC将所述计算机设备复位的信息传输给所述报文发送单元。The processor transmits the reset information of the computer device to the message sending unit through the BMC.
  6. 根据权利要求1-5任意一项所述的计算机设备,其特征在于,所述计算机设备还包括电源模块,所述电源模块通过通用输入输出GPIO引脚与所述报文发送单元连接;The computer device according to any one of claims 1 to 5, wherein the computer device further comprises a power supply module, and the power supply module is connected to the message sending unit through a general input and output GPIO pin;
    所述电源模块,用于在所述计算机设备掉电时,通过触发引脚的跳变将所述计算机设备掉电的信息传递给所述报文发送单元;The power supply module is used to transfer the power-off information of the computer device to the message sending unit through the transition of the trigger pin when the computer device is powered off;
    所述报文发送单元,还用于根据所述计算机设备掉电的信息,生成包含所述计算机设备掉电信息的掉电通知报文,并将所述掉电通知报文发送给所述分布式系统中的所述其它计算机设备。The message sending unit is further configured to generate a power-down notification message containing the computer device's power-down information based on the computer device's power-down information, and send the power-down notification message to the distribution Other computer equipment in a distributed system.
  7. 根据权利要求1-3任意一项所述的计算机设备,其特征在于,所述报文发送单元是基板管理控制器BMC,所述BMC还包括通知模块;The computer device according to any one of claims 1 to 3, wherein the message sending unit is a baseboard management controller BMC, and the BMC further includes a notification module;
    所述通知模块根据所述BMC获取到的所述计算机设备复位的信息生成所述复位通知报文,并将所述复位通知报文发送给所述分布式系统中的所述其它计算机设备。The notification module generates the reset notification message according to the reset information of the computer device acquired by the BMC, and sends the reset notification message to the other computer devices in the distributed system.
  8. 根据权利要求7所述的计算机设备,其特征在于:所述计算机设备还包括电源模块;The computer device according to claim 7, wherein the computer device further comprises a power supply module;
    所述电源模块,还用于在所述计算机设备掉电时通过触发引脚的跳变,向所述BMC 传递所述计算机设备掉电的信息;The power supply module is also used to transmit the power-off information of the computer device to the BMC by triggering the transition of the pin when the computer device is power-off;
    所述通知模块,还用于根据获取到的所述计算机设备掉电的信息生成包含所述计算机设备掉电信息的掉电通知报文,并将所述掉电通知报文发送给所述分布式系统中的所述其它计算机设备。The notification module is further configured to generate a power-down notification message including the computer device's power-down information according to the acquired power-down information of the computer device, and send the power-down notification message to the distribution Other computer equipment in a distributed system.
  9. 一种分布式计算机设备系统,其特征在于,包括至少两个权利要求1-8中的计算机设备。A distributed computer equipment system, characterized by comprising at least two computer equipment as claimed in claims 1-8.
  10. 一种设备信息发送的方法,其特征在于,所述方法包括:A method for sending device information, characterized in that the method includes:
    计算机设备中的处理器在所述计算机设备复位前获取所述计算机设备复位的信息,并将所述计算机设备复位的信息传输给所述计算机设备中的报文发送单元;The processor in the computer device acquires the reset information of the computer device before resetting the computer device, and transmits the reset information of the computer device to the message sending unit in the computer device;
    所述报文发送单元接收所述计算机设备复位的信息;The message sending unit receives the reset information of the computer device;
    所述报文发送单元根据所述计算机设备复位的信息,生成包含所述计算机设备复位的信息的复位通知报文,并将所述复位通知报文发送给所述计算机设备所在的分布式系统中的其它计算机设备。The message sending unit generates a reset notification message containing the reset information of the computer device based on the reset information of the computer device, and sends the reset notification message to the distributed system where the computer device is located Other computer equipment.
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:The method of claim 10, further comprising:
    所述处理器通过预设函数和所述计算机设备的操作系统中关于复位的通知链,获取所述计算机设备复位的信息。The processor obtains reset information of the computer device through a preset function and a notification chain about reset in the operating system of the computer device.
  12. 根据权利要求11所述的方法,其特征在于,所述预设函数为回调函数,所述回调函数注册于所述通知链上;The method according to claim 11, wherein the preset function is a callback function, and the callback function is registered on the notification chain;
    所述处理器通过预设函数和所述计算机设备的操作系统中关于复位的通知链,获取所述计算机设备复位的信息包括:The processor acquiring the reset information of the computer device through a preset function and a notification chain about reset in the operating system of the computer device includes:
    所述处理器通过所述回调函数从所述通知链获取所述计算机设备复位的信息。The processor obtains the reset information of the computer device from the notification chain through the callback function.
  13. 根据权利要求10-12任意一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 10-12, wherein the method further comprises:
    所述处理器通过所述计算机设备中的基板管理控制器BMC将所述计算机设备复位的信息传输给所述报文发送单元。The processor transmits the reset information of the computer device to the message sending unit through the baseboard management controller BMC in the computer device.
  14. 根据权利要求10-12任意一项所述的方法,其特征在于,The method according to any one of claims 10-12, characterized in that
    所述报文发送单元是所述计算机设备中的基板管理控制器BMC。The message sending unit is a baseboard management controller BMC in the computer device.
  15. 根据权利要求11-14任意一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 11-14, wherein the method further comprises:
    当所述计算机设备掉电时,所述报文发送单元通过引脚的跳变获取所述计算机设备掉电的信息,生成包含所述计算机设备掉电信息的掉电通知报文,并将所述掉电通知报文发送给所述分布式系统中的所述其它计算机设备。When the computer device is powered off, the message sending unit acquires the information of the computer device's power off through the transition of the pin, generates a power-down notification message containing the computer device's power-off information, and sends The power-down notification message is sent to the other computer equipment in the distributed system.
PCT/CN2019/113147 2018-11-01 2019-10-25 Method for sending device information, computer device and distributed computer device system WO2020088351A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201811294576.8 2018-11-01
CN201811294576 2018-11-01
CN201811632716.8A CN109831350A (en) 2018-11-01 2018-12-29 Method, computer equipment and the distributed computer device systems that facility information is sent
CN201811632716.8 2018-12-29

Publications (1)

Publication Number Publication Date
WO2020088351A1 true WO2020088351A1 (en) 2020-05-07

Family

ID=66860602

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/113147 WO2020088351A1 (en) 2018-11-01 2019-10-25 Method for sending device information, computer device and distributed computer device system

Country Status (2)

Country Link
CN (1) CN109831350A (en)
WO (1) WO2020088351A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114257492A (en) * 2021-12-09 2022-03-29 北京天融信网络安全技术有限公司 Fault processing method and device of intelligent network card, computer equipment and medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109831350A (en) * 2018-11-01 2019-05-31 华为技术有限公司 Method, computer equipment and the distributed computer device systems that facility information is sent
CN111338914A (en) * 2020-02-10 2020-06-26 华为技术有限公司 Fault notification method and related equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102970167A (en) * 2012-11-26 2013-03-13 华为技术有限公司 Method for detecting faults of network nodes in cluster system, network node and system
US20130151714A1 (en) * 2011-12-13 2013-06-13 Motorola Mobility, Inc. Method and apparatus for adaptive network heartbeat message for tcp channel
JP2016220131A (en) * 2015-05-25 2016-12-22 三菱電機株式会社 Ring network relay device, ring network system and ring network relay method
CN108121571A (en) * 2017-12-21 2018-06-05 郑州云海信息技术有限公司 A kind of individual reset design and realization based on system hardware module
CN109831350A (en) * 2018-11-01 2019-05-31 华为技术有限公司 Method, computer equipment and the distributed computer device systems that facility information is sent

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102307367A (en) * 2011-08-18 2012-01-04 大唐移动通信设备有限公司 Communication equipment and power failure alarming method
CN102412994A (en) * 2011-11-23 2012-04-11 福建星网锐捷网络有限公司 Receiving equipment, transmitting equipment, and line fault processing method and system
CN106504514A (en) * 2016-11-04 2017-03-15 重庆世纪之光科技实业有限公司 Communication equipment and its power down alarm method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130151714A1 (en) * 2011-12-13 2013-06-13 Motorola Mobility, Inc. Method and apparatus for adaptive network heartbeat message for tcp channel
CN102970167A (en) * 2012-11-26 2013-03-13 华为技术有限公司 Method for detecting faults of network nodes in cluster system, network node and system
JP2016220131A (en) * 2015-05-25 2016-12-22 三菱電機株式会社 Ring network relay device, ring network system and ring network relay method
CN108121571A (en) * 2017-12-21 2018-06-05 郑州云海信息技术有限公司 A kind of individual reset design and realization based on system hardware module
CN109831350A (en) * 2018-11-01 2019-05-31 华为技术有限公司 Method, computer equipment and the distributed computer device systems that facility information is sent

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114257492A (en) * 2021-12-09 2022-03-29 北京天融信网络安全技术有限公司 Fault processing method and device of intelligent network card, computer equipment and medium
CN114257492B (en) * 2021-12-09 2023-11-28 北京天融信网络安全技术有限公司 Fault processing method and device for intelligent network card, computer equipment and medium

Also Published As

Publication number Publication date
CN109831350A (en) 2019-05-31

Similar Documents

Publication Publication Date Title
US10715411B1 (en) Altering networking switch priority responsive to compute node fitness
US10560315B2 (en) Method and device for processing failure in at least one distributed cluster, and system
US10866624B2 (en) Power management method of a system made of devices powered over data cable
US10693813B1 (en) Enabling and disabling links of a networking switch responsive to compute node fitness
WO2020088351A1 (en) Method for sending device information, computer device and distributed computer device system
US8719410B2 (en) Native bi-directional communication for hardware management
US8204990B1 (en) Power cycler with internet connectivity monitor
US9503322B2 (en) Automatic stack unit replacement system
TW201944236A (en) Task processing method, apparatus, and system
US20050066218A1 (en) Method and apparatus for alert failover
WO2019128670A1 (en) Method and apparatus for enabling self-recovery of management capability in distributed system
US10581697B2 (en) SDN controlled PoE management system
WO2015058711A1 (en) Rapid fault detection method and device
US11258666B2 (en) Method, device, and system for implementing MUX machine
CN103036701A (en) Network segment crossing N+1 backup method and network segment crossing N+1 backup device
KR20150104435A (en) Method of performing transition of operation mode for a routing processor
US11812487B2 (en) Method, device, extender, and computer medium for automatically restoring connection
CN116137603A (en) Link fault detection method and device, storage medium and electronic device
USRE46520E1 (en) Server cluster and control mechanism thereof
CN112187877B (en) Node awakening method based on distributed cluster and controlled terminal
CN107423113B (en) Method for managing virtual equipment, out-of-band management equipment and standby virtual equipment
JP2014532236A (en) Connection method
CN111224803B (en) Multi-master detection method in stacking system and stacking system
JP3266841B2 (en) Communication control device
US20210157384A1 (en) Assigning power sources

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19879941

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19879941

Country of ref document: EP

Kind code of ref document: A1