CN113839810B - HPC-based server out-of-band data transmission method, device and system - Google Patents

HPC-based server out-of-band data transmission method, device and system Download PDF

Info

Publication number
CN113839810B
CN113839810B CN202110998462.7A CN202110998462A CN113839810B CN 113839810 B CN113839810 B CN 113839810B CN 202110998462 A CN202110998462 A CN 202110998462A CN 113839810 B CN113839810 B CN 113839810B
Authority
CN
China
Prior art keywords
node server
state
server
node
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110998462.7A
Other languages
Chinese (zh)
Other versions
CN113839810A (en
Inventor
赵阳阳
段谊海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Jinan data Technology Co ltd
Original Assignee
Inspur Jinan data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Jinan data Technology Co ltd filed Critical Inspur Jinan data Technology Co ltd
Priority to CN202110998462.7A priority Critical patent/CN113839810B/en
Publication of CN113839810A publication Critical patent/CN113839810A/en
Application granted granted Critical
Publication of CN113839810B publication Critical patent/CN113839810B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/02Standardisation; Integration
    • H04L41/0246Exchanging or transporting network management information using the Internet; Embedding network management web servers in network elements; Web-services-based protocols
    • H04L41/0266Exchanging or transporting network management information using the Internet; Embedding network management web servers in network elements; Web-services-based protocols using meta-data, objects or commands for formatting management information, e.g. using eXtensible markup language [XML]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention provides a method, a device and a system for transmitting server out-of-band data based on HPC, wherein the method comprises the following steps: sending different collecting instructions to each node server according to the running state of each node server; when a collection instruction is received, acquiring corresponding asset information and alarm information; synchronizing the acquired asset information and the alarm information to a node server for caching; and when the uploading instruction is received, reading corresponding asset information and alarm information from the node server cache for uploading. Compared with the traditional Ethernet, the IB network can quickly and effectively transmit data to the monitoring management terminal. The method and the system realize the quick and effective collection of the out-of-band assets and the alarm data of each HPC node server, can effectively reduce the load of the monitoring server, and improve the data transmission rate and the stability.

Description

HPC-based server out-of-band data transmission method, device and system
Technical Field
The invention relates to the technical field of server out-of-band data transmission in an HPC scene, in particular to a method, a device and a system for server out-of-band data transmission based on HPC.
Background
With the rapid development of social informatization and data mining, the requirement of human beings on data information processing capacity is higher and higher, and the demand of high-performance computers in more extensive fields such as petroleum exploration, weather forecast, aerospace national defense, scientific research and the like, and the demand of high-performance computers in more extensive fields such as finance, government informatization, education, enterprises, online games and the like is rapidly increased. HPC (High Performance Computing) is a short term for High-Performance computer cluster, and the health and Performance of the hardware assets of HPC directly affect the overall operation speed of the system. Therefore, fast and efficient transmission of out-of-band asset and alarm data in HPC scenarios is important.
In the related technology, mainly monitoring the out-of-band assets of a local area network server, lacking monitoring under a special scene of HPC, the HPC, and the service node under the scene of the HPC are configured with an IB network and can carry out high-efficiency data transmission.
Disclosure of Invention
The invention provides a HPC-based server out-of-band data transmission method, device and system, and aims to solve the problems that an IPMI is adopted to remotely call a node server through an Ethernet to acquire out-of-band asset and alarm data only in the traditional Ethernet, the data transmission efficiency is not high due to network bottleneck, and data loss is easily caused due to the limitation of a server and network bandwidth.
The technical scheme of the invention is as follows:
in a first aspect, a technical solution of the present invention provides a method for transmitting data out of band for a server based on HPC, including the following steps:
sending different collecting instructions to each node server according to the running state of each node server;
when a collection instruction is received, acquiring corresponding asset information and alarm information;
synchronizing the acquired asset information and the alarm information to a node server for caching;
and when the uploading instruction is received, reading corresponding asset information and alarm information from the node server cache for uploading.
The method can realize the quick and effective collection of the out-of-band asset information and the alarm information of each node server, effectively reduce the load of the monitoring server and improve the data transmission rate and the stability.
Further, the step of deciding to send different collection instructions to each node server according to the operation state of each node server includes:
periodically and circularly acquiring the operating system state and the network communication state of each node server;
judging whether the network communication state of the node server is normal or not;
if the network communication state is normal, judging whether the operating system state of the node server is normal;
if the operating system state of the node server is abnormal, a collection instruction is issued to the node BMC;
if the operating system state of the node server is normal, issuing a collection instruction to the operating system of the node server;
and if the network communication state is abnormal, sending a collection instruction to the node BMC.
And determining out-of-band data information required to be collected according to the configuration of a user, and determining to send different collection instructions to each node server according to the system running state of each node.
And setting two collection modes according to the state, when the node server system normally operates, acquiring the out-of-band asset information and the alarm information by the node operating system, and when the node server fails or the network between the node server and the monitoring management terminal server is not communicated due to other reasons, directly carrying out remote IPMI (intelligent platform management interface) calling to acquire the out-of-band asset information and the alarm information.
Further, the step of acquiring the operating system and the network connectivity status of each node server in a timed cycle includes:
and establishing a node server state linked list and a node server cache, and initializing. And creating a state linked list for putting the acquired state of each node server, so that the state can be updated in real time conveniently.
Further, when receiving the uploading instruction, the step of reading the corresponding asset information and the alarm information from the node server cache for uploading includes:
and issuing a data uploading instruction at regular time.
In order to improve the transmission efficiency and save the resource information, the data is uploaded regularly in a timing mode.
Further, the step of acquiring the corresponding asset information and the alarm information after receiving the collection instruction includes:
and after receiving the collection instruction, calling an IPMI local data interface, and acquiring corresponding asset information and alarm information from the BMC.
Further, when receiving the uploading instruction, the step of reading corresponding asset information and alarm information from the node server cache for uploading includes:
and when the uploading instruction is received, reading corresponding asset information and alarm information from the node server cache, and uploading the information through an IB network. By adopting the IB network, compared with the traditional Ethernet, the data can be uploaded quickly and effectively.
In a second aspect, the technical solution of the present invention further provides an HPC-based server out-of-band data transmission apparatus, which includes a control instruction issuing module, an out-of-band data acquisition module, a system data cache module, and a data transmission control module;
the control instruction issuing module is used for determining to send different collecting instructions to each node server according to the running state of each node server and is also used for sending uploading instructions at regular time;
the out-of-band data acquisition module is used for acquiring corresponding asset information and alarm information when receiving a collection instruction;
the system data cache module is used for storing the collected asset information and the alarm information into a cache according to a fixed format and continuously updating the cache according to the configured frequency;
and the data transmission control module is used for reading corresponding asset information and alarm information from the node server cache for uploading when receiving the uploading instruction.
The control instruction issuing module is mainly used for determining out-of-band data information to be acquired according to the configuration of a user, determining to send different collecting instructions to each node according to the system running state of each node, and issuing an uploading instruction at regular time;
and the system data cache module stores the data locally acquired by the node server operating system into a cache according to a fixed format, and continuously updates the cache according to the configured frequency.
Furthermore, the device also comprises a state monitoring module and a state judging module;
the state monitoring module is used for periodically and circularly acquiring the operating system state and the network communication state of each node server;
the state judgment module is used for judging whether the network communication state of the node server is normal or not; if the network connection state is normal, judging whether the operating system state of the node server is normal or not;
the control instruction issuing module is used for issuing a collection instruction to the node BMC when the state judgment module judges that the operating system state of the node server is abnormal or the network communication state is abnormal; and the method is also used for issuing a collection instruction to the operating system of the node server when the operating system of the node server is in a normal state and the network communication state is normal.
The data acquisition module comprises two collection modes, when the node server system normally operates, the node operating system acquires the out-of-band asset and the alarm information by adopting IPMI local calling, and when the node server fails or the network between the node server and the monitoring management terminal is not communicated due to other reasons, remote IPMI calling is directly carried out to acquire the out-of-band asset and the alarm information.
Further, the device also comprises a preprocessing module which is used for establishing the node server state linked list and the node server cache and initializing.
Further, the device also comprises a timing module, which is used for timing and sending triggering information to the control instruction issuing module when the set timing time is reached;
and the control instruction issuing module is used for issuing a data uploading instruction to the data transmission control module when receiving the trigger information.
And the out-of-band data acquisition module is used for calling the IPMI local data interface after receiving the collection instruction and acquiring corresponding asset information and alarm information from the BMC.
And the control instruction issuing module is used for determining to send different collecting instructions to each node server according to the running state of each node server.
And the data transmission control module is used for reading corresponding asset information and alarm information from the local cache of the node server and uploading the information through the IB network when receiving the uploading instruction. And the fast transmission of data is achieved.
In a third aspect, the technical solution of the present invention further provides an HPC-based server out-of-band data transmission system, which includes a monitoring management server and a plurality of node servers; the plurality of node servers are respectively in communication connection with the monitoring management server;
the monitoring management server comprises a control instruction issuing module;
each node server is provided with a node operating system and a BMC;
the node operating system comprises an out-of-band data acquisition module, a system data cache module and a data transmission control module;
the control instruction issuing module is used for determining to send different collecting instructions to each node server according to the running state of each node server and sending uploading instructions at regular time;
the out-of-band data acquisition module is used for acquiring corresponding asset information and alarm information when receiving the collection instruction;
the system data cache module is used for storing the collected asset information and the alarm information into a cache according to a fixed format and continuously updating the cache according to the configured frequency;
and the data transmission control module is used for reading corresponding asset information and alarm information from the node server cache to upload the asset information and the alarm information to the monitoring management server when receiving the uploading instruction.
The control instruction issuing module is mainly used for determining out-of-band data information to be acquired according to the configuration of a user and determining to send different collecting instructions to each node according to the system running state of each node; the system comprises an out-of-band data acquisition module and a monitoring management server, wherein the out-of-band data acquisition module comprises two collection modes, when a node server system normally operates, a node OS acquires out-of-band assets and alarm information by adopting IPMI local calling, and when a network between the node server and the monitoring management server is not communicated due to faults or other reasons of the node server, the monitoring management server directly performs remote IPMI calling to acquire the out-of-band assets and the alarm information; the system data caching module is used for storing data locally acquired by the node server OS into a cache according to a fixed format by applying a caching technology on a node, and continuously updating the cache according to the configured frequency; and the data transmission control module is mainly used for taking out corresponding asset and alarm data from the local cache after receiving a data reporting instruction of the monitoring management server, and directly transmitting the asset and alarm data to the monitoring management server through an IB network so as to realize the rapid transmission of the data.
Furthermore, the monitoring management server also comprises a state monitoring module and a state judging module;
the state monitoring module is used for periodically and circularly acquiring the operating system state and the network communication state of each node server;
the state judgment module is used for judging whether the network communication state of the node server is normal or not; if the network communication state is normal, judging whether the operating system state of the node server is normal;
the control instruction issuing module is used for issuing a collection instruction to the node BMC when the state judgment module judges that the operating system state of the node server is abnormal or the network communication state is abnormal; and the system is also used for issuing a collection instruction to the operating system of the node server when the operating system of the node server is in a normal state and the network communication state is normal.
Further, the monitoring management server further comprises a preprocessing module for creating a node server state linked list and a node server cache, and initializing.
Further, the monitoring management server also comprises a timing module which is used for timing and sending triggering information to the control instruction issuing module when the set timing time is reached;
and the control instruction issuing module is used for issuing a data uploading instruction to the data transmission control module when receiving the trigger information.
And the control instruction issuing module is used for determining to send different collecting instructions to each node server according to the running state of each node server.
And the out-of-band data acquisition module is used for calling the IPMI local data interface after receiving the collection instruction and acquiring corresponding asset information and alarm information from the BMC.
And the data transmission control module is used for reading corresponding asset information and alarm information from the local cache of the node server when receiving the uploading instruction, and uploading the information to the monitoring management server through the IB network.
The method comprises the steps of creating a cache in a system in a next service node of an HPC (high performance computing), calling a local IPMI (intelligent platform management interface) command, regularly acquiring local assets and alarm data, storing the local assets and the alarm data in a local cache of a node server, and directly sending the local assets and the alarm data to a monitoring management end server through a high-speed IB (internet b-browser) network, so that the data can be transmitted at a high speed and accurately, and the load and the network bandwidth occupation of the monitoring server are effectively reduced.
According to the technical scheme, the invention has the following advantages: the first is based on server out-of-band data acquisition in an HPC scene, wherein a plurality of computing nodes exist in the HPC scene, and assets and alarm data of each node are acquired under task operation execution of different nodes; secondly, the result in-band operation provides an environment, the out-of-band data is stored in an in-band cache at regular time and is updated continuously, and the real-time effectiveness of the out-of-band data is kept; and thirdly, in the aspect of data transmission, the IB network is adopted, and compared with the traditional Ethernet, the IB network can quickly and effectively transmit data to the monitoring management end. The method and the system realize the quick and effective collection of the out-of-band assets and alarm data of each node server of the HPC, effectively reduce the load of the monitoring server and improve the data transmission rate and stability.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Therefore, compared with the prior art, the invention has prominent substantive features and remarkable progress, and the beneficial effects of the implementation are also obvious.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.
Fig. 2 is a schematic flow diagram of a method of another embodiment of the invention.
Fig. 3 is a schematic block diagram of an apparatus of one embodiment of the present invention.
FIG. 4 is a schematic block diagram of a system of one embodiment of the present invention.
In the figure, 1-a monitoring management server, 2-a node server, 11-a control instruction issuing module, 12-a state monitoring module, 13-a state judging module, 14-a preprocessing module, 15-a timing module, 20-a node operating system, 21-an out-of-band data acquisition module, 22-a system data caching module, 23-a data transmission control module and 30-BMC.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a method for transmitting out-of-band data of a HPC-based server, including the following steps:
step 11: sending different collecting instructions to each node server according to the running state of each node server;
step 12: when a collection instruction is received, acquiring corresponding asset information and alarm information;
step 13: synchronizing the acquired asset information and the alarm information to a node server for caching;
step 14: and when an uploading instruction is received, reading corresponding asset information and alarm information from the node server cache for uploading.
The method can realize the quick and effective collection of the out-of-band asset information and the alarm information of each node server, effectively reduce the load of the monitoring server and improve the data transmission rate and the stability.
As shown in fig. 2, in some embodiments, the step 11 of deciding to send different collection instructions to each node server according to the operation status of each node server includes:
step 111: periodically and circularly acquiring the operating system state and the network communication state of each node server;
step 112: judging whether the network communication state of the node server is normal or not; if yes, go to step 113: if not, go to step 114;
step 113: judging whether the operating system state of the node server is normal or not; if not, go to step 114, if yes, go to step 115;
step 114: issuing a collection instruction to a node BMC;
step 115: and issuing a collection instruction to an operating system of the node server.
And determining out-of-band data information required to be acquired according to the configuration of a user, and determining to send different collection instructions to each node server according to the system running state of each node.
And setting two collection modes according to the state, when the node server system normally operates, acquiring the out-of-band asset information and the alarm information by the node operating system, and when the node server fails or the network between the node server and the monitoring management terminal server is not communicated due to other reasons, directly carrying out remote IPMI (intelligent platform management interface) calling to acquire the out-of-band asset information and the alarm information.
In some embodiments, the step of acquiring the operating system and the network connectivity status of each node server in the timing loop in step 111 includes:
step 110: and establishing a node server state linked list and a node server cache, and initializing. And creating a state linked list for putting the acquired state of each node server, so that the state can be updated in real time conveniently.
When the uploading instruction is received in step 114, the step of reading the corresponding asset information and alarm information from the node server cache for uploading includes:
step 134: and issuing a data uploading instruction at fixed time.
In order to improve the transmission efficiency and save the resource information, the data is uploaded regularly in a timing mode.
In some embodiments, in step 12, the step of obtaining the corresponding asset information and the alarm information after receiving the collection instruction includes:
and after receiving the collection instruction, calling an IPMI local data interface, and acquiring corresponding asset information and alarm information from the BMC.
In step 14, when the upload instruction is received, the step of reading the corresponding asset information and the alarm information from the node server cache to upload includes:
and when the uploading instruction is received, reading corresponding asset information and alarm information from the node server cache, and uploading the information through an IB network. By adopting the IB network, compared with the traditional Ethernet, the data can be uploaded quickly and effectively.
As shown in fig. 3, the technical solution of the present invention further provides an HPC-based server out-of-band data transmission apparatus, which includes a control instruction issuing module 11, an out-of-band data acquisition module 21, a system data cache module 22, and a data transmission control module 23;
the control instruction issuing module 11 is configured to determine to send different collection instructions to each node server according to the operating state of each node server, and is further configured to send an upload instruction at regular time;
the out-of-band data acquisition module 21 is used for acquiring corresponding asset information and alarm information when receiving a collection instruction;
the system data cache module 22 is used for storing the collected asset information and the alarm information into a cache according to a fixed format and continuously updating the cache according to the configured frequency;
and the data transmission control module 23 is configured to, when receiving the uploading instruction, read corresponding asset information and alarm information from the node server cache, and upload the asset information and the alarm information.
The control instruction issuing module 11 is mainly used for determining out-of-band data information to be acquired according to configuration of a user, determining to send different collecting instructions to each node according to the system running state of each node, and issuing an uploading instruction at regular time;
the system data cache module 22 stores the data locally acquired by the node server operating system in a cache according to a fixed format, and continuously updates the cache according to the configured frequency.
In some embodiments, the apparatus further comprises a status monitoring module 12, a status determining module 13;
the state monitoring module 12 is configured to periodically and circularly acquire an operating system state and a network connectivity state of each node server;
a state judgment module 13, configured to judge whether a network connection state of the node server is normal; if the network communication state is normal, judging whether the operating system state of the node server is normal;
the control instruction issuing module 11 is configured to issue a collection instruction to the node BMC when the state determination module determines that the operating system state of the node server is abnormal or the network connectivity state is abnormal; and the method is also used for issuing a collection instruction to the operating system of the node server when the operating system of the node server is in a normal state and the network communication state is normal.
The out-of-band data acquisition module 21 includes two collection modes, when the node server system operates normally, the node operating system uses IPMI local call to obtain out-of-band assets and alarm information, and when the node server fails or other reasons cause the network between the node server and the monitoring management terminal to be disconnected, remote IPMI call is directly performed to obtain out-of-band assets and alarm information.
In some embodiments, the apparatus further comprises a pre-processing module 14 and a timing module 15;
and the preprocessing module 14 is configured to create a node server state linked list and a node server cache, and initialize the node server state linked list and the node server cache.
The timing module 15 is used for timing and sending triggering information to the control instruction issuing module when the set timing time is reached;
the control instruction issuing module 11 is configured to determine to send different collection instructions to each node server according to the operating state of each node server; for issuing a data upload instruction to the data transmission control module 23 when receiving the trigger information.
And the out-of-band data acquisition module 21 is used for calling the IPMI local data interface after receiving the collection instruction, and acquiring corresponding asset information and alarm information from the BMC.
And the data transmission control module 23 is configured to, when receiving the upload instruction, read corresponding asset information and alarm information from the local cache of the node server, and upload the information through the IB network. And the data can be transmitted quickly.
As shown in fig. 4, the present invention further provides an HPC-based server out-of-band data transmission system, which includes a monitoring management server 1 and a plurality of node servers 2; the node servers 2 are respectively in communication connection with the monitoring management server 1;
the monitoring management server 1 comprises a control instruction issuing module 11;
each node server 2 is provided with a node operating system 20 and a BMC30;
the node operating system 20 comprises an out-of-band data acquisition module 21, a system data cache module 22 and a data transmission control module 23;
the control instruction issuing module 11 is configured to determine to send different collection instructions to each node server according to the operating state of each node server, and send an upload instruction at regular time;
the out-of-band data acquisition module 21 is used for acquiring corresponding asset information and alarm information when receiving a collection instruction;
the system data cache module 22 is used for storing the collected asset information and the alarm information into a cache according to a fixed format and continuously updating the cache according to the configured frequency;
and the data transmission control module 23 is configured to, when receiving the upload instruction, read corresponding asset information and alarm information from the node server cache and upload the asset information and the alarm information to the monitoring management server 1.
The control instruction issuing module 11 mainly functions to determine out-of-band data information to be acquired according to configuration of a user and determine to send different collection instructions to each node according to the system running state of each node; an out-of-band data acquisition module 21, which includes two collection modes, when the node server 2 system normally operates, the node OS acquires out-of-band assets and alarm information by using IPMI local call, and when the node server 2 fails or the network is blocked between the node server 2 and the monitoring management server 1 due to other reasons, the monitoring management server 1 directly performs remote IPMI call to acquire out-of-band assets and alarm information; a system data cache module 22, which applies a cache technology on the node, stores the data locally acquired by the node server OS into the cache according to a fixed format, and continuously updates the cache according to the configured frequency; and the data transmission control module 23 is mainly used for taking out corresponding assets and alarm data from the local cache after receiving a data reporting instruction of the monitoring management server, and directly transmitting the assets and the alarm data to the monitoring management server 1 through an IB network so as to achieve the purpose of quickly transmitting the data.
In some embodiments, the monitoring management server 1 further includes a status monitoring module 12 and a status determining module 13;
the state monitoring module 12 is configured to periodically and cyclically acquire an operating system state and a network connectivity state of each node server;
a state judgment module 13, configured to judge whether a network connectivity state of the node server is normal; if the network communication state is normal, judging whether the operating system state of the node server is normal;
the control instruction issuing module 11 is configured to issue a collection instruction to the node BMC when the state judgment module judges that the operating system state of the node server is abnormal or the network connectivity state is abnormal; and the system is also used for issuing a collection instruction to the operating system of the node server when the operating system of the node server is in a normal state and the network communication state is normal.
In some embodiments, the apparatus further comprises a pre-processing module 14 and a timing module 15;
and the preprocessing module 14 is configured to create a node server state linked list and a node server cache, and initialize the node server state linked list and the node server cache.
The timing module 15 is used for timing and sending triggering information to the control instruction issuing module when the set timing time is reached;
the control instruction issuing module 11 is configured to determine to send different collection instructions to each node server according to the operating state of each node server; and is configured to issue a data uploading instruction to the data transmission control module 23 when receiving the trigger information.
And the out-of-band data acquisition module 21 is used for calling the IPMI local data interface after receiving the collection instruction and acquiring corresponding asset information and alarm information from the BMC.
And the data transmission control module 23 is configured to, when receiving the upload instruction, read corresponding asset information and alarm information from the local cache of the node server, and upload the information through the IB network. And the data can be transmitted quickly.
Firstly, a monitoring management server determines out-of-band indexes and alarm information to be acquired according to configuration of a user, determines a blind acquisition mode according to the operating system state of a server of each node, and sends an acquisition data instruction to a node server;
secondly, after receiving the collection instruction sent by the monitoring management server, each node server calls an IPMI local data interface to acquire corresponding asset information and alarm information and stores corresponding data in a local cache.
And then, the monitoring management server starts a timer, issues an uploading instruction to the node server according to a fixed frequency, and the node server reads assets and alarm information from the cache, pushes the information to the monitoring management server through an IB network and carries out the next processing.
The specific implementation algorithm is as follows:
(1) creating a node server state linked list and a node server cache, and initializing;
(2) periodically and circularly acquiring the operating system and IB network communication state of each node server, and continuously updating the state;
(3) if the operating system state of the node server is abnormal, a collection instruction is issued to the node BMC, the data returned by the BMC is directly stored, and the step (2) is continuously executed;
(4) if the IB network communication state of the node server is normal, the monitoring management server issues a collection instruction to the node OS, and the step (5) is continuously executed;
(5) the node server calls an IPMI local interface to acquire out-of-band assets and alarm information from the BMC;
(6) synchronizing the acquired data to a local cache, and repeating the step (4);
(7) and starting a timer, issuing an uploading instruction to the node server according to a fixed frequency, reading data from the cache by the node server and sending the data to the monitoring management server through the IB network.
The local IPMI command is called by creating a cache in each node server under the HPC, local asset information and alarm information are regularly acquired and stored in the local cache of the node server, and then the local asset information and the alarm information are directly sent to the monitoring management server through the high-speed IB network, so that data can be transmitted at high speed and accurately, and the load and the network bandwidth occupation of the monitoring server are effectively reduced.
Although the present invention has been described in detail in connection with the preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (8)

1. An HPC-based server out-of-band data transmission method is characterized by comprising the following steps:
sending different collecting instructions to each node server according to the running state of each node server;
when each node server receives a collection instruction, acquiring corresponding asset information and alarm information;
each node server synchronizes the acquired asset information and the acquired alarm information to the node server for caching;
when each node server receives an uploading instruction, reading corresponding asset information and alarm information from a node server cache for uploading;
the step of deciding to send different collection instructions to each node server according to the operation state of each node server comprises the following steps:
periodically and circularly acquiring the operating system state and the network communication state of each node server;
judging whether the network connection state of the node server is normal or not;
if the network communication state is normal, judging whether the operating system state of the node server is normal;
if the operating system state of the node server is abnormal, a collection instruction is issued to the node BMC;
if the operating system state of the node server is normal, issuing a collection instruction to the operating system of the node server;
if the network communication state is abnormal, a collection instruction is sent to the node BMC;
when an uploading instruction is received, the step of reading corresponding asset information and alarm information from the node server cache for uploading comprises the following steps: and when the uploading instruction is received, reading corresponding asset information and alarm information from the node server cache, and uploading the information through an IB network.
2. The HPC-based server out-of-band data transfer method of claim 1, wherein the step of periodically cycling the operating system and network connectivity status of each node server comprises:
and establishing a node server state linked list and a node server cache, and initializing.
3. The HPC-based server out-of-band data transmission method of claim 2, wherein the step of reading the corresponding asset information and alarm information from the node server cache for uploading upon receiving the upload instruction comprises:
and issuing a data uploading instruction at regular time.
4. The HPC-based server out-of-band data transmission method of claim 3, wherein the step of obtaining corresponding asset information and alarm information upon receiving the collection instruction comprises:
and after receiving the collection instruction, calling the IPMI local data interface to acquire corresponding asset information and alarm information from the BMC.
5. An HPC-based server out-of-band data transmission apparatus configured to perform the server out-of-band data transmission method according to any one of claims 1 to 4, comprising a control instruction issuing module, an out-of-band data acquisition module, a system data caching module, and a data transmission control module;
the control instruction issuing module is used for determining to send different collecting instructions to each node server according to the running state of each node server and is also used for sending uploading instructions at regular time;
the out-of-band data acquisition module is used for acquiring corresponding asset information and alarm information when receiving the collection instruction;
the system data cache module is used for storing the collected asset information and the alarm information into a cache according to a fixed format and continuously updating the cache according to the configured frequency;
and the data transmission control module is used for reading corresponding asset information and alarm information from the node server cache for uploading when receiving the uploading instruction.
6. The HPC-based server out-of-band data transmission apparatus of claim 5, further comprising a status monitoring module, a status determination module;
the state monitoring module is used for periodically and circularly acquiring the operating system state and the network communication state of each node server;
the state judgment module is used for judging whether the network communication state of the node server is normal or not; if the network connection state is normal, judging whether the operating system state of the node server is normal or not;
the control instruction issuing module is used for issuing a collection instruction to the node BMC when the state judgment module judges that the operating system state of the node server is abnormal or the network communication state is abnormal; and the system is also used for issuing a collection instruction to the operating system of the node server when the operating system of the node server is in a normal state and the network communication state is normal.
7. An HPC-based server out-of-band data transmission system configured to perform the server out-of-band data transmission method of any one of claims 1 to 4, comprising a monitoring management server and a number of node servers; the plurality of node servers are respectively in communication connection with the monitoring management server;
the monitoring management server comprises a control instruction issuing module;
each node server is provided with a node operating system and a BMC;
the node operating system comprises an out-of-band data acquisition module, a system data cache module and a data transmission control module;
the control instruction issuing module is used for determining to send different collecting instructions to each node server according to the running state of each node server and sending uploading instructions at regular time;
the out-of-band data acquisition module is used for acquiring corresponding asset information and alarm information when receiving a collection instruction;
the system data cache module is used for storing the collected asset information and the alarm information into a cache according to a fixed format and continuously updating the cache according to the configured frequency;
and the data transmission control module is used for reading corresponding asset information and alarm information from the node server cache to upload the asset information and the alarm information to the monitoring management server when receiving the uploading instruction.
8. The HPC-based server out-of-band data transmission system of claim 7, the monitoring management server further comprising a status monitoring module, a status determination module;
the state monitoring module is used for periodically and circularly acquiring the operating system state and the network communication state of each node server;
the state judgment module is used for judging whether the network communication state of the node server is normal or not; if the network communication state is normal, judging whether the operating system state of the node server is normal;
the control instruction issuing module is used for issuing a collection instruction to the node BMC when the state judgment module judges that the operating system state of the node server is abnormal or the network communication state is abnormal; and the method is also used for issuing a collection instruction to the operating system of the node server when the operating system of the node server is in a normal state and the network communication state is normal.
CN202110998462.7A 2021-08-27 2021-08-27 HPC-based server out-of-band data transmission method, device and system Active CN113839810B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110998462.7A CN113839810B (en) 2021-08-27 2021-08-27 HPC-based server out-of-band data transmission method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110998462.7A CN113839810B (en) 2021-08-27 2021-08-27 HPC-based server out-of-band data transmission method, device and system

Publications (2)

Publication Number Publication Date
CN113839810A CN113839810A (en) 2021-12-24
CN113839810B true CN113839810B (en) 2023-04-07

Family

ID=78961360

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110998462.7A Active CN113839810B (en) 2021-08-27 2021-08-27 HPC-based server out-of-band data transmission method, device and system

Country Status (1)

Country Link
CN (1) CN113839810B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101577698A (en) * 2008-05-09 2009-11-11 中兴通讯股份有限公司 System with external intelligent management server and method for monitoring server and processing commands
CN102681900A (en) * 2012-04-28 2012-09-19 浪潮电子信息产业股份有限公司 Method for managing assets of server node
CN105373899A (en) * 2015-12-03 2016-03-02 广州云新信息技术有限公司 Server asset management method and apparatus
WO2016101638A1 (en) * 2014-12-23 2016-06-30 国家电网公司 Operation management method for electric power system cloud simulation platform
CN112286755A (en) * 2020-09-24 2021-01-29 曙光信息产业股份有限公司 Cluster server out-of-band data acquisition method and device and computer equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080209031A1 (en) * 2007-02-22 2008-08-28 Inventec Corporation Method of collecting and managing computer device information
US20140208133A1 (en) * 2013-01-23 2014-07-24 Dell Products L.P. Systems and methods for out-of-band management of an information handling system
US9467464B2 (en) * 2013-03-15 2016-10-11 Tenable Network Security, Inc. System and method for correlating log data to discover network vulnerabilities and assets
US11405293B2 (en) * 2018-09-21 2022-08-02 Kyndryl, Inc. System and method for managing IT asset inventories using low power, short range network technologies
JP7396616B2 (en) * 2019-06-28 2023-12-12 i-PRO株式会社 Asset management system and asset management method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101577698A (en) * 2008-05-09 2009-11-11 中兴通讯股份有限公司 System with external intelligent management server and method for monitoring server and processing commands
CN102681900A (en) * 2012-04-28 2012-09-19 浪潮电子信息产业股份有限公司 Method for managing assets of server node
WO2016101638A1 (en) * 2014-12-23 2016-06-30 国家电网公司 Operation management method for electric power system cloud simulation platform
CN105373899A (en) * 2015-12-03 2016-03-02 广州云新信息技术有限公司 Server asset management method and apparatus
CN112286755A (en) * 2020-09-24 2021-01-29 曙光信息产业股份有限公司 Cluster server out-of-band data acquisition method and device and computer equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于流量感知的动态网络资产监测研究;李憧等;《信息安全研究》;20200604(第06期);全文 *
面向云化网络的资产安全管理方案;张小梅等;《邮电设计技术》;20190420(第04期);全文 *

Also Published As

Publication number Publication date
CN113839810A (en) 2021-12-24

Similar Documents

Publication Publication Date Title
CN112769897B (en) Synchronization method and device of edge calculation message, electronic equipment and storage medium
CN111090699A (en) Service data synchronization method and device, storage medium and electronic device
CN105635279A (en) Distributed monitor system and data acquisition method thereof
CN103761309A (en) Operation data processing method and system
CN101179435B (en) Method of active push network management event and network management system
CN106878111A (en) The cloud monitoring system and monitoring method of a kind of High Availabitity
CN111277511A (en) Transmission rate control method, device, computer system and readable storage medium
CN112650755B (en) Data storage method, method for querying data, database, and readable medium
CN113537495B (en) Model training system, method and device based on federal learning and computer equipment
CN109871269A (en) A kind of Remote Sensing Data Processing method, system, electronic equipment and medium
CN103607302A (en) Fault information report method, monitoring equipment and management equipment
CN115996197B (en) Distributed computing flow simulation system and method with preposed flow congestion
Dunne et al. A comparison of data streaming frameworks for anomaly detection in embedded systems
CN115016934A (en) Method, device and system for federated learning, electronic equipment and storage medium
CN111835578B (en) Information transmission management method, information transmission management apparatus, and readable storage medium
CN113839810B (en) HPC-based server out-of-band data transmission method, device and system
CN112491508B (en) Data transmission method and device based on transmission process self-adaption
CN115396523A (en) Internet of things data processing method, device, equipment, medium and internet of things soft gateway
CN104503866A (en) Data backup system, data backup method and backup data recovery method
CN111815449B (en) Abnormality detection method and system of multi-host quotation system based on stream computing
CN114518960A (en) Data preprocessing method and system for Internet of things edge gateway
US9361191B2 (en) Methods and apparatus for data recovery following a service interruption at a central processing station
CN113824651B (en) Market data caching method and device, storage medium and electronic equipment
CN208044360U (en) A kind of warehousing management monitoring device based on ARM platforms
CN114885373A (en) Node data uploading method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant