CN117992270A

CN117992270A - Memory resource management system, method, device, equipment and storage medium

Info

Publication number: CN117992270A
Application number: CN202410372719.1A
Authority: CN
Inventors: 马晓宇; 王兴隆
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2024-03-29
Filing date: 2024-03-29
Publication date: 2024-05-07
Anticipated expiration: 2044-03-29
Also published as: CN117992270B

Abstract

The embodiment of the application relates to the technical field of storage, in particular to a memory resource management system, a method, a device, equipment and a storage medium, which aim at effectively managing and maintaining memory resources. The system comprises: the system comprises a computing node module, a high-speed interconnection exchange chip module, a memory resource module and a network exchange chip module; the computing node module comprises a first baseboard management controller and a central processing unit; the high-speed interconnection exchange chip module comprises a high-speed interconnection exchange chip, a second substrate management controller and a memory resource management processor, and is used for managing memory resources, receiving fault information and carrying out fault positioning; the memory resource module comprises a third baseboard management controller, a memory expansion controller and a memory, wherein the third baseboard management controller is used for controlling the memory expansion controller, and the memory expansion controller is used for monitoring and managing the memory; the network exchange chip module is used for realizing network interconnection among the modules.

Description

Memory resource management system, method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of storage, in particular to a memory resource management system, a method, a device, equipment and a storage medium.

Background

With the continuous development of computer technology, the memory resource demands are increasing, and the memory resource pooling technology is generated. The memory resource pooling technology mainly comprises a computing resource pool and a memory pool, can realize flexible allocation of large-scale memory in the memory pool, greatly improves the utilization rate of server hardware resources, and is a key problem of research in the memory pooling technology because of how to effectively manage and maintain the memory resources in the memory pool in the memory pooling environment so as to ensure normal operation of services under the memory pooling architecture.

In the related art, a sensor is installed in a memory pool to monitor the memory in the memory pool, and a maintainer is used for maintaining the memory pool.

In the related art, faults of each hardware in the computing resource pool and the memory pool cannot be summarized in time, and the faulty hardware cannot be rapidly positioned, so that effective maintenance and management of the memory resources cannot be performed.

Disclosure of Invention

The embodiment of the application provides a memory resource management system, a method, a device, equipment and a storage medium, which aim at effectively managing and maintaining memory resources.

The first aspect of the present application provides a memory resource management system, the system comprising:

the system comprises a computing node module, a high-speed interconnection exchange chip module, a memory resource module and a network exchange chip module;

The computing node module comprises a first baseboard management controller and a central processing unit, wherein the first baseboard management controller is used for controlling the central processing unit;

The high-speed interconnection exchange chip module comprises a high-speed interconnection exchange chip, a second substrate management controller and a memory resource management processor, wherein the high-speed interconnection exchange chip is used for managing memory resources, the memory resource management processor is used for distributing corresponding memory for the computing node module through the high-speed interconnection exchange chip, the second substrate management controller is used for receiving fault information sent by the memory resource management processor, and receiving the fault information sent by the computing node module and the memory resource module through the network exchange chip, and determining corresponding fault hardware according to the fault information;

The memory resource module comprises a third baseboard management controller, a memory expansion controller and a memory, wherein the third baseboard management controller is used for controlling the memory expansion controller, and the memory expansion controller is used for monitoring and managing the memory;

The network exchange chip module is used for realizing network interconnection among the computing node module, the high-speed interconnection exchange chip module and the memory resource module.

Optionally, the high-speed interconnection exchange chip is connected with the central processing unit and the memory resource management processor, and the memory resource management processor is connected with the second baseboard management controller;

the third baseboard management controller is connected with the memory expansion controller, and the memory expansion controller is connected with the high-speed interconnection exchange chip and the memory;

The network exchange chip is connected with the computing node module, the high-speed interconnection exchange chip module and the memory resource module.

Optionally, the third baseboard management controller obtains the state information of the memory through the memory expansion controller;

The third baseboard management controller generates first fault information when any numerical value in the state information of the memory exceeds a preset threshold value;

The third baseboard management controller sends the first fault information to the network switching chip module;

The network switching chip module sends the first fault information to the second baseboard management controller.

Optionally, the third baseboard management controller polls and acquires the generated alarm information in the running process;

generating second fault information when the alarm information is any one of preset alarm information;

The third baseboard management controller sends the second fault information to the network switching chip module;

And the network exchange chip module sends the second fault information to the second baseboard management controller.

Optionally, the first baseboard management controller generates third fault information under the condition that the fault alarm information exists in the memory expansion controller;

the first baseboard management controller sends the third fault information to the network switching chip module;

and the network exchange chip module sends the third fault information to the second baseboard management controller.

Optionally, the memory resource management processor performs fault identification on the high-speed interconnection switching chip;

the memory resource management processor sends the identified fourth fault information to a second baseboard management controller;

And the second baseboard management controller performs fault aggregation on all received fault information.

Optionally, under the condition that the second baseboard management controller receives the fault information, determining a computing node corresponding to the fault information and the memory according to a memory topology interconnection relationship read by a memory resource management processor from the high-speed interconnection exchange chip;

the second baseboard management controller records the fault information into a fault log;

The second baseboard management controller triggers the first baseboard management controller to check the running state of the central processing unit;

the second baseboard management controller controls the computing node to be powered off under the condition that the first baseboard management controller detects that the central processing unit cannot execute the computing task;

The second baseboard management controller adds an abnormal memory mark for the memory;

the second baseboard management controller sends the memory information of the memory to the memory management processor;

The memory management processor stops configuring tasks for the memory under the condition that the memory information is received;

The second baseboard management controller sends a resource allocation command to the memory management processor;

the memory management processor performs memory resource allocation under the condition of receiving the resource allocation command;

the second baseboard management controller controls the computing node to restart.

Optionally, the second baseboard management controller eliminates the abnormal memory mark when detecting that the memory repair is successful.

A second aspect of the embodiment of the present application provides a memory resource management method, where the method includes:

during the running of the memories in the memory pool, polling and monitoring the memory state information of each memory in the memory pool;

Generating first fault information when any numerical value in the memory state information exceeds a preset threshold value;

and sending the first fault information to a second baseboard management controller.

Optionally, the method further comprises:

During the running of the memories in the memory pool, polling and monitoring alarm information sent by each memory in the memory pool;

Generating second fault information under the condition that the alarm information is any one piece of preset alarm information;

And sending the second fault information to the second baseboard management controller.

Optionally, the method further comprises:

Determining whether first fault alarm information exists in the memory expansion controller or not in the starting process of the computing node;

Generating third fault information when the first fault alarm information exists in the memory expansion controller;

And sending the third fault information to the second baseboard management controller.

Optionally, the method further comprises:

During the starting process of the memory resource management processor, identifying second fault alarm information in the high-speed interconnection exchange control chip;

Generating fourth fault information under the condition that the second fault alarm information is identified;

and sending the fourth fault information to the second baseboard management controller.

Optionally, the method further comprises:

When fault information is received, determining the computing node and the memory corresponding to the fault information according to a memory topology interconnection relationship;

Detecting whether a central processing unit corresponding to the computing node operates normally or not;

closing the computing node under the condition that the CPU is detected to be incapable of operating normally;

Adding an operation abnormality mark for the memory;

Transmitting the memory information of the memory to the memory resource management processor;

distributing new memory for the computing node;

Restarting the computing node under the condition that the memory of the computing node is distributed.

Optionally, the allocating the new memory for the computing node includes:

screening a plurality of memories which are not added with the operation abnormality marks from the memory pool;

determining any one of the memories in an idle state in the memories to which the operation abnormality mark is not added;

And under the condition that the memory is in a normal running state, the memory is distributed to be the memory corresponding to the computing node.

Optionally, the method further comprises:

and deleting the operation abnormality mark corresponding to the memory under the condition that the memory added with the operation abnormality mark is detected to be repaired.

A third aspect of an embodiment of the present application provides a memory resource management device, where the device includes:

The memory state information determining module is used for polling and monitoring the memory state information of each memory in the memory pool during the running of the memory in the memory pool;

The first fault information generation module is used for generating first fault information when any numerical value in the memory state information exceeds a preset threshold value;

And the first fault information sending module is used for sending the first fault information to the second baseboard management controller.

Optionally, the apparatus further comprises:

the memory alarm monitoring module is used for polling and monitoring alarm information sent by each memory in the memory pool during the operation of the memory in the memory pool;

the second fault information generation module is used for generating second fault information under the condition that the alarm information is any one of preset alarm information;

And the second fault information sending module is used for sending the second fault information to the second baseboard management controller.

Optionally, the apparatus further comprises:

the first fault alarm information detection module is used for determining whether first fault alarm information exists in the memory expansion controller or not in the starting process of the computing node;

the third fault information generation module is used for generating third fault information when the first fault alarm information exists in the memory expansion controller;

And the third fault information sending module is used for sending the third fault information to the second baseboard management controller.

Optionally, the apparatus further comprises:

the second fault alarm information detection module is used for identifying second fault alarm information in the high-speed interconnection exchange control chip in the starting process of the memory resource management processor;

a fourth fault information generating module, configured to generate fourth fault information when the second fault alarm information is identified;

and the fourth fault information sending module is used for sending the fourth fault information to the second baseboard management controller.

Optionally, the method further comprises:

The hardware determining module is used for determining the computing node and the memory corresponding to the fault information according to the memory topology interconnection relation when the fault information is received;

The running state detection module is used for detecting whether the central processing unit corresponding to the computing node runs normally or not;

The computing node closing module is used for closing the computing node under the condition that the CPU is detected to be incapable of operating normally;

The operation abnormal mark adding module is used for adding an operation abnormal mark for the memory;

The memory information sending module is used for sending the memory information of the memory to the memory resource management processor;

The memory allocation module is used for allocating the new memory for the computing node;

Optionally, the memory allocation module includes:

the memory screening submodule is used for screening a plurality of memories which are not added with the operation abnormality marks from the memory pool;

A memory determination submodule, configured to determine any one of the memories in an idle state from among the plurality of memories to which the operation exception flag is not added;

And the memory allocation sub-module is used for allocating the memory to the memory corresponding to the computing node under the condition that the memory is in a normal running state.

Optionally, the apparatus further comprises:

And the operation abnormal mark deleting module is used for deleting the operation abnormal mark corresponding to the memory under the condition that the memory added with the operation abnormal mark is detected to be repaired.

A fourth aspect of the embodiments of the present application provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to the first aspect of the present application.

A fifth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to the first aspect of the present application when the computer program is executed by the processor.

The memory resource management system provided by the application comprises: the system comprises a computing node module, a high-speed interconnection exchange chip module, a memory resource module and a network exchange chip module; the computing node module comprises a first baseboard management controller and a central processing unit, wherein the first baseboard management controller is used for controlling the central processing unit; the high-speed interconnection exchange chip module comprises a high-speed interconnection exchange chip, a second substrate management controller and a memory resource management processor, wherein the high-speed interconnection exchange chip is used for managing memory resources, the memory resource management processor is used for distributing corresponding memory for the computing node module through the high-speed interconnection exchange chip, the second substrate management controller is used for receiving fault information sent by the memory resource management processor, and receiving the fault information sent by the computing node module and the memory resource module through the network exchange chip, and determining corresponding fault hardware according to the fault information; the memory resource module comprises a third baseboard management controller, a memory expansion controller and a memory, wherein the third baseboard management controller is used for controlling the memory expansion controller, and the memory expansion controller is used for monitoring and managing the memory; the network exchange chip module is used for realizing network interconnection among the computing node module, the high-speed interconnection exchange chip module and the memory resource module.

In the system, a complete memory resource management system is formed by a computing node module, a high-speed interconnection exchange chip module, a memory resource module and a network exchange chip module, a central processing unit on the computing node module uses a memory on the memory resource module when executing a computing task, and monitors and manages the memory connected on the memory resource module through a memory expansion controller on the memory resource module, and the memory resource management processor can allocate a corresponding memory for each computing node through the high-speed interconnection exchange chip, so that flexible allocation of memory resources is realized, a second substrate management controller receives fault information sent by other modules through a network to carry out fault summarization, and can be positioned to hardware with faults, thereby facilitating maintenance of a memory resource pool, and further realizing effective management and effective maintenance of the memory resource system.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a memory resource management system according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating an interconnection of memory resource management systems according to an embodiment of the present application;

FIG. 3 is a flowchart of a memory resource management method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a memory resource management device according to an embodiment of the present application;

Fig. 5 is a schematic diagram of an electronic device according to an embodiment of the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, fig. 1 is a schematic diagram of a memory resource management system according to an embodiment of the application. As shown in fig. 1, the system includes:

the system comprises a computing node module, a high-speed interconnection exchange chip module, a memory resource module and a network exchange chip module.

In this embodiment, the modules are all modules formed by components and circuits integrated on a circuit board.

In this embodiment, the compute node module is a component for executing a compute task in the memory resource management system, and the high-speed interconnect (CXL, compute Express Link) switch chip module is configured to manage and configure a memory resource, and the memory resource module is configured to monitor and manage a memory, and to switch a high-speed interconnect bus of the memory to a common compute node on the high-speed interconnect switch core for connection. The network exchange chip module is connected with each board card, and the interconnection between each board card is realized through a network.

The computing node module comprises a first baseboard management controller and a central processing unit, wherein the first baseboard management controller is used for controlling the central processing unit, and the central processing unit is used for executing computing tasks.

In this embodiment, as shown in fig. 1, the computing node module (CPU Board) includes a first baseboard management controller (CPU Board BMC), and a Central Processing Unit (CPU), where the central processing unit is connected to the first baseboard management controller through an LPC (Low pin count Bus, low-pin-count bus), and the central processing unit is connected to the external world through a PCIE (high-speed serial bus) and connected to the high-speed interconnection switching control chip. The central processing unit uses memory resources in the memory pool when executing the computing task.

The high-speed interconnection exchange chip module comprises a high-speed interconnection exchange chip, a second substrate management controller and a memory resource management processor, wherein the high-speed interconnection exchange chip is used for managing memory resources, the memory resource management processor is used for distributing corresponding memory for the computing node module through the high-speed interconnection exchange chip, the second substrate management controller is used for receiving fault information sent by the memory resource management processor, and receiving the fault information sent by the computing node module and the memory resource module through the network exchange chip, and corresponding fault hardware is determined according to the fault information.

In this embodiment, as shown in fig. 1, the high-speed interconnection switch chip module includes a high-speed interconnection switch chip (CXL SW Board), a second baseboard management controller (CXL SW BMC), and a memory resource management processor (mCPU, MANAGEMENT CPU).

The high-speed interconnection exchange chip is a chip based on a high-speed interconnection protocol and is used for managing the memory in the memory pool, one end of the high-speed interconnection exchange chip is connected to the central processing unit through a PCIE interface, the other end of the high-speed interconnection exchange chip is connected to an MXC (Memory Expander Controller memory expansion controller) through a corresponding interface, the high-speed interconnection exchange chip is connected with the memory resource management processor through a PCIE bus, and the high-speed interconnection exchange chip is also connected with an uart (universal asynchronous receiver/transmitter)/I2C (Inter-INTEGRATED CIRCUIT, integrated circuit bus) interface.

The second baseboard management controller receives the second fault information sent by the first baseboard management controller on the computing node module through the network exchange chip module, receives the first fault information sent by the third baseboard management controller, receives the third fault information sent by the memory resource management processor, gathers the fault information, performs fault location according to a pre-stored memory topology interconnection relationship, and locates the hardware with faults. As shown in fig. 1, the second baseboard management controller is connected to the memory resource management processor through an LPC interface and an SGMII (SERIAL GMII, serial) interface.

The memory resource management processor controls the distribution of memory resources through the high-speed interconnection exchange chip, and distributes corresponding memory resources for the plurality of computing node modules, wherein the memory in the memory resource modules can be distributed to the corresponding computing node modules at will, so that the flexible distribution of the memory resources is realized.

The memory resource module comprises a third baseboard management controller, a memory expansion controller and a memory, wherein the third baseboard management controller is used for controlling the memory expansion controller, the memory expansion controller is used for monitoring and managing the memory, the third baseboard management controller is connected with the memory expansion controller, and the memory expansion controller is connected with the high-speed interconnection exchange chip and the memory.

In this embodiment, as shown in fig. 1, a third baseboard management controller (DIMM BMC Board) is connected to the memory expansion controller, and the third baseboard management controller is configured to monitor and manage the memory, and generate corresponding fault information to send the fault information to the second baseboard management controller when detecting that an abnormality occurs in the running state of the memory or the memory sends alarm information during running.

The memory expansion controller is used for controlling and managing the memory, the memory management controller can acquire the hardware sensor information such as temperature information, voltage information, power consumption information and the like of the memory during operation, the information is transmitted to the third baseboard management controller through an SMBus (micro controller communication link management) protocol, the memory resource module comprises a plurality of memory expansion controllers, each memory expansion controller is connected with a plurality of memories, and interaction is carried out between the memory expansion controllers and the memories through the SMBus (micro controller communication link management) protocol.

The network exchange chip module is used for realizing network interconnection among the computing node module, the high-speed interconnection exchange chip module and the memory resource module, and the network exchange chip is connected with the computing node module, the high-speed interconnection exchange chip module and the memory resource module.

In this embodiment, as shown in fig. 1, the network switching chip module is connected to the high-speed interconnection switching chip module, connected to the computing node module, and connected to the memory resource module. The network exchange chip module can realize interconnection among the three modules.

Referring to fig. 2, fig. 2 is a network interconnection diagram of a memory resource management system according to an embodiment of the present application, and as shown in fig. 2, a first baseboard management controller, a second baseboard management controller, a third baseboard management controller, and a network switch chip module

In this embodiment, the high-speed interconnection switching chip is connected with the upstream and downstream hardware in a topological manner, the upstream is connected to the port of the computing node module, the downstream is connected to the port of the memory resource module, the memory resource management processor configures the interconnection relationship between the upstream and downstream of the high-speed interconnection switching chip and the memory resource slice management, the second baseboard management controller interacts with the memory resource management processor to obtain which downstream ports correspond to the memory used by the upstream computing node of the current high-speed interconnection switching chip, and the connection topology of the computing resource and the memory resource is formed.

And the third baseboard management controller acquires the state information of the memory through the memory expansion controller.

In this embodiment, the state information of the memory is the numerical value of each index during the operation of the memory, which is obtained by the sensor deployed in advance during the operation of the memory.

In this embodiment, the third baseboard management controller interacts with the memory expansion controller through an SMBus (micro controller communication link management) protocol, and obtains memory information of the memory in the running process from the memory expansion controller.

And the third baseboard management controller generates first fault information when any numerical value in the state information of the internal memory exceeds a preset threshold value.

In this embodiment, the state information of the memory includes a plurality of values, each value has a preset threshold, and when any value in the state information of the memory exceeds the preset threshold, the first fault information is generated.

For example, the temperature information in the memory state information indicates that the current memory temperature is 90 degrees celsius, and the preset temperature threshold is 80 degrees celsius, so that the first fault information is generated at this time.

And the third baseboard management controller sends the first fault information to the network switching chip module.

In this embodiment, the third baseboard management controller sends the first failure information to the network switch chip module after generating the first failure information.

In this embodiment, after receiving the first failure information, the network switching chip module sends the first failure information to the second baseboard management controller.

And the third baseboard management controller polls and acquires the alarm information generated in the running process of the internal memory.

In this embodiment, the third baseboard management controller interacts with the memory expansion controller through MCTP over SMBus (a computer management transmission protocol) protocol, and polls to obtain the alarm information (Mailbox Event Record) existing in the running process.

And generating second fault information when the alarm information is any one of preset alarm information.

In this embodiment, the preset alarm information is preset important alarm information.

In this embodiment, when the alarm information is any one of preset alarm information, the second fault information is generated.

For example, the preset alert information may be GENERAL MEDIA EVENT Record (general media event Record), DRAM EVENT Record (dynamic random access memory event Record), memory Module Event Record (memory module event Record), PHYSICAL SWITCH EVENT Record (physical switch event Record), virtual SWITCH EVENT Record (network switch event Record), MLD Port Event Record (network protocol event Record), DYNAMIC CAPACITY EVENT Record (memory capacity event Record), and the like.

And the third baseboard management controller sends the second fault information to the network switching chip module.

In this embodiment, the third baseboard management controller sends the second fault information to the network switching chip module.

In this embodiment, after receiving the second fault information, the network switching chip module sends the second fault information to the second baseboard management controller.

And the first baseboard management controller generates third fault information under the condition that the fault alarm information exists in the memory expansion controller.

In this embodiment, in the compute node module, the BIOS (basic input output system) sends an IPMI (INTELLIGENT PLATFORM MANAGEMENT INTERFACE ) command to the first baseboard management controller through the LPC bus for interaction.

In this embodiment, during the process of starting up a computing node in the computing node module, the BIOS identifies whether the fault alarm information (PCIE alarm information) of the memory expansion controller exists, and when the fault alarm information exists in the memory expansion controller, the computer node cannot normally use the corresponding memory resource, and further cannot execute the computing task, at this time, the BIOS sends the IPMI command to the first baseboard management controller, and after identifying the fault alarm information of the corresponding memory expansion controller, the first baseboard management controller generates the third fault information.

And the first baseboard management controller sends the third fault information to the network switching chip module.

In this embodiment, the first baseboard management controller sends the third fault information to the network switching chip module.

In this embodiment, after receiving the third fault information, the network switching chip sends the third fault information to the second baseboard management controller.

And the memory resource management processor performs fault identification on the high-speed interconnection switching chip.

In this embodiment, in the high-speed interconnect-switching chip module, the BIOS sends an IPMI command to the second baseboard management controller through the LCP bus for interaction.

In this embodiment, during the startup process of the memory resource management processor, the BIOS of the memory resource management processor identifies PCIE alarm information sent by the high-speed interconnect switching chip.

The memory resource management processor sends the identified fourth fault information to the second baseboard management controller.

In this embodiment, the memory resource management processor sends the identified fourth failure information to the second baseboard management controller in the form of an IPMI command.

In this embodiment, the second baseboard management controller gathers all the received fault information including the first fault information, the second fault information, the third fault information, and the fourth fault information.

In this embodiment, the second baseboard management Controller is used as a complete machine management CMC (CHASSIS MANAGEMENT Controller chassis management Controller) to perform unified fault summary, and the first baseboard management Controller and the third baseboard management Controller establish connection with the second baseboard management Controller through a network hardware link and a command interface, and report fault information of hardware such as a memory, a memory expansion Controller, an interface, and the like.

And under the condition that the second baseboard management controller receives the fault information, determining the computing node corresponding to the fault information and the memory according to the memory topology interconnection relation read from the high-speed interconnection exchange chip by the memory resource management processor.

In this embodiment, the second baseboard management controller determines, under the condition of receiving the fault information, the computing node and the memory corresponding to the fault information according to the memory topology interconnection relationship read from the high-speed interconnection switching chip by the memory resource management processor.

In this embodiment, the memory resource manager reads the computing node or the memory expansion controller connected to each interface from the high-speed interconnection chip, so as to obtain the topology interconnection relationship of the whole system, the second baseboard management controller may obtain the topology interconnection relationship of the whole system from the memory resource management processor, further determine the computing node and the memory corresponding to the fault information according to the topology interconnection relationship, and only knowing the number of one hardware, the number of the corresponding other hardware may be known, or the memory module group configured with the memory slice may be located, where one memory module is composed of a plurality of memories, and one computing node may also correspond to one memory module.

For example, the fault information is sent by DIMM0 (memory 0), and if the computing node corresponding to DIMM0 in the topology interconnection structure is CPU0, it is determined that the memory corresponding to the fault information is DIMM0 and the computing node is CPU0.

The second baseboard management controller records the fault information into a fault log.

In this embodiment, after determining the computing node and the memory corresponding to the fault information, the second baseboard management controller records the fault information in the fault log.

The second baseboard management controller triggers the first baseboard management controller to check the running state of the central processing unit.

In this embodiment, the second baseboard management controller sends a command to the first baseboard management controller through the network switching chip module, and triggers the first baseboard management controller to check the operation state of the central processing unit.

And the second baseboard management controller controls the computing node to be powered off under the condition that the first baseboard management controller detects that the central processing unit cannot execute the computing task.

In this embodiment, when the first baseboard management controller detects that the central processing unit cannot execute the computing task, the second baseboard management controller controls the computing node where the central processing unit is located to be powered off through the network.

The second baseboard management controller adds an abnormal memory mark for the memory.

In this embodiment, the second baseboard management controller adds an abnormal memory tag to the failed memory. Marking the memory or the memory module which triggers the abnormality and is not repaired in all the memory resources.

And the second baseboard management controller sends the memory information of the memory to the memory management processor.

In this embodiment, the second baseboard management controller sends the memory information of the memory to which the abnormal memory tag is added to the memory management processor. The memory management processor will not use this portion of the memory resources in subsequent memory allocation or slice configuration tasks.

And under the condition that the memory management processor receives the memory information, stopping configuring tasks for the memory.

In this embodiment, when the memory management processor receives the memory information of the failed memory, the memory resource is not used any more in the subsequent memory allocation or memory switching configuration task, and the task corresponding to the memory configuration is stopped.

And the second baseboard management controller sends a resource allocation command to the memory management processor.

In this embodiment, the second baseboard management controller sends a resource allocation command for commanding the memory management processor to reallocate memory for the corresponding computing node.

And the memory management processor performs memory resource allocation under the condition of receiving the resource allocation command.

In this embodiment, the memory management processor performs memory resource allocation under the condition of receiving the resource allocation command, and reallocates the memory that can be normally used for the computing node.

In this embodiment, the second baseboard management controller restarts the computing node after the computing node allocates a new memory that can be used normally.

And the second baseboard management controller eliminates the abnormal memory mark under the condition that the memory repair is detected to be successful.

In this embodiment, when the second baseboard management controller detects that the memory repair is successful, it indicates that the memory can be normally used after the memory repair is completed, and at this time, the abnormal memory mark added for the memory is eliminated, and the memory resource management processor can use the part of the memory resource in the subsequent configuration.

In this embodiment, the memory resource management system achieves the purpose of flexibly distributing the memory in the memory resource pool through the high-speed interconnection exchange chip module, collects and gathers fault information of each piece of hardware through the second baseboard management controller, flexibly locates the piece of hardware with faults, timely informs maintenance personnel to repair, distributes the memory resource of the integrated machine to different computing resource service nodes, greatly improves the utilization rate of the hardware resource of the server, reduces the operation and maintenance cost, and pauses the use of the memory during repair, and continues to enable the memory after the memory repair is completed, so that the operation of computing tasks is not affected. And memory resources of the integrated machine are distributed to different computing resource service nodes, so that the utilization rate of hardware resources of a server is greatly improved, and the operation and maintenance cost is reduced.

Referring to fig. 3, fig. 3 is a flowchart of a memory resource management method according to an embodiment of the present application, where the method is applied to a memory resource management system, and specific steps are as follows:

S11: during the running of the memories in the memory pool, the memory state information of each memory in the memory pool is monitored in a polling mode.

In this embodiment, the memory pool is a memory cluster formed by a plurality of memories in the memory resource module.

In this embodiment, the memory running device in the memory pool polls and monitors the memory status information of each memory in the memory pool.

For example, there are 10 memories, namely, memories 0 to 9, and the third baseboard management controller obtains the memory status information of each memory from memories 0 to 9 through the corresponding memory expansion controllers.

S12: and when any numerical value in the memory state information exceeds a preset threshold value, generating first fault information.

In this embodiment, when any value in the memory state information exceeds a preset threshold, first failure information is generated.

S13: and sending the first fault information to a second baseboard management controller.

In this embodiment, the third baseboard management controller sends the first failure information to the second baseboard management controller.

In this embodiment, the method further includes:

s14: and during the running of the memories in the memory pool, polling and monitoring alarm information sent by each memory in the memory pool.

In this embodiment, during the operation of the memories in the memory pool, the memories may send various alarm information, and the third baseboard management controller polls and monitors the alarm information sent by each memory in the memory pool.

S15: and generating second fault information under the condition that the alarm information is any one piece of preset alarm information.

In this embodiment, the preset alarm information is preset alarm information that is serious and affects the operation of the memory.

In this embodiment, when the alarm information sent by the memory is any one of the preset alarm information, it is indicated that the memory has a problem of affecting normal operation, and at this time, the second fault information is generated.

S16: and sending the second fault information to the second baseboard management controller.

In this embodiment, after the second failure information is generated, the second failure information is sent to the second baseboard management controller.

In this embodiment, the method further includes:

S17: and in the process of starting the computing node, determining whether first fault alarm information exists in the memory expansion controller.

In this embodiment, the first fault alert information is PCIE alert information sent by a memory expansion controller corresponding to the computing node.

In this embodiment, during the starting process of the computing node, whether the first fault alarm information exists in the memory expansion controller is determined by the first baseboard management controller.

S18: and generating third fault information when the first fault alarm information exists in the memory expansion controller.

In this embodiment, when the first failure warning information exists in the memory expansion controller, third failure information is generated.

S19: and sending the third fault information to the second baseboard management controller.

In this embodiment, the first baseboard management controller sends the third fault information to the second baseboard management controller through the network interconnection module.

In this embodiment, the method further includes:

S110: and in the starting process of the memory resource management processor, identifying second fault alarm information in the high-speed interconnection exchange control chip.

In this embodiment, the second fault alert information is PCIE alert information sent by the high-speed interconnection switching control chip.

In this embodiment, during the starting process of the memory resource management processor, the second fault alarm information in the high-speed interconnection exchange control chip is identified.

S111: and generating fourth fault information under the condition that the second fault alarm information is identified.

In this embodiment, when the memory resource management processor identifies the second failure warning information, fourth failure information is generated.

S112: and sending the fourth fault information to the second baseboard management controller.

In this embodiment, the memory resource management processor sends the fourth failure information to the second baseboard management controller.

In this embodiment, the method further includes:

and S21, when fault information is received, determining the computing node and the memory corresponding to the fault information according to the memory topology interconnection relation.

In this embodiment, when the second baseboard management controller receives the fault information, the second baseboard management controller obtains the memory interconnection topological relation from the memory resource management processor, and determines the computing node and the memory corresponding to the fault information according to the memory interconnection topological relation.

In this embodiment, the fault information includes the number of the hardware sending the fault information, and each computing node corresponding to each memory is recorded in the memory interconnection topology relationship, so that the computing node and the memory corresponding to the fault information can be determined according to the memory interconnection topology relationship.

S22: and detecting whether the central processing unit corresponding to the computing node operates normally or not.

In this embodiment, the second baseboard management controller instructs the first baseboard management controller to detect whether the cpu on the computing node is operating normally through the network.

S23: and closing the computing node under the condition that the CPU is detected to be incapable of operating normally.

In this embodiment, the computing node is turned off when it is detected that the cpu fails to operate normally.

S24: and adding an operation abnormity mark for the memory.

In this embodiment, the fault exception flag is a flag for indicating that an abnormal event occurs in the memory and that normal operation is disabled.

S25: and sending the memory information of the memory to the memory resource management processor.

In this embodiment, after an operation exception flag is added to the failed memory, the memory information of the memory is sent to the memory resource management processor.

S26: and allocating the new memory for the computing node.

In this embodiment, the memory resource management processor allocates a new memory for the computing node, so that the computing node can normally execute the computing task.

S27: restarting the computing node under the condition that the memory of the computing node is distributed.

In this embodiment, the computing node is restarted if new memory is allocated for the computing node.

In this embodiment, the allocating the new memory for the computing node includes:

s27-1: and screening out a plurality of memories which are not added with the operation abnormality marks from the memory pool.

In this embodiment, when allocating memory for a computing node, first, a plurality of memories to which an operation anomaly flag is not added are screened out from a memory pool.

S27-2: and determining any one of the memories in an idle state in the memories to which the operation abnormality marks are not added.

In this embodiment, after a plurality of memories are determined, any one of the memories in an idle state is determined from the plurality of memories.

S27-3: and under the condition that the memory is in a normal running state, the memory is distributed to be the memory corresponding to the computing node.

In this embodiment, the memory is allocated as the memory corresponding to the computer node when the memory is in a normal running state.

In this embodiment, the method further includes:

S31: and deleting the operation abnormality mark corresponding to the memory under the condition that the memory added with the operation abnormality mark is detected to be repaired.

In this embodiment, the memory resource management controller periodically detects the memory added with the abnormal mark, and when detecting that the memory added with the operation abnormal mark is repaired, deletes the operation abnormal mark corresponding to the memory. At this time, the memory is restored to normal running memory in the memory pool, and the partial memory can be used when the memory resource management processor performs memory allocation.

In this embodiment, the method is used in the memory resource management system, so that memory resources in the memory pool can be flexibly allocated, and in the case of a memory failure, normal memory can be allocated to the computing node, so that normal operation of computing tasks is ensured, and failed hardware can be rapidly located, so that operation and maintenance efficiency of the whole system is improved.

Based on the same inventive concept, an embodiment of the present application provides a memory resource management device. Referring to fig. 4, fig. 4 is a schematic diagram of a memory resource management device 400 according to an embodiment of the application. As shown in fig. 4, the apparatus includes:

a memory status information determining module 401, configured to, during operation of the memories in the memory pool, poll and monitor memory status information of each of the memories in the memory pool;

a first fault information generating module 402, configured to generate first fault information when any value in the memory status information exceeds a preset threshold;

And a first fault information sending module 403, configured to send the first fault information to the second baseboard management controller.

Optionally, the apparatus further comprises:

Optionally, the method further comprises:

Optionally, the memory allocation module includes:

Based on the same inventive concept, another embodiment of the present application provides a readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the memory resource management method according to any of the above embodiments of the present application.

Based on the same inventive concept, another embodiment of the present application provides an electronic device, and referring to fig. 5, fig. 5 is a schematic diagram of an electronic device 500 according to an embodiment of the present application, including a memory 502, a processor 501, and a computer program stored in the memory and capable of running on the processor, where the processor implements the steps in the memory resource management method according to any one of the foregoing embodiments of the present application when executing.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.

The memory resource management method, device, equipment and storage medium provided by the application are described in detail, and specific examples are applied to illustrate the principles and implementation modes of the application, and the description of the above examples is only used for helping to understand the method and core ideas of the application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A memory resource management system, the system comprising:

2. The memory resource management system of claim 1, wherein the high-speed interconnect switch chip is coupled to the central processor and the memory resource management processor, the memory resource management processor being coupled to the second baseboard management controller;

3. The memory resource management system according to claim 1, wherein the third baseboard management controller obtains the status information of the memory through the memory expansion controller;

4. The memory resource management system according to claim 1, wherein the third baseboard management controller polls to acquire the alarm information generated in the running process of the memory;

5. The memory resource management system of claim 1, wherein the first baseboard management controller generates third failure information if it is identified that the memory expansion controller has failure alarm information;

6. The memory resource management system of claim 1, wherein the memory resource management processor performs fault identification on the high-speed interconnect switching chip;

7. The memory resource management system according to claim 1, wherein the second baseboard management controller determines the computing node and the memory corresponding to the fault information according to the memory topology interconnection relationship read by the memory resource management processor from the high-speed interconnection switching chip when the fault information is received;

8. The memory resource management system of claim 7, wherein the second baseboard management controller eliminates the abnormal memory tag if the memory repair is detected to be successful.

9. A memory resource management method, wherein the method is applied to the memory resource management system of any one of claims 1 to 8, and comprises:

10. The memory resource management method according to claim 9, wherein the method further comprises:

11. The memory resource management method according to claim 10, wherein the method further comprises:

12. The memory resource management method of claim 11, further comprising:

13. The memory resource management method of claim 12, further comprising:

Adding an operation abnormality mark for the memory;

distributing new memory for the computing node;

14. The memory resource management method of claim 13, wherein said allocating new memory for the computing node comprises:

15. The memory resource management method of claim 13, further comprising:

16. A memory resource management device, the device comprising:

17. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 9 to 15.

18. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to one of claims 9 to 15 when executing the computer program.