CN117370063A

CN117370063A - Cloud server memory fault feature extraction method, system and related device

Info

Publication number: CN117370063A
Application number: CN202311389223.7A
Authority: CN
Inventors: 赵磊; 谢涛涛; 宋伟; 田雨; 尹萍
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-01-09

Abstract

The application provides a method for extracting memory fault characteristics of a server, which comprises the following steps: when a feature extraction instruction is detected, an application interface module is called to acquire memory related features of five dimensions in a server; the memory related features comprise public information features, static information features, alarm information features, log information features and state information features; encoding the memory related features into character strings; and transmitting the memory related features in the form of character strings to a feature receiver through a message queue so as to execute feature analysis. According to the method and the device, through collecting data in multiple dimensions, more memory use modes and fault characteristics can be captured, the data quality is improved, and more sufficient and extensive data support is provided for machine learning or deep learning technology, so that more accurate and reliable memory fault prediction is realized. The application also provides a cloud server memory fault feature extraction system, a computer readable storage medium and electronic equipment, which have the beneficial effects.

Description

Cloud server memory fault feature extraction method, system and related device

Technical Field

The present invention relates to the field of servers, and in particular, to a method, a system, and a related device for extracting a memory failure feature of a server.

Background

In cloud services, memory is used to temporarily store programs and data, providing fast access and processing. However, memory failure is one of the common computer hardware failures. In order to discover memory failures in a timely manner, conventional approaches are typically based on thresholds or rules, such as triggering an alarm when memory usage exceeds a certain threshold or certain memory access patterns occur. However, the determination of the threshold or rule generally requires experience and expertise, and may require constant adjustment and updating, which may not be able to deal with complex nonlinear relationships, such as changes in memory usage patterns and interactions between different memory modules, and thus generate a large number of false positives or false negatives.

Disclosure of Invention

The invention aims to provide a cloud server memory fault feature extraction method, a cloud server memory fault feature extraction system, a computer readable storage medium and electronic equipment, which can acquire memory related features of a plurality of cloud centers to execute memory fault analysis and prediction, and improve the accuracy and reliability of memory fault prediction.

In order to solve the technical problems, the application provides a method for extracting the memory fault characteristics of a server, which comprises the following specific technical scheme:

when a feature extraction instruction is detected, an application interface module is called to acquire memory related features of five dimensions in a server; the memory related features comprise public information features, static information features, alarm information features, log information features and state information features;

encoding the memory related features into character strings;

and transmitting the memory related features in the form of character strings to a feature receiver through a message queue so as to execute feature analysis.

Optionally, before the calling application interface module obtains the memory related characteristics of five dimensions in the server, the method further includes:

and packaging any one or a combination of any several of a resource management interface, a physical machine management interface, a server configuration management interface, a remote data communication management interface, a main board management controller interface, a system and service supervision interface and a document management interface to obtain the application interface module.

Optionally, the calling the application interface module to obtain the memory related characteristics of five dimensions in the server includes:

and the calling application interface module generates the public information feature according to the current acquisition time, the cloud server name, the node name and the server parameter.

and the calling application interface module obtains the static information characteristics according to the node static information, the memory topology information, the error detection and correction topology information and the processor topology information.

and calling an application interface module to acquire an advanced error report, an external equipment error log, a memory abnormal event, a machine record checking abnormal data table and an intelligent management controller notification event in the server, and determining the alarm information characteristics.

and calling an application interface module to acquire an in-band log and an out-of-band log in the server to acquire the log information characteristics.

acquiring a memory index, a processor index, a disk index and a file system index according to a set acquisition frequency to obtain the state information characteristic; the memory metrics include available memory size, free memory size, total memory size, free swap space size, cache memory size, available memory duty cycle, free swap space duty cycle, cache memory duty cycle, and cache memory duty cycle.

The application provides a server memory fault feature extraction system, which comprises:

the feature acquisition module is used for calling the application interface module to acquire memory related features of five dimensions in the server when the feature extraction instruction is detected; the memory related features comprise public information features, static information features, alarm information features, log information features and state information features;

the feature coding module is used for coding the memory related features into character strings;

and the feature transmission module is used for transmitting the memory related features in the form of the character strings to a feature receiver through the message queue so as to execute feature analysis.

The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.

The application also provides a server comprising a memory in which a computer program is stored and a processor which when calling the computer program in the memory implements the steps of the method as described above.

The application provides a method for extracting memory fault characteristics of a server, which comprises the following steps: when a feature extraction instruction is detected, an application interface module is called to acquire memory related features of five dimensions in a server; the memory related features comprise public information features, static information features, alarm information features, log information features and state information features; encoding the memory related features into character strings; and transmitting the memory related features in the form of character strings to a feature receiver through a message queue so as to execute feature analysis.

According to the method and the device, through collecting data of multiple dimensions such as public information features, static information features, alarm information features, log information features and state information features, more memory use modes and fault features can be captured, the data quality is improved, more sufficient and extensive data support is provided for machine learning or deep learning technology, and therefore more accurate and reliable memory fault prediction is achieved.

The application further provides a cloud server memory fault feature extraction system, a computer readable storage medium and electronic equipment, which have the beneficial effects and are not repeated here.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

Fig. 1 is a flowchart of a method for extracting a memory failure feature of a cloud server according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an extraction system of a memory failure feature of a cloud server according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Referring to fig. 1, fig. 1 is a flowchart of a method for extracting a memory failure feature of a cloud server according to an embodiment of the present application, where the method includes:

s101: when a feature extraction instruction is detected, an application interface module is called to acquire memory related features of five dimensions in a server; the memory related features comprise public information features, static information features, alarm information features, log information features and state information features;

s102: encoding the memory related features into character strings;

s103: and transmitting the memory related features in the form of character strings to a feature receiver through a message queue so as to execute feature analysis.

The specific content and source of the feature extraction instruction are not limited herein, and may originate from a periodic feature extraction setting of the server, or may be a feature extraction instruction generated for a user operation. The feature extraction instruction is used for highlighting memory related features in five dimensions, and specifically comprises a public information feature, a static information feature, an alarm information feature, a log information feature and a state information feature.

Specifically, the calling application interface module may generate the public information feature according to a current acquisition time, a cloud server name, a node name and a server parameter. The meaning of the public information features is that each piece of extracted static information, alarm information, log information or state information contains public information, so that each piece of information has information such as a cloud center, a node name and the like which are explicitly corresponding to each other. Table 1 is a common information feature table containing common information:

TABLE 1 public information characteristics table

For the static information feature, the static information feature can be obtained according to node static information, memory topology information, error detection and correction topology information and processor topology information. Static information refers to features that remain unchanged during the time interval of the predicted task or that characterize server and memory attributes. The static information is divided into node static information, memory topology information, error detection and correction topology information and processor topology information. Table 2 is a node static information table including common node static information features:

the memory topology information takes a memory bar of a server in a cloud center as a unit, records relevant static information of each memory bar, and all the information is formulated by a manufacturer and packaged into a dmidecode for a user to inquire. Table 3 is a table of memory topology information:

table 3 memory topology information table

The error detection and correction topology information, namely edac (Error Detection and Correction) topology information, takes a memory bank of a server in a cloud center as a unit, and records relevant static information of each memory bank. The difference between the error detection and correction topology information and the memory topology information is that the error detection and correction topology information is resolved by a driver edac of the linux kernel and comprises a memory controller and related DIMM information; and the memory topology information is automatically packaged into the dmidecode software by a manufacturer for user inquiry. Table 4 is a table of error detection and correction topology information:

table 4 error detection and correction topology information table

The processor topology information takes one processor chip of one server in a cloud center as a unit, and relevant static information of each processor chip is recorded. Table 5 is a processor topology information table:

table 5 processor topology information table

For the alarm information feature, an advanced error report, an external device error log, a memory exception event, a machine record check exception data table and an intelligent management controller notification event in the server can be acquired, and the alarm information feature is determined.

The alarm information refers to recording in places such as ras-daemon service, ras-daemon database, edac kernel driver, abnormal indexes monitored by precursor (including various in-band and out-of-band indexes), out-of-band system event log and the like when a server in a cloud center has memory faults.

Advanced error reporting (Advanced Error Reporting, AER) is a feature of PCI Express (PCIe) to provide more powerful reporting and processing capabilities for device errors, including reporting of various hardware errors that may be encountered by a device, driver, or operating system, including but not limited to data transmission errors, device failures, invalid or illegal requests, and the like. Advanced error reporting is primarily concerned with PCIe device and bus errors that may affect the correct transfer of data, rather than being specific to memory events. However, advanced error reporting may also record memory-related events (e.g., if the PCIe device attempts to access an invalid memory address) and the alert information includes advanced error reporting in order to ensure feature integrity. Table 6 is a high-level error report information table:

table 6 advanced error reporting information table

External device error Log (Extleg) system External Log events are typically related to errors of External devices, such as power supply errors, temperature anomalies, and the like. These feature names and meanings are used to describe various attribute information of the external log event, such as a unique identifier of the event, time of occurrence, type, error count, severity, address, FRU ID, FRU text information, and CPER data, among others. The administrator can use these features to diagnose errors in external devices, such as power failure, temperature

Degree anomalies, etc., to improve the reliability and stability of the system. Also, to ensure feature integrity, the alert information contains an external device error log. Table 7 is an external device error log information table:

TABLE 7 external device error log information Table

The machine record check exception table is a table that records Memory Controller Event (which may be referred to simply as mc_event) events. Memory Controller is a hardware component in a computer for managing and controlling memory access. Memory Controller Event events generally refer to errors or exception events associated with a memory controller, such as memory read-write errors, memory check errors, and the like. In this application, the educs and ipmi system event log are fused with the mc_event, so that it is intuitive to see the relationship of mc_event and educs, and whether each mc_event caused a node restart. As shown in table 8:

table 8 machine record check exception data table

The intelligent management controller notification event is a data table that records machine check exceptions (Machine Check Exception), which may be referred to simply as mce _record, where mce represents Machine Check Exception, including values of various registers. Machine check exceptions generally refer to errors occurring in the processor or system hardware and also include memory errors. Also, in this application, the edac and ipmi system event log are fused with mce _record, and mc_event and mce _record have a strong correlation from the perspective of memory failure. See table 9:

table 9 Intelligent management controller notification event table

For the log information feature, an application interface module can be called to acquire an in-band log and an out-of-band log in a server, so as to acquire the log information feature. The log information can be classified into an in-band log and an out-of-band log.

Sources of in-band logs include syslog logs, kern log logs, dmesg logs, rasdaemon service logs, messages logs. Sources of out-of-band logs include ipmitool system event log log, audio log, blackbox peci log. According to the embodiment of the application, the in-band log and the out-of-band log can be automatically imported into the elastic search, a large amount of in-band log data can be rapidly searched and analyzed by utilizing the elastic search, various query modes such as full-text search, filtering and aggregation are supported, and a powerful data analysis and visualization function is provided.

For the state information feature, the memory index, the processor index, the disk index and the file system index can be acquired according to the set acquisition frequency to obtain the state information feature. The set collection frequency is not limited, and may be collected once a day, or the like, and may be specifically set by one skilled in the art. The state information is time series data and is changed frequently, and the change frequency can be in units of minutes.

For memory metrics, the memory metrics may include available memory size, free memory size, total memory size, free swap space size, cache memory size, available memory duty cycle, free swap space duty cycle, cache memory duty cycle, and cache memory duty cycle.

For processor metrics, it may include processor usage time, client processor usage time, lower priority processor usage time for clients, CPU idle time, CPU waiting for I/O time, processor response time to hardware interrupts, processor usage time (lower priority), processor response time to software interrupts, processor occupancy by other virtual machines in the virtualized environment, processor usage time (system level), and processor usage time (user level).

The disk index may include a disk I/O operation time, a weighted disk I/O operation time, a number of disk read bytes, a number of disk read operations completed, a number of disk write bytes, and a number of disk write operations completed.

The file system metrics may include metrics such as the size of available space for the file system, the total size of space for the file system, the total number of files for the file system, the number of files available for the file system, whether the file system is read-only, and errors in the file system devices.

After the data with multiple dimensions are obtained, data fusion and data coding can be performed, and a corresponding machine learning data set is created so as to train a model to realize memory failure prediction. In this process, processes such as data cleansing and data analysis may also be performed. In one possible implementation, a rolling window may be set to extract statistical features, and determine a data trend and a mode in a certain period of time, so as to determine whether a memory failure is likely to occur or a probability of occurrence of the memory failure, or perform feature processing by combining periodic features and frequency domain features. The smoothed memory availability value may also be obtained by an exponentially weighted moving average method, which may be referred to simply as a smoothed value. The smoothed values are used to capture trends in the memory availability values and reduce the effects of noise between observations. In creating the model, a variety of common machine learning models may be employed, including decision trees, support vector machines, neural networks, etc., and the most appropriate model determined by comparison.

According to the method and the device for predicting the memory faults, through collecting data of multiple dimensions such as public information features, static information features, alarm information features, log information features and state information features, more memory use modes and fault features can be captured, data quality is improved, more sufficient and extensive data support is provided for machine learning or deep learning technologies, and therefore more accurate and reliable memory fault prediction is achieved.

On the basis of the above embodiment, in order to obtain the features of each dimension, the API (Application Program Interface ) module may be redesigned to ensure that the features of different dimensions in the server can be obtained.

Specifically, any one or a combination of any several of a resource management interface, a physical machine management interface, a server configuration management interface, a remote data communication management interface, a motherboard management controller interface, a system and service supervision interface, and a document management interface may be encapsulated, so as to obtain the application interface module in the above embodiment.

For the resource management interface, in the Kubernetes cluster, physical machine resources can be automatically managed and scheduled through the Kubernetes API server and various controllers, and in this case, custom_api is used by custom resources, core_v1 is used by k8s official resources, a simple and easy-to-use interface is finally provided, and IPMI (Intelligent Platform Management Interface ) information of the nodes can be conveniently acquired from the Kubernetes cluster, so that remote management and operation can be performed.

For the physical machine management interface, the packaging can be performed based on MAAD, and MAAS (Metal-as-a-Service) is an automatic physical machine management tool, and can be used for managing and scheduling large-scale physical machine resources. In this case, the IPMI information of the corresponding node can also be conveniently acquired by encapsulating the mass api interface, thereby performing remote management and operation.

For the server configuration management interface, the Salt can be based on Salt, which is an automatic operation and maintenance tool developed based on Python, and uses a framework named SaltStack to allow a user to perform configuration management, software deployment, task scheduling and other operations among a plurality of servers.

For the remote data communication management interface, SSH protocol, which is an encryption-based network protocol for secure data communication and remote management between remote hosts, may be applied. The embodiment of the application can uniformly package the salt api module and the SSH module, and allows batch nodes to execute commands and copy files.

For the BMC interface, it is based on a BMC (Baseboard Management Controller, motherboard management controller), which is an independent, embedded hardware device for managing hardware resources of a computer system. The BMC usage and configuration of servers of different manufacturers are different, so that the BMC of different manufacturers is packaged to form an API, and the out-of-band log of the servers of different models is conveniently obtained.

For system and service supervision interfaces, API interfaces based on promethaus packages may be applied. Prometheus is a powerful monitoring system and time series database, which can help users to monitor and analyze various systems and services comprehensively, thereby improving the reliability and stability of the system. When the alarm information and the state information are collected, the promethaus api is required to be called, therefore, the promethaus api is packaged, the abnormal value of the index and the corresponding time node are allowed to be checked, and the index value of the state information of each day can be conveniently acquired.

The method is mainly applied to an elastic search, provides an efficient document storage and retrieval mode, supports real-time data query and analysis, has the advantages of expandability, high availability and the like, and is widely applied to the fields of full-text search, log analysis, data mining and the like. By packaging the elastic search module, the memory fault related logs of all nodes in each cloud center can be obtained in batches.

The feature extraction process can realize the acquisition of the memory related features by calling the API module. And then, the acquired memory related characteristics can be exported to a database and a log file of the local cloud center. To facilitate feature aggregation and feature analysis, feature transmission modules may be deployed at each cloud center or server for sending databases and log files to a common message queue, such as a RabbitMQ message queue. The transmission is by encoding binary data into printable ASCII characters using a base64 module for transmission and storage. Specifically, the base64.B64encode () method in the code is used to convert file data from a binary format to a base64 encoding format, and then the encoded data is converted into a character string by the decode () method for transmission in a message queue. Thus, the data after the base64 coding can be transmitted and stored in different computers, operating systems and network environments, and the data damage caused by character set or byte order problems is avoided. At the receiving end, the base64 encoded data may be decoded into the original binary format by a corresponding decoding method for subsequent processing.

Corresponding databases and configuration files may also be used during storage and management, the following being a viable database and configuration file design:

python Object Relational Mapping (ORM) implemented using the SQLAlchem library is used to define the structure of multiple data tables for storing and managing data in the database. Specifically, the code defines a plurality of classes, each class corresponding to a data table, in which fields and data types of the table are defined, and other attributes and methods, such as table names, primary keys, foreign keys, indexes, etc. The designed table includes:

1) Computenodesticinfodb: the table is used for storing static information of the computing node, including information of an operating system, a kernel, a CPU and the like.

2) Memrytopioglynfondb: the table is used for storing memory topology information, including information of memory addresses, sizes, types and the like.

3) EdacTopologyinfoDB: the table is used for storing EDAC (Error Detection and Correction) topology information, including controller names, memory label and the like.

4) CpuTopologyinfoDB: the table is used for storing the topology information of the CPU, including the information of the CPU model, the core number, the thread number and the like.

5) Aeroeventalert db: the table is used to store AER (Advanced Error Reporting) event related error information.

6) ExtlegEventAlertDB: the table is used to store Extlog (External log) event related error information.

7) Mceventalert db-this table is used to store Mc (Memory Controller Event) event related error information.

8) Mcerecord alert db-this table is used to store Mce (Machine Check Exception) event related error information.

9) Bmc _Trap_Mem_AlertDB this table is used to store error information related to Bmc Trap (Baseboard Management Controller) events.

10 Computenodestatusinfondb) the table is used to store node state information.

The profile module is designed as follows:

using oslo.config management configuration items, the contained configuration items can be divided into the following categories:

1) Default: default configuration items such as storage locations for export data are defined.

2) Database: configuration items associated with the database, such as database connection addresses, etc., are defined.

3) K8s: configuration items used in the Kubernetes environment are defined, such as url of k8s, etc.

4) Lma: configurations associated with LMA (Log Management and Analysis) are defined, such as a username and password for promethaus, etc.

5) Outofband: configuration items related to out-of-band management, such as users of mass, etc., are defined.

6) Rabbittmq: configuration items associated with the message queue, such as host and port of the rubbiq, etc., are defined.

7) Remote: a remote management mode of the environment, such as ssh or salt user information, is defined.

The following describes a system for extracting a memory failure feature of a server provided in an embodiment of the present application, where the extraction system described below and the method for extracting a memory failure feature of a server described above may be referred to correspondingly with each other.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an extraction system of a memory failure feature of a cloud server according to an embodiment of the present application, where the present application provides an extraction system of a memory failure feature of a server, including:

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed, implements the steps provided by the above embodiments. The storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The present application also provides a server, which may include a memory and a processor, where the memory stores a computer program, and the processor may implement the steps provided in the foregoing embodiments when calling the computer program in the memory. The server may of course also include various network interfaces, power supplies, etc.

In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. The system provided by the embodiment is relatively simple to describe as it corresponds to the method provided by the embodiment, and the relevant points are referred to in the description of the method section.

Specific examples are set forth herein to illustrate the principles and embodiments of the present application, and the description of the examples above is only intended to assist in understanding the methods of the present application and their core ideas. It should be noted that it would be obvious to those skilled in the art that various improvements and modifications can be made to the present application without departing from the principles of the present application, and such improvements and modifications fall within the scope of the claims of the present application.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. The extraction method of the memory fault characteristics of the cloud server is characterized by comprising the following steps of:

encoding the memory related features into character strings;

2. The extraction method according to claim 1, wherein before the calling the application interface module to obtain the memory related features of five dimensions in the server, the method further comprises:

3. The extraction method according to claim 1, wherein the calling the application interface module to obtain the memory related features of five dimensions in the server comprises:

4. The extraction method according to claim 1, wherein the calling the application interface module to obtain the memory related features of five dimensions in the server comprises:

5. The method of claim 4, wherein the invoking the application interface module to obtain the five-dimensional memory-related features in the server comprises:

6. The extraction method according to claim 5, wherein the calling the application interface module to obtain the memory related features of five dimensions in the server comprises:

7. The extraction method according to claim 5, wherein the calling the application interface module to obtain the memory related features of five dimensions in the server comprises:

8. The extraction system of the cloud server memory fault characteristics is characterized by comprising the following components:

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-7.

10. A server comprising a memory in which a computer program is stored and a processor which, when calling the computer program in the memory, implements the steps of the method according to any of claims 1-7.