CN116595756A

CN116595756A - Digital twinning-based intelligent operation and maintenance method and device for data center

Info

Publication number: CN116595756A
Application number: CN202310553209.XA
Authority: CN
Inventors: 程铄; 徐一坚; 孙钊
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2023-05-16
Filing date: 2023-05-16
Publication date: 2023-08-15

Abstract

The invention provides a digital twinning-based intelligent operation and maintenance method and a device for a data center, which relate to the technical field of data operation and maintenance, and comprise the following steps: responding to an abnormal occurrence event, and acquiring a detection time range corresponding to the abnormal occurrence event; selecting all monitoring index values in the detection time range, and calculating the fluctuation rate; determining a monitoring index value with the fluctuation rate larger than a preset fluctuation rate threshold value as a primary screening abnormality index; calculating the fluctuation similarity of the primary screening abnormal index and the alarm index, and selecting a target monitoring index value with a similar fluctuation rule; sequencing the target monitoring index values according to the time dimension to draw a first abnormal relation map; and determining a monitoring object corresponding to the target monitoring index value corresponding to the first moment in the first abnormal relation map as the root cause of the abnormal occurrence event, and generating root cause analysis so as to maintain the abnormal occurrence event according to the root cause analysis, thereby not only reducing the labor cost, but also effectively improving the operation and maintenance efficiency of the data center.

Description

Digital twinning-based intelligent operation and maintenance method and device for data center

Technical Field

The invention relates to the technical field of data operation and maintenance, in particular to an intelligent operation and maintenance method and device for a data center based on digital twinning.

Background

Data centers generally refer to network devices for transferring, accelerating, displaying, calculating, and storing data information on internet network infrastructure, and with the widespread application of data centers, artificial intelligence, network security, etc. are also continuously presented, and with the increase of servers and data volumes, maintenance of data centers is also receiving more and more attention.

In the related art, the daily operation and maintenance work of the data center is mainly implemented in the following matters:

1. delivery link of server: server warehouse entry, server shelf, system deployment, network IP distribution, server delivery and use and the like;

2. monitoring of a server and a machine room: monitoring the surrounding environment such as voltage, current, machine room temperature, humidity and the like;

3. maintenance of the server: daily hardware failure changes (e.g., disk, CPU, memory, etc.).

4. Lifecycle management of servers: server-related lifecycle data management for usage information, reassignment information, maintenance information, repair records, and the like.

For the above main content, at present, most of the data centers are completed manually, which not only consumes a great deal of manpower resources and has low efficiency, but also cannot form an effective information transmission channel between links.

In addition, because the life cycle of the current server is mostly transmitted downwards in a mode of manually entering a related system and the like, the quality of channel data transmission is completely dependent on the service proficiency and the careful degree of corresponding workers, so that the situations of inaccurate, damaged and lost data occur sometimes, and when the situations occur, larger manpower resources are required to search mail system records and the like to repair the data, and the operation and maintenance efficiency of the data center is severely limited.

Disclosure of Invention

Accordingly, the present invention is directed to a method and apparatus for intelligent operation and maintenance of a data center based on digital twinning, so as to alleviate the above technical problems.

In a first aspect, an embodiment of the present invention provides an intelligent operation and maintenance of a data center based on digital twinning, which is applied to a simulation platform, where the simulation platform is a platform obtained by modeling a preset proportion of constituent elements of the data center through three-dimensional simulation software, and the simulation platform includes a physical model of the data center, and the method includes: responding to an abnormal occurrence event, and acquiring a detection time range corresponding to the abnormal occurrence event, wherein the abnormal occurrence event carries an alarm index; selecting all monitoring index values within the detection time range, and calculating the fluctuation rate of all the monitoring index values within the detection time range; determining the monitoring index value of which the fluctuation rate is larger than a preset fluctuation rate threshold value as a preliminary screening abnormality index; calculating the fluctuation similarity of the primary screening abnormal index and the alarm index, and selecting the primary screening abnormal index with a similar fluctuation rule with the alarm index as a target monitoring index value based on the fluctuation similarity; sequencing the target monitoring index values according to the time dimension to draw a first abnormal relation map; and determining a monitoring object corresponding to the target monitoring index value corresponding to the first moment in the first abnormal relation map as a root cause of the abnormal occurrence event, and generating root cause analysis based on the root cause so as to maintain the abnormal occurrence event according to the root cause analysis.

With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the step of drawing the first abnormal relationship map includes: acquiring a pre-constructed abnormal relation map, wherein the abnormal relation map is constructed by taking all monitoring data of the data center as monitoring indexes; screening the target monitoring index value in the abnormal relation map, and drawing the first abnormal relation map according to the internal relation of the screened target monitoring index value in the abnormal relation map.

With reference to the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, wherein the step of generating root cause analysis based on the root cause includes: judging whether object analysis corresponding to the root cause causing the abnormal occurrence event is recorded in a pre-established root cause analysis library; if yes, generating root cause analysis of the abnormal occurrence event based on object analysis recorded in the root cause analysis library; if not, adding the object analysis of the monitoring object corresponding to the root cause in the root cause analysis library, and generating the root cause analysis of the abnormal occurrence event according to the added object analysis.

With reference to the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the step of calculating the fluctuation rate of all the monitor indicator values in the detection time range includes: calculating the fluctuation rate of all the monitoring index values within the detection time range according to the following formula:

wherein E represents the fluctuation rate of the monitoring index value, T is the target time of the abnormal occurrence event, and the detection time range is represented as T (T-deltat, t+deltat), s _i For the i-th monitoring index value,and the average value of the monitoring index values.

With reference to the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where the step of calculating a fluctuation similarity of the prescreening anomaly index and the alarm index includes: calculating the similarity between the preliminary screening abnormal index and the alarm index by adopting a pre-established DTW algorithm; wherein, the calculation formula of the DTW algorithm is as follows:

wherein one of H, K is the data sequence of the primary screening abnormality index, the other is the data sequence of the warning index, w _u A u-th element value of a target path in a data matrix constructed by a data sequence representing H, K; u represents the number of element values in the target path.

With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where the method further includes: acquiring monitoring data of the data center; drawing an abnormal relation map containing the monitoring data according to a preset fault propagation internal relation; the monitoring data comprises data collected by physical sensors arranged in the data center, monitoring data of a server layer in the data center and monitoring data of an application program layer in the data center.

With reference to the first aspect, an embodiment of the present invention provides a sixth possible implementation manner of the first aspect, where the data center is further configured with a maintenance robot, and the method further includes: and if the root cause analysis carries a maintenance instruction acting on the maintenance robot, sending the maintenance instruction to the maintenance robot so that the maintenance robot executes a maintenance task on the data center based on the maintenance instruction.

In a second aspect, an embodiment of the present invention further provides a digital twin-based data center intelligent operation and maintenance device, which is applied to a simulation platform, where the simulation platform is a platform obtained by modeling a preset proportion of constituent elements of a data center through three-dimensional simulation software, and the simulation platform includes a physical model of the data center, and the device includes: the response module is used for responding to an abnormal occurrence event and acquiring a detection time range corresponding to the abnormal occurrence event, wherein the abnormal occurrence event carries an alarm index; the first calculation module is used for selecting all the monitoring index values in the detection time range and calculating the fluctuation rate of all the monitoring index values in the detection time range; the determining module is used for determining that the monitoring index value with the fluctuation rate larger than a preset fluctuation rate threshold value is a primary screening abnormality index; the second calculation module is used for calculating the fluctuation similarity of the primary screening abnormal index and the alarm index, and selecting the primary screening abnormal index with a similar fluctuation rule with the alarm index as a target monitoring index value based on the fluctuation similarity; the sequencing module is used for sequencing the target monitoring index values according to the time dimension so as to draw a first abnormal relation map; and the analysis module is used for determining a monitoring object corresponding to the target monitoring index value corresponding to the first moment in the first abnormal relation map as the root cause of the abnormal occurrence event, generating root cause analysis based on the root cause, and maintaining the abnormal occurrence event according to the root cause analysis.

In a third aspect, the embodiment of the invention also provides a simulation platform, wherein the simulation platform is obtained by modeling the constituent elements of the data center in a preset proportion through three-dimensional simulation software, and the simulation platform comprises a physical model of the data center; the simulation kernel of the simulation platform is configured with the intelligent operation and maintenance device of the second aspect so as to execute the intelligent operation and maintenance method of the digital twin-based data center of the first aspect.

In a fourth aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor executes the computer program to implement the steps of the method described in the first aspect.

In a fifth aspect, embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, which when being executed by a processor performs the steps of the method according to the first aspect.

The embodiment of the invention has the following beneficial effects:

the intelligent operation and maintenance method and the intelligent operation and maintenance device for the data center based on the digital twin can respond to an abnormal occurrence event, acquire a detection time range corresponding to the abnormal occurrence event, further select all monitoring index values in the detection time range, calculate the fluctuation rate of all the monitoring index values in the detection time range, further determine the monitoring index value with the fluctuation rate being greater than a preset fluctuation rate threshold value as a primary screening abnormal index, calculate the fluctuation similarity of the primary screening abnormal index and the alarm index of the abnormal occurrence event, and select the primary screening abnormal index with a similar fluctuation rule with the alarm index as a target monitoring index value based on the fluctuation similarity; the method comprises the steps of sequencing target monitoring index values according to time dimension, drawing a first abnormal relation map, determining a monitoring object corresponding to the target monitoring index value corresponding to the first moment in the first abnormal relation map as a root cause of an abnormal occurrence event, generating root cause analysis based on the root cause, maintaining the abnormal occurrence event according to the root cause analysis, and acquiring physical operation data of a data center by utilizing a virtual-real linkage simulation platform by virtue of a platform obtained by carrying out preset proportion modeling on constituent elements of the data center through three-dimensional simulation software, wherein the intelligent operation and maintenance of the data center can be carried out by fully utilizing an information processing model of the constructed simulation platform under the driving of a twin data consisting of the physical operation data and operation data of the simulation platform, so that the labor cost is reduced, the operation and maintenance efficiency of the data center is effectively improved, and the times of personnel entering the data center machine room are reduced, and the safety of the machine room is greatly improved no matter whether the machine room is safe or the data is greatly improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the invention and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a basic framework diagram of an intelligent operation and maintenance method of a data center based on digital twinning, which is provided by an embodiment of the invention;

FIG. 2 is a flow chart of an intelligent operation and maintenance method of a data center based on digital twinning provided by an embodiment of the invention;

FIG. 3 is a schematic diagram of an anomaly relationship graph according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a first anomaly relationship graph according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a matrix according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an intelligent operation and maintenance device of a data center based on digital twinning according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

At present, the daily operation and maintenance work of the data center is mostly realized manually, and the data transmission is mostly dependent on the form of manual form entry, so that an effective information transmission channel is difficult to form.

And after the server is on line, monitoring the voltage, current, power of the server, the temperature of a rack and a machine room, the humidity of the machine room and other physical data are distributed in each independent monitoring module, and the instantaneous value display is often performed, so that operation and maintenance staff can not sense each monitoring index of the server and each monitoring index of the machine room environment from the whole level, and meanwhile, the abnormal change trend and the association of some potential risks of the machine room and the monitoring indexes can not be found in time.

The data center is used as the most important carrier of the modern whole software system, the stability of the data center is the basic stone for the enterprises to normally provide services, and the abnormal breakdown of the data center causes great loss to the enterprises. Therefore, the daily inspection operation and maintenance work after the delivery of the data center server becomes extremely important, and particularly, the timely discovery and replacement processing of hardware faults is realized. The traditional hardware perception method is poor in timeliness and high in positioning cost because the hardware fault of a server level is reversely tracked after the abnormality of a service system and then manually replaced by a data center engineer, and the situation of waiting for maintenance in a queue can occur when a plurality of servers simultaneously fail, so that the recovery time of the service is seriously prolonged.

Further, when the life cycle data related to the server such as the usage information, the reassignment information, the maintenance record and the like of the server are not recorded timely, dirty data is often generated, and the automatic workflow based on the data is frequently failed.

In addition, because the on-site operation still needs to be performed manually to maintain the hardware, the on-line operation and the off-line operation are performed, the whole intelligent operation and maintenance system of the whole data center cannot be completely closed. The unsmooth information channel of each link often causes serious information island, the data quality is reduced, the monitoring data is scattered, the linkage display capability of historical data is poor, the operation and maintenance engineers are not beneficial to globally managing and controlling each link of the whole data center, when abnormality occurs, the fault positioning cost is higher, the timeliness is poor, the switching of a plurality of systems is not beneficial to finding abnormal association indexes from the data, and the operation and maintenance efficiency of the data center is severely limited.

Based on the above, the data center intelligent operation and maintenance method and device based on digital twin provided by the embodiment of the invention can effectively alleviate the technical problems.

For the sake of understanding the present embodiment, first, a detailed description is given of an intelligent operation and maintenance method for a data center based on digital twinning disclosed in the present embodiment.

In a possible implementation manner, the embodiment of the invention provides an intelligent operation and maintenance method for a data center based on digital twinning, which is hereinafter referred to as an intelligent operation and maintenance method.

In practical use, the simulation platform in the embodiment of the invention is a platform obtained by modeling the constituent elements of the data center in a preset proportion through three-dimensional simulation software, for example, 1:1 modeling and the like, and the simulation platform comprises a physical model of the data center, so that the intelligent operation and maintenance method based on the simulation platform is an intelligent operation and maintenance method of the data center based on digital twin, and the method has the following characteristics: the simulation platform can be used for performing simulation modeling on the data center, wherein the simulation modeling comprises a machine room, a server, hardware (magnetic disk, cpu, memory and the like) of the data center, a maintenance robot used for replacing workers to carry and replace the hardware, and the like.

Specifically, the simulation platform provided by the embodiment of the invention is mainly embodied in the following links:

(1) Delivery link of server: server warehouse entry, server shelf, system deployment, network IP distribution, server delivery and use and the like;

specifically, the link can be pre-paved with a track on the ground of a machine room, and an AGV (Automated Guided Vehicle, AGV for short) with a mechanical arm and an automatic tag such as a two-dimensional code, an RFID (Radio Frequency Identification ) tag and the like can be used for realizing automatic operation and identification of a machine and the position of a rack, transporting of a rack server, acquisition and replacement of related hardware and the like.

(2) Monitoring of a server and a machine room: monitoring the surrounding environments such as the voltage, the current, the temperature, the humidity and the like of a server;

in the link, the voltage, current, power, temperature, humidity and the like of a server in a machine room of the data center can be obtained in real time through an intelligent ammeter, an environment sensor (temperature, humidity) and the like.

(3) Maintenance of the server: daily hardware failure changes (disk, CPU, memory, etc.).

Because the connection mode of hardware is mostly in pluggable mode, in this link, can be with the help of computer lab ground pre-track way to with AGV dolly with robotic arm, the automatic work of changing hardware (including changing disk, CPU, memory etc.) of realizing robotic arm according to the instruction that simulation platform issued becomes possible.

(4) Lifecycle management of servers: server-related lifecycle data management for usage information, reassignment information, maintenance information, repair records, and the like.

For example, after purchasing a warehouse of a shipping and withstanding data center from a server, server data can be synchronously collected to a simulation platform through an RFID technology, and the simulation platform can be synchronized with automatic loading information and hardware replacement information of a later robot, so that full-flow automatic management of life cycle data of the server is completely realized.

(5) Through the digital twin technology, the whole operation and maintenance scene can be opened in the whole life cycle, and the information integration degree is higher and the linkage is stronger from the system software to the data center machine room hardware and from the virtual world to the physical world. The operation and maintenance system of the whole data center is formed into an effective closed loop, operation and maintenance staff can finish operations such as loading and unloading of the server, maintenance and replacement of hardware, temperature and humidity adjustment of a machine room and the like only by operating or monitoring related interfaces in a unified monitoring center, and by means of a unified scheduling algorithm, efficient arrangement of field tasks can be realized, manpower is further saved, and efficiency is improved.

Therefore, the intelligent operation and maintenance method of the data center based on the digital twin can acquire real object operation data by utilizing various sensors of a machine room and a server of the data center, acquire simulation operation data by utilizing a virtual-real linkage simulation platform, and perform intelligent operation and maintenance on the physical environment of the data center by synchronous mapping and real-time interaction of the server based on the physical machine room, the virtual machine room and the server under the driving of twin data consisting of the real object operation data and the simulation operation data by fully utilizing the constructed information processing model.

For ease of understanding, fig. 1 shows a basic framework diagram of a digital twin-based data center intelligent operation and maintenance method, as shown in fig. 1, where the diagram includes a data center in a physical environment and a simulation platform in a simulation environment, and fig. 1 shows a framework diagram of a digital twin-based data center intelligent operation and maintenance method.

The data center comprises a plurality of servers, the servers are deployed in a physical machine room, a monitoring system is deployed in the data center for realizing digital twin technology, data acquisition can be performed, a simulation platform models a simulation machine room corresponding to the physical machine room of the data center, and a simulation kernel is deployed, wherein in order to realize an intelligent operation and maintenance method, the simulation kernel of the simulation platform generally comprises the following contents: mechanism analysis, simulation platform construction, data processing, algorithm design, and discovery, positioning and automatic handling of abnormal events.

Among them, the mechanism analysis is generally configured with a coupling relationship between hardware and an abnormal transmission mechanism, and therefore, it draws an abnormal relationship map of a monitoring index mainly by analyzing an internal relationship of fault propagation.

The simulation platform is constructed and used for modeling the constituent elements of the data center through three-dimensional simulation software, such as 1:1 modeling, and modeling comprises modeling all real objects in the data center, such as a rack, a server, a network cable, an ammeter, a hard disk, a CPU, a memory and the like, and realizing data transmission and instruction issuing through a special communication channel.

Data processing refers to processing data collected in the data center of fig. 1, which is typically used as a monitoring index value for the data center. The algorithm design is used for designing and writing the whole intelligent operation and maintenance method, and the discovery, positioning and automatic handling of the abnormal events are mainly used for realizing automatic handling of the abnormal events, including fault set analysis, fault positioning handling and the like.

Thus, based on fig. 1, it can be seen that the data center can provide data driving for the simulation platform, the collected monitoring data is input into the simulation model, and the simulation platform can provide operation and maintenance support for the data center according to the operation and maintenance strategy.

In general, in the daily operation and maintenance process, it is found that the occurrence of an abnormal event often has transmissibility, and the transmissibility is mainly reflected in an abnormal fluctuation rule of a monitoring index value after a certain root causes abnormality, and the abnormal fluctuation rule can be transmitted to other related monitoring indexes. Each monitoring index has a time sequence in the time dimension, and although partial monitoring indexes possibly have the simultaneous occurrence, the intelligent operation and maintenance method provided by the embodiment of the invention mainly aims to realize rapid root cause positioning in the operation and maintenance process.

Based on the basic framework diagram of the digital twin-based data center intelligent operation and maintenance method shown in fig. 1, fig. 2 also shows a flow chart of the digital twin-based data center intelligent operation and maintenance method, as shown in fig. 2, comprising the following steps:

step S202, responding to an abnormal occurrence event, and acquiring a detection time range corresponding to the abnormal occurrence event;

the abnormal event carries an alarm index, and the alarm index refers to an index that causes the abnormal event, for example, if the current of a certain server is abnormal and exceeds the normal current value range, at this time, the current value of the server is the alarm index, and the event reporting the current abnormality is the abnormal event.

In addition, the abnormal occurrence event refers to an abnormal occurrence event of a data center in an actual physical environment, for example, when an abnormal occurrence of a value of a certain monitoring index is monitored in a monitoring system deployed in the data center, an abnormal prompt can be sent to a simulation platform, at the moment, the simulation platform can receive the abnormal prompt and determine that the abnormal occurrence event occurs, and then the abnormal occurrence event is responded, so that the method provided by the embodiment of the invention is executed.

Furthermore, because the simulation platform is constructed by the digital twin technology, the data in the simulation platform and the data of the physical center in the actual physical environment are shared in real time, so that abnormal data can be synchronously generated in real time and an alarm can be generated in the simulation platform, and at the moment, the simulation platform can also respond to the abnormal occurrence event generated by the alarm at the moment, and further the method provided by the embodiment of the invention can be executed.

In addition, the simulation platform can also receive an abnormal event processing operation triggered by the operation and maintenance personnel, namely the operation and maintenance personnel find that the data center is abnormal, the simulation platform is triggered to process the abnormal event, and at the moment, the simulation platform can also respond to the abnormal event, so that the method provided by the embodiment of the invention is executed.

Further, the above-mentioned detection time range refers to a time range including the occurrence time of the abnormality occurrence event, and in general, the detection time range includes a period of time before and a period of time after the occurrence time of the abnormality occurrence event, for example, the occurrence time of the abnormality occurrence event is denoted by T or becomes the target time, and the detection time range may be denoted by T (T- Δt, a+Δt), and Δt represents a period of time before or a period of time after the target time.

Step S204, selecting all monitoring index values in the detection time range, and calculating the fluctuation rate of all the monitoring index values in the detection time range;

in actual use, the monitoring index value refers to a specific value of a monitoring index for monitoring the data center by the current simulation platform, and the monitoring index in the application is usually part of monitoring data or all of the monitoring data, and the monitoring data refers to data collected by a physical sensor of the data center, monitoring data at a server level in the data center and monitoring data at an application program level in the data center.

The above monitoring data are finally reflected in the simulation kernel of the simulation platform shown in fig. 1, specifically, the data processing part of the simulation kernel realizes the collection of the monitoring data, so that the above steps can be executed to select the monitoring index value and calculate the fluctuation rate when responding to the abnormal occurrence event.

Step S206, determining a monitoring index value with the fluctuation rate larger than a preset fluctuation rate threshold value as a primary screening abnormality index;

step S208, calculating the fluctuation similarity of the primary screening abnormal index and the alarm index, and selecting the primary screening abnormal index with a similar fluctuation rule with the alarm index as a target monitoring index value based on the fluctuation similarity;

Step S210, sorting the target monitoring index values according to the time dimension so as to draw a first abnormal relation map;

step S212, determining the monitoring object corresponding to the target monitoring index value corresponding to the first moment in the first abnormal relation map as the root cause of the abnormal occurrence event, and generating root cause analysis based on the root cause so as to maintain the abnormal occurrence event according to the root cause analysis.

The monitoring object refers to a software object or a hardware object specifically deployed in the data center, for example, if the target monitoring index value is the data collected by the physical sensor, specifically, may be the current of a certain server collected by the smart meter, and the monitoring object at this time may be directly located to the server, that is, at this time, the server is the root cause of the abnormal occurrence.

In actual use, due to the transmissibility of the abnormal event, the corresponding monitoring index after an abnormality occurs to an abnormal fluctuation rule, and the abnormal fluctuation is transmitted to other related monitoring indexes with a similar fluctuation rule, so that the abnormal fluctuation rule has a sequence of occurrence in a time dimension, and in the step S210, the root cause in the operation and maintenance process can be effectively and rapidly positioned by the first abnormal relation map obtained through the sequencing analysis in the time dimension, so that root cause analysis is generated, and the abnormal occurrence is maintained and handled conveniently.

Therefore, the intelligent operation and maintenance method for the data center based on the digital twin can acquire the real object operation data of the data center by using the virtual-real linkage simulation platform, and the intelligent operation and maintenance of the physical environment of the data center can be performed by fully utilizing the constructed information processing model of the simulation platform through synchronous mapping and real-time interaction of the server of the physical machine room of the data center and the virtual machine room and the server of the simulation platform under the driving of twin data consisting of the real object operation data and the operation data of the simulation platform, so that the operation and maintenance efficiency of the data center is effectively improved, and meanwhile, the times of personnel entering the machine room of the data center is reduced, and the safety of the machine room or the data security is greatly improved.

In practical use, in order to facilitate mechanism analysis of an abnormal occurrence event, an abnormal relation map depicting a monitoring index may be pre-established for all monitoring data of a data center according to a fault propagation internal relation, so that the embodiment of the invention further includes a process of constructing the abnormal relation map, and the process is usually implemented in mechanism analysis in a simulation kernel, and specifically includes the following processes:

Acquiring monitoring data of a data center; and drawing an abnormal relation map containing monitoring data according to the preset fault propagation internal relation.

The monitoring data comprise data collected by physical sensors arranged in the data center, monitoring data of a server layer in the data center and monitoring data of an application program layer in the data center.

Specifically, as shown in fig. 1, the data collected by the physical sensor generally includes data collected by the smart meter, such as current, voltage, power, etc. of the server; the temperature sensor collects the temperature of the machine room, the humidity sensor collects the humidity of the machine room, and the like; frame vibration data collected by the frame vibration sensor and identification information of hardware equipment such as a server and the like collected by the automatic tag.

Monitoring data of a server layer comprises multi-finger cpu utilization rate, disk busy rate, system load, memory utilization rate and the like; the application program level monitoring data refers to: data such as request quantity, success rate, request time consumption and the like of the server.

And, based on the pre-established abnormal relation map, in the step S210, when the first abnormal relation map is drawn, the pre-established abnormal relation map may be obtained, where the abnormal relation map is constructed by using all monitoring data of the data center as monitoring indexes; and then screening out the target monitoring index value from the abnormal relation map, and drawing a first abnormal relation map according to the internal relation of the screened target monitoring index value in the abnormal relation map.

For easy understanding, fig. 3 shows a schematic diagram of an abnormal relationship map, taking a time range from t1 to t2 as an example, including a plurality of monitoring indexes a to G, where an internal relationship of fault propagation among the monitoring indexes is represented by an arrow, where the internal relationship of fault propagation is generally constructed or generated based on experience, for example, when a disk has a read-write fault, it inevitably causes a CPU to use an abnormality, and at the same time, it also causes a blocking of an application program, etc., and propagation among the faults has a certain transmissibility and similarity, that is, a certain internal relationship of fault propagation exists, and by analyzing the internal relationship of fault propagation, the abnormal relationship map shown in fig. 3 can be constructed, so as to implement rapid positioning of root causes in an operation and maintenance process according to a time sequence on the whole abnormal link.

Further, fig. 4 also shows a schematic diagram of a first abnormal relationship map, taking a time period from t1 to t4 as an example, and showing a schematic diagram of occurrence of a part of monitoring indexes in a time dimension. Although a part of monitoring indexes may have simultaneous occurrence, that is, several monitoring indexes shown by a dashed line at a time t2 and several monitoring indexes shown by a dashed line at a time t4, the whole abnormal link still has time sequence, so that a monitoring object corresponding to a target monitoring index value corresponding to a first time in the first abnormal relation map can be determined as a root cause of an abnormal occurrence event.

Further, in the embodiment of the present invention, when root cause analysis is generated, a pre-established root cause analysis library may be used, where the root cause analysis library is generally constructed based on previous operation and maintenance experience, for example, for a root cause, object analysis of a monitored object corresponding to the root cause may be directly given out, for example, what fault type the monitored object may generate, what fault performance, and a specific treatment mode may be presented, so when root cause analysis is generated, whether the pre-established root cause analysis library records the object analysis corresponding to the root cause that causes the abnormal occurrence event may be first determined; if yes, the root cause analysis of the abnormal occurrence can be generated by directly analyzing the object recorded in the base Yu Genyin analysis library; if not, the object analysis of the monitoring object corresponding to the root cause can be added into the root cause analysis library according to the actual operation and maintenance treatment experience, the root cause analysis library is updated, and the root cause analysis of the abnormal occurrence event is generated according to the added object analysis.

Further, in the step S204, when calculating the fluctuation ratio, the fluctuation ratio of all the monitor index values in the detection time range is generally calculated according to the following formula:

Wherein E represents the fluctuation rate of the monitoring index value, T is the target time of the abnormal occurrence event, the detection time range is represented by T (T-deltat, A+deltaA), deltat represents the time difference before and after the target time, and s _i For the i-th monitoring index value,to monitor within the detection time rangeAverage value of index values.

In practical use, the monitoring index value is generally time-series data, that is, includes the collected time point and the corresponding monitoring index value, so that the calculation can be performed within the detection time range based on the above formula, the fluctuation rate E obtained by the above formula can narrow the range of abnormal positioning of the monitoring index value, and E represents the fluctuation rate threshold, that is, the monitoring index value with fluctuation less than E within the detection time range can not participate in the calculation of the subsequent steps.

Further, when calculating the fluctuation similarity in the step S208, a pre-established DTW (Dynamic Time Warping, DTW) algorithm is generally adopted to primarily screen the similarity between the abnormal index and the alarm index; because all monitoring indexes of the data center are time sequence data, a dynamic time warping/planning DTW algorithm can be adopted according to the characteristics of the monitoring data. The DTW algorithm can measure the similarity between two given time sequences, and has certain adaptability to the expansion and compression of the two time sequences.

The algorithm may include the steps of:

(1) Given the data line sequence data of two monitoring indexes:

H＝h1,h2,……,hi,……,hm；

K＝k1,k2,……,kj,……,kn；

(2) For better calculation of the similarity of two sequences, a matrix of n×m is first constructed, and the values of matrix elements (i, j) represent the distances d (hi, kj) between two points hi and kj, i.e.: the smaller the distance, the higher the similarity from each point of the sequence H to each point of the K sequence, typically using the euclidean distance d (hi, kj) = (hi, kj) ² To calculate.

(3) Looking for a path from the upper left corner to the lower right corner of the matrix, for ease of understanding, FIG. 5 shows a schematic diagram of the matrix, as shown in FIG. 5, where the path is denoted by W and the u-th element of W is defined as W _u ＝(w _h (u),w _k (u)) wherein w _h The value of (u) may be 1,2, … …, m, w _k The value of (u) may be 1,2, … …, n.

Defining a mapping of sequences H and K, such that:

W＝w ₁ ,w ₂ ,w ₃ ,……,w _u ,……,w _U max(m,n)≤U≤m+n-1

(4) Establishing related constraint conditions:

monotonicity: w (w) _h (u+1)≥w _h (u) and w _k (u+1)≥w _k (u)；

Continuity: w (w) _h (u+1)-w _h (u) is less than or equal to 1 and w _k (u+1)-w _k (u)≤1；

Boundary conditions: w (w) ₁ = (1, 1) and w _U ＝(m,n)；

When the constraint condition is satisfied, the DTW algorithm may be expressed as a path with minimum regular cost:

at this time, the calculation formula of the DTW algorithm is expressed as:

in the embodiment of the invention, one of H, K is a data sequence of a preliminary screening abnormality index, the other is a data sequence of an alarm index, and w _u A u-th element value of a target path in a data matrix constructed by a data sequence representing H, K; u represents the number of element values in the target path, and can be used for compensating for regular paths with different lengths.

After the target monitoring index values with similar fluctuation rules with the alarm index are obtained through the calculation process, the target monitoring index values can be sequentially arranged on the abscissa of the time dimension, so that a first abnormal relation map can be obtained, and at the moment, the monitoring object corresponding to the monitoring index arranged at the first moment at the forefront is the root cause of the abnormal occurrence event.

Further, as can be seen from fig. 1, the simulation platform according to the embodiment of the present invention further includes content for discovering, locating and automatically handling an abnormal event, where the content may enable discovery of an abnormal event, such as the response action of the abnormal event may be completed in the content, and the data may be transmitted to the simulation platform configured with the digital twin technology in real time, and the locating process of the abnormal event may be obtained through an algorithm designed module built in the simulation kernel.

Meanwhile, the corresponding data center can be further provided with a maintenance robot, so that if the root cause analysis carries a maintenance instruction acting on the maintenance robot, the maintenance instruction is sent to the maintenance robot, and the maintenance robot executes a maintenance task on the data center based on the maintenance instruction.

Specifically, the execution process of the maintenance robot is actually a process of automatically processing in a data center, and after abnormal positioning is performed in a simulation center, a maintenance task can be issued through a data twinned simulation platform, and the maintenance robot of the data center in a physical environment and the twinned robot in the data center of the simulation platform synchronously act in real time.

For example: an abnormal event a caused by a hard disk failure is located to a sde disk failure of the EQ1 host of the A1 rack of the SZ203 room. Before the interactive interface of the simulation center, a worker only needs to click to issue maintenance tasks, and a data center in a physical environment can automatically schedule an AGV trolley with a mechanical arm, automatically take out new disks with the same specification from a warehouse shelf and automatically carry the disks to an EQ1 host of an A1 rack of an SZ203 machine room for replacement. The whole process does not need to contact operation and maintenance engineers to apply for entering a machine room, and only needs interaction of the simulation platform and personnel in front of monitoring to issue maintenance tasks, so that the progress of the simulation platform for dynamically monitoring the maintenance tasks in real time can be realized, and the real-time state of a data center in the whole process is known, so that the time for maintaining hardware is greatly reduced, the condition before the service is recovered to a fault is ensured, the times of personnel entering the machine room is reduced, and the safety of the machine room or the data security are greatly improved.

Further, the embodiment of the invention also provides a digital twin-based intelligent operation and maintenance device for a data center, which is applied to a simulation platform, wherein the simulation platform is obtained by modeling the preset proportion of the constituent elements of the data center through three-dimensional simulation software, and comprises a physical model of the data center, and the digital twin-based intelligent operation and maintenance device for the data center is shown in a structural schematic diagram in fig. 6 and comprises:

the response module 60 is configured to respond to an abnormal occurrence event, and obtain a detection time range corresponding to the abnormal occurrence event, where the abnormal occurrence event carries an alarm indicator;

a first calculation module 62, configured to select all the monitoring index values within the detection time range, and calculate the fluctuation rates of all the monitoring index values within the detection time range;

a determining module 64, configured to determine the monitor indicator value with the volatility greater than a preset volatility threshold value as a preliminary screening abnormality indicator;

a second calculation module 66, configured to calculate a fluctuation similarity between the prescreening anomaly index and the alarm index, and select, based on the fluctuation similarity, the prescreening anomaly index having a similar fluctuation rule with the alarm index as a target monitoring index value;

A ranking module 68, configured to rank the target monitor index values according to a time dimension, so as to draw a first abnormal relationship map;

and an analysis module 69, configured to determine a monitored object corresponding to the target monitoring index value corresponding to the first moment in the first abnormal relationship map as a root cause of the abnormal event, and generate a root cause analysis based on the root cause, so as to facilitate maintenance of the abnormal event according to the root cause analysis.

The embodiment of the invention further provides a simulation platform which is obtained by modeling the constituent elements of the data center in a preset proportion through three-dimensional simulation software, and the simulation platform comprises a physical model of the data center; the simulation kernel of the simulation platform is configured with the intelligent operation and maintenance device so as to execute the intelligent operation and maintenance method.

The intelligent operation and maintenance device and the simulation platform provided by the embodiment of the invention have the same technical characteristics as the intelligent operation and maintenance method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.

The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the method when executing the computer program.

Embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.

Further, an embodiment of the present invention provides a schematic structural diagram of an electronic device, as shown in fig. 7, where the electronic device includes a processor 71 and a memory 70, where the memory 70 stores computer executable instructions that can be executed by the processor 71, and the processor 71 executes the computer executable instructions to implement the method shown in fig. 1.

In the embodiment shown in fig. 7, the electronic device further comprises a bus 72 and a communication interface 73, wherein the processor 71, the communication interface 73 and the memory 70 are connected by the bus 72.

The memory 70 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and the at least one other network element is achieved via at least one communication interface 73 (which may be wired or wireless), which may use the internet, a wide area network, a local network, a metropolitan area network, etc. Bus 72 may be an ISA (Industry Standard Architecture ) bus, PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The bus 72 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one bi-directional arrow is shown in FIG. 7, but not only one bus or type of bus.

The processor 71 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 71. The processor 71 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory and the processor 71 reads the information in the memory and in combination with its hardware performs the steps of the method described above.

The computer program product of the digital twin-based data center intelligent operation and maintenance method and device provided by the embodiment of the invention comprises a computer readable storage medium storing program codes, wherein the instructions included in the program codes can be used for executing the method described in the method embodiment, and specific implementation can be referred to the method embodiment and will not be repeated here.

It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding process in the foregoing method embodiment for the specific working process of the apparatus described above, which is not described herein again.

In addition, in the description of embodiments of the present invention, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood by those skilled in the art in specific cases.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above examples are only specific embodiments of the present invention for illustrating the technical solution of the present invention, but not for limiting the scope of the present invention, and although the present invention has been described in detail with reference to the foregoing examples, it will be understood by those skilled in the art that the present invention is not limited thereto: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The intelligent operation and maintenance method of the data center based on digital twinning is characterized by being applied to a simulation platform, wherein the simulation platform is obtained by modeling the preset proportion of constituent elements of the data center through three-dimensional simulation software, and comprises a physical model of the data center, and the method comprises the following steps:

Responding to an abnormal occurrence event, and acquiring a detection time range corresponding to the abnormal occurrence event, wherein the abnormal occurrence event carries an alarm index;

selecting all monitoring index values within the detection time range, and calculating the fluctuation rate of all the monitoring index values within the detection time range;

determining the monitoring index value of which the fluctuation rate is larger than a preset fluctuation rate threshold value as a preliminary screening abnormality index;

calculating the fluctuation similarity of the primary screening abnormal index and the alarm index, and selecting the primary screening abnormal index with a similar fluctuation rule with the alarm index as a target monitoring index value based on the fluctuation similarity;

sequencing the target monitoring index values according to the time dimension to draw a first abnormal relation map;

and determining a monitoring object corresponding to the target monitoring index value corresponding to the first moment in the first abnormal relation map as a root cause of the abnormal occurrence event, and generating root cause analysis based on the root cause so as to maintain the abnormal occurrence event according to the root cause analysis.

2. The method of claim 1, wherein the step of mapping the first anomaly relationship graph comprises:

Acquiring a pre-constructed abnormal relation map, wherein the abnormal relation map is constructed by taking all monitoring data of the data center as monitoring indexes;

screening the target monitoring index value in the abnormal relation map, and drawing the first abnormal relation map according to the internal relation of the screened target monitoring index value in the abnormal relation map.

3. The method of claim 1, wherein generating root cause analysis based on the root cause comprises:

judging whether object analysis corresponding to the root cause causing the abnormal occurrence event is recorded in a pre-established root cause analysis library;

if yes, generating root cause analysis of the abnormal occurrence event based on object analysis recorded in the root cause analysis library;

if not, adding the object analysis of the monitoring object corresponding to the root cause in the root cause analysis library, and generating the root cause analysis of the abnormal occurrence event according to the added object analysis.

4. The method according to claim 1, wherein the step of calculating the fluctuation rate of all the monitor index values in the detection time range includes:

Calculating the fluctuation rate of all the monitoring index values within the detection time range according to the following formula:

5. The method of claim 1, wherein the step of calculating a fluctuating similarity of the prescreening anomaly index to the alert index comprises:

calculating the similarity between the preliminary screening abnormal index and the alarm index by adopting a pre-established DTW algorithm;

wherein, the calculation formula of the DTW algorithm is as follows:

6. The method according to claim 2, wherein the method further comprises:

acquiring monitoring data of the data center;

drawing an abnormal relation map containing the monitoring data according to a preset fault propagation internal relation; the monitoring data comprises data collected by physical sensors arranged in the data center, monitoring data of a server layer in the data center and monitoring data of an application program layer in the data center.

7. The method of claim 1, wherein the data center is further configured with a maintenance robot, the method further comprising:

and if the root cause analysis carries a maintenance instruction acting on the maintenance robot, sending the maintenance instruction to the maintenance robot so that the maintenance robot executes a maintenance task on the data center based on the maintenance instruction.

8. The utility model provides a data center intelligent operation and maintenance device based on digital twin which characterized in that is applied to simulation platform, simulation platform is the platform that carries out the modeling of preset proportion through three-dimensional simulation software and gets for the constituent element of data center, and, simulation platform contains data center's practicality model, the device includes:

the response module is used for responding to an abnormal occurrence event and acquiring a detection time range corresponding to the abnormal occurrence event, wherein the abnormal occurrence event carries an alarm index;

the first calculation module is used for selecting all the monitoring index values in the detection time range and calculating the fluctuation rate of all the monitoring index values in the detection time range;

the determining module is used for determining that the monitoring index value with the fluctuation rate larger than a preset fluctuation rate threshold value is a primary screening abnormality index;

The second calculation module is used for calculating the fluctuation similarity of the primary screening abnormal index and the alarm index, and selecting the primary screening abnormal index with a similar fluctuation rule with the alarm index as a target monitoring index value based on the fluctuation similarity;

the sequencing module is used for sequencing the target monitoring index values according to the time dimension so as to draw a first abnormal relation map;

and the analysis module is used for determining a monitoring object corresponding to the target monitoring index value corresponding to the first moment in the first abnormal relation map as the root cause of the abnormal occurrence event, generating root cause analysis based on the root cause, and maintaining the abnormal occurrence event according to the root cause analysis.

9. The simulation platform is obtained by modeling the constituent elements of a data center in a preset proportion through three-dimensional simulation software, and comprises a physical model of the data center;

the simulation kernel of the simulation platform is configured with the intelligent operation and maintenance device according to claim 8 to execute the intelligent operation and maintenance method of the digital twin-based data center according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of the preceding claims 1-7.