CN115576736A

CN115576736A - Refined intelligent monitoring method for data center

Info

Publication number: CN115576736A
Application number: CN202211562132.4A
Authority: CN
Inventors: 李超成; 高鸿波; 刘毅; 康凯; 刘大维; 高雷; 饶智斌
Original assignee: Beijing Tongniu Information Technology Co ltd
Current assignee: Beijing Tongniu Information Technology Co ltd
Priority date: 2022-12-07
Filing date: 2022-12-07
Publication date: 2023-01-06

Abstract

The invention relates to the technical field of data security, in particular to a refined intelligent monitoring method for a data center, which comprises the steps of establishing a database, acquiring fault data and recording the fault data into the database; grouping the data centers, and acquiring temperature and humidity data and hardware telemetering data of each working group and temperature and humidity data and hardware telemetering data of each device; generating a fault event according to the fault data, and positioning a fault working group and fault equipment which generate the fault event according to fault event information; carrying out hardware fault troubleshooting on the fault working group and the fault equipment, and judging whether the fault equipment is a hardware fault according to temperature and humidity data and hardware telemetering data of the fault working group and the fault equipment; when the fault working group and the fault equipment are determined to be hardware faults, generating warranty information and automatically reporting the warranty information; when the hardware of the fault working group and the fault equipment normally runs, the fault system is checked, and meanwhile, automatic repair is carried out.

Description

Refined intelligent monitoring method for data center

Technical Field

The invention relates to the technical field of data security, in particular to a refined intelligent monitoring method for a data center.

Background

A data center is a globally collaborative network of devices that is used to deliver, accelerate, present, compute, store data information over the internet network infrastructure. In future development, data centers will become competitive assets for enterprises, and business models will change accordingly. With the popularization of data center applications, artificial intelligence, network security and the like are also appeared in succession, and more users are brought into the applications of networks and mobile phones. With the increase of computers and data volume, people can also improve the self ability by continuously learning and accumulating, and the method is an important mark advancing to the information age. The data center is one of the more popular researches in the field of computers, so the research technology is mature. The computer network mainly comprises TREE, FAT-TREE, BCUBE, FICONN and the like, and thousands of devices are divided by taking units as units and managed one by one mainly by adopting a modularized, hierarchical and flattened design idea and a virtualized division management technology. Data centers are made up of numerous computer hardware, and problems with the hardware can result in some functions not being performed or functioning properly.

In order to realize the safe operation of the data center, most of the data centers are safely monitored by a video monitoring method and a software monitoring method, but the existing monitoring mode has the problems of message lag and untimely disposal, and when the hardware problem occurs in the data center, if the hardware problem cannot be timely powered off, the hardware equipment with the problem can be overheated, ignited and exploded under the condition of power-on.

Disclosure of Invention

The invention aims to solve the technical problem of providing a refined intelligent monitoring method for a data center, which has high intelligence degree, can quickly locate abnormal equipment, automatically repair and report the abnormal equipment in time, and improves the maintenance efficiency and the operation stability.

In order to solve the technical problems, the invention adopts the following technical scheme:

a refined intelligent monitoring method for a data center comprises

Establishing a database, collecting information data of logs and operation records of the data center, collecting and recording the data into the database, periodically scanning each device of the data center to acquire fault data, and recording the fault data into the database;

the method comprises the steps that a data center is placed in a centralized mode according to functions and purposes, the data center is grouped, the data center is divided into a plurality of working groups, temperature and humidity data and hardware telemetering data of each working group and temperature and humidity data and hardware telemetering data of each device are obtained, and power is supplied to each working group in a grouped mode;

generating a fault event according to the fault data, and positioning a fault working group and fault equipment which generate the fault event according to fault event information;

carrying out hardware fault troubleshooting on the fault working group and the fault equipment, and judging whether the fault equipment is a hardware fault or not according to temperature and humidity data and hardware telemetering data of the fault working group and the fault equipment;

when the fault working group and the fault equipment are determined to be hardware faults, the fault working group and the fault equipment are powered off, maintenance information is generated, and automatic reporting is carried out;

when the hardware of the fault working group and the fault equipment normally runs, the fault system is checked, and meanwhile, automatic repair is carried out.

And further, a log analysis tool is included, and the database analyzes the log through the log analysis tool.

Further, the step of grouping the data centers specifically includes:

dividing the data center into a plurality of working groups according to cabinet distribution of the data center, and performing grouping power supply on each working group;

carrying out video monitoring on the data center and acquiring a monitoring picture;

and separating the monitoring pictures according to the separation condition of the workgroup, and carrying out visual display.

Further, the video monitoring of the data center and the acquisition of the monitoring picture comprise a high-definition camera, and the video picture divider and the large-screen display device are used for dividing the monitoring picture according to the division condition of the workgroup and performing visual display on the monitoring picture.

Further, the hardware telemetry data includes data streams generated by the CPU, the memory and the Pcle interface.

Further, the temperature and humidity data and the hardware telemetering data of each working group and the temperature and humidity data and the hardware telemetering data of each device are acquired by a temperature and humidity sensor and a register, and the register is used for monitoring cache, CPU frequency, memory bandwidth, input and output access.

Further, the database module is used for establishing a database, collecting information data of logs and operation records of the data center, collecting and recording the data into the database, regularly scanning each device of the data center, acquiring fault data and recording the fault data into the database;

the data center grouping and operation data monitoring module is used for grouping the data centers, dividing the data centers into a plurality of working groups, acquiring temperature and humidity data and hardware telemetering data of each working group and temperature and humidity data and hardware telemetering data of each device, and grouping and supplying power to each working group;

the invention also provides a refined intelligent monitoring device of the data center, which comprises the following components:

the fault positioning module is used for generating a fault event according to the fault data and positioning a fault working group and fault equipment which generate the fault event according to the fault event information;

the hardware troubleshooting module is used for performing hardware troubleshooting on the fault working group and the fault equipment and judging whether the fault equipment is a hardware fault according to temperature and humidity data and hardware telemetering data of the fault working group and the fault equipment;

the hardware fault operation module is used for powering off the fault working group and the fault equipment when the fault working group and the fault equipment are determined to be hardware faults, generating maintenance information and automatically reporting the maintenance information;

and the system fault operation module is used for checking the fault system and automatically repairing the fault system when the hardware of the fault working group and the fault equipment normally runs.

The invention also provides computer equipment which comprises a memory and a processor, wherein the memory is stored with computer readable instructions, and the processor executes the computer readable instructions to realize the steps of the data center refined intelligent monitoring method.

The invention also provides a computer readable storage medium, wherein computer readable instructions are stored on the computer readable storage medium, and when the computer readable instructions are executed by a processor, the steps of the data center refined intelligent monitoring method are realized.

The invention has the beneficial effects that: when the system is actually used, a database is established, information data of logs and operation records of a data center are collected, the data are collected and recorded into the database, each device of the data center is periodically scanned to obtain fault data, and the fault data are recorded into the database; then grouping the data centers, dividing the data centers into a plurality of working groups, acquiring temperature and humidity data and hardware telemetering data of each working group and temperature and humidity data and hardware telemetering data of each device, and grouping and supplying power to each working group; generating a fault event according to the fault data, and positioning a fault working group and fault equipment generating the fault event according to fault event information; then, carrying out hardware fault troubleshooting on a fault working group and fault equipment, wherein all application services are carried out on the basis of normal operation of physical hardware, and partial functions cannot be normally exerted or operated due to the occurrence of a problem in the hardware; finally, when the fault working group and the fault equipment are determined to be hardware faults, the fault working group and the fault equipment are powered off, maintenance information is generated, and automatic reporting is carried out; and finally, when the hardware of the fault working group and the fault equipment normally runs, the fault system is checked and automatically repaired. The fine intelligent monitoring method for the data center is high in intelligent degree, abnormal equipment can be quickly positioned, automatic repairing and timely reporting are carried out, and maintenance efficiency and operation stability are improved.

Drawings

FIG. 1 is a block flow diagram of the present invention;

FIG. 2 is a schematic structural diagram of a data center refined intelligent monitoring device according to the present invention;

FIG. 3 is a schematic diagram of the structure of the computer device of the present invention;

fig. 4 is a schematic diagram of monitoring pictures grouped by the data center according to the present invention.

Detailed Description

In order to facilitate understanding of those skilled in the art, the present invention is further described below with reference to the following examples and the accompanying drawings, which are not intended to limit the present invention.

As shown in fig. 1, a method for refining intelligent monitoring of a data center includes:

establishing a database, collecting information data of logs and operation records of the data center, collecting and recording the data into the database, regularly scanning each device of the data center, acquiring fault data, and recording the fault data into the database;

grouping the data centers, dividing the data centers into a plurality of working groups, and acquiring temperature and humidity data and hardware telemetering data of each working group and temperature and humidity data and hardware telemetering data of each device;

hardware faults are checked on a fault working group and fault equipment, all application services are carried out on the basis of normal operation of physical hardware, partial functions cannot be normally played or operated when the hardware has problems, and whether the hardware faults exist is judged according to temperature and humidity data and hardware telemetering data of the fault working group and the fault equipment;

when the fault working group and the fault equipment are determined to be hardware faults, generating warranty information and automatically reporting the warranty information;

When the data base is established, an independent server can be used for operation, when the independent server is used, attention is paid to not opening strangers or files and mail attachments with unknown histories, not clicking an office macro operation prompt and not double clicking to open js and vbs postfix files, meanwhile, the latest security feature base such as anti-virus and the like is timely upgraded, important data and files are periodically backed up in different places, when the data center is grouped, the arrangement directions of all working groups are mainly adopted for grouping, therefore, in the process of arranging and establishing all the working groups, parts of the same working groups are arranged together as far as possible so as to be convenient for later maintenance and use, information data of log operation records of the data center are collected, the data are periodically arranged in the data base, all devices of the data center are scanned, fault data are obtained, system monitoring of all the working groups and devices is realized, the phenomenon that the monitoring of all the working groups and devices can be automatically carried out through collecting the information data and collecting the data of all the working groups and the hardware data can be directly repaired by remote monitoring and monitoring the devices when the devices operate, the temperature and humidity monitoring system monitoring equipment can be carried out by manual intervention when the temperature monitoring system is abnormally operated, the temperature and temperature monitoring equipment can be directly carried out, the phenomenon that the equipment can be repaired when the equipment is discovered by the equipment, the equipment can be repaired, the temperature and temperature monitoring system can be repaired when the temperature monitoring system.

The specific working principle and implementation of the invention are as follows: when the system works, the data centers are grouped according to functions, purposes and placement positions, fault events are generated through fault data collected by a database, fault work groups and fault equipment which generate the fault events are located according to fault event information, hardware fault judgment and system fault judgment are carried out on reasons of the fault events, when system faults occur, automatic repair can be carried out, fault systems can be checked, when the hardware faults occur, the fault work groups cannot work through the system repair, when the temperature and humidity data and the hardware remote measurement data of the fault work groups reach set thresholds, the fault equipment is proved to have risks of fire and explosion at the moment, when the hardware remote measurement data are abnormal, the fault equipment is proved to be abnormal in operation and cannot work normally, the fault work groups and the fault equipment are powered off in advance, the fault equipment can be prevented from being fired and exploded due to power on, the occurrence of fire and explosion accidents can be effectively prevented by cutting off power supply of the fault work groups and the fault equipment at the moment, and automatic reporting is carried out maintenance by contact maintainers.

In this embodiment: the data center is monitored in a video mode, and monitoring pictures are obtained through the data center and comprise high-definition cameras.

In the above structure: the high definition camera is installed in the inside roof department of data center computer lab, and the high definition camera can cover whole data center, is used for monitoring data center through the high definition camera, and the rear end equipment of being convenient for acquires the control picture.

As shown in fig. 4, in the present embodiment: the method for separating the monitoring pictures according to the separation condition of the working group and carrying out visual display comprises a video picture divider and a large-screen display device.

In the above structure: the monitoring data of the high-definition camera are subjected to picture segmentation through the video picture splitter and then displayed through the large-screen display device, and the specific working condition of each working group can be visually checked by a worker through the video picture segmentation, so that the worker can rapidly position the working group with problems, and the maintenance efficiency of maintenance personnel is improved.

In this embodiment: the temperature and humidity data and the hardware telemetering data of each working group and the temperature and humidity data and the hardware telemetering data of each device are acquired by the temperature and humidity sensor and the register, and the register is used for monitoring cache, CPU frequency, memory bandwidth and input and output access.

In the above structure: the temperature and humidity sensor is used for monitoring the temperature and humidity during operation of the data center, the temperature and humidity sensors are multiple and are respectively installed on equipment of each working group, the temperature and humidity condition of each equipment can be accurately monitored, the register is used for monitoring the operation conditions of monitoring cache, CPU frequency, memory bandwidth, input and output access and the like of the equipment, when problems occur, the temperature and humidity sensor can timely find out and be connected with maintenance personnel for maintenance.

As shown in fig. 2, a data center refined intelligent monitoring apparatus includes:

the database module is used for establishing a database, collecting information data of logs and operation records of the data center, collecting and recording the data into the database, regularly scanning each device of the data center, acquiring fault data and recording the fault data into the database;

and the system fault operation module is used for checking a fault system and automatically repairing the fault system when the hardware of the fault working group and the fault equipment normally runs.

In the above structure: collecting logs of a data center and information data of operation records through a database module, collecting and recording the data into the database module, periodically scanning each device of the data center to obtain fault data, recording the fault data into the database module, grouping and operating a data monitoring module through the data center, dividing the data center into a plurality of working groups, obtaining temperature and humidity data and hardware remote measurement data of each working group and temperature and humidity data and hardware remote measurement data of each device, and grouping and supplying power to each working group; and then generating a fault event by using fault data according to a fault positioning module, positioning a fault working group and fault equipment generating the fault event through fault event information, then carrying out hardware fault troubleshooting on the fault working group and the fault equipment, acquiring temperature and humidity data and hardware telemetering data of each working group and temperature and humidity data and hardware telemetering data of each equipment, carrying out hardware troubleshooting, judging whether the fault working group and the fault equipment are hardware faults or not after the hardware troubleshooting module inspects, when the fault working group and the fault equipment are determined to be hardware faults, powering off the fault working group and the fault equipment through a hardware fault operation module, generating maintenance information, automatically reporting, and when the hardware of the fault working group and the fault equipment normally operates, troubleshooting and automatically repairing the fault system through a system fault operation module.

As shown in FIG. 3, the computer device includes a memory, a processor, and a network interface communicatively connected to each other via a system bus. It should be noted that only a computer device having a memory, a processor, and a network interface is shown in fig. 3, but it should be understood that not all of the illustrated components are required to be implemented, and that more or fewer components may alternatively be implemented. As will be understood by those skilled in the art, the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The memory includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. The processor may be a central processing unit, controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to run a program code stored in the memory or process data, for example, run a program code of the data center refined intelligent monitoring method.

All the technical features in the embodiment can be freely combined according to actual needs.

The above embodiments are preferred implementations of the present invention, and other implementations are also included, and any obvious substitutions are within the scope of the present invention without departing from the spirit of the present invention.

Claims

1. A refined intelligent monitoring method for a data center is characterized by comprising the following steps:

carrying out hardware fault troubleshooting on the fault working group and the fault equipment, and judging whether the fault equipment is a hardware fault according to temperature and humidity data and hardware telemetering data of the fault working group and the fault equipment;

2. The refined intelligent monitoring method for the data center according to claim 1, characterized in that: the system also comprises a log analysis tool, and the log recorded in the database is analyzed through the log analysis tool.

3. The refined intelligent monitoring method for the data center according to claim 1, wherein: the step of grouping the data centers specifically includes:

dividing the data center into a plurality of working groups according to the cabinet distribution of the data center, and performing grouping power supply on each working group;

and separating the monitoring pictures according to the separation condition of the working groups, and performing visual display.

4. The refined intelligent monitoring method for the data center according to claim 3, wherein the refined intelligent monitoring method comprises the following steps: the data center is subjected to video monitoring, and monitoring pictures obtained by the video monitoring comprise high-definition cameras;

the step of separating the monitoring pictures according to the separation condition of the working group and carrying out visual display comprises a video picture divider and a large-screen display device.

5. The refined intelligent monitoring method for the data center according to claim 1, characterized in that: the hardware telemetry data includes data streams generated by the CPU, the memory and the Pcle interface.

6. The refined intelligent monitoring method for the data center according to claim 5, wherein: the temperature and humidity data and the hardware telemetering data of each working group and the temperature and humidity data and the hardware telemetering data of each device are acquired and comprise temperature and humidity sensor data and register data, and the register is used for monitoring cache, CPU frequency, memory bandwidth and input and output access.