KR101783201B1 - System and method for managing servers totally - Google Patents

System and method for managing servers totally Download PDF

Info

Publication number
KR101783201B1
KR101783201B1 KR1020150178246A KR20150178246A KR101783201B1 KR 101783201 B1 KR101783201 B1 KR 101783201B1 KR 1020150178246 A KR1020150178246 A KR 1020150178246A KR 20150178246 A KR20150178246 A KR 20150178246A KR 101783201 B1 KR101783201 B1 KR 101783201B1
Authority
KR
South Korea
Prior art keywords
server
managed
failure
battery
managed server
Prior art date
Application number
KR1020150178246A
Other languages
Korean (ko)
Other versions
KR20170070568A (en
Inventor
유세권
Original Assignee
주식회사 이스턴생명과학
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 이스턴생명과학 filed Critical 주식회사 이스턴생명과학
Priority to KR1020150178246A priority Critical patent/KR101783201B1/en
Publication of KR20170070568A publication Critical patent/KR20170070568A/en
Application granted granted Critical
Publication of KR101783201B1 publication Critical patent/KR101783201B1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/30Means for acting in the event of power-supply failure or interruption, e.g. power-supply fluctuations
    • G06Q50/32
    • H04L51/22
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/12Messaging; Mailboxes; Announcements
    • H04W4/14Short messaging services, e.g. short message services [SMS] or unstructured supplementary service data [USSD]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention relates to a server integrated management system and method for managing and integrating servers, and more particularly, to a server integrated management system in which two or more managed servers are integrated and managed, A management server for collecting and managing the status of each server, a hardware server for storing collected hardware information and software information, a database for providing the stored information to the management server, and a manager And a manager terminal for communicating with the management server and displaying the status of the managed server on the screen and transmitting the command input from the manager to the management server, A similar failure occurs by analyzing the fault pattern The server transmits a predicted failure occurrence message indicating that a failure may occur in response to an event generated when a predetermined event is generated in the managed server, to the corresponding managed server, and a solution to the expected failure is managed To the target server. According to the present invention, a failure occurring in the server can be prevented in advance by predicting and alerting a failure occurring in the server and providing a solution, thereby reducing the damage caused by the server failure.

Description

[0001] The present invention relates to a server management system and method,

The present invention relates to a server integrated management system and method for managing servers in a unified manner. More particularly, the present invention relates to a server integrated management system for analyzing a failure pattern occurring in a server, And more particularly, to a server integrated management system and method.

BACKGROUND ART [0002] Recently, as computers have become larger and faster, computer troubles due to system errors and viruses are frequently occurring. Especially, in the case of a large capacity server, various troubles due to various operations such as operation of the application programs, data storage, reading and transmission may occur frequently. Therefore, each company maintains a separate server manager that manages these servers, manages the servers, and handles them when a failure occurs.

However, specialized skills are required for server management, and a considerable expense is required to employ such skilled personnel. Therefore, especially in a small-sized enterprise, a suitable person is selected as a server manager instead of employing a professional engineer as a server manager. In such a case, it is difficult to manage the server smoothly, and it is almost impossible to smoothly cope with a server failure.

In addition, even when a server manager having a specialized skill for server management is employed, when the server manager is located at a remote place on the server due to a business trip or the like, it is difficult for the manager to be notified of the situation of the server promptly, In addition, even when the server manager is informed of the failure of the server, it is difficult to immediately deal with the failure because the server manager is located at the remote location. As a result, the server may be seriously damaged.

Conventionally, when a server failure occurs in a server integrated management system that manages a plurality of servers, the system detects the failure and restores the failure after the failure. However, the conventional post-failure recovery method has a problem that the operation of the corresponding server is interrupted during the period of recovering the failed server, the loss due to the interruption of the server use occurs, and the loss due to the labor and cost for recovery is large have.

Korean Patent Publication No. 10-2015-0124642

SUMMARY OF THE INVENTION The present invention has been conceived to solve the above-mentioned problems, and it is an object of the present invention to provide a server integrated management system and method for preventing a failure occurrence by preemptively detecting a failure pattern occurring in a server, The purpose is to provide.

The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

In order to achieve the above object, according to the present invention, in a server integrated management system in which two or more managed servers are integrated and managed, hardware information and software information are collected from two or more managed servers, A management server, a hardware server for storing collected hardware information and software information, a database for providing information to the management server, and a terminal used by a manager for managing the server integrated management system, And an administrator terminal for displaying the status of the managed server on the screen and transmitting the command input from the administrator to the management server. The management server analyzes the failure pattern of the managed server to prevent similar failures , A predetermined event on the managed server The expected failure message describing the failure that may occur due to the occurrence, the occurrence of an event and passed a resolution to send and, with this predicted failure for that managed server to the managed server.

The management server transmits a predicted failure occurrence message to the manager of the corresponding management server registered in the database through a short message service (SMS) and an e-mail (e-mail) , And a solution to the anticipated failure.

The management server checks the backup battery unit (BBU) cycle of the managed server and informs the managed server of the contents of the backup battery unit cycle when the predetermined period has elapsed.

The management server checks the BBU charge capacity of the managed server and notifies the managed server of the content when the charge efficiency of the battery decreases to a predetermined value or less. For example, the management server checks the charge capacity of the BBU of the managed server and informs the managed server of the content when the charge efficiency of the battery is reduced to 40% or less.

The management server checks the remaining capacity of the BBU of the managed server, and when the remaining capacity of the battery is equal to or less than a predetermined value, it can notify the managed server of the content. For example, the management server checks the remaining capacity of the BBU of the managed server, and notifies the managed server of the remaining capacity of the battery when the remaining battery capacity is 10% or less.

The management server checks the BBU write policy of the managed server and notifies the managed server of the write policy when the write policy is changed.

The management server includes a Dell server among the managed servers. When the management server detects an abnormal operation on an operating system (OS) after a kernel update on the Dell server, It is possible to transmit the expected failure occurrence message to the corresponding managed server and to forward the solution to the expected failure to the corresponding managed server.

The management server may diagnose a memory production cycle of the managed server, determine a predetermined memory production cycle as bad, and inform the management server of the content.

In a server integrated management method in a server integrated management system that integrates and manages two or more managed servers according to the present invention, the server integrated management system collects hardware information and software information from two or more managed servers, Analyzing a failure pattern of a management target server, analyzing a failure pattern, and outputting a predicted failure occurrence message indicating that a failure may occur according to an event generated when a predetermined event occurs, to the management server In addition, a solution to the expected failure can be transmitted to the corresponding managed server.

The server integrated management system transmits a predicted failure occurrence message to a manager in charge of the registered managed server through a short message service (SMS) and an e-mail (e-mail) You can communicate a solution to the expected failure.

The server integrated management system can check the backup battery unit (BBU) cycle of the managed server and notify the management server of the content when the predetermined period has elapsed.

The server integrated management system checks the BBU charge capacity of the managed server and notifies the managed server of the content when the charge efficiency of the battery decreases to a predetermined value or less.

The server integrated management system can check the BBU charging capacity of the managed server and notify the managed server of the contents when the charging efficiency of the battery is reduced to 40% or less.

The server integrated management system checks the remaining capacity of the BBU of the managed server and notifies the managed server of the remaining capacity of the battery when the remaining capacity of the battery is less than a predetermined value.

The server integrated management system can check the remaining capacity of the BBU of the managed server and notify the managed server of the remaining capacity of the battery when the remaining battery capacity is 10% or less.

The server integrated management system can check the BBU write policy of the managed server and notify the managed server of the changed contents when the write policy is changed.

And a Dell server among the managed servers. When the server integrated management system detects an abnormal operation on an operating system (OS) after a kernel update in the Dell server, It is possible to transmit an expected failure occurrence message to the corresponding managed server and to transmit a solution to the expected failure to the managed server.

The server integrated management system may diagnose a memory production cycle of the managed server, determine a predetermined memory production cycle as bad, and inform the management server of the content.

According to the present invention, a failure occurring in the server can be prevented in advance by predicting and alerting a failure occurring in the server and providing a solution, thereby reducing the damage caused by the server failure.

In addition, according to the present invention, the failure pattern generated in the server is analyzed and updated to actively cope with various server failures.

In addition, according to the present invention, not only a server failure is notified in advance, but also a solution method thereof is presented, thereby providing convenience in that the server manager can manage the server more easily.

1 is a diagram illustrating a network configuration of a server integrated management system according to an embodiment of the present invention.
2 is a block diagram illustrating an internal configuration of a server integrated management system according to an embodiment of the present invention.
FIGS. 3 to 12 are screen examples of a server integrated management system according to an embodiment of the present invention.
13 to 16 are exemplary report screens of the server integrated management system according to an embodiment of the present invention.
17 is an example of a screen when an event occurs in the server according to an embodiment of the present invention.
18 is a flowchart illustrating a server integrated management method according to an embodiment of the present invention.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted in an ideal or overly formal sense unless expressly defined in the present application Do not.

In the following description of the present invention with reference to the accompanying drawings, the same components are denoted by the same reference numerals regardless of the reference numerals, and redundant explanations thereof will be omitted. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail.

The present invention relates to a server integration management system that integrates and manages two or more managed servers.

1 is a diagram illustrating a network configuration of a server integrated management system according to an embodiment of the present invention.

Referring to FIG. 1, the server integrated management system of the present invention includes a management server 110, a database 120, and an administrator terminal 130.

The server integrated management system integrates and manages a plurality of servers 10, 20, 30, and 40. The server to be managed in the present invention may be various x86 servers, for example, a Dell server 10, an HP server 20, an IBM server 30, and an X86 server 40.

The servers 10, 20, 30 and 40 to be managed and the management server 110 communicate with each other through various wired / wireless communication methods, and can communicate using, for example, HTTP communication or JSON type POST transmission method.

In addition, the servers 10, 20, 30, and 40 automatically execute scripts according to predetermined scheduling in various x86 servers in a large-scale computing environment.

The manager accesses the management server 110 through the administrator terminal 130, executes a batch (BATCH) program according to the scheduling determined by the management server 120, and manages the change history by comparing with existing data. In the present invention, the administrator terminal 130 may be a desktop computer, a laptop computer, a tablet PC, a mobile phone, a mobile phone, a smart phone, or the like.

The management server 110 automatically collects hardware information and software information of the servers 10, 20, 30, and 40, grasps the status of each server based on the information, and manages the management service to provide.

The database 120 stores data necessary for the management of the servers 10, 20, 30 and 40 and provides data at the request of the management server 110. [ That is, the database 120 stores the collected hardware information and software information, and provides the stored information to the management server 110.

The administrator terminal 130 is a terminal used by an administrator who manages the server integrated management system and communicates with the management server 110 and displays the status of the managed servers 10, 20, 30, and 40 on the screen, To the management server (110).

In the present invention, the management server 110 diagnoses the managed servers 10, 20, 30, and 40 in order to prevent the occurrence of similar failures by analyzing the failure pattern of the managed server, To the management server, and transmits a solution for the expected failure to the management server.

The management server 110 transmits a predicted failure occurrence message to the management server of the corresponding managed server registered in the database 120 through short message service (SMS) and e-mail (e-mail) Together with detailed information, communicate the solution to the expected failure.

The management server 110 checks the backup battery unit (BBU) cycle of the managed server and informs the managed server of the contents of the backup battery unit cycle when the predetermined period has elapsed.

The management server 110 also checks the charge capacity of the BBU of the managed server and informs the management server of the content when the charging efficiency of the battery decreases to a predetermined value or less. For example, the management server 110 checks the charge capacity of the BBU of the managed server and informs the managed server of the content when the charge efficiency of the battery is reduced to 40% or less.

The management server 110 checks the remaining capacity of the BBU of the managed server and notifies the managed server of the remaining capacity of the battery when the remaining capacity of the battery is less than a predetermined value. For example, the management server 110 may check the remaining capacity of the BBU of the managed server, and notify the managed server of the remaining capacity of the battery when the remaining battery capacity is 10% or less.

In addition, the management server 110 can check the BBU write policy of the managed server and notify the managed server of the write policy when the write policy is changed.

2 is a block diagram illustrating an internal configuration of a server integrated management system according to an embodiment of the present invention.

2, the server 10 to be managed includes an information collecting unit 11, an information processing unit 12, an information transmitting unit 13, an instruction receiving unit 14 and an instruction executing unit 15 do.

The management server 110 includes an information receiving unit 111, an information analyzing unit 112, an information storing unit 113, an instruction receiving unit 114, and an instruction transmitting unit 115.

The administrator terminal 130 includes an instruction transmission unit 131.

The information collecting unit 11 of the server 10 to be managed collects information necessary for the diagnosis of the server 10.

The information processing unit 12 processes the collected information according to the transmission format.

The information transmission unit 13 transmits the processed information to the management server 110. [

The command receiving unit 14 receives the command transmitted from the management server 110. [

The instruction execution unit 15 plays a role of executing the received instruction.

The information receiving unit 111 of the management server 110 receives the information transmitted from the server 10.

The information analysis unit 112 analyzes the received information.

The information storage unit 113 serves to store the analyzed information in the database 120.

The command receiving unit 114 receives the command transmitted from the administrator terminal 130. [

The command transmission unit 115 transmits the received command to the server 10.

The command transmission unit 131 of the administrator terminal 130 transmits the command input from the manager to the management server 110. [

The present invention relates to a server integrated management system that integrates and manages a plurality of servers of the present invention, diagnoses various functions of the server, predicts and alerts a failure in advance, and presents a solution method.

First, among various functions of the server, a backup battery unit (BBU) will be exemplified in the present invention.

As an example of a Dell server, it is necessary to check the battery status of the BBU and proceed with preemptive replacement to prevent loss of cache data due to battery controller failure. To do this, the Full Charging efficiency (%) of the battery is checked by checking the log of the Dell server, the equipment with the full charging efficiency of less than 50% is checked, and the battery is replaced. The battery charging efficiency after 36 months is naturally reduced to about 70%, and it can be judged that the charging efficiency is poor for a battery having an additional reduction of about 20%.

FIG. 3 to FIG. 6 show an example of a BBU management function according to an embodiment of the present invention.

3 to 6, the server integrated management system of the present invention performs BBU cycle check, charge capacity check, remaining capacity check, and write policy check, thereby preventing cache data loss, Prevent risk factors in advance.

FIG. 3 shows an example of a BBU period check screen. In the case of battery charging, the disk write policy is changed from WriteBack to WriteThrough, and a phenomenon occurs in which the speed decrease and data loss occur. When it is near 90 days, it informs relevant server about related information.

FIG. 4 shows an example of a screen for checking the BBU charge capacity. As a symptom, there is a phenomenon in which the charge efficiency of the battery drops and the charge process is frequently required. In the case where the charge efficiency of the battery decreases to 40% To the related information.

FIG. 5 shows an example of a screen for checking the remaining capacity of the BBU. As a symptom, there is a possibility that the remaining amount of the battery falls to a dangerous level and the disk writing policy is changed. If the battery charging is required as a processing method and the battery remaining amount is 10% Tells the server about the relevant information.

FIG. 6 shows an example of the BBU write policy check screen. As a symptom, the write policy of the RC card changes and the speed is lowered. The RC card and the battery check are required as a processing method. T, the changed server is checked through the notification function.

FIGS. 7 to 13 are views showing functions of a server integrated management system according to an embodiment of the present invention.

7 is an example of a screen for displaying various OS information such as Windows, Linux, and VMware.

Referring to FIG. 7, the physical system, OS, and software information of the managed server can be retrieved at a time.

FIG. 8 is an example of a screen for allowing a software status and a specific software version to be viewed for the entire managed server.

Referring to FIG. 8, in the server integrated management system of the present invention, a list of software installed in each system of the entire management target server can be inquired, not by accessing an individual system of the managed server, The system of FIG.

FIG. 9 is an example of a screen for identifying a job history of a specific equipment through a condition search, and it is possible to quickly grasp job history information of a specific equipment because it supports condition search through accumulated data.

FIG. 10 illustrates an example of a prediction model in which a pattern of a similar obstacle can be analyzed and prevented and counteracted.

Referring to FIG. 10, it is possible to prevent a similar disorder by selecting a risk group of a risk group for a specific disorder pattern.

The server integrated management system of the present invention can identify the fault information by searching for a date condition such as a fault occurrence date, a work date and time, a completion date and the like.

11 is an example of a screen in which a date condition is searched and a monthly fault is searched.

12 is an example of a screen for analyzing a failure pattern to prevent similar failures.

12 shows that the search condition includes the PE 6850, the BIOS does not include the A06, the OS type includes the Windows 2003, and the M / B fatal error may occur upon rebooting. .

13 to 16 are exemplary report screens of the server integrated management system according to an embodiment of the present invention.

Referring to Fig. 13, there is shown an example of a risk group management screen of a preventive check report.

As shown in FIG. 13, the risk management screen is displayed on the upper part of the screen in the form of a graph and a chart, so that the contents can be easily grasped and the risk group name, description, target equipment, normal number, abnormal number, You can easily identify the details of a risk group by marking them in a table with items.

FIG. 14 shows an example of a job management screen in the preventive maintenance report, and it is displayed in the form of a table having a fault name, a job classification, a group, a model, an operator, and a status item along with a chart.

Fig. 15 shows an example of the inventory management screen in the precautionary inspection report, and is displayed in the form of a table having a host name, a change number, a model, a change date and time, and a status item in addition to a chart.

FIG. 16 shows an example of a system management screen in the pre-occurrence check report, and is displayed in a form of a table having a template, a total number, a model, a person in charge, and a registration date item along with a chart.

According to the present invention, when an event occurs, the system diagnoses that a failure may occur in the server through the event, warns the system of the server in advance, and transmits information about the solution. In this regard, there are a variety of events occurring in the server, and new events may occur that have not occurred before. Hereinafter, some events among the events that may occur in the server will be exemplified in the present invention.

1. Latest version of iDRAC7 1.51.51 Latest product applied FAN noise on Dell R720 server (Reading over 12,000 RPM).

The recommended solution is to downgrade to iDRAC7 version 1.46.45.

2. Power usage rate in Rack PDU # 1 and PDU # 2 is deviated to PDU # 1.

Referring to FIG. 17, not only the Dell server but also the HP server are set to operate in the active standby mode by default of the power supply, so that power is supplied to the rack PDU , It is necessary to adjust the ratio of Primary to PSU in order to balance the balance.

3. Operating system error after Dell R620 server kernel update.

At this time, when the management server 110 detects an abnormal operation on the operating system (OS) after updating the kernels in the dell server, the management server 110 transmits a predicted failure occurrence message, And transmits a solution for the expected failure to the corresponding managed server.

4. Service is disabled due to lack of TCP / IP port.

This is a phenomenon where the network TIME_WAIT session is not closed when the Uptime is more than 497 days in Windows 2008. This leads to a problem when there is no more port that occupies the port. Windows 2008 servers and Windows 2012 servers will be targeted and the failure can be resolved by removing the updated patches.

5. Windows (Windows) 2003, 2008 Event log generation.

6. Memory production cycle diagnosis.

(R730, R930, R630), and the failed OS is a server that contains the hotfix KB3064209 in the Windows 2012 R2 Server. , And the workaround is to remove the hotfix.

In the present invention, the management server 110 diagnoses the memory production cycle of the managed server, determines that the predetermined memory production cycle is bad, and informs the management server of the content.

7. If you are using a PCIe Type SSD, the device setup will stop responding.

The workaround for this is to update to BIOS 1.1.4 -> 1.2.10.

8. 12G Server (Server) Temperature after sensor update (Sensor) Alert_ (Alert_) continues to occur due to sensor failure.

The solution is to diagnose the BIOS version 2.5.2 and update to the latest firmware

Update.

9. Booting after BSOD occurs after patch update

This is due to the August 2014 Patch Tuesday update Windows error KB2982791.

The failure target is the Windows2008 server, and the failure can be solved through patch update.

10. Windows (Windows) 2012 DNS connection error in client using Active Directory.

When logging in to the domain account on the server, the error "Username or password is incorrect" occurs even though the account and password are normal.

AES256-CTS-HMAC-SHA1-96, AES128-CTS-HMAC-SHA1-96, RC4 without using DES-CBC-MD5 and DES-CBC-CRC encryption from Windows Server 2008 R2 / -HMAC encryption only. If the AD server is Windows Server 2012 R2 and the Domain Member is Windows Server 2008 R2 or Windows 7, the password for the computer account This is a phenomenon caused by an issue on the product where AES key generation fails at the time of update.

11. Vulnerabilities in the GNU Bash 4.3 Shell.

Using the Bash vulnerability, attackers are known to be able to perform content and code changes on Web servers, Web site tampering, user data leakage, and DDoS attacks. In addition, attack scenarios of Bash code injection vulnerability under various circumstances such as SSH and DHCP protocol are being raised.

The failure target is Red Hat Enterprise Linux 5,6,7 server, and the solution to the problem is Bash update.

12. Buffer overflow vulnerability in the GNU C library (glibc).

A vulnerable function is called when gethostbyname () and gethostbyname2 () functions are frequently used to connect to a network. An external attacker can execute arbitrary code remotely from a vulnerable server.

The target of the failure is Red Hat Enterprise Linux 5, 6, 7 server, and the resolution method is GLIBC update.

13. Bugs in Radhat V5 and V6 series operating systems.

Red Hat Enterprise Linux 6 or 5 with Intel CPUs Reboot after 208.5 days in all versions.

The failure target is Red Hat Enterprise Linux 5,6 server, and the solution to the problem is kernel update.

14. Raid Controller Battery Fail.

I / O performance is degraded due to unavailability of the RAID controller cache. The failure target is the Raid Controller Battery for Dell Perc 5i, 6i, and the troubleshooting method is the advance replacement every 4-5 years for the Raid Controller Battery for Dell Perc 5i, 6i.

15. CPU IERR System down due to an error (SYSTEM DOWN).

The failure target is the Intel iBridge V2 CPU used server (PE R720, PE R920), and the troubleshooting method is to change the BIOS setting.

For example, if you set the System Profile Settings to Custom, set the System Profile to Custom, set CPU Power Management to Maximum Performance, set C1E to Disabled C States Disabled, and set Monitor / Mwait To Disabled.

16. When using iDrac 1.50.50 F / W (Firmware) (search corresponding version)

Upgrade to iDrac F / W (Firmware) 1.51.51.

1) F / W upgrade on OS (Upgrade)

2) Upgrade through media in everyday life (Upgrade)

18 is a flowchart illustrating a server integrated management method according to an embodiment of the present invention.

Referring to FIG. 18, a server integrated management method in a server integrated management system that integrates and manages two or more managed servers is as follows.

First, hardware information and software information are collected from two or more managed servers, and the status of each server is acquired and managed (S210).

Then, the failure pattern of the managed server is analyzed (S220).

As a result of analyzing the failure pattern, at step S230, a predetermined failure occurrence message indicating that a failure may occur according to the generated event is transmitted to the managed server at step S240.

At the same time, a solution to the expected failure is transmitted to the corresponding managed server (S250).

In the present invention, the server integrated management system transmits a predicted failure occurrence message to a manager in charge of the registered managed server through a short message service (SMS) and an e-mail (e-mail) Together, you can pass a resolution to the expected failure.

The server integrated management system checks the BBU (Backup Battery Unit) cycle of the management target server, and notifies the management server of this content when a predetermined period has elapsed.

Further, the server integrated management system checks the BBU charging capacity of the management subject server and notifies the management server of the contents when the charging efficiency of the battery decreases to a predetermined value or less. For example, the server integrated management system checks the BBU charge capacity of the managed server and informs the managed server of the content when the charge efficiency of the battery decreases to 40% or less.

The server integrated management system checks the remaining capacity of the BBU of the managed server and notifies the managed server of the remaining capacity of the battery when the remaining capacity of the battery is less than a predetermined value. For example, the server integrated management system checks the remaining capacity of the BBU of the managed server, and notifies the managed server of the remaining capacity of the battery when the remaining battery capacity is 10% or less.

The server integrated management system checks the BBU write policy of the managed server and notifies the managed server of the write policy when the write policy is changed.

In one embodiment of the present invention, a Dell server is included among the managed servers, and the server integrated management system detects an abnormal operation on an operating system (OS) after a kernel update in the Dell server , A predicted failure occurrence message that may be caused thereby is transmitted to the corresponding managed server, and a solution for the expected failure can be transmitted to the managed server.

The server integrated management system diagnoses the memory production cycle of the managed server, determines that the predetermined memory production cycle is bad, and informs the managed server of the content.

While the present invention has been described with reference to several preferred embodiments, these embodiments are illustrative and not restrictive. It will be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit of the invention and the scope of the appended claims.

110 management server 120 database
130 Manager terminal 10 Managed server
11 Information collecting section 12 Information processing section
13 information transmission unit 14 command receiver
15 command execution unit 111 information reception unit
112 information analysis unit 113 information storage unit
114 Command receiving unit 115 Command transmitting unit
131 command transmission unit

Claims (20)

In a server integrated management system in which two or more managed servers are integrated and managed,
A management server for collecting hardware information and software information from two or more managed servers to identify and manage the status of each server;
A database for storing the collected hardware information and software information, and providing the stored information to the management server; And
And a manager terminal for communicating with the management server and displaying the status of the managed server on the screen and transmitting the command input from the manager to the management server,
In order to prevent the occurrence of a similar failure by analyzing the failure pattern of the managed server, the management server manages a predetermined failure occurrence message describing that a failure may occur according to the generated event when a predetermined event is generated in the managed server And transmits the solution to the target server together with the solution to the expected failure,
If the predetermined event is a phenomenon that can not be closed due to a network TIME_WAIT session remaining when the Uptime is equal to or greater than a predetermined number of days in a specific OS version, A service interruption due to a lack of a port, and a solution to the expected failure is to remove an updated patch of the particular OS version,
Wherein if the predetermined event is a RAID controller battery failure, the predicted failure is a degradation of I / O performance due to a failure of using a RAID controller cache, Is to replace the RAID controller battery,
The predetermined event is a phenomenon in which power is pushed to one of a plurality of rack PDUs in a managed server in which a power supply is set to operate as an active standby by default Wherein the expected failure is a balance collapse of the power usage rate, and a solution to the expected failure is a ratio of the primary to the PSU.
The method according to claim 1,
The management server transmits a predicted failure occurrence message to the manager of the corresponding management server registered in the database through a short message service (SMS) and an e-mail (e-mail) And delivering a solution to the anticipated failure.
The method according to claim 1,
Wherein the management server checks the backup battery unit (BBU) cycle of the managed server and informs the managed server of the content when the predetermined period has elapsed.
The method according to claim 1,
Wherein the management server checks the BBU charge capacity of the managed server and informs the managed server of the content when the charging efficiency of the battery decreases to a predetermined value or less.
The method of claim 4,
Wherein the management server checks the BBU charge capacity of the managed server and informs the managed server of the content when the charge efficiency of the battery is reduced to 40% or less.
The method according to claim 1,
Wherein the management server checks the remaining capacity of the BBU of the managed server and notifies the managed server of the remaining capacity of the battery when the remaining capacity of the battery is less than a predetermined value.
The method of claim 6,
Wherein the management server checks the remaining capacity of the BBU of the managed server and informs the managed server of the remaining capacity of the battery when the remaining capacity of the battery is 10% or less.
The method according to claim 1,
Wherein the management server checks the BBU write policy of the management server and notifies the management server of the write policy when the write policy is changed.
The method according to claim 1,
After the kernel is updated on the management server, if the management server detects an abnormal operation on an operating system (OS), the management server transmits a predicted failure occurrence message to the management server, Together with a solution to the expected failure, to the corresponding managed server.
The method according to claim 1,
Wherein the management server diagnoses a memory production cycle of the managed server, determines that a predetermined memory production cycle is defective, and informs the managed server of the content.
In a server integrated management method in a server integrated management system in which two or more managed servers are integrated and managed,
The server integrated management system collecting hardware information and software information from two or more managed servers to identify and manage the status of each server;
Analyzing a failure pattern of the managed server; And
As a result of analyzing the failure pattern, when a predetermined event occurs, a predicted failure occurrence message indicating that a failure may occur according to the generated event is transmitted to the corresponding management server, and a solution to the expected failure is transmitted to the corresponding management server , ≪ / RTI >
If the predetermined event is a phenomenon that can not be closed due to a network TIME_WAIT session remaining when the Uptime is equal to or greater than a predetermined number of days in a specific OS version, A service interruption due to a lack of a port, and a solution to the expected failure is to remove an updated patch of the particular OS version,
Wherein if the predetermined event is a RAID controller battery failure, the predicted failure is a degradation of I / O performance due to a failure of using a RAID controller cache, Is to replace the RAID controller battery,
The predetermined event is a phenomenon in which power is pushed to one of a plurality of rack PDUs in a managed server in which a power supply is set to operate as an active standby by default Wherein the expected failure is a balance collapse of the power usage rate, and a solution to the expected failure is a ratio of the primary to the PSU.
The method of claim 11,
The server integrated management system transmits a predicted failure occurrence message to a manager in charge of the registered managed server through a short message service (SMS) and an e-mail (e-mail) And forwarding a solution to the expected failure.
The method of claim 11,
Wherein the server integrated management system checks the backup battery unit (BBU) cycle of the managed server and informs the managed server of the contents when the predetermined period is reached.
The method of claim 11,
Wherein the server integrated management system checks the BBU charge capacity of the managed server and informs the managed server of the content when the charging efficiency of the battery drops below a predetermined value.
15. The method of claim 14,
Wherein the server integrated management system checks the BBU charging capacity of the managed server and informs the managed server of the content when the charging efficiency of the battery is reduced to 40% or less.
The method of claim 11,
Wherein the server integrated management system checks the remaining capacity of the BBU of the managed server and informs the managed server of the remaining capacity of the battery when the remaining capacity of the battery is less than a predetermined value.
18. The method of claim 16,
Wherein the server integrated management system checks the remaining capacity of the BBU of the managed server and informs the managed server of the remaining capacity of the battery when the remaining amount of the battery is less than 10%.
The method of claim 11,
Wherein the server integrated management system checks the BBU write policy of the managed server and, when the write policy is changed, notifies the managed server of the changed content.
The method of claim 11,
When the server integrated management system detects an abnormal operation on an operating system (OS) after a kernel update on the managed server, the server integrated management system transmits a predicted failure occurrence message that may be generated to the managed server And a solution method for the expected failure is transmitted to the corresponding managed server.
The method of claim 11,
Wherein the server integrated management system diagnoses a memory production cycle of the managed server, determines that a predetermined memory production cycle is defective, and informs the managed server of the content.



KR1020150178246A 2015-12-14 2015-12-14 System and method for managing servers totally KR101783201B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020150178246A KR101783201B1 (en) 2015-12-14 2015-12-14 System and method for managing servers totally

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020150178246A KR101783201B1 (en) 2015-12-14 2015-12-14 System and method for managing servers totally

Publications (2)

Publication Number Publication Date
KR20170070568A KR20170070568A (en) 2017-06-22
KR101783201B1 true KR101783201B1 (en) 2017-10-13

Family

ID=59282914

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020150178246A KR101783201B1 (en) 2015-12-14 2015-12-14 System and method for managing servers totally

Country Status (1)

Country Link
KR (1) KR101783201B1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102176028B1 (en) 2020-08-24 2020-11-09 (주)에오스와이텍 System for Real-time integrated monitoring and method thereof

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102139058B1 (en) * 2019-05-10 2020-07-29 (주)비앤에스컴 Cloud computing system for zero client device using cloud server having device for managing server and local server

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010526352A (en) 2006-11-16 2010-07-29 サムスン エスディーエス カンパニー リミテッド Performance fault management system and method using statistical analysis
US20150095718A1 (en) 2013-09-30 2015-04-02 Fujitsu Limited Locational Prediction of Failures

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010526352A (en) 2006-11-16 2010-07-29 サムスン エスディーエス カンパニー リミテッド Performance fault management system and method using statistical analysis
US20150095718A1 (en) 2013-09-30 2015-04-02 Fujitsu Limited Locational Prediction of Failures

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Watanabe 외 4명. 'Online failure prediction in cloud datacenters by real-time message pattern learning'. IEEE 4th International Conference on Cloud Computing Technology and Science, 2012, pp.504-511.

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102176028B1 (en) 2020-08-24 2020-11-09 (주)에오스와이텍 System for Real-time integrated monitoring and method thereof

Also Published As

Publication number Publication date
KR20170070568A (en) 2017-06-22

Similar Documents

Publication Publication Date Title
US11269750B2 (en) System and method to assess information handling system health and resource utilization
US10761926B2 (en) Server hardware fault analysis and recovery
US10069710B2 (en) System and method to identify resources used by applications in an information handling system
US8839032B2 (en) Managing errors in a data processing system
US8713350B2 (en) Handling errors in a data processing system
US8892965B2 (en) Automated trouble ticket generation
US20110004791A1 (en) Server apparatus, fault detection method of server apparatus, and fault detection program of server apparatus
US10275330B2 (en) Computer readable non-transitory recording medium storing pseudo failure generation program, generation method, and generation apparatus
US20160378602A1 (en) Pre-boot self-healing and adaptive fault isolation
US7318171B2 (en) Policy-based response to system errors occurring during OS runtime
CN108292342B (en) Notification of intrusions into firmware
US9798625B2 (en) Agentless and/or pre-boot support, and field replaceable unit (FRU) isolation
KR101783201B1 (en) System and method for managing servers totally
EP2819020A1 (en) Information system management device and information system management method and program
KR20130075807A (en) An atm with back-up hdd for booting and the booting method there of
WO2019241199A1 (en) System and method for predictive maintenance of networked devices
KR102526368B1 (en) Server management system supporting multi-vendor
JP2018169920A (en) Management device, management method and management program
KR20230073469A (en) Server management system capable of responding to failure
US11593191B2 (en) Systems and methods for self-healing and/or failure analysis of information handling system storage
Lundin et al. Significant advances in Cray system architecture for diagnostics, availability, resiliency and health
US20220391277A1 (en) Computing cluster health reporting engine
US20240028723A1 (en) Suspicious workspace instantiation detection
JP2011159234A (en) Fault handling system and fault handling method
JP2017134559A (en) Server device, screen information acquisition method, and bmc

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right