CN110502390B - Automatic operation and maintenance management system of colleges and universities cloud computing center - Google Patents

Automatic operation and maintenance management system of colleges and universities cloud computing center Download PDF

Info

Publication number
CN110502390B
CN110502390B CN201910611693.0A CN201910611693A CN110502390B CN 110502390 B CN110502390 B CN 110502390B CN 201910611693 A CN201910611693 A CN 201910611693A CN 110502390 B CN110502390 B CN 110502390B
Authority
CN
China
Prior art keywords
maintenance
information
cloud computing
training
abnormal event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910611693.0A
Other languages
Chinese (zh)
Other versions
CN110502390A (en
Inventor
宋焘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN201910611693.0A priority Critical patent/CN110502390B/en
Publication of CN110502390A publication Critical patent/CN110502390A/en
Application granted granted Critical
Publication of CN110502390B publication Critical patent/CN110502390B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The invention provides an automatic operation and maintenance management system of a cloud computing center in colleges and universities, which comprises: the system comprises a monitoring system, an operation and maintenance system and a management system, wherein the monitoring system is used for monitoring state information of the cloud computing center and judging whether the cloud computing center has an abnormal event or not; the operation and maintenance system is used for analyzing the data of the abnormal event when the monitoring system detects the abnormal event and matching corresponding problem information from the problem library; the management system is used for generating a corresponding operation and maintenance task according to the problem information, synchronizing the operation and maintenance task information to one or more user terminals corresponding to the operation and maintenance task, and updating the operation and maintenance task information according to the feedback information of the user terminals. The invention can find the problems existing in the cloud computing center in time, realize the refined allocation of the operation and maintenance roles and the effective tracing of the operation and maintenance tasks, and improve the operation and maintenance management effect of the cloud computing center.

Description

Automatic operation and maintenance management system of colleges and universities cloud computing center
Technical Field
The invention relates to the technical field of operation and maintenance of cloud computing centers, in particular to an automatic operation and maintenance management system of a cloud computing center in colleges and universities.
Background
With the rapid development of educational informatization construction, operation and maintenance management becomes one of the main informationization works of colleges and universities. The emergence of new technologies such as virtualization and cloud computing gradually changes the information-based construction mode, the traditional computing platform is replaced by a cloud computing platform, and the operating efficiency of a data center is improved. The operation and maintenance method has the advantages that the method brings the bonus in enjoying new technology and simultaneously has new problems, the number, scale and complexity of objects for operation and maintenance management are greatly increased, the operation and maintenance mode is only undertaken by operation and maintenance personnel in the traditional 'who constructs and maintains' operation and maintenance mode, and the operation and maintenance mode that the outside cannot know, supervise and feed back the operation and maintenance process obviously cannot meet the requirements of services, technologies and management.
Disclosure of Invention
Aiming at the problems, the invention aims to provide an automatic operation and maintenance management system of a cloud computing center in colleges and universities.
The purpose of the invention is realized by adopting the following technical scheme:
an automated operation and maintenance management system of a college cloud computing center, comprising: a monitoring system, an operation and maintenance system and a management system, wherein,
the monitoring system is used for monitoring the state information of the cloud computing center and judging whether the cloud computing center has an abnormal event or not;
the operation and maintenance system is used for analyzing the data of the abnormal event when the monitoring system detects the abnormal event and matching corresponding problem information from the problem library;
and the management system is used for generating a corresponding operation and maintenance task according to the problem information, synchronizing the operation and maintenance task information to one or more user terminals corresponding to the operation and maintenance task, and updating the operation and maintenance task information according to the feedback information of the user terminals.
In one embodiment, the operation and maintenance system further comprises:
the automatic operation and maintenance task processing method specifically comprises the following steps:
and calling and executing a corresponding operation and maintenance script from the knowledge base according to the problem information in the operation and maintenance task to process the abnormal event.
In one embodiment, the state information includes computing resource information, network resource information, storage resource information, and the like of the cloud computing center;
in one embodiment, the operation and maintenance task information includes: abnormal event information, operation and maintenance task types, related personnel information, operation and maintenance task processing information and the like;
the abnormal event information comprises problem information of the abnormal event, the occurrence time of the abnormal event, the range influenced by the abnormal event and the like;
the operation and maintenance task types comprise automatic maintenance, manual maintenance, daily maintenance, fault treatment and the like;
the related personnel information comprises user information corresponding to the influence range of the abnormal event, agent maintenance personnel information of the influence range, operation and maintenance personnel information, administrator information and the like;
the operation and maintenance task processing information comprises a processing flow, a processing result, a processing log and the like.
In one embodiment, the system further comprises a database module, wherein the database module further comprises an operation and maintenance task database, a history database, a problem base and a knowledge base;
the operation and maintenance task database is used for storing the operation and maintenance task data generated by the management system;
the historical database is used for recording historical state information of the cloud computing center, the processing process of abnormal events and processing result information, and is also used for synchronizing operation and maintenance scripts generated by operation records in the processing process into the knowledge base;
the problem library is used for storing common problems in the operation and maintenance process and abnormal event data characteristics corresponding to the problems;
and the knowledge base is used for storing the operation and maintenance script corresponding to the problem.
In one embodiment, the management system further comprises: a work order module is arranged on the machine body,
and the work order module is used for receiving the operation and maintenance feedback work order sent by the user and generating a corresponding operation and maintenance task according to the work order information, wherein the work order information comprises the problem information fed back by the user.
In one embodiment, an operation and maintenance system comprises: a question bank input module and a knowledge bank input module,
the problem bank input module is used for inputting the problems and the abnormal event data characteristics corresponding to the problems into the problem bank by operation and maintenance personnel;
and the knowledge base input module is used for inputting the operation and maintenance script and the problem information correspondingly solved by the script into the knowledge base by the operation and maintenance personnel.
In one embodiment, a monitoring system includes: an anomaly monitoring module that monitors, among other things,
the anomaly monitoring module is used for monitoring the state information of the cloud computing center and judging whether the state information is abnormal or not, and specifically comprises the following steps:
acquiring state information of a cloud computing center within a period of time;
and inputting the state information in a period of time into the trained anomaly detection model, and acquiring the running state detection result output by the model.
The invention has the beneficial effects that: the monitoring system is arranged in the cloud computing center to monitor the state information of the cloud computing center, detect abnormal events and find problems in the cloud computing center in time; setting an operation and maintenance system to analyze the abnormal events and matching corresponding problem information of the abnormal events from a problem library; and the management system generates corresponding operation and maintenance tasks, realizes an automatic work order of the operation and maintenance tasks, realizes the fine distribution of operation and maintenance roles and the effective tracing of the operation and maintenance tasks, establishes a complete operation and maintenance system of the cloud computing center, and improves the operation and maintenance management effect of the cloud computing center.
Meanwhile, the operation and maintenance system can automatically process the operation and maintenance tasks, corresponding processing scripts are called according to different problems to process abnormal events, automatic monitoring and automatic operation and maintenance of the cloud computing center are achieved, and the workload of operation and maintenance personnel can be effectively reduced.
Drawings
The invention is further illustrated by means of the attached drawings, but the embodiments in the drawings do not constitute any limitation to the invention, and for a person skilled in the art, other drawings can be obtained on the basis of the following drawings without inventive effort.
Fig. 1 is a frame structure diagram of the present invention.
Reference numerals:
the monitoring system 1, the alarm module 11, the abnormality monitoring module 12, the abnormality detection model training unit 121, the operation and maintenance system 2, the automatic operation and maintenance module 21, the question bank entry module 22, the knowledge bank entry module 23, the management system 3, the task generation module 31, the work order module 32, the database module 4, the operation and maintenance task database 41, the history database 42, the question bank 43, and the knowledge bank 44
Detailed Description
The invention is further described in connection with the following application scenarios.
Referring to fig. 1, there is shown a college cloud computing center automated operation and maintenance management system 3, including: a monitoring system 1, an operation and maintenance system 2 and a management system 3, wherein,
the monitoring system 1 is used for monitoring the state information of the cloud computing center and judging whether the cloud computing center has an abnormal event or not;
the operation and maintenance system 2 is used for analyzing the data of the abnormal event when the monitoring system 1 detects the abnormal event, and matching corresponding problem information from the problem database 43;
and the management system 3 is used for generating a corresponding operation and maintenance task according to the problem information, synchronizing the operation and maintenance task information to one or more user terminals corresponding to the operation and maintenance task, and updating the operation and maintenance task information according to the feedback information of the user terminals.
According to the embodiment of the invention, the monitoring system 1 is arranged in the cloud computing center to monitor the state information of the cloud computing center, detect abnormal events and find problems in the cloud computing center in time; the operation and maintenance system 2 is set to analyze the abnormal events and match the corresponding problem information of the abnormal events from the problem database 43; and the management system 3 generates corresponding operation and maintenance tasks, realizes an automatic work order of the operation and maintenance tasks, realizes the fine distribution of operation and maintenance roles and the effective tracing of the operation and maintenance tasks, establishes a complete operation and maintenance system of the cloud computing center, and improves the operation and maintenance management effect of the cloud computing center.
In one embodiment, the state information includes computing resource information, network resource information, storage resource information, and the like of the cloud computing center.
In one scenario, the monitoring system 1 mainly monitors the cloud computing center status information, including: CPU utilization, memory utilization, IO latency information.
In one embodiment, the operation and maintenance task information includes: abnormal event information, operation and maintenance task types, related personnel information, operation and maintenance task processing information and the like;
the abnormal event information comprises problem information of the abnormal event, the occurrence time of the abnormal event, the range influenced by the abnormal event and the like;
the operation and maintenance task types comprise automatic maintenance, manual maintenance, daily maintenance and the like;
the related personnel information comprises user information corresponding to the influence range of the abnormal event, agent maintenance personnel information of the influence range, operation and maintenance personnel information, administrator information and the like;
the operation and maintenance task processing information comprises a processing flow, a processing result, a processing log and the like.
In one scenario, when an abnormal event occurs, after an operation and maintenance task is generated, the operation and maintenance task information is simultaneously sent to a terminal corresponding to an administrator, a user, a representative maintenance person and an operation and maintenance person, and the person related to the operation and maintenance task is notified at the first time, so that the operation and maintenance efficiency is improved; meanwhile, relevant personnel of the operation and maintenance task can respectively execute corresponding operations on the operation and maintenance task, so that the operation and maintenance task can be processed simultaneously in multiple roles (for example, a user supplements problem description, simple operation and maintenance operation can be directly operated by maintenance personnel, complex operation and maintenance tasks are operated by the operation and maintenance personnel, an administrator sets operation and maintenance permission, operation and maintenance progress, operation and maintenance time limit and the like), and the processing efficiency and the processing effect of the operation and maintenance task are further improved.
After the operation and maintenance task is completed, the operation and maintenance personnel or the agent maintenance personnel upload the processing result of the operation and maintenance task to the management system, and the management system updates the state information of the operation and maintenance task, so that a manager can master the progress of the operation and maintenance task according to the needs, count the workload condition of the operation and maintenance personnel, and make allocation flexibly.
Aiming at the characteristics of 'who constructs and maintains' in the traditional operation and maintenance mode, other users and other managers cannot know the problem of the whole operation and maintenance process; the operation and maintenance roles are set finely according to actual conditions, different roles are added into the operation and maintenance tasks at the same time to process and manage the operation and maintenance tasks, users and managers related to the operation and maintenance can know the progress and conditions of the operation and maintenance in real time, the transparence of operation and maintenance information is achieved, and the operation and maintenance management quality is improved.
In one embodiment, the operation and maintenance system 2 further includes:
the automatic operation and maintenance task processing method specifically comprises the following steps:
and calling and executing a corresponding operation and maintenance script from the knowledge base 44 to process the abnormal event according to the problem information in the operation and maintenance task.
In one embodiment, the operation and maintenance system 2 further includes:
the automatic operation and maintenance module 21 is used for calling and executing the corresponding operation and maintenance script from the knowledge base 44 according to the problem information in the operation and maintenance task
According to the embodiment of the invention, the operation and maintenance system 2 can also automatically process the operation and maintenance tasks, call the corresponding processing scripts to process the abnormal events according to different problems, realize automatic monitoring and automatic operation and maintenance of the cloud computing center, and effectively reduce the workload of the operation and maintenance personnel.
In one embodiment, the system further comprises a database module 4, which further comprises an operation and maintenance task database 41, a history database 42, a question bank 43 and a knowledge bank 44;
the operation and maintenance task database 41 is used for storing the operation and maintenance task data generated by the management system 3;
the historical database 42 is used for recording historical state information of the cloud computing center, processing procedures and processing result information of abnormal events, and is also used for synchronizing operation and maintenance scripts generated by operation records in the processing procedures into the knowledge base 44;
the problem library 43 is used for storing common problems in the operation and maintenance process and abnormal event data characteristics corresponding to the problems;
and the knowledge base 44 is used for storing the operation and maintenance script corresponding to the problem.
In one embodiment, when an abnormal event is detected, the operation and maintenance personnel can also process the abnormal event through manual operation, and the processing procedure and the processing result information of the manual operation are recorded in the historical database 42.
In one scenario, the operation and maintenance task database 41 stores data including historical operation and maintenance tasks that have been completed, as well as ongoing operation and maintenance tasks for the system and user to call when needed.
In the above embodiment of the present invention, the operation and maintenance task database 41 is configured to uniformly manage the generated operation and maintenance task data for a user to call when needed, and the historical database 42 stores the historical state information of the cloud computing center and the processing procedure and processing result information of the abnormal event, so that the active recording of the operation and maintenance procedure can be realized, and the traceability of the operation and maintenance information can be improved; meanwhile, the problem base 43 and the knowledge base 44 are established to sort the common problems and the corresponding solutions in the operation and maintenance process, and the operation and maintenance script is established in the knowledge base 44 to manage script resources, so that the efficiency of operation and maintenance work can be effectively improved, the automatic operation of repeated operation and maintenance operation is realized, and the work load of operation and maintenance personnel is effectively reduced.
In one embodiment, the management system 3 further comprises: the task generating module 31 is configured to generate a task,
the task generating module 31 is configured to generate a corresponding operation and maintenance task according to the problem information.
In one embodiment, the management system 3 further comprises: the work order module 32 is configured to receive work orders,
and the work order module 32 is configured to receive an operation and maintenance feedback work order sent by a user, and generate a corresponding operation and maintenance task according to the work order information, where the work order information includes problem information fed back by the user.
In one embodiment, the operation and maintenance system 2 includes: a question bank entry module 22 and a knowledge bank entry module 23,
the question bank recording module 22 is used for the operation and maintenance personnel to record the questions and the abnormal event data characteristics corresponding to the questions into the question bank 43;
and the knowledge base entry module 23 is used for the operation and maintenance personnel to enter the operation and maintenance script and the problem information correspondingly solved by the script into the knowledge base 44.
In the traditional operation and maintenance process, problems of repetition, low technical content and even no need of operation and maintenance personnel are encountered. The problems are continuously collected and sorted through the problem database 43, the problem database 43 is continuously enlarged through automatic discovery of abnormal events and calibration of the problems, and a foundation is laid for automatic operation and maintenance. Meanwhile, corresponding processing scripts are developed by combining the operation and maintenance process and the solved problems, the knowledge base 44 is continuously perfected, and the automatic operation and maintenance degree can be continuously improved.
In one embodiment, the operation and maintenance personnel can also perform problem calibration on the abnormal event information from the historical database 42, and the calibrated abnormal event information and the corresponding problem information are synchronized into the problem database 43.
Similarly, the operation and maintenance personnel can calibrate the abnormal event handling process from the historical database 42, and the calibrated abnormal event handling process is generated into the operation and maintenance script and synchronized into the knowledge base 44 together with the calibrated problem information.
In the above embodiment of the present invention, after the operation and maintenance personnel manually or for the first time processes an abnormal event, the historical database 42 records the operation and maintenance process, after the operation and maintenance is completed, the operation and maintenance personnel can manually calibrate the historical operation and maintenance operation data and the solved abnormal event, and manually help the operation and maintenance system 2 to record the operation and maintenance operation, generate the corresponding operation and maintenance script, and store the operation and maintenance script in the knowledge base 44, so that the operation and maintenance system 2 can implement automatic operation and maintenance when encountering the same abnormal event. The method realizes the effective establishment of the operation and maintenance database, and effectively reduces the burden of operation and maintenance personnel caused by repeated operation.
In one embodiment, the monitoring system 1 comprises an alarm module 11,
and the alarm module 11 is configured to send a corresponding alarm message to the user terminal when the abnormal event is detected.
In one embodiment, according to the area where the abnormal event occurs, the abnormal event information is sent to the management terminal corresponding to the area, and the corresponding operation and maintenance personnel can process the abnormal event.
According to the above embodiment of the invention, after the abnormal event is detected, the specified operation and maintenance personnel can be notified to process the abnormal event according to the area where the abnormal event occurs and the preset rules (such as the operation and maintenance role, the operation and maintenance responsibility, the operation and maintenance requirement and the like), so that the operation and maintenance role and the operation and maintenance task can be clearly allocated.
In one embodiment, the monitoring system 1 comprises: an anomaly monitoring module 12 that monitors, among other things,
the anomaly monitoring module 12 is configured to monitor state information of the cloud computing center, and determine whether the state information is anomalous, specifically including:
acquiring state information of a cloud computing center within a period of time;
and inputting the state information in a period of time into the trained anomaly detection model, and acquiring the operation state detection result output by the model.
According to the embodiment of the invention, the state information of the cloud computing center is monitored by adopting the trained anomaly detection model through the machine learning technology, the running state of the cloud computing center can be intelligently detected, the anomaly condition is timely found, and the operation and maintenance efficiency and effect are improved.
In one embodiment, the anomaly monitoring module 12 further includes an anomaly detection model training unit 121: training an anomaly detection model based on the historical state information, wherein the anomaly detection model is modeled based on an SOM network; the abnormality detection model training unit 121 includes:
preparing a layer:
acquiring historical state information of a cloud computing center within a period of time as a training sample, and enabling a historical state information vector g (t) ═ g1(t),g2(t),…,gm(t)]Setting the input vector as an input vector, wherein m represents the dimension of the input vector, and t represents that the corresponding time sequence number of the input vector is t; initializing an input vector and a weight vector δ of a neuronxy(0) Where x, y is 1,2, …, S, x and y denote the specific location of a neuron in the SOM network, and a neuron weight vector δxy(0) Is the same as the dimension of the input vector g (t), S denotes the size of the SOM neural network;
matching layer:
inputting an input vector in a training sample into an SOM network, acquiring the position of a neuron matched with the input vector, taking the Euclidean distance between the input vector and each neuron weight vector as a judgment basis, and selecting the corresponding neuron with the minimum Euclidean distance as the neuron matched with the input vector;
a training layer:
taking neurons matched with input vectors g (t) corresponding to the current time sequence number as a training region center phi, and acquiring an updating control factor of each neuron in a training region consisting of the training region center and a neighborhood thereof:
Figure BDA0002122579880000071
wherein the content of the first and second substances,
Figure BDA0002122579880000072
update control factor, I, representing a neuron (x, y)ΦTo representTraining a coordinate vector of a center phi of the region in the SOM network, I (x, y) represents the coordinate vector of the neuron node (x, y) in the SOM network, lambda (t) represents an updating adjustment factor, d (t) represents a set neighborhood width, wherein the training region is a region which is less than the set neighborhood width from the center of the training region, SeRepresenting the size of the SOM network;
updating each neuron in the training area according to the input vector, wherein the specifically adopted updating function is as follows:
Figure BDA0002122579880000073
wherein, deltaxy(t) represents the weight vector of the neuron (x, y) at the current time index t, δxy(t) represents the weight vector of the neuron (x, y) at the previous time index t-1, g (t) represents the input vector of the time index t,
Figure BDA0002122579880000074
an update control factor representing a neuron (x, y);
and after the training of the current time sequence number t is finished, starting the training of the next time sequence number t +1, and repeating the training of the training layer until the SOM network converges or exceeds the maximum time sequence number.
In one embodiment, the training sample further includes running state detection information corresponding to the historical state information.
In a first scenario, if the running state detection information includes normality or abnormality, the output result of the abnormality detection model is normal or abnormal;
in one scenario, the running state detection information includes problem information corresponding to normal or different abnormal events, and the result output by the abnormal detection model is normal or different kinds of problem information.
In one scenario, the input vector consists of the state information at the same time.
In one scenario, the SOM neural network is sized as sxs, which contains a total of sxs neuron nodes.
According to the embodiment of the invention, the anomaly detection model is trained by adopting the method, the historical running state information of the cloud computing center can be used as a training sample, the adaptability and the accuracy of model training are improved, meanwhile, the anomaly detection model is constructed by adopting the SOM network, the characteristic of huge data volume in the cloud computing center can be adapted, and the reliability of anomaly event detection is improved.
In one embodiment, in the anomaly detection model training unit 121, the set neighborhood width d (t) is obtained according to the following custom function:
Figure BDA0002122579880000081
where d (t) represents the neighborhood width at time number t, L represents the size of the SOM network, b represents the set control factor, λ (1) represents the updated adjustment factor at time number t equal to 1, λ (t) represents the updated adjustment factor at time number t,
Figure BDA0002122579880000082
and (4) indicating the set neighborhood width adjusting factor, and T indicating the time sequence number corresponding to the input vector, wherein T is 1,2, …, and T indicates the total number of the input vectors in the training sample.
According to the embodiment of the invention, the convergence of the training of the anomaly detection model is improved along with the increase of the input vector, and the neuron nodes in the SOM network form obvious differences, so that the field width is properly reduced along with the increase of the iteration times in the training of the model, and the accuracy and the stability of the training of the model can be improved.
In one embodiment, in the anomaly detection model training unit 121, the updated adjustment factor λ (t) is obtained according to the following custom function:
Figure BDA0002122579880000083
in the formula, λ (t) represents an update adjustment factor at time index t.
According to the above embodiment of the invention, as the convergence degree in the training process of the anomaly detection model is higher and higher, the stability of the SOM network is improved, so that the updated adjustment factor is synchronously adjusted by adopting the above method, the condition of 'over-training' of the anomaly detection model can be avoided, and the stability of the training of the anomaly detection model is improved.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the protection scope of the present invention, although the present invention is described in detail with reference to the preferred embodiments, it should be analyzed by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (7)

1. An automated operation and maintenance management system of a college cloud computing center is characterized by comprising: a monitoring system, an operation and maintenance system and a management system, wherein,
the monitoring system is used for monitoring the state information of the cloud computing center and judging whether the cloud computing center has an abnormal event or not;
the operation and maintenance system is used for analyzing the data of the abnormal event when the monitoring system detects the abnormal event and matching corresponding problem information from a problem library;
the management system is used for generating a corresponding operation and maintenance task according to the problem information, synchronizing the operation and maintenance task information to one or more user terminals corresponding to the operation and maintenance task, and updating the operation and maintenance task information according to feedback information of the user terminals;
the monitoring system includes: an anomaly monitoring module that monitors, among other things,
the anomaly monitoring module is used for monitoring the state information of the cloud computing center and judging whether the state information is abnormal or not, and specifically comprises the following steps:
acquiring state information of the cloud computing center within a period of time;
inputting the state information in the period of time into a trained anomaly detection model, and acquiring an operation state detection result output by the model;
the anomaly monitoring module further comprises an anomaly detection model training unit: training an anomaly detection model based on historical state information, wherein the anomaly detection model is modeled based on a SOM network; the abnormality detection model training unit includes:
preparing a layer:
acquiring historical state information of a cloud computing center within a period of time as a training sample, and enabling a historical state information vector g (t) ═ g1(t),g2(t),…,gm(t)]Setting the input vector as an input vector, wherein m represents the dimension of the input vector, and t represents that the corresponding time sequence number of the input vector is t; initializing an input vector and a weight vector δ of a neuronxy(0) Where x, y is 1,2, …, S, x and y denote the specific location of a neuron in the SOM network, and a neuron weight vector δxy(0) Is the same as the dimension of the input vector g (t), S represents the size of the SOM network;
matching layer:
inputting an input vector in a training sample into an SOM network, acquiring a neuron position matched with the input vector, taking Euclidean distance between the input vector and each neuron weight vector as a judgment basis, and selecting a corresponding neuron with the minimum Euclidean distance as a neuron matched with the input vector;
a training layer:
taking neurons matched with input vectors g (t) corresponding to the current time sequence number as a training region center phi, and acquiring an updating control factor of each neuron in a training region consisting of the training region center and a neighborhood thereof:
Figure FDA0002914094080000021
wherein the content of the first and second substances,
Figure FDA0002914094080000022
to representUpdate control factor for neurons (x, y), IΦRepresenting a coordinate vector of a training region center phi in the SOM network, I (x, y) representing a coordinate vector of a neuron node (x, y) in the SOM network, lambda (t) representing an updating adjustment factor, d (t) representing a set neighborhood width, wherein the training region is a region which is less than the set neighborhood width from the training region center, and S represents the size of the SOM network;
updating each neuron in the training area according to the input vector, wherein the specifically adopted updating function is as follows:
Figure FDA0002914094080000023
wherein, deltaxy(t) represents the weight vector of the neuron (x, y) at the current time index t, δxy(t-1) represents the weight vector of the neuron (x, y) at the previous time index t-1, g (t) represents the input vector of the time index t,
Figure FDA0002914094080000024
an update control factor representing a neuron (x, y);
after the training of the current time sequence number t is finished, starting the training of the next time sequence number t +1, and repeating the training of the training layer until the SOM network converges or exceeds the maximum time sequence number;
in the anomaly detection model training unit, the set neighborhood width d (t) is obtained according to the following custom function:
Figure FDA0002914094080000025
where d (t) represents the neighborhood width at time number t, S represents the size of the SOM network, b represents a set control factor, λ (1) represents an update adjustment factor at time number t equal to 1, λ (t) represents an update adjustment factor at time number t,
Figure FDA0002914094080000026
and (4) indicating the set neighborhood width adjusting factor, and T indicating the time sequence number corresponding to the input vector, wherein T is 1,2, …, and T indicates the total number of the input vectors in the training sample.
2. The automated operation and maintenance management system according to claim 1, further comprising:
the operation and maintenance task is automatically processed, and the method specifically comprises the following steps:
and calling and executing a corresponding operation and maintenance script from a knowledge base according to the problem information in the operation and maintenance task to process the abnormal event.
3. The automated operation and maintenance management system according to claim 1, wherein the status information comprises computing resource information, network resource information and storage resource information of a cloud computing center; and/or the presence of a gas in the gas,
the operation and maintenance task information comprises: abnormal event information, operation and maintenance task types, related personnel information and operation and maintenance task processing information;
the abnormal event information comprises problem information of an abnormal event, the occurrence time of the abnormal event and the range influenced by the abnormal event;
the operation and maintenance task types comprise automatic maintenance and manual maintenance;
the related personnel information comprises user information corresponding to an influence range of an abnormal event, agent maintenance personnel information of the influence range, operation maintenance personnel information and administrator information;
the operation and maintenance task processing information comprises a processing flow, a processing result and a processing log.
4. The automated operation and maintenance management system according to claim 2, further comprising a database module, further comprising an operation and maintenance task database, a history database, the question bank and the knowledge bank;
the operation and maintenance task database is used for storing the operation and maintenance task data generated by the management system;
the historical database is used for recording historical state information of the cloud computing center, the processing process and the processing result information of the abnormal events, and is also used for synchronizing operation and maintenance scripts generated by operation records in the processing process into the knowledge base;
the problem library is used for storing common problems in the operation and maintenance process and abnormal event data characteristics corresponding to the problems;
and the knowledge base is used for storing the operation and maintenance script corresponding to the problem.
5. The automated operation and maintenance management system according to claim 1, wherein the management system further comprises: a work order module is arranged on the machine body,
the work order module is used for receiving an operation and maintenance feedback work order sent by a user and generating a corresponding operation and maintenance task according to the work order information, wherein the work order information comprises problem information fed back by the user.
6. The automated operation and maintenance management system according to claim 2, wherein the operation and maintenance system comprises: a question bank input module and a knowledge bank input module,
the problem bank input module is used for inputting the problems and the abnormal event data characteristics corresponding to the problems into the problem bank by operation and maintenance personnel;
and the knowledge base input module is used for inputting the operation and maintenance script and the problem information correspondingly solved by the script into the knowledge base by the operation and maintenance personnel.
7. The automated operation and maintenance management system according to claim 1, wherein the monitoring system comprises an alarm module,
and the alarm module is used for sending a corresponding alarm message to the user terminal when the abnormal event is detected.
CN201910611693.0A 2019-07-08 2019-07-08 Automatic operation and maintenance management system of colleges and universities cloud computing center Active CN110502390B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910611693.0A CN110502390B (en) 2019-07-08 2019-07-08 Automatic operation and maintenance management system of colleges and universities cloud computing center

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910611693.0A CN110502390B (en) 2019-07-08 2019-07-08 Automatic operation and maintenance management system of colleges and universities cloud computing center

Publications (2)

Publication Number Publication Date
CN110502390A CN110502390A (en) 2019-11-26
CN110502390B true CN110502390B (en) 2021-06-01

Family

ID=68586159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910611693.0A Active CN110502390B (en) 2019-07-08 2019-07-08 Automatic operation and maintenance management system of colleges and universities cloud computing center

Country Status (1)

Country Link
CN (1) CN110502390B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111815273A (en) * 2020-07-03 2020-10-23 远光软件股份有限公司 Configuration method of document approval process, storage medium and electronic equipment
CN111934934B (en) * 2020-08-17 2023-06-16 浪潮通信信息系统有限公司 Method for realizing remote automatic operation and maintenance through short message transmission
CN113194297B (en) * 2021-04-30 2023-05-23 重庆市科学技术研究院 Intelligent monitoring system and method
CN113269522B (en) * 2021-05-19 2021-11-30 江苏星月测绘科技股份有限公司 Building intelligent management method and system based on BIM technology
CN114064400B (en) * 2021-11-01 2023-03-24 江苏新希望科技有限公司 IT equipment operation and maintenance perception monitoring system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5712896A (en) * 1995-08-05 1998-01-27 Electronics And Telecommunications Research Institute Method for diagnosing a fault of digital exchanger
CN102195813A (en) * 2011-05-04 2011-09-21 成都勤智数码科技有限公司 Method and device for intelligently creating operation and maintenance worksheet
CN106921526A (en) * 2017-04-13 2017-07-04 湖南森纳信息科技有限公司 Intelligent campus network O&M system
CN107070720A (en) * 2017-04-26 2017-08-18 深圳市神云科技有限公司 The monitoring of cloud platform anomalous event and the method automatically processed and framework
CN107862393A (en) * 2017-10-31 2018-03-30 广西宜州市联森网络科技有限公司 A kind of IT operation management system
CN108062586A (en) * 2017-11-30 2018-05-22 中国船舶工业系统工程研究院 Marine main engine associated member state monitoring method and system based on decline contribution degree

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5712896A (en) * 1995-08-05 1998-01-27 Electronics And Telecommunications Research Institute Method for diagnosing a fault of digital exchanger
CN102195813A (en) * 2011-05-04 2011-09-21 成都勤智数码科技有限公司 Method and device for intelligently creating operation and maintenance worksheet
CN106921526A (en) * 2017-04-13 2017-07-04 湖南森纳信息科技有限公司 Intelligent campus network O&M system
CN107070720A (en) * 2017-04-26 2017-08-18 深圳市神云科技有限公司 The monitoring of cloud platform anomalous event and the method automatically processed and framework
CN107862393A (en) * 2017-10-31 2018-03-30 广西宜州市联森网络科技有限公司 A kind of IT operation management system
CN108062586A (en) * 2017-11-30 2018-05-22 中国船舶工业系统工程研究院 Marine main engine associated member state monitoring method and system based on decline contribution degree

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"自组织神经网络的新算法以及应用";夏文文;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20090315;第I140-75页 *

Also Published As

Publication number Publication date
CN110502390A (en) 2019-11-26

Similar Documents

Publication Publication Date Title
CN110502390B (en) Automatic operation and maintenance management system of colleges and universities cloud computing center
WO2020077682A1 (en) Service quality evaluation model training method and device
CN105577440B (en) A kind of network downtime localization method and analytical equipment
CN112600891B (en) Information physical fusion-based edge cloud cooperative system and working method
CN110943983B (en) Network security prevention method based on security situation awareness and risk assessment
DE112016005290T5 (en) ANOM-RELIEF ON TEMPORAL CAUSALITY GRAPHS
DE112016001742T5 (en) Integrated community and role discovery in enterprise networks
CN109697570B (en) State evaluation method, system and equipment for secondary equipment of transformer substation
DE102022201746A1 (en) MANAGE DATA CENTERS WITH MACHINE LEARNING
WO2023066084A1 (en) Computing power distribution method and apparatus, and computing power server
KR102087959B1 (en) Artificial intelligence operations system of telecommunication network, and operating method thereof
CN112884163B (en) Combined service evaluation method and system based on federal machine learning algorithm and cloud feedback
WO2022134911A1 (en) Diagnosis method and apparatus, and terminal and storage medium
CN110991871A (en) Risk monitoring method, device, equipment and computer readable storage medium
CN110175272A (en) One kind realizing the convergent control method of work order and control device based on feature modeling
CN107070720A (en) The monitoring of cloud platform anomalous event and the method automatically processed and framework
CN108334427A (en) Method for diagnosing faults in storage system and device
CN109905268A (en) The method and device of network O&M
CN108415819B (en) Hard disk fault tracking method and device
CN114598652A (en) Flow regulation and control method, device, equipment and readable storage medium
CN108737164A (en) A kind of telecommunication network Real-time Alarm filter method and device
KR20180131018A (en) Real-time 3d facility monitoring system
CN113656452A (en) Method and device for detecting abnormal index of call chain, electronic equipment and storage medium
CN113849333A (en) WN-Spline baseline domain algorithm-based data center self-driving troubleshooting method and system
CN113327033A (en) Power distribution network fault diagnosis method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant