CN107992392B

CN107992392B - Automatic monitoring and repairing system and method for cloud rendering system

Info

Publication number: CN107992392B
Application number: CN201711165385.7A
Authority: CN
Inventors: 都政; 秦莉兰; 井革新; 陈远磊; 陈聪梅; 刘昭; 靳绍巍
Original assignee: NATIONAL SUPERCOMPUTING CENTER IN SHENZHEN (SHENZHEN CLOUD COMPUTING CENTER)
Current assignee: NATIONAL SUPERCOMPUTING CENTER IN SHENZHEN (SHENZHEN CLOUD COMPUTING CENTER)
Priority date: 2017-11-21
Filing date: 2017-11-21
Publication date: 2021-03-23
Anticipated expiration: 2037-11-21
Also published as: CN107992392A

Abstract

The invention provides an automatic monitoring and repairing system for a cloud rendering system, wherein a user client is used for a manufacturer to set task parameters and upload a required rendering task to a main transfer server; the main transfer server is used for verifying the account registration information of the uploaded rendering task and distributing the rendering task to the matched secondary transfer server; the secondary transfer server is used for distributing the rendering tasks to the matched rendering servers for rendering according to the running dynamic data and sending the running dynamic data to the management server and the main transfer server; the rendering server is used for executing rendering tasks; the management server is used for automatically detecting and repairing the rendering server according to the running dynamic data; and the management client is used for correcting the abnormal information in the management server. The invention can automatically monitor each rendering server, enables an administrator to more automatically manage the rendering farm servers, improves the management efficiency and optimizes the use of the rendering farm.

Description

Automatic monitoring and repairing system and method for cloud rendering system

Technical Field

The invention relates to the technical field of automatic monitoring and repair, in particular to an automatic monitoring and repair system and method for a cloud rendering system.

Background

The computer animation technology is one of the fastest developing technical fields in the world. To obtain high-quality computer animation, it is necessary to perform processing such as rendering a scene after completing operations such as animation modeling and motion design. For optimal rendering, a lot of material is needed, which occupies a lot of CPU resources, for example, a picture with a higher resolution usually takes 10 hours in the rendering process. Therefore, cloud rendering systems (also called cloud rendering platforms and rendering farms) are generally adopted for making large-scale animations, special effect movies and the like at present.

Compared with the defects of traditional rendering, the cloud rendering system is the most advanced rendering solution based on the cloud computing service. With a cloud rendering system, a user can invoke thousands of cloud servers for parallel computing rendering in as little as a few seconds. One rendering platform can be composed of hundreds of rendering servers, and for so many servers, corresponding management software can reasonably allocate and optimize resources on the whole network, manage jobs submitted to the system, and implement cross-platform, multi-engine and multi-task large-scale rendering. However, in terms of maintaining the servers, the state of each server needs to be checked manually at variable time to perform manual maintenance, or the abnormal condition of the servers needs to be checked after the task is abnormal. Based on this, two important issues that need to be solved for the management of the cloud rendering system are:

1. whether the capability of automatically monitoring the dynamic operation data of the rendering server and timely repairing or feeding back the abnormal condition to an administrator is provided;

2. whether the state of each computing rendering server is monitored in real time or not, the running condition of the server is analyzed, and the cloud rendering system is optimized in time (whether a new server needs to be replaced or not, whether a local hard disk needs to be added or not and the like).

At present, the two problems are solved by the following ways: firstly, the running condition of a rendering server is manually checked and calculated, and more times, a corresponding server is searched for and manually repaired after a task runs abnormally, so that on one hand, the labor cost is high, and on the other hand, the rendering server cannot be repaired in time; and secondly, the problem of the server is solved by experience or abnormal times of the server, no monitoring record is used as a certificate, and the cloud rendering system cannot be optimized in time.

Disclosure of Invention

Aiming at the defects of the existing processing mechanism, the invention provides an automatic monitoring and repairing system and method for a cloud rendering system.

In one aspect, an embodiment of the present invention provides an automatic monitoring and repairing system for a cloud rendering system, including a user client, a management client, a primary relay server, a management server, a secondary relay server, and a rendering server, where,

the user client is used for a producer to set task parameters and upload a required rendering task to the main transfer server;

the main transfer server is used for verifying and uploading account registration information of the rendering task, automatically generating a task number after the verification is passed, distributing the rendering task to a matched secondary transfer server, and generating a rendering task distribution log;

the secondary transfer server is used for receiving the running dynamic data of the rendering server, distributing the rendering tasks to the matched rendering servers for rendering according to the running dynamic data, generating a secondary rendering task distribution log, and sending the running dynamic data to the management server and the main transfer server;

the rendering server is used for executing the rendering task and sending an execution result to the user client through the corresponding secondary transit server and the main transit server after the rendering task is completed;

the management server is used for automatically detecting and repairing the rendering server according to the running dynamic data and sending reminding information to the management client;

and the management client is used for correcting the abnormal information in the management server by checking the reminding information.

In the automatic monitoring and repairing system for the cloud rendering system, the rendering task allocation log comprises a task source user client ID, a client registration account number, a task number, first allocation time and a matched secondary transit server number, and the secondary rendering task allocation log comprises the task number, second allocation time and the matched rendering server number.

In the automatic monitoring and repairing system for the cloud rendering system provided by the invention, the main transfer server comprises a receiving/returning module, an identification module, a monitoring module, a processing module, a storage module and a distribution module, wherein,

the receiving/returning module is used for receiving the rendering task from the user client and transmitting the rendering task to the identification module;

the identification module is used for identifying whether the rendering task belongs to a verified account according to a preset rule, if not, the rendering task is fed back to the user client, if so, a task form is created and stored in the storage module, and meanwhile, the rendering task is transmitted to the processing module;

the processing module is used for generating the rendering task distribution log according to the running dynamic data of the rendering server and storing the rendering task distribution log to the storage module;

the distribution module is used for distributing the rendering tasks to the matched secondary transit servers according to the rendering task distribution logs;

the storage module is used for storing the task form and the rendering task distribution log.

In the automatic monitoring and repairing system for the cloud rendering system provided by the invention, the management server comprises a data acquisition module, a data storage module and a trigger module, wherein,

the data acquisition module is used for acquiring the running dynamic data of the rendering server to form a running form and store the running form in the data storage module, and is also used for sending abnormal information to the trigger module when abnormal conditions occur;

the trigger module comprises an abnormal data model base, searches the abnormal information of the rendering server in the abnormal data model base, triggers a repairing or feedback behavior corresponding to the abnormal information, records the operation in a rendering server log list, and stores the operation in the data storage module.

In the automatic monitoring and repairing system for the cloud rendering system, if the abnormal information is that rendering server software is abnormal and a task is stopped, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering servers and records the operation in the rendering server log list;

if the abnormal information is that the rendering server is in an off-line state and no task is rendered, the triggering module restarts the rendering server and records the operation in the rendering server log list, and if the restart is invalid, the reminding information is sent to the management client;

if the abnormal information is that the memory of the rendering server overflows and the task stops, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering server, records the operation in the rendering server log list and sends the reminding information to the management client;

if the abnormal information is that the rendering server is interrupted in the network and cannot be connected, the triggering module automatically sends the reminding information to the management client and records the operation in the rendering server log list;

and if the abnormal information indicates that the rendering server frequently has the same abnormal condition, the triggering module automatically sends the reminding information to the management client and records the operation in the rendering server log list.

Correspondingly, the invention also provides an automatic monitoring and repairing method for the cloud rendering system, which comprises the following steps:

step S1: a producer sets task parameters through a user client and uploads a required rendering task to a main transfer server;

step S2: verifying and uploading account registration information of the rendering task through the main transfer server, automatically generating a task number after the verification is passed, distributing the rendering task to a matched secondary transfer server, and generating a rendering task distribution log;

step S3: the rendering server sends running dynamic data to the corresponding secondary transfer server, and the secondary transfer server sends the running dynamic data to a management server and the main transfer server;

step S4: the secondary transfer server distributes the rendering tasks to matched rendering servers for rendering according to the running dynamic data to generate secondary rendering task distribution logs;

step S5: the rendering server executes the rendering task and sends an execution result to the user client through the corresponding secondary transfer server and the main transfer server after the rendering task is completed;

step S6: the management server automatically detects and repairs the rendering server according to the running dynamic data and sends reminding information to a management client;

step S7: and the management client checks the reminding information and corrects the abnormal information in the management server.

In the automatic monitoring and repairing method for the cloud rendering system, the rendering task allocation log comprises a task source user client ID, a client registration account number, a task number, first allocation time and a matched secondary transit server number, and the secondary rendering task allocation log comprises the task number, second allocation time and the matched rendering server number.

In the automatic monitoring and repairing method for the cloud rendering system provided by the present invention, the step S2 includes:

step S21: receiving the rendering task from the user client through a receiving/returning module, and transmitting the rendering task to an identification module;

step S22: identifying whether the rendering task belongs to a verified account or not through the identification module according to a preset rule, if not, feeding back the rendering task to the user client, and if so, creating a task form and storing the task form to a storage module, and meanwhile, transmitting the rendering task to a processing module;

step S23: generating the rendering task allocation log according to the running dynamic data of the rendering server through the processing module, and storing the rendering task allocation log to the storage module;

step S24: and distributing the rendering task to the matched secondary transit server through a distribution module according to the rendering task distribution log.

In the automatic monitoring and repairing method for the cloud rendering system provided by the present invention, the step S6 includes:

step S61: acquiring the running dynamic data of the rendering server through a data acquisition module, forming a running form, storing the running form in a data storage module, and sending abnormal information to a trigger module when an abnormal condition occurs;

step S62: the triggering module searches the abnormal information of the rendering server in the abnormal data model base, triggers the repairing or feedback behavior corresponding to the abnormal information, records the operation in a rendering server log list, and stores the operation in the data storage module.

In the automatic monitoring and repairing method for the cloud rendering system, if the abnormal information is that rendering server software is abnormal and a task is stopped, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering servers and records the operation in the rendering server log list;

The embodiment of the invention has the following beneficial effects: the automatic monitoring and repairing system and method for the cloud rendering system provided by the invention rely on the cloud rendering system to monitor rendering servers in real time, efficiently and intelligently manage thousands of rendering servers, analyze monitoring data, display the data in a more intuitive mode, reasonably allocate rendering servers, replace abnormal servers, increase local hard disks of the servers and the like, optimize rendering farms more conveniently and pertinently and improve rendering efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of an automatic monitoring and repairing system for a cloud rendering system according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the master transit server shown in FIG. 1;

FIG. 3 is a schematic diagram of the management server shown in FIG. 1;

fig. 4 is a diagram showing statistics of the number of times of memory overflow in 8 months in 2017 of a rendering server in the secondary transit server a;

fig. 5 is a flowchart illustrating an automatic monitoring and repairing method for a cloud rendering system according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating step S2 shown in FIG. 5;

fig. 7 is a flowchart illustrating step S6 shown in fig. 5.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic diagram of an automatic monitoring and repairing system for a cloud rendering system according to an embodiment of the present invention, and as shown in fig. 1, the automatic monitoring and repairing system for a cloud rendering system according to the present invention includes a user client 10, a management client 20, a main relay server 30, a management server 40, a secondary relay server 50, and a rendering server 60, wherein,

In the invention, a user client is installed on a computer of a maker, and a required rendering task is uploaded after task parameters are set through the client. The main relay server checks the registration information of the uploaded account, automatically generates a task number after verification is passed, distributes the task to a matched secondary relay server according to the load of a secondary relay server and the dynamic information of task distribution, and generates a rendering task distribution log, wherein the rendering task distribution log comprises a task source user client ID, a client registration account, a task number, first distribution time and the number of the matched secondary relay server; the secondary transfer server receives the running dynamic data of the rendering server in real time, distributes the tasks to the matched rendering servers for rendering according to the monitoring data, and generates a secondary rendering task distribution log, wherein the secondary rendering task distribution log comprises the task number, the second distribution time and the number of the matched rendering servers; after the task rendering is completed, the task automatically passes through the secondary transfer server and the main transfer server and is automatically downloaded to a computer sending the task according to the set naming rule.

In the present invention, a plurality of master relay servers may be included. Fig. 2 is a schematic diagram of a master transit server, which includes, as shown in fig. 2, a receiving/returning module 310, an identifying module 320, a monitoring module 330, a processing module 340, a storing module 350 and a distributing module 360, wherein,

the identification module is used for identifying whether the rendering task belongs to a verified account according to a preset rule, if not, feeding back the rendering task to the user client, and if so, creating a task form and storing the task form to the storage module, and meanwhile, transmitting the rendering task to the processing module;

In the invention, the main transfer server can be connected with a plurality of maker computers through a network, and is connected with the management server and all secondary transfer servers at high speed through a local area network. The receiving/returning module is provided with a port connected with an external computer and is responsible for receiving the rendering task file uploaded by the computer of a manufacturer and transmitting the task file to the identification module; the identification module identifies whether the uploaded file belongs to the verified account number according to a preset rule, and if not, the uploaded file is fed back to the source computer; if yes, a task form (information such as time, source computer or source network IP, task priority, task size, task frame number and the like) is created, stored in the storage module, and the task is continuously transmitted to the processing module; the processing module generates a task allocation log according to the dynamic data of each rendering server and stores the task allocation log to the storage module; the distribution module distributes the tasks to the corresponding secondary transit servers according to the task distribution logs; the secondary transfer server distributes tasks according to the task priority and the number of the required servers; the storage module is used for storing the task form and the distribution log.

Fig. 3 is a schematic diagram of a management server, which, as shown in fig. 3, includes a data acquisition module 410, a data storage module 420, and a trigger module 430, wherein,

In order to improve the efficient management of the abnormal rendering server, the management server is required to automatically detect and repair the rendering server and feed back the rendering server to an administrator in time. The management server can be directly connected with a plurality of administrator computer networks and is connected with the main transfer server and all secondary transfer server local area networks at high speed. And dynamic data monitored by the application programs on the rendering servers are simultaneously transmitted to the secondary transfer server and the management server, so that the monitoring data of the management server and the monitoring data of the main transfer server are the same. The data acquisition module acquires dynamic monitoring data of each rendering server, wherein the dynamic monitoring data comprises the address, time, calculation task number, calculation time length, CPU utilization rate, memory utilization rate, network state, running state and the like of the rendering server, forms a form and stores the form in the storage module, and sends abnormal information to the trigger module if abnormal conditions occur. The trigger module comprises an abnormal data model base, and the administrator can add, modify, delete and the like to the model base. According to the preset rule of the model library, the trigger module searches the abnormal information of the rendering server in the model library, triggers the corresponding repairing or feedback behavior, records the operation in a log list of the rendering server, and stores the operation in the storage module. And if new abnormal information appears, updating the abnormal data model base in time.

The model library exception information in the trigger module includes (but is not limited to):

if the abnormal information is that the rendering server software is abnormal and the task is stopped, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering servers and records the operation in the rendering server log list;

In the invention, the client has a function of reminding the mobile phone APP or the short message. The administrator can add or modify the abnormal model library in the management server by checking the reminding information, so that the content of the model library is perfected. The administrator can set parameters (for example, the offline condition of rendering nodes under the secondary transit server in a certain time period needs to be checked), call abnormal data of the parameters, perform visual graphical display, perform operations such as reasonable allocation, abnormal server replacement and server local hard disk increase on the rendering server according to the displayed data, optimize the rendering farm more conveniently and more pertinently, and improve the rendering efficiency.

Fig. 4 is a diagram showing statistics of the number of times of memory overflow in 8 months in 2017 of the rendering server in the secondary transit server a. The administrator outputs parameters by calling statistical data in the log: the rendering server to which the secondary transit server a belongs, 2017, 8 months and the number of times of memory overflow are obtained as the chart shown in fig. 4, and an administrator can check the memory of the a003 rendering server and the conditions of all tasks according to the data, and expand the memory or modify the task allocation information in time. The administrator can also carry out operations such as reasonable allocation, abnormal server replacement, server local hard disk increase and the like on the rendering server according to the display data, so that the rendering farm is optimized more conveniently and pertinently, and the rendering efficiency is improved.

Fig. 5 is a flowchart of an automatic monitoring and repairing method for a cloud rendering system according to an embodiment of the present invention, and as shown in fig. 5, the automatic monitoring and repairing method for a cloud rendering system according to the present invention includes the following steps:

specifically, the step S2 includes:

specifically, the step S6 includes:

The automatic monitoring and repairing system and method for the cloud rendering system provided by the invention rely on the cloud rendering system to monitor rendering servers in real time, efficiently and intelligently manage thousands of rendering servers, analyze monitoring data, display the data in a more intuitive mode, reasonably allocate rendering servers, replace abnormal servers, increase local hard disks of the servers and the like, optimize rendering farms more conveniently and pertinently and improve rendering efficiency.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An automatic monitoring and repairing system for a cloud rendering system is characterized by comprising a user client, a management client, a main relay server, a management server, a secondary relay server and a rendering server, wherein,

the management client is used for correcting the abnormal information in the management server by checking the reminding information;

the management server comprises a data acquisition module, a data storage module and a triggering module, wherein,

the trigger module comprises an abnormal data model base, searches the abnormal information of the rendering server in the abnormal data model base, triggers a repairing or feedback behavior corresponding to the abnormal information, records the operation in a rendering server log list and stores the operation in the data storage module;

2. The automated monitoring and repair system for the cloud rendering system of claim 1, wherein the rendering task allocation log comprises a task source user client ID, a client registration account number, a task number, a first allocation time, a matching secondary transit server number, and the secondary rendering task allocation log comprises the task number, a second allocation time, and a matching rendering server number.

3. The automated monitoring and repair system for cloud rendering system of claim 1, wherein the main relay server comprises a receiving/returning module, an identifying module, a monitoring module, a processing module, a storing module and a distributing module, wherein,

4. An automatic monitoring and repairing method for a cloud rendering system is characterized by comprising the following steps:

step S3: the rendering server sends running dynamic data to the corresponding secondary transfer server, and the secondary transfer server sends the running dynamic data to the management server and the main transfer server;

step S7: the management client checks the reminding information and corrects the abnormal information in the management server;

the step S6 includes:

step S62: the triggering module searches the abnormal information of the rendering server in an abnormal data model base, triggers a repairing or feedback behavior corresponding to the abnormal information, records the operation in a rendering server log list and stores the operation in the data storage module;

5. The automatic monitoring and repairing method for the cloud rendering system according to claim 4, wherein the rendering task allocation log includes a task source user client ID, a client registration account number, a task number, a first allocation time, and a matching secondary transit server number, and the secondary rendering task allocation log includes the task number, a second allocation time, and a matching rendering server number.

6. The automatic monitoring repair method for the cloud rendering system of claim 4, wherein the step S2 includes: