CN107992392B - Automatic monitoring and repairing system and method for cloud rendering system - Google Patents

Automatic monitoring and repairing system and method for cloud rendering system Download PDF

Info

Publication number
CN107992392B
CN107992392B CN201711165385.7A CN201711165385A CN107992392B CN 107992392 B CN107992392 B CN 107992392B CN 201711165385 A CN201711165385 A CN 201711165385A CN 107992392 B CN107992392 B CN 107992392B
Authority
CN
China
Prior art keywords
rendering
server
task
module
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711165385.7A
Other languages
Chinese (zh)
Other versions
CN107992392A (en
Inventor
都政
秦莉兰
井革新
陈远磊
陈聪梅
刘昭
靳绍巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NATIONAL SUPERCOMPUTING CENTER IN SHENZHEN (SHENZHEN CLOUD COMPUTING CENTER)
Original Assignee
NATIONAL SUPERCOMPUTING CENTER IN SHENZHEN (SHENZHEN CLOUD COMPUTING CENTER)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NATIONAL SUPERCOMPUTING CENTER IN SHENZHEN (SHENZHEN CLOUD COMPUTING CENTER) filed Critical NATIONAL SUPERCOMPUTING CENTER IN SHENZHEN (SHENZHEN CLOUD COMPUTING CENTER)
Priority to CN201711165385.7A priority Critical patent/CN107992392B/en
Publication of CN107992392A publication Critical patent/CN107992392A/en
Application granted granted Critical
Publication of CN107992392B publication Critical patent/CN107992392B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/508Monitor

Abstract

The invention provides an automatic monitoring and repairing system for a cloud rendering system, wherein a user client is used for a manufacturer to set task parameters and upload a required rendering task to a main transfer server; the main transfer server is used for verifying the account registration information of the uploaded rendering task and distributing the rendering task to the matched secondary transfer server; the secondary transfer server is used for distributing the rendering tasks to the matched rendering servers for rendering according to the running dynamic data and sending the running dynamic data to the management server and the main transfer server; the rendering server is used for executing rendering tasks; the management server is used for automatically detecting and repairing the rendering server according to the running dynamic data; and the management client is used for correcting the abnormal information in the management server. The invention can automatically monitor each rendering server, enables an administrator to more automatically manage the rendering farm servers, improves the management efficiency and optimizes the use of the rendering farm.

Description

Automatic monitoring and repairing system and method for cloud rendering system
Technical Field
The invention relates to the technical field of automatic monitoring and repair, in particular to an automatic monitoring and repair system and method for a cloud rendering system.
Background
The computer animation technology is one of the fastest developing technical fields in the world. To obtain high-quality computer animation, it is necessary to perform processing such as rendering a scene after completing operations such as animation modeling and motion design. For optimal rendering, a lot of material is needed, which occupies a lot of CPU resources, for example, a picture with a higher resolution usually takes 10 hours in the rendering process. Therefore, cloud rendering systems (also called cloud rendering platforms and rendering farms) are generally adopted for making large-scale animations, special effect movies and the like at present.
Compared with the defects of traditional rendering, the cloud rendering system is the most advanced rendering solution based on the cloud computing service. With a cloud rendering system, a user can invoke thousands of cloud servers for parallel computing rendering in as little as a few seconds. One rendering platform can be composed of hundreds of rendering servers, and for so many servers, corresponding management software can reasonably allocate and optimize resources on the whole network, manage jobs submitted to the system, and implement cross-platform, multi-engine and multi-task large-scale rendering. However, in terms of maintaining the servers, the state of each server needs to be checked manually at variable time to perform manual maintenance, or the abnormal condition of the servers needs to be checked after the task is abnormal. Based on this, two important issues that need to be solved for the management of the cloud rendering system are:
1. whether the capability of automatically monitoring the dynamic operation data of the rendering server and timely repairing or feeding back the abnormal condition to an administrator is provided;
2. whether the state of each computing rendering server is monitored in real time or not, the running condition of the server is analyzed, and the cloud rendering system is optimized in time (whether a new server needs to be replaced or not, whether a local hard disk needs to be added or not and the like).
At present, the two problems are solved by the following ways: firstly, the running condition of a rendering server is manually checked and calculated, and more times, a corresponding server is searched for and manually repaired after a task runs abnormally, so that on one hand, the labor cost is high, and on the other hand, the rendering server cannot be repaired in time; and secondly, the problem of the server is solved by experience or abnormal times of the server, no monitoring record is used as a certificate, and the cloud rendering system cannot be optimized in time.
Disclosure of Invention
Aiming at the defects of the existing processing mechanism, the invention provides an automatic monitoring and repairing system and method for a cloud rendering system.
In one aspect, an embodiment of the present invention provides an automatic monitoring and repairing system for a cloud rendering system, including a user client, a management client, a primary relay server, a management server, a secondary relay server, and a rendering server, where,
the user client is used for a producer to set task parameters and upload a required rendering task to the main transfer server;
the main transfer server is used for verifying and uploading account registration information of the rendering task, automatically generating a task number after the verification is passed, distributing the rendering task to a matched secondary transfer server, and generating a rendering task distribution log;
the secondary transfer server is used for receiving the running dynamic data of the rendering server, distributing the rendering tasks to the matched rendering servers for rendering according to the running dynamic data, generating a secondary rendering task distribution log, and sending the running dynamic data to the management server and the main transfer server;
the rendering server is used for executing the rendering task and sending an execution result to the user client through the corresponding secondary transit server and the main transit server after the rendering task is completed;
the management server is used for automatically detecting and repairing the rendering server according to the running dynamic data and sending reminding information to the management client;
and the management client is used for correcting the abnormal information in the management server by checking the reminding information.
In the automatic monitoring and repairing system for the cloud rendering system, the rendering task allocation log comprises a task source user client ID, a client registration account number, a task number, first allocation time and a matched secondary transit server number, and the secondary rendering task allocation log comprises the task number, second allocation time and the matched rendering server number.
In the automatic monitoring and repairing system for the cloud rendering system provided by the invention, the main transfer server comprises a receiving/returning module, an identification module, a monitoring module, a processing module, a storage module and a distribution module, wherein,
the receiving/returning module is used for receiving the rendering task from the user client and transmitting the rendering task to the identification module;
the identification module is used for identifying whether the rendering task belongs to a verified account according to a preset rule, if not, the rendering task is fed back to the user client, if so, a task form is created and stored in the storage module, and meanwhile, the rendering task is transmitted to the processing module;
the processing module is used for generating the rendering task distribution log according to the running dynamic data of the rendering server and storing the rendering task distribution log to the storage module;
the distribution module is used for distributing the rendering tasks to the matched secondary transit servers according to the rendering task distribution logs;
the storage module is used for storing the task form and the rendering task distribution log.
In the automatic monitoring and repairing system for the cloud rendering system provided by the invention, the management server comprises a data acquisition module, a data storage module and a trigger module, wherein,
the data acquisition module is used for acquiring the running dynamic data of the rendering server to form a running form and store the running form in the data storage module, and is also used for sending abnormal information to the trigger module when abnormal conditions occur;
the trigger module comprises an abnormal data model base, searches the abnormal information of the rendering server in the abnormal data model base, triggers a repairing or feedback behavior corresponding to the abnormal information, records the operation in a rendering server log list, and stores the operation in the data storage module.
In the automatic monitoring and repairing system for the cloud rendering system, if the abnormal information is that rendering server software is abnormal and a task is stopped, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering servers and records the operation in the rendering server log list;
if the abnormal information is that the rendering server is in an off-line state and no task is rendered, the triggering module restarts the rendering server and records the operation in the rendering server log list, and if the restart is invalid, the reminding information is sent to the management client;
if the abnormal information is that the memory of the rendering server overflows and the task stops, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering server, records the operation in the rendering server log list and sends the reminding information to the management client;
if the abnormal information is that the rendering server is interrupted in the network and cannot be connected, the triggering module automatically sends the reminding information to the management client and records the operation in the rendering server log list;
and if the abnormal information indicates that the rendering server frequently has the same abnormal condition, the triggering module automatically sends the reminding information to the management client and records the operation in the rendering server log list.
Correspondingly, the invention also provides an automatic monitoring and repairing method for the cloud rendering system, which comprises the following steps:
step S1: a producer sets task parameters through a user client and uploads a required rendering task to a main transfer server;
step S2: verifying and uploading account registration information of the rendering task through the main transfer server, automatically generating a task number after the verification is passed, distributing the rendering task to a matched secondary transfer server, and generating a rendering task distribution log;
step S3: the rendering server sends running dynamic data to the corresponding secondary transfer server, and the secondary transfer server sends the running dynamic data to a management server and the main transfer server;
step S4: the secondary transfer server distributes the rendering tasks to matched rendering servers for rendering according to the running dynamic data to generate secondary rendering task distribution logs;
step S5: the rendering server executes the rendering task and sends an execution result to the user client through the corresponding secondary transfer server and the main transfer server after the rendering task is completed;
step S6: the management server automatically detects and repairs the rendering server according to the running dynamic data and sends reminding information to a management client;
step S7: and the management client checks the reminding information and corrects the abnormal information in the management server.
In the automatic monitoring and repairing method for the cloud rendering system, the rendering task allocation log comprises a task source user client ID, a client registration account number, a task number, first allocation time and a matched secondary transit server number, and the secondary rendering task allocation log comprises the task number, second allocation time and the matched rendering server number.
In the automatic monitoring and repairing method for the cloud rendering system provided by the present invention, the step S2 includes:
step S21: receiving the rendering task from the user client through a receiving/returning module, and transmitting the rendering task to an identification module;
step S22: identifying whether the rendering task belongs to a verified account or not through the identification module according to a preset rule, if not, feeding back the rendering task to the user client, and if so, creating a task form and storing the task form to a storage module, and meanwhile, transmitting the rendering task to a processing module;
step S23: generating the rendering task allocation log according to the running dynamic data of the rendering server through the processing module, and storing the rendering task allocation log to the storage module;
step S24: and distributing the rendering task to the matched secondary transit server through a distribution module according to the rendering task distribution log.
In the automatic monitoring and repairing method for the cloud rendering system provided by the present invention, the step S6 includes:
step S61: acquiring the running dynamic data of the rendering server through a data acquisition module, forming a running form, storing the running form in a data storage module, and sending abnormal information to a trigger module when an abnormal condition occurs;
step S62: the triggering module searches the abnormal information of the rendering server in the abnormal data model base, triggers the repairing or feedback behavior corresponding to the abnormal information, records the operation in a rendering server log list, and stores the operation in the data storage module.
In the automatic monitoring and repairing method for the cloud rendering system, if the abnormal information is that rendering server software is abnormal and a task is stopped, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering servers and records the operation in the rendering server log list;
if the abnormal information is that the rendering server is in an off-line state and no task is rendered, the triggering module restarts the rendering server and records the operation in the rendering server log list, and if the restart is invalid, the reminding information is sent to the management client;
if the abnormal information is that the memory of the rendering server overflows and the task stops, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering server, records the operation in the rendering server log list and sends the reminding information to the management client;
if the abnormal information is that the rendering server is interrupted in the network and cannot be connected, the triggering module automatically sends the reminding information to the management client and records the operation in the rendering server log list;
and if the abnormal information indicates that the rendering server frequently has the same abnormal condition, the triggering module automatically sends the reminding information to the management client and records the operation in the rendering server log list.
The embodiment of the invention has the following beneficial effects: the automatic monitoring and repairing system and method for the cloud rendering system provided by the invention rely on the cloud rendering system to monitor rendering servers in real time, efficiently and intelligently manage thousands of rendering servers, analyze monitoring data, display the data in a more intuitive mode, reasonably allocate rendering servers, replace abnormal servers, increase local hard disks of the servers and the like, optimize rendering farms more conveniently and pertinently and improve rendering efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic diagram of an automatic monitoring and repairing system for a cloud rendering system according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the master transit server shown in FIG. 1;
FIG. 3 is a schematic diagram of the management server shown in FIG. 1;
fig. 4 is a diagram showing statistics of the number of times of memory overflow in 8 months in 2017 of a rendering server in the secondary transit server a;
fig. 5 is a flowchart illustrating an automatic monitoring and repairing method for a cloud rendering system according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating step S2 shown in FIG. 5;
fig. 7 is a flowchart illustrating step S6 shown in fig. 5.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic diagram of an automatic monitoring and repairing system for a cloud rendering system according to an embodiment of the present invention, and as shown in fig. 1, the automatic monitoring and repairing system for a cloud rendering system according to the present invention includes a user client 10, a management client 20, a main relay server 30, a management server 40, a secondary relay server 50, and a rendering server 60, wherein,
the user client is used for a producer to set task parameters and upload a required rendering task to the main transfer server;
the main transfer server is used for verifying and uploading account registration information of the rendering task, automatically generating a task number after the verification is passed, distributing the rendering task to a matched secondary transfer server, and generating a rendering task distribution log;
the secondary transfer server is used for receiving the running dynamic data of the rendering server, distributing the rendering tasks to the matched rendering servers for rendering according to the running dynamic data, generating a secondary rendering task distribution log, and sending the running dynamic data to the management server and the main transfer server;
the rendering server is used for executing the rendering task and sending an execution result to the user client through the corresponding secondary transit server and the main transit server after the rendering task is completed;
the management server is used for automatically detecting and repairing the rendering server according to the running dynamic data and sending reminding information to the management client;
and the management client is used for correcting the abnormal information in the management server by checking the reminding information.
In the invention, a user client is installed on a computer of a maker, and a required rendering task is uploaded after task parameters are set through the client. The main relay server checks the registration information of the uploaded account, automatically generates a task number after verification is passed, distributes the task to a matched secondary relay server according to the load of a secondary relay server and the dynamic information of task distribution, and generates a rendering task distribution log, wherein the rendering task distribution log comprises a task source user client ID, a client registration account, a task number, first distribution time and the number of the matched secondary relay server; the secondary transfer server receives the running dynamic data of the rendering server in real time, distributes the tasks to the matched rendering servers for rendering according to the monitoring data, and generates a secondary rendering task distribution log, wherein the secondary rendering task distribution log comprises the task number, the second distribution time and the number of the matched rendering servers; after the task rendering is completed, the task automatically passes through the secondary transfer server and the main transfer server and is automatically downloaded to a computer sending the task according to the set naming rule.
In the present invention, a plurality of master relay servers may be included. Fig. 2 is a schematic diagram of a master transit server, which includes, as shown in fig. 2, a receiving/returning module 310, an identifying module 320, a monitoring module 330, a processing module 340, a storing module 350 and a distributing module 360, wherein,
the receiving/returning module is used for receiving the rendering task from the user client and transmitting the rendering task to the identification module;
the identification module is used for identifying whether the rendering task belongs to a verified account according to a preset rule, if not, feeding back the rendering task to the user client, and if so, creating a task form and storing the task form to the storage module, and meanwhile, transmitting the rendering task to the processing module;
the processing module is used for generating the rendering task distribution log according to the running dynamic data of the rendering server and storing the rendering task distribution log to the storage module;
the distribution module is used for distributing the rendering tasks to the matched secondary transit servers according to the rendering task distribution logs;
the storage module is used for storing the task form and the rendering task distribution log.
In the invention, the main transfer server can be connected with a plurality of maker computers through a network, and is connected with the management server and all secondary transfer servers at high speed through a local area network. The receiving/returning module is provided with a port connected with an external computer and is responsible for receiving the rendering task file uploaded by the computer of a manufacturer and transmitting the task file to the identification module; the identification module identifies whether the uploaded file belongs to the verified account number according to a preset rule, and if not, the uploaded file is fed back to the source computer; if yes, a task form (information such as time, source computer or source network IP, task priority, task size, task frame number and the like) is created, stored in the storage module, and the task is continuously transmitted to the processing module; the processing module generates a task allocation log according to the dynamic data of each rendering server and stores the task allocation log to the storage module; the distribution module distributes the tasks to the corresponding secondary transit servers according to the task distribution logs; the secondary transfer server distributes tasks according to the task priority and the number of the required servers; the storage module is used for storing the task form and the distribution log.
Fig. 3 is a schematic diagram of a management server, which, as shown in fig. 3, includes a data acquisition module 410, a data storage module 420, and a trigger module 430, wherein,
the data acquisition module is used for acquiring the running dynamic data of the rendering server to form a running form and store the running form in the data storage module, and is also used for sending abnormal information to the trigger module when abnormal conditions occur;
the trigger module comprises an abnormal data model base, searches the abnormal information of the rendering server in the abnormal data model base, triggers a repairing or feedback behavior corresponding to the abnormal information, records the operation in a rendering server log list, and stores the operation in the data storage module.
In order to improve the efficient management of the abnormal rendering server, the management server is required to automatically detect and repair the rendering server and feed back the rendering server to an administrator in time. The management server can be directly connected with a plurality of administrator computer networks and is connected with the main transfer server and all secondary transfer server local area networks at high speed. And dynamic data monitored by the application programs on the rendering servers are simultaneously transmitted to the secondary transfer server and the management server, so that the monitoring data of the management server and the monitoring data of the main transfer server are the same. The data acquisition module acquires dynamic monitoring data of each rendering server, wherein the dynamic monitoring data comprises the address, time, calculation task number, calculation time length, CPU utilization rate, memory utilization rate, network state, running state and the like of the rendering server, forms a form and stores the form in the storage module, and sends abnormal information to the trigger module if abnormal conditions occur. The trigger module comprises an abnormal data model base, and the administrator can add, modify, delete and the like to the model base. According to the preset rule of the model library, the trigger module searches the abnormal information of the rendering server in the model library, triggers the corresponding repairing or feedback behavior, records the operation in a log list of the rendering server, and stores the operation in the storage module. And if new abnormal information appears, updating the abnormal data model base in time.
The model library exception information in the trigger module includes (but is not limited to):
if the abnormal information is that the rendering server software is abnormal and the task is stopped, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering servers and records the operation in the rendering server log list;
if the abnormal information is that the rendering server is in an off-line state and no task is rendered, the triggering module restarts the rendering server and records the operation in the rendering server log list, and if the restart is invalid, the reminding information is sent to the management client;
if the abnormal information is that the memory of the rendering server overflows and the task stops, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering server, records the operation in the rendering server log list and sends the reminding information to the management client;
if the abnormal information is that the rendering server is interrupted in the network and cannot be connected, the triggering module automatically sends the reminding information to the management client and records the operation in the rendering server log list;
and if the abnormal information indicates that the rendering server frequently has the same abnormal condition, the triggering module automatically sends the reminding information to the management client and records the operation in the rendering server log list.
In the invention, the client has a function of reminding the mobile phone APP or the short message. The administrator can add or modify the abnormal model library in the management server by checking the reminding information, so that the content of the model library is perfected. The administrator can set parameters (for example, the offline condition of rendering nodes under the secondary transit server in a certain time period needs to be checked), call abnormal data of the parameters, perform visual graphical display, perform operations such as reasonable allocation, abnormal server replacement and server local hard disk increase on the rendering server according to the displayed data, optimize the rendering farm more conveniently and more pertinently, and improve the rendering efficiency.
Fig. 4 is a diagram showing statistics of the number of times of memory overflow in 8 months in 2017 of the rendering server in the secondary transit server a. The administrator outputs parameters by calling statistical data in the log: the rendering server to which the secondary transit server a belongs, 2017, 8 months and the number of times of memory overflow are obtained as the chart shown in fig. 4, and an administrator can check the memory of the a003 rendering server and the conditions of all tasks according to the data, and expand the memory or modify the task allocation information in time. The administrator can also carry out operations such as reasonable allocation, abnormal server replacement, server local hard disk increase and the like on the rendering server according to the display data, so that the rendering farm is optimized more conveniently and pertinently, and the rendering efficiency is improved.
Fig. 5 is a flowchart of an automatic monitoring and repairing method for a cloud rendering system according to an embodiment of the present invention, and as shown in fig. 5, the automatic monitoring and repairing method for a cloud rendering system according to the present invention includes the following steps:
step S1: a producer sets task parameters through a user client and uploads a required rendering task to a main transfer server;
step S2: verifying and uploading account registration information of the rendering task through the main transfer server, automatically generating a task number after the verification is passed, distributing the rendering task to a matched secondary transfer server, and generating a rendering task distribution log;
specifically, the step S2 includes:
step S21: receiving the rendering task from the user client through a receiving/returning module, and transmitting the rendering task to an identification module;
step S22: identifying whether the rendering task belongs to a verified account or not through the identification module according to a preset rule, if not, feeding back the rendering task to the user client, and if so, creating a task form and storing the task form to a storage module, and meanwhile, transmitting the rendering task to a processing module;
step S23: generating the rendering task allocation log according to the running dynamic data of the rendering server through the processing module, and storing the rendering task allocation log to the storage module;
step S24: and distributing the rendering task to the matched secondary transit server through a distribution module according to the rendering task distribution log.
Step S3: the rendering server sends running dynamic data to the corresponding secondary transfer server, and the secondary transfer server sends the running dynamic data to a management server and the main transfer server;
step S4: the secondary transfer server distributes the rendering tasks to matched rendering servers for rendering according to the running dynamic data to generate secondary rendering task distribution logs;
step S5: the rendering server executes the rendering task and sends an execution result to the user client through the corresponding secondary transfer server and the main transfer server after the rendering task is completed;
step S6: the management server automatically detects and repairs the rendering server according to the running dynamic data and sends reminding information to a management client;
specifically, the step S6 includes:
step S61: acquiring the running dynamic data of the rendering server through a data acquisition module, forming a running form, storing the running form in a data storage module, and sending abnormal information to a trigger module when an abnormal condition occurs;
step S62: the triggering module searches the abnormal information of the rendering server in the abnormal data model base, triggers the repairing or feedback behavior corresponding to the abnormal information, records the operation in a rendering server log list, and stores the operation in the data storage module.
Step S7: and the management client checks the reminding information and corrects the abnormal information in the management server.
The automatic monitoring and repairing system and method for the cloud rendering system provided by the invention rely on the cloud rendering system to monitor rendering servers in real time, efficiently and intelligently manage thousands of rendering servers, analyze monitoring data, display the data in a more intuitive mode, reasonably allocate rendering servers, replace abnormal servers, increase local hard disks of the servers and the like, optimize rendering farms more conveniently and pertinently and improve rendering efficiency.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (6)

1. An automatic monitoring and repairing system for a cloud rendering system is characterized by comprising a user client, a management client, a main relay server, a management server, a secondary relay server and a rendering server, wherein,
the user client is used for a producer to set task parameters and upload a required rendering task to the main transfer server;
the main transfer server is used for verifying and uploading account registration information of the rendering task, automatically generating a task number after the verification is passed, distributing the rendering task to a matched secondary transfer server, and generating a rendering task distribution log;
the secondary transfer server is used for receiving the running dynamic data of the rendering server, distributing the rendering tasks to the matched rendering servers for rendering according to the running dynamic data, generating a secondary rendering task distribution log, and sending the running dynamic data to the management server and the main transfer server;
the rendering server is used for executing the rendering task and sending an execution result to the user client through the corresponding secondary transit server and the main transit server after the rendering task is completed;
the management server is used for automatically detecting and repairing the rendering server according to the running dynamic data and sending reminding information to the management client;
the management client is used for correcting the abnormal information in the management server by checking the reminding information;
the management server comprises a data acquisition module, a data storage module and a triggering module, wherein,
the data acquisition module is used for acquiring the running dynamic data of the rendering server to form a running form and store the running form in the data storage module, and is also used for sending abnormal information to the trigger module when abnormal conditions occur;
the trigger module comprises an abnormal data model base, searches the abnormal information of the rendering server in the abnormal data model base, triggers a repairing or feedback behavior corresponding to the abnormal information, records the operation in a rendering server log list and stores the operation in the data storage module;
if the abnormal information is that the rendering server software is abnormal and the task is stopped, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering servers and records the operation in the rendering server log list;
if the abnormal information is that the rendering server is in an off-line state and no task is rendered, the triggering module restarts the rendering server and records the operation in the rendering server log list, and if the restart is invalid, the reminding information is sent to the management client;
if the abnormal information is that the memory of the rendering server overflows and the task stops, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering server, records the operation in the rendering server log list and sends the reminding information to the management client;
if the abnormal information is that the rendering server is interrupted in the network and cannot be connected, the triggering module automatically sends the reminding information to the management client and records the operation in the rendering server log list;
and if the abnormal information indicates that the rendering server frequently has the same abnormal condition, the triggering module automatically sends the reminding information to the management client and records the operation in the rendering server log list.
2. The automated monitoring and repair system for the cloud rendering system of claim 1, wherein the rendering task allocation log comprises a task source user client ID, a client registration account number, a task number, a first allocation time, a matching secondary transit server number, and the secondary rendering task allocation log comprises the task number, a second allocation time, and a matching rendering server number.
3. The automated monitoring and repair system for cloud rendering system of claim 1, wherein the main relay server comprises a receiving/returning module, an identifying module, a monitoring module, a processing module, a storing module and a distributing module, wherein,
the receiving/returning module is used for receiving the rendering task from the user client and transmitting the rendering task to the identification module;
the identification module is used for identifying whether the rendering task belongs to a verified account according to a preset rule, if not, the rendering task is fed back to the user client, if so, a task form is created and stored in the storage module, and meanwhile, the rendering task is transmitted to the processing module;
the processing module is used for generating the rendering task distribution log according to the running dynamic data of the rendering server and storing the rendering task distribution log to the storage module;
the distribution module is used for distributing the rendering tasks to the matched secondary transit servers according to the rendering task distribution logs;
the storage module is used for storing the task form and the rendering task distribution log.
4. An automatic monitoring and repairing method for a cloud rendering system is characterized by comprising the following steps:
step S1: a producer sets task parameters through a user client and uploads a required rendering task to a main transfer server;
step S2: verifying and uploading account registration information of the rendering task through the main transfer server, automatically generating a task number after the verification is passed, distributing the rendering task to a matched secondary transfer server, and generating a rendering task distribution log;
step S3: the rendering server sends running dynamic data to the corresponding secondary transfer server, and the secondary transfer server sends the running dynamic data to the management server and the main transfer server;
step S4: the secondary transfer server distributes the rendering tasks to matched rendering servers for rendering according to the running dynamic data to generate secondary rendering task distribution logs;
step S5: the rendering server executes the rendering task and sends an execution result to the user client through the corresponding secondary transfer server and the main transfer server after the rendering task is completed;
step S6: the management server automatically detects and repairs the rendering server according to the running dynamic data and sends reminding information to a management client;
step S7: the management client checks the reminding information and corrects the abnormal information in the management server;
the step S6 includes:
step S61: acquiring the running dynamic data of the rendering server through a data acquisition module, forming a running form, storing the running form in a data storage module, and sending abnormal information to a trigger module when an abnormal condition occurs;
step S62: the triggering module searches the abnormal information of the rendering server in an abnormal data model base, triggers a repairing or feedback behavior corresponding to the abnormal information, records the operation in a rendering server log list and stores the operation in the data storage module;
if the abnormal information is that the rendering server software is abnormal and the task is stopped, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering servers and records the operation in the rendering server log list;
if the abnormal information is that the rendering server is in an off-line state and no task is rendered, the triggering module restarts the rendering server and records the operation in the rendering server log list, and if the restart is invalid, the reminding information is sent to the management client;
if the abnormal information is that the memory of the rendering server overflows and the task stops, the triggering module automatically detects other matched rendering servers to continue rendering, restarts the abnormal rendering server, records the operation in the rendering server log list and sends the reminding information to the management client;
if the abnormal information is that the rendering server is interrupted in the network and cannot be connected, the triggering module automatically sends the reminding information to the management client and records the operation in the rendering server log list;
and if the abnormal information indicates that the rendering server frequently has the same abnormal condition, the triggering module automatically sends the reminding information to the management client and records the operation in the rendering server log list.
5. The automatic monitoring and repairing method for the cloud rendering system according to claim 4, wherein the rendering task allocation log includes a task source user client ID, a client registration account number, a task number, a first allocation time, and a matching secondary transit server number, and the secondary rendering task allocation log includes the task number, a second allocation time, and a matching rendering server number.
6. The automatic monitoring repair method for the cloud rendering system of claim 4, wherein the step S2 includes:
step S21: receiving the rendering task from the user client through a receiving/returning module, and transmitting the rendering task to an identification module;
step S22: identifying whether the rendering task belongs to a verified account or not through the identification module according to a preset rule, if not, feeding back the rendering task to the user client, and if so, creating a task form and storing the task form to a storage module, and meanwhile, transmitting the rendering task to a processing module;
step S23: generating the rendering task allocation log according to the running dynamic data of the rendering server through the processing module, and storing the rendering task allocation log to the storage module;
step S24: and distributing the rendering task to the matched secondary transit server through a distribution module according to the rendering task distribution log.
CN201711165385.7A 2017-11-21 2017-11-21 Automatic monitoring and repairing system and method for cloud rendering system Active CN107992392B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711165385.7A CN107992392B (en) 2017-11-21 2017-11-21 Automatic monitoring and repairing system and method for cloud rendering system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711165385.7A CN107992392B (en) 2017-11-21 2017-11-21 Automatic monitoring and repairing system and method for cloud rendering system

Publications (2)

Publication Number Publication Date
CN107992392A CN107992392A (en) 2018-05-04
CN107992392B true CN107992392B (en) 2021-03-23

Family

ID=62031870

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711165385.7A Active CN107992392B (en) 2017-11-21 2017-11-21 Automatic monitoring and repairing system and method for cloud rendering system

Country Status (1)

Country Link
CN (1) CN107992392B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488542B (en) * 2019-01-29 2023-09-26 上海哔哩哔哩科技有限公司 Webpage output method, device, system and storage medium
CN112118463A (en) * 2019-06-21 2020-12-22 广州虎牙科技有限公司 Information processing method, cloud platform and information processing system
CN111028124A (en) * 2019-11-29 2020-04-17 安徽赛诚云渲网络科技有限公司 Rendering system
CN111563027B (en) * 2020-04-30 2023-09-01 北京视博云信息技术有限公司 Application operation monitoring method, device and system
CN113094177A (en) * 2021-04-21 2021-07-09 上海商汤科技开发有限公司 Task distribution system, method and device, computer equipment and storage medium
CN114490097A (en) * 2022-01-12 2022-05-13 北京易智时代数字科技有限公司 Management system for rendering service and VR display system
CN115865518B (en) * 2023-01-30 2023-05-16 天云融创数据科技(北京)有限公司 Cloud platform data processing method and system based on big data
CN116828215B (en) * 2023-08-30 2023-11-14 湖南马栏山视频先进技术研究院有限公司 Video rendering method and system for reducing local computing power load

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105071969A (en) * 2015-08-19 2015-11-18 焦点科技股份有限公司 JMX (Java Management Extensions)-based customization real-time monitoring and automatic exception handling system and method
CN105446810A (en) * 2015-12-24 2016-03-30 赞奇科技发展有限公司 Cost based multi-farm cloud rendering task distributing system and method
TWI579709B (en) * 2015-11-05 2017-04-21 Chunghwa Telecom Co Ltd Instantly analyze the scene file and automatically fill the cloud of the cloud system and methods

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5331192B2 (en) * 2011-11-07 2013-10-30 株式会社スクウェア・エニックス・ホールディングス Drawing server, center server, encoding device, control method, encoding method, program, and recording medium
CN103268220A (en) * 2012-02-24 2013-08-28 苏州蓝海彤翔系统科技有限公司 Software architecture suitable for large-scale animation rendering service cloud platform
CN103442036A (en) * 2013-08-09 2013-12-11 苏州蓝海彤翔系统科技有限公司 System integrating design development, post production and data storage and based on cloud platform
CN106127844A (en) * 2016-06-22 2016-11-16 民政部零研究所 Mobile phone users real-time, interactive access long-range 3D scene render exchange method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105071969A (en) * 2015-08-19 2015-11-18 焦点科技股份有限公司 JMX (Java Management Extensions)-based customization real-time monitoring and automatic exception handling system and method
TWI579709B (en) * 2015-11-05 2017-04-21 Chunghwa Telecom Co Ltd Instantly analyze the scene file and automatically fill the cloud of the cloud system and methods
CN105446810A (en) * 2015-12-24 2016-03-30 赞奇科技发展有限公司 Cost based multi-farm cloud rendering task distributing system and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
动漫平台集群渲染系统的研究与实现;蔡靖;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20140115;第I138-1827页 *
基于云计算的动漫渲染实验平台研究与实现;廖宏建等;《实验室研究与探索》;20120715;全文 *
基于层次化调度策略的渲染作业管理系统的研究与实现;董陆阳;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20130215;第I138-429页 *

Also Published As

Publication number Publication date
CN107992392A (en) 2018-05-04

Similar Documents

Publication Publication Date Title
CN107992392B (en) Automatic monitoring and repairing system and method for cloud rendering system
US10783051B2 (en) Performance regression framework
CN109522287B (en) Monitoring method, system, equipment and medium for distributed file storage cluster
US10079721B2 (en) Integrated digital network management platform
US10652119B2 (en) Automatic recovery engine with continuous recovery state machine and remote workflows
US9049105B1 (en) Systems and methods for tracking and managing event records associated with network incidents
US9497072B2 (en) Identifying alarms for a root cause of a problem in a data processing system
CN105653425B (en) Monitoring system based on complex event processing engine
US11467915B2 (en) System and method for backup scheduling using prediction models
US9235491B2 (en) Systems and methods for installing, managing, and provisioning applications
US10942831B2 (en) Automating and monitoring rolling cluster reboots
CN111324417A (en) Kubernetes cluster component control method and device, electronic equipment and medium
CN102857371A (en) Dynamic allocation management method for cluster system
CN113742031A (en) Node state information acquisition method and device, electronic equipment and readable storage medium
US9317269B2 (en) Systems and methods for installing, managing, and provisioning applications
CN111338913A (en) Analyzing device-related data to generate and/or suppress device-related alerts
CN108199901B (en) Hardware repair reporting method, system, device, hardware management server and storage medium
CN106126419A (en) The adjustment method of a kind of application program and device
CN104967532A (en) TOC technology operation and maintenance system and application method
US20100049559A1 (en) Method and system for focused and scalable event enrichment for complex ims service models
US11907699B2 (en) System and method for implementing self-driven change detection release automation
CN102761432A (en) CGI (Common Gateway Interface) monitoring method, device and system thereof
WO2024008130A1 (en) Faulty hardware processing method, apparatus and system
CN109284204B (en) Big data platform operation and maintenance method and system based on virtualization computing
CN111162938A (en) Data processing system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant