CN111343219B - Computing service cloud platform - Google Patents

Computing service cloud platform Download PDF

Info

Publication number
CN111343219B
CN111343219B CN201811551571.9A CN201811551571A CN111343219B CN 111343219 B CN111343219 B CN 111343219B CN 201811551571 A CN201811551571 A CN 201811551571A CN 111343219 B CN111343219 B CN 111343219B
Authority
CN
China
Prior art keywords
platform
computing
task
service
container
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811551571.9A
Other languages
Chinese (zh)
Other versions
CN111343219A (en
Inventor
徐斌
李博文
杨洪雪
童彬祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nuctech Technology Jiangsu Co ltd
Nuctech Co Ltd
Original Assignee
Nuctech Technology Jiangsu Co ltd
Nuctech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nuctech Technology Jiangsu Co ltd, Nuctech Co Ltd filed Critical Nuctech Technology Jiangsu Co ltd
Priority to CN201811551571.9A priority Critical patent/CN111343219B/en
Publication of CN111343219A publication Critical patent/CN111343219A/en
Application granted granted Critical
Publication of CN111343219B publication Critical patent/CN111343219B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/30Decision processes by autonomous network management units using voting and bidding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Abstract

The invention provides a computing service cloud platform which comprises a platform front end and a platform rear end, wherein the platform front end is at least used for providing a display interface for a user and receiving user input; and the platform back end comprises a back end server cluster and comprises a plurality of computing nodes, wherein any computing node comprises a container carrying images simulating various operating environments of an actual operating system interface, the container is started after a user creates a computing service task, and the images are used for providing corresponding operating environments through a display interface according to the type of a computing service request. The operating environments of various computing services are pre-configured in the task starting mirror image, the mirror images carrying different operating environments are provided, and the actual operating system interface is simulated after the task is started, so that a friendly operating environment/computing service platform in a container is provided for a user. Meanwhile, the server requirements of platform users for different network environments are met by providing the area selection function, and the influence of network factors on the service is avoided.

Description

Computing service cloud platform
Technical Field
The invention relates to the technical field of big data and artificial intelligence, in particular to a computing service cloud platform.
Background
With the advent of the big data era and the continuous development of artificial intelligence technology, more and more engineers are dedicated to the research of algorithms in machine learning. Especially for deep learning, high performance GPU resources are often needed to accelerate the model training and testing process. Meanwhile, the deep learning worker needs complex environment construction for training the model in a new environment, and therefore unnecessary time and labor are often consumed. Therefore, the demand of a deep learning computing platform which provides a friendly development environment for deep learning workers and can efficiently manage hardware resources is more and more urgent. In actual work, occasional factors such as network abnormality and server failure sometimes occur, so that service interruption occurs, and operation and maintenance personnel can not always solve the problem in real time. Therefore, it is also essential to provide a platform service with high availability.
At present, computing service platforms at home and abroad are still in a development stage. The computing service of machine learning has particularity, and depends on a complex compiling environment and high-performance hardware resources. When the deep learning worker needs to perform model training, hardware resources such as a GPU (graphics processing unit) need to be used, and when the deep learning worker finishes the training, the resources can be released for other tasks with requirements. Therefore, an efficient deep learning platform needs to dynamically allocate hardware resources such as a GPU according to task application and also needs to perform corresponding resource isolation on different tasks.
When a new task is started, a universal service platform starts a container according to corresponding resource requirements, and provides services to the outside in a containerized mode. In the actual training process, the tasks such as parameter adjustment, optimization and the like are often required according to the training effect, the service provided by the form is not friendly to the user, and the user hopes to work in the container.
Therefore, a new computing service platform is needed.
The above information disclosed in this background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
It is an object of the present invention to provide a cloud platform for computing services, which overcomes, at least to some extent, the problems caused by the limitations and disadvantages of the related art.
Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.
According to an example embodiment of the invention, a computing service cloud platform is disclosed, which is characterized by comprising a platform front end and a platform back end, wherein the platform front end and the platform back end are provided with
A platform front end at least for providing a display interface to a user and receiving user input; and
the platform back end comprises a back end server cluster, the back end server cluster comprises a plurality of computing nodes, any computing node comprises a container carrying images simulating various operating environments of an actual operating system interface, the container is started after a user creates a computing service task, and the images are used for providing corresponding operating environments through a display interface according to the type of a computing service request.
According to an exemplary embodiment of the present invention, the platform front end includes a task configuration page, configured to obtain and display an area tag list of all computing nodes in the back-end server cluster, so that a user may select a corresponding computing node area for running a computing service task according to a network environment.
According to an example embodiment of the present invention, a computing node is selected among a plurality of computing nodes to provide computing services that meets the resource requirements needed for the computing services and meets the regional requirements of the computing nodes running the computing services task.
According to an example embodiment of the present invention, a cloud platform runs a plurality of platform backend instances simultaneously, one of the platform backend instances is a main service, and the other platform backend instances are standby services, and when the main service is interrupted, an available standby service is automatically selected and switched to the main service to continue providing computing services.
According to an example embodiment of the present invention, the cloud platform further includes a scheduler at least for scheduling the plurality of platform backend instances.
According to an example embodiment of the present invention, wherein the scheduler schedules the plurality of platform backend instances through a ZooKeeper service.
According to an example embodiment of the present invention, the cloud platform further includes a computing node maintenance interface, at least configured to, when a certain computing node needs to be maintained, add the computing node into a maintenance list, so as to ensure that an existing computing service on the computing node normally runs until exiting is finished while limiting a new computing service from starting on the computing node.
According to an example embodiment of the present invention, the cloud platform further includes a task snapshot module, which is at least used for saving task data of the user in real time and restoring the task by creating the task snapshot.
According to an example embodiment of the present invention, the cloud platform further includes a platform monitoring module, which is at least configured to perform real-time monitoring on the cloud platform.
According to an example embodiment of the present invention, wherein the computing service is machine learning.
According to an example embodiment of the present invention, wherein simulating the real operating system interface is performed by a VNC service.
According to an exemplary embodiment of the invention, wherein the container is based on Mesos and/or Docker container technology.
According to some example embodiments of the present invention, operating environments of a plurality of computing services are pre-configured in a task boot image, and images carrying different operating environments are provided, so that an actual operating system interface is simulated after a task is booted, thereby providing a user with a friendly in-container operating environment/computing service platform.
According to some exemplary embodiments of the present invention, by providing the region selection function, the server requirements of the platform user for different network environments are met, and the influence of network factors on the service is avoided.
According to some example embodiments of the present invention, by providing a multi-copy mode, service interruption due to a single point service failure is avoided, and by automatically switching working nodes, a platform is ensured to continuously and stably provide services.
According to some exemplary embodiments of the present invention, by providing a node maintenance function, a contradiction between a platform user and an administrator for a task running on a node is solved, and both the use of the running task by the user and the management of the maintenance node by the administrator are satisfied.
According to other exemplary embodiments of the present invention, by providing the task snapshot function, the problem that task data needs to be backed up due to node maintenance is solved, and meanwhile, the requirements of a platform user for the backup and recovery tasks are also met, and the security of the snapshot mirror image of the user is also ensured.
According to still other exemplary embodiments of the present invention, by providing the platform monitoring module, not only the status of the platform cluster and the usage of the resource can be monitored, but also the running status of the task on the platform can be monitored, and corresponding log information is collected, so as to implement automatic alarm for abnormal situations, and meet the monitoring and management requirements of the platform manager for the cluster and the platform.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.
FIG. 1 illustrates a block diagram of a computing services cloud platform, according to an example embodiment of the present invention.
Fig. 2 shows a schematic diagram of a scheduler scheduling multiple platform backend instances through a ZooKeeper service.
Fig. 3 shows a snapshot function flow diagram.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more example embodiments. In the following description, numerous specific details are provided to give a thorough understanding of example embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, steps, and so forth. In other instances, well-known structures, methods, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The invention aims to provide a computing service cloud platform which comprises a platform front end and a platform rear end, wherein the platform front end is at least used for providing a display interface for a user and receiving user input; and the platform back end comprises a back end server cluster, the back end server cluster comprises a plurality of computing nodes, any computing node comprises a container carrying images simulating various operating environments of an actual operating system interface, the container is started after a user creates a computing service task, and the images are used for providing corresponding operating environments through a display interface according to the type of a computing service request. The operating environments of various computing services are pre-configured in the task starting mirror image, the mirror images carrying different operating environments are provided, and the actual operating system interface is simulated after the task is started, so that a friendly operating environment/computing service platform in a container is provided for a user. Meanwhile, by providing the area selection function, the server requirements of platform users for different network environments are met, and the influence of network factors on the service is avoided; by providing a multi-copy mode, service interruption caused by single-point service failure is avoided, and a platform is ensured to continuously and stably provide services by automatically switching working nodes; by providing the node maintenance function, the contradiction of the platform user and the administrator on the running task on the node is solved, the use of the running task by the user is met, and the management of the administrator on the maintenance node is also met; by providing the task snapshot function, the problem that task data needs to be backed up due to node maintenance is solved, the requirements of a platform user on the backup and recovery tasks are met, and the safety of snapshot images of the user is guaranteed; by providing the platform monitoring module, the state of the platform cluster and the service condition of resources can be monitored, the running state of tasks on the platform can also be monitored, corresponding log information is collected, automatic alarm is realized for abnormal conditions, and the monitoring and management requirements of a platform manager on the cluster and the platform are met.
The computing services cloud platform of the present invention is described in detail below with reference to fig. 1-3, wherein fig. 1 illustrates a block diagram of a computing services cloud platform according to an exemplary embodiment of the present invention; FIG. 2 illustrates a schematic diagram of a scheduler scheduling multiple platform backend instances through a ZooKeeper service; fig. 3 shows a snapshot function flow diagram. It should be particularly noted that, in the following embodiments, the computing service cloud platform of the present invention is mainly described by taking a machine learning computing service as an example, but the computing service cloud platform of the present invention is not limited thereto, and may provide other types of computing services.
The invention relates to a computing service cloud platform which is based on a meso/Docker container technology, faces to computing services such as machine learning (but not limited to the meso/Docker container technology), provides computing resources according to needs, provides a friendly development environment and has high availability. The service provided by the platform mainly aims at providing a customized development environment for computing services such as deep learning. In order to improve the stability of platform service, the invention designs a computing platform scheme which supports region selection, task snapshot storage and recovery, real-time resource monitoring, a computing node maintenance mode and high service availability. By designing the modules, a friendly deep learning development environment is provided for users, and meanwhile certain enhancement is achieved in service stability and disaster tolerance capability.
The platform uses Docker and meso containerization technologies, and the containerization technologies are generally applied to scenes needing continuous service provision based on the advantages of easy deployment, light weight, resource isolation and the like of containers. However, in deep learning development, operations such as parameter adjustment and optimization are often required, and general container services cannot meet the requirements. A container boot image provided by the platform is built in a VNC (Virtual Network Computing, which is a set of lightweight remote control computer software, and allows access and control of a desktop application, for example, a Windows computer can be used to control a Linux system or a Mac OS of apple, or vice versa, without any limitation to an operating system) service, and after a task is started, a visible operating interface inside the container can be provided for a user, and the user operation is basically the same as that of an actual operating system. The VNC service provided by the invention can be accessed on any operating system through a browser.
As described in detail below in conjunction with fig. 1-3.
As shown in fig. 1, the computing service cloud platform of the present invention includes a platform front end 1 and a platform back end 2, where the platform front end 1 is at least used for providing a display interface to a user and receiving user input (such as a computing service request); and the platform back end 2 comprises a back end server cluster 21, the back end server cluster 21 comprises a plurality of computing nodes, any computing node comprises a container carrying images simulating a plurality of operating environments of an actual operating system interface, the container is started after a user creates a computing service task, and the images are used for providing corresponding operating environments through a display interface according to the type of a computing service request. The operating environments of various computing services are pre-configured in the task starting mirror image, the mirror images carrying different operating environments are provided, and the actual operating system interface is simulated after the computing service task is started, so that a friendly operating environment/computing service platform in a container is provided for a user.
According to an example embodiment of the present invention, the computing service is machine learning, and may include a mainstream deep learning framework, a deep learning Environment such as an Integrated Development Environment (IDE).
According to an example embodiment of the present invention, wherein simulating the real operating system interface is performed by a VNC service.
According to an exemplary embodiment of the present invention, wherein the container is based on Mesos and/or Docker container technology.
That is, the cloud platform solution designed by the present invention provides a friendly in-container operating Environment for a user by pre-configuring a mainstream deep learning framework, a deep learning Environment such as a programming Development Environment (IDE) and other computing service environments in a task boot image, and providing images carrying different environments, and automatically allocating addresses and ports through a customized vnc (virtual Network computing) service after a task is booted, and simulating an actual operating system interface.
According to an exemplary embodiment of the present invention, the platform front end includes a task configuration page 11, configured to obtain and display an area tag list of all computing nodes in the back-end server cluster, so that a user may select a corresponding computing node area for running a computing service task according to a network environment.
According to an example embodiment of the present invention, a computing node is selected among a plurality of computing nodes to provide computing services that meets the resource requirements needed for the computing services and meets the regional requirements of the computing nodes running the computing services task.
Specifically, generally speaking, a platform back-end server cluster may be composed of servers in different network environments, and when a user applies for a job task, the platform allocates the task to a computing node in the cluster that meets resource conditions. In actual deep learning development work, training samples or models often reach the size of more than G, and network limitation on data transmission between a user local and a computing node is one of factors influencing the development efficiency of the user. The invention provides a method for appointing a computing node area to start a task, which specifically comprises the following steps:
(1) when a platform rear-end cluster is deployed, setting corresponding area labels such as an area A, B and the like for all the computing nodes;
(2) a task configuration page 11 is created at the front end of the platform, a region label list of all computing nodes in the cluster is obtained and displayed, and a user can select a proper region option according to a network environment;
(3) when the platform processes the request of a new task, the resource quantity and node area label information on a plurality of nodes are obtained through a sort strategy, the computing nodes meeting the computing resources and meeting the area requirements are selected to distribute the task, and then containers and services are started on the nodes. A user can remotely access the VNC service through the allocated network address and port;
through regional selection, a deep learning developer can efficiently access corresponding remote public data resources, and efficiency reduction caused by network transmission is avoided.
According to an example embodiment of the present invention, a cloud platform runs a plurality of platform backend instances simultaneously, one of the platform backend instances is a main service, and the other platform backend instances are standby services, and when the main service is interrupted, an available standby service is automatically selected and switched to the main service to continue providing computing services.
According to an example embodiment of the present invention, the cloud platform further includes a scheduler 3 at least for scheduling the plurality of platform backend instances.
According to an example embodiment of the present invention, the scheduler 3 schedules multiple platform backend instances through the ZooKeeper service, as shown in fig. 2.
Specifically, generally, a platform provides a single-point service, that is, only a single service is supported to simultaneously connect with a back-end cluster and provide the service to a front-end, once a network anomaly or a failure occurs in a server where the service is located, the service is no longer available, and a request such as a user access or task creation cannot be responded. The invention provides a scheme for solving the problem of service interruption caused by single-point failure, namely, when the single-point failure occurs, a platform is automatically switched to other standby nodes to continuously provide service. Specifically, the method comprises the following steps:
(1) the cluster consists of a plurality of Master nodes and other Agent nodes, wherein the Master nodes are responsible for collecting computing resources of all the Agent nodes and reporting a resource list to the platform, and the platform allocates tasks according to the computing resources reported by the Master nodes;
(2) there are usually a plurality of Master nodes in a cluster, one of which is a Leader, and the others are followers, and normally the Leader reports resources to the platform, and the others are standby. When the Leader is interrupted from the cluster due to network abnormity or failure and other factors, the cluster selects one of the Followerers to upgrade to the Leader and takes the role of the Leader;
(3) similarly, the component responsible for communicating with the cluster, performing task scheduling, and providing an interface to the front-end also has a service interruption due to a node network anomaly or failure, and is hereinafter referred to as a scheduler. The scheduler needs to register with the cluster before the Master node in the cluster reports resources to it. If multiple schedulers were simply registered with the cluster, the Master node would report resources to all schedulers, which is clearly not feasible. On one hand, if the resources are reported to all schedulers, since one resource can only be allocated to one task, the schedulers are easy to conflict when processing the resources; on the other hand, the cluster Master node reports resources to each scheduler in sequence, and the schedulers need to wait for the resources, so that the efficiency of processing the resources is greatly reduced;
(4) in order to solve the problems, the invention uses open-source ZooKeeper (a distributed application program coordination service) and designs a reasonable multi-copy scheduling algorithm. Typically, N schedulers need to be started and registered to ZooKeeper, and all schedulers will temporarily assign a sequentially increasing sequence number after registration. The ZooKeeper is responsible for judging whether all schedulers are normally connected, and selecting the service with the minimum sequence number as a Leader, wherein other services are Follower. When one of the services fails, i.e., loses connectivity with the ZooKeeper, a new Leader is re-elected among the remaining N-1 services. According to the scheduling algorithm of the invention, when the connection of the non-Leader is interrupted, the re-election does not affect the status of the original Leader, and only when the connection of the Leader is interrupted, a new Leader is elected, thereby avoiding the repeated registration of the scheduler caused by the frequent switching of the Leader. When the scheduler that lost the connection reconnects to the ZooKeeper, a new sequence number will be assigned.
(5) All schedulers connected to the ZooKeeper will start the Web service on the specified port at startup, i.e. provide Restful API interface to the platform front end in response to user requests. Only the scheduler with the Leader role can be registered to the cluster to execute the work of receiving resource reports, processing tasks and the like, namely, only one scheduler is registered to the cluster, thereby effectively avoiding conflict. The invention sets that the rest of the Follower schedulers do not interact with the cluster but can process the front-end request and forward the received Web request to the Leader node for processing.
(6) Because a plurality of schedulers provide Web services, service interruption caused by single-point failure is effectively avoided, the front end of the platform can send a user request to any working scheduler, and finally actual work is executed by the Leader scheduler after forwarding, so that conflict is effectively avoided.
(7) When the Follower is converted into a Leader, the information of the previous Leader scheduler needs to be inherited, mainly including data such as a front-end request received by the scheduler, tasks running in a cluster and the like, which relates to real-time synchronization of the database. Because when a Leader interrupt is often a loss of connection, the data of the Leader will not be acquired. However, if each Master stores data synchronously, the efficiency is also low due to factors such as network. The method reasonably utilizes the ZooKeeper characteristic, namely, the ZooKeeper plays a role in Leader election and executes the work of a high-availability database. The scheduler may keep the data in the ZooKeeper in the form of key-value pairs while maintaining the connection with the ZooKeeper, and all schedulers registered with the ZooKeeper may access the data therein.
(8) When the Follower is converted into a Leader, the current task information including task data in waiting, running and finished states is obtained by reading data in a route specified by the ZooKeeper. When re-registered to the cluster, the original scheduler data can be inherited to continue working.
According to an example embodiment of the present invention, the cloud platform further includes a computing node maintenance interface 4, at least configured to, when a certain computing node needs to be maintained, add the computing node into a maintenance list, so as to ensure that an existing computing service on the computing node runs normally until exiting is finished while a new computing service is restricted from being started on the computing node.
In particular, in the actual operation of the platform, a situation that a computing node fails or needs to be maintained down may be encountered, and generally, a user does not want to immediately end a computing service task executed on the node, and an administrator does not want to create a new computing service task on the node, and needs to restart the node or move out of the cluster after the computing service task on the node gradually exits. Therefore, the present invention provides a cluster node maintenance scheme, specifically:
(1) the platform provides a node maintenance interface 4 for an administrator, and when a maintenance node is requested to be added, a record of an IP address of the maintenance node is added into a database;
(2) when the scheduler 3 allocates the resource starting task, the IP address of the node to which the resource belongs is judged to be matched with the record in the maintenance list, and when the node to which the resource belongs is judged to be in the maintenance list, a new task is not started on the maintenance node;
(3) the functions of forced closing of the task of the maintenance node or normal exit of the task are not influenced by maintenance;
(4) after the tasks on the nodes exit in sequence, the administrator starts to maintain the nodes;
(5) after the maintenance is finished, the corresponding interface is called, the node is moved out of the maintenance list, and a new task is received.
According to an example embodiment of the present invention, the cloud platform further includes a task snapshot module 5, at least configured to save task data of a user in real time and restore a task by creating a task snapshot, where fig. 3 shows a flowchart of a snapshot function of the cloud platform of the computing service according to the present invention.
Specifically, in the actual operation of the platform, situations may occur in which the node needs to be shut down for maintenance, the user needs to save task data in the container, or the task needs to be restarted. In general, the container operation data is non-persistent, that is, when the server where the container is located reboots, the task ends, and the like, the data in the container is deleted to release the space. Therefore, in order to solve this problem, the present invention designs a scheme for efficiently saving data in a user container, specifically:
(1) the combined file system is a lightweight high-performance layered file system, and supports that modified information in the file system is submitted once and is overlapped layer by layer, different directories can be mounted under the same virtual file system, and the method is adopted by a platform used meso container file system. The Docker mirror itself is composed of multiple file layers, each layer has a unique number, when the containers are started by the Mesos with the Docker mirror, a new readable and writable layer is mounted on the topmost of the mirror file system to the containers, and the content update in the containers will occur on the readable and writable layer.
(2) For container data, the container data can be saved as a mirror image file and uploaded to a mirror image warehouse, and the next task can be conveniently recovered from a snapshot mirror image. When the container is saved, if all files in the working directory are packed, not only the layer information of the original mirror image is lost, but also a long time is consumed when the files are packed and uploaded to the mirror image warehouse. Based on the Docker mirror image characteristic, the invention provides an efficient packaging mirror image scheme, namely, a process for executing a packaging snapshot mirror image is started on a computing node where a task is located, the mirror image is started based on a container, only a modified file generated after the container is started is copied, and finally the modified file is uploaded to a mirror image warehouse used by a cluster. On one hand, the packaging process is started on the computing node where the task is located, the mirror image cache pulled when the task is started can be utilized, and the time consumed when the basic mirror image is pulled is avoided; on the other hand, the mirror image is started based on the container, which is equivalent to that only the modified mirror image layer needs to be copied to the mirror image, namely the layer information of the original mirror image is reserved, so that the process of packaging the snapshot mirror image file is accelerated, and the data volume needing to be submitted when the snapshot mirror image is uploaded is reduced; when a user starts a new task from the snapshot mirror image, the new task may be distributed to other computing nodes, and at the moment, the snapshot mirror image is pulled only by pulling a modification layer of the snapshot mirror image, so that the starting speed of the new task is accelerated;
(3) and executing the process of packaging the snapshot mirror image, and after finishing the snapshot mirror image packaging, automatically uploading the snapshot mirror image to the mirror image warehouse by acquiring the mirror image warehouse address used when the task is started. Meanwhile, corresponding information is updated in a user database, and the snapshot mirror image is only visible to a task creator, so that unauthorized use by other users is avoided;
(4) when a user recovers a task from the platform through the mirror image file, the user can specify resources with different quantities from the original task, and when the platform creates a new container, the user distributes the task to a new node meeting the conditions to run and provides a service address of the new task.
(5) The user may continue to save the new snapshot from the task initiated by the snapshot image.
According to an example embodiment of the present invention, the cloud platform further includes a platform monitoring module 6, which is at least used for performing real-time monitoring on the cloud platform.
Specifically, the platform monitoring performed by the platform monitoring module 6 is divided into Mesos cluster container monitoring and host machine monitoring, and the existing platform system is monitored in real time, so that abnormal operation of a user task can be diagnosed and early-warned.
(1) Monitoring a host machine by using Zabbix (Zabbix is an enterprise-level open source solution which provides distributed system monitoring and network monitoring functions based on a WEB interface) to monitor server hardware (a disk, a fan, a mainboard and the like) and a host machine system (a CPU, a GPU, a memory, IO and the like) of a platform in real time;
(2) the platform acquires index data of a Docker container and a meso container by utilizing an API (application program interface) provided by a CAdvison (the CAdvison is an open source tool developed by Google and used for analyzing resource occupation and performance indexes of the running container, is a running daemon process and is responsible for collecting, aggregating, processing and outputting information of the running container), and realizes real-time monitoring of the Docker container and the meso container by utilizing the time sequence data storage, visual display and alarm functions of Zabbix, and tracks and diagnoses the container process with problems according to the change of the monitoring index;
(3) the platform collects and analyzes the running task (container) logs and the platform process logs on the host machine, extracts key information of the logs, and realizes quick positioning and early warning of problems in the platform operation and maintenance process.
From the foregoing detailed description, those skilled in the art will readily appreciate that a computing services cloud platform in accordance with embodiments of the present invention has one or more of the following advantages.
According to some example embodiments of the present invention, operating environments of a plurality of computing services are pre-configured in a task boot image, and images carrying different operating environments are provided, so that an actual operating system interface is simulated after a task is booted, thereby providing a user with a friendly in-container operating environment/computing service platform.
According to some exemplary embodiments of the present invention, by providing the region selection function, the server requirements of the platform user for different network environments are met, and the influence of network factors on the service is avoided.
According to some example embodiments of the present invention, by providing a multi-copy mode, service interruption due to a single point service failure is avoided, and by automatically switching working nodes, a platform is ensured to continuously and stably provide services.
According to some exemplary embodiments of the present invention, by providing a node maintenance function, a contradiction between a platform user and an administrator for a task running on a node is solved, and both the use of the running task by the user and the management of the maintenance node by the administrator are satisfied.
According to other exemplary embodiments of the present invention, by providing the task snapshot function, the problem that task data needs to be backed up due to node maintenance is solved, and meanwhile, the requirements of a platform user for the backup and recovery tasks are also met, and the security of the snapshot mirror image of the user is also ensured.
According to still other exemplary embodiments of the present invention, by providing the platform monitoring module, not only the status of the platform cluster and the usage of the resource can be monitored, but also the running status of the task on the platform can be monitored, and corresponding log information is collected, so as to implement automatic alarm for abnormal situations, and meet the monitoring and management requirements of the platform manager for the cluster and the platform.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (8)

1. A computing service cloud platform is characterized by comprising a platform front end and a platform back end, wherein
The platform front end is at least used for providing a display interface for a user and receiving user input; and
the platform back end comprises a back end server cluster, the back end server cluster comprises a plurality of computing nodes, any computing node comprises a container carrying mirror images of various operating environments simulating an actual operating system interface, the container is started after a user creates a computing service task, and the mirror images are used for providing corresponding operating environments through a display interface according to the type of a computing service request;
the platform front end comprises a task configuration page, a task configuration page and a task configuration page, wherein the task configuration page is used for acquiring and displaying an area label list of all computing nodes in a rear-end server cluster so that a user can select a corresponding computing node area for running a computing service task according to a network environment;
selecting a computing node which meets the resource requirement required by the computing service and meets the regional requirement of the computing node for running the task of the computing service from a plurality of computing nodes to provide the computing service;
wherein the computing service is machine learning;
the system also comprises a task snapshot module at least used for storing task data of a user in real time and restoring a task by creating a task snapshot;
and for container data, starting a process for executing a packed snapshot mirror image on a computing node where a task is located, copying only a modified file generated after the container is started based on the container start mirror image, and finally uploading the modified file to a mirror image warehouse used by a back-end server cluster.
2. The cloud platform of claim 1, wherein the cloud platform runs multiple platform backend instances simultaneously, one of the platform backend instances is a primary service and the others are backup services, and when the primary service is interrupted, the available backup services are automatically selected and switched to the primary service to continue providing computing services.
3. The cloud platform of claim 2, further comprising a scheduler at least to schedule the plurality of platform backend instances.
4. The cloud platform of claim 3, wherein the scheduler schedules the plurality of platform backend instances through a ZooKeeper service.
5. The cloud platform of claim 1, further comprising a compute node maintenance interface to, at least when a compute node needs maintenance, add the compute node to a maintenance list to ensure that existing compute services on the compute node run normally until exiting is finished while restricting new compute services from starting on the compute node.
6. The cloud platform of claim 1, further comprising a platform monitoring module to at least monitor the cloud platform in real time.
7. The cloud platform of claim 1, wherein simulating the real operating system interface is performed by a VNC service.
8. The cloud platform of claim 1, wherein the containers are based on Mesos and/or Docker container technology.
CN201811551571.9A 2018-12-18 2018-12-18 Computing service cloud platform Active CN111343219B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811551571.9A CN111343219B (en) 2018-12-18 2018-12-18 Computing service cloud platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811551571.9A CN111343219B (en) 2018-12-18 2018-12-18 Computing service cloud platform

Publications (2)

Publication Number Publication Date
CN111343219A CN111343219A (en) 2020-06-26
CN111343219B true CN111343219B (en) 2022-08-02

Family

ID=71186777

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811551571.9A Active CN111343219B (en) 2018-12-18 2018-12-18 Computing service cloud platform

Country Status (1)

Country Link
CN (1) CN111343219B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032093B (en) * 2021-03-05 2024-01-09 北京百度网讯科技有限公司 Distributed computing method, device and platform
CN113051055A (en) * 2021-03-24 2021-06-29 北京沃东天骏信息技术有限公司 Task processing method and device
CN113282419B (en) * 2021-06-07 2022-07-08 国家超级计算天津中心 Resource scheduling method, electronic device, and computer-readable storage medium
CN115277457A (en) * 2022-07-28 2022-11-01 卡奥斯工业智能研究院(青岛)有限公司 Server control method, server and storage medium
CN116980346B (en) * 2023-09-22 2023-11-28 新华三技术有限公司 Container management method and device based on cloud platform

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102769615A (en) * 2012-07-02 2012-11-07 北京大学 Task scheduling method and system based on MapReduce mechanism
CN107450961A (en) * 2017-09-22 2017-12-08 济南浚达信息技术有限公司 A kind of distributed deep learning system and its building method, method of work based on Docker containers
CN107612736A (en) * 2017-09-21 2018-01-19 成都安恒信息技术有限公司 A kind of web browser operation audit method based on container
CN108388472A (en) * 2018-03-01 2018-08-10 吉林大学 A kind of elastic task scheduling system and method based on Docker clusters
CN108388460A (en) * 2018-02-05 2018-08-10 中国人民解放军战略支援部队航天工程大学 Long-range real-time rendering platform construction method based on graphics cluster

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105357296B (en) * 2015-10-30 2018-10-23 河海大学 Elastic caching system under a kind of Docker cloud platforms
CN105607954B (en) * 2015-12-21 2019-05-14 华南师范大学 A kind of method and apparatus that stateful container migrates online
CN105631196B (en) * 2015-12-22 2018-04-17 中国科学院软件研究所 A kind of container levels flexible resource feed system and method towards micro services framework
US10044640B1 (en) * 2016-04-26 2018-08-07 EMC IP Holding Company LLC Distributed resource scheduling layer utilizable with resource abstraction frameworks
CN107577496B (en) * 2017-09-15 2020-11-10 济南浚达信息技术有限公司 Docker-based desktop cloud management platform deployment system and working method and application thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102769615A (en) * 2012-07-02 2012-11-07 北京大学 Task scheduling method and system based on MapReduce mechanism
CN107612736A (en) * 2017-09-21 2018-01-19 成都安恒信息技术有限公司 A kind of web browser operation audit method based on container
CN107450961A (en) * 2017-09-22 2017-12-08 济南浚达信息技术有限公司 A kind of distributed deep learning system and its building method, method of work based on Docker containers
CN108388460A (en) * 2018-02-05 2018-08-10 中国人民解放军战略支援部队航天工程大学 Long-range real-time rendering platform construction method based on graphics cluster
CN108388472A (en) * 2018-03-01 2018-08-10 吉林大学 A kind of elastic task scheduling system and method based on Docker clusters

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Docker虚拟化技术性能优化分析;刘胜强等;《自动化与仪器仪表》;20181125(第11期);全文 *

Also Published As

Publication number Publication date
CN111343219A (en) 2020-06-26

Similar Documents

Publication Publication Date Title
CN111343219B (en) Computing service cloud platform
US11797395B2 (en) Application migration between environments
CN111966305B (en) Persistent volume allocation method and device, computer equipment and storage medium
CN111488241B (en) Method and system for realizing agent-free backup and recovery operation in container arrangement platform
US11663085B2 (en) Application backup and management
US7992032B2 (en) Cluster system and failover method for cluster system
US9870291B2 (en) Snapshotting shared disk resources for checkpointing a virtual machine cluster
US9851989B2 (en) Methods and apparatus to manage virtual machines
CN112437915A (en) Method for monitoring multiple clusters and application programs on cloud platform
US8260840B1 (en) Dynamic scaling of a cluster of computing nodes used for distributed execution of a program
US9069465B2 (en) Computer system, management method of computer resource and program
US8117641B2 (en) Control device and control method for information system
US11520506B2 (en) Techniques for implementing fault domain sets
CN112424750A (en) Multi-cluster supply and management method on cloud platform
CN112424751A (en) Cluster resource allocation and management method on cloud platform
CN113971095A (en) KUBERNETES application program interface in extended process
CN112424752A (en) Volume (storage) supply method of application program container on cloud platform
US11256576B2 (en) Intelligent scheduling of backups
CN113849137A (en) Visual block storage method and system for Shenwei container platform
CN115102851B (en) Fusion platform for HPC and AI fusion calculation and resource management method thereof
US11579780B1 (en) Volume remote copy based on application priority
CN116820686B (en) Physical machine deployment method, virtual machine and container unified monitoring method and device
CN115686802B (en) Cloud computing cluster scheduling system
CN116166343A (en) Cluster process arranging method, system, device and medium
GB2622918A (en) Device health driven migration of applications and its dependencies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant