CN109961151B - System of computing services for machine learning and method for machine learning - Google Patents

System of computing services for machine learning and method for machine learning Download PDF

Info

Publication number
CN109961151B
CN109961151B CN201711391831.6A CN201711391831A CN109961151B CN 109961151 B CN109961151 B CN 109961151B CN 201711391831 A CN201711391831 A CN 201711391831A CN 109961151 B CN109961151 B CN 109961151B
Authority
CN
China
Prior art keywords
machine learning
module
providing
training
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711391831.6A
Other languages
Chinese (zh)
Other versions
CN109961151A (en
Inventor
李博文
杨洪雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongfang Vision Technology Jiangsu Co ltd
Original Assignee
Tongfang Vision Technology Jiangsu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongfang Vision Technology Jiangsu Co ltd filed Critical Tongfang Vision Technology Jiangsu Co ltd
Priority to CN201711391831.6A priority Critical patent/CN109961151B/en
Publication of CN109961151A publication Critical patent/CN109961151A/en
Application granted granted Critical
Publication of CN109961151B publication Critical patent/CN109961151B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

A system of computing services for machine learning and a method for machine learning are disclosed. Relates to the field of computer information processing, and the system comprises: the online experiment module is used for providing basic service for machine learning codes of users, and the basic service comprises data maintenance, data editing and code operation; the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and the remote debugging module is used for providing a remote debugging function for the training test module for a user. The system for the computer service of the machine learning and the method for the machine learning can simply and quickly carry out the environment deployment and the data selection of the machine learning task, meet the special requirements facing the machine learning task, and ensure the isolation of the machine learning experiment environment.

Description

System of computing services for machine learning and method for machine learning
Technical Field
The invention relates to the field of computer information processing, in particular to a system of computing service for machine learning and a method for machine learning.
Background
With the advent of the big data era and the continuous development of artificial intelligence technology, more and more engineers are dedicated to the research of algorithms in machine learning. This depends not only on the theoretical basis of the individual but also on the operational capabilities for high performance hardware (such as GPU) resources. This is because, at present, machine learning research, especially deep learning, often involves relatively complex environment configuration, parameter debugging, etc., which brings great challenges to inexperienced developers. After the experimental environment is changed, a developer needs to perform repeated environment building and code compiling processes, which results in time waste and efficiency reduction. On the other hand, the high-performance GPU server is often statically bound with the developer, and lacks uniform scheduling and management, resulting in coexistence of a phenomenon of idle consumption waiting of computing resources and a phenomenon of serious shortage. In order to enable developers to efficiently and quickly enter the research of the machine learning algorithm, the demand of the machine learning-oriented computing cloud platform is more and more urgent.
At present, computing service platforms at home and abroad are still in a development stage. The AWS cloud computing service platform provides a general GPU computing product, the AWS corresponding product is built according to general requirements, and computing service aiming at machine learning is not provided. The computing service of machine learning has particularity, and depends on a complex compiling environment and high-performance hardware resources. In order to build an experimental environment, a virtualization technology is often adopted, wherein the virtual machine technology is to run one or more independent machines on physical hardware in a virtual mode, and the resource occupancy rate is high. The container virtualization technology has the characteristics of light weight, environment isolation, efficient deployment and the like. In the example of using container virtualization in platform construction, most services are run in containers. In the patent nos.: CN106569895A is a container-based multi-tenant big data platform construction method, which is based on a container virtualization technology, and independently encapsulates different functional components of the big data platform, such as storage, computation, monitoring, caching, and backup. This way of running services in containers is not suitable for machine learning computing platforms, because machine learning models are not uniform and cannot meet the requirements of model training/testing by means of modular services in containers.
In the prior art, machine learning services can be provided through a universal demand cloud platform and a virtual machine-based cloud platform, but the universal demand cloud platform provides the universal demand cloud platform for users, which is equivalent to performing resource isolation and pooling, and does not bring convenience to research and development of machine learning algorithms. In the cloud platform based on the virtual machine, the resource occupancy rate of the virtual machine technology is high, and the performance overhead is large.
Therefore, a new system of computing services for machine learning and a method for machine learning are needed.
The above information disclosed in this background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of this, the invention provides a system of computing service for machine learning and a method for machine learning, which can simply and quickly perform environment deployment and data selection of a machine learning task, meet the specific requirements for the machine learning task, and ensure the isolation of a machine learning experiment environment.
Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.
According to an aspect of the invention, a system for a computing service for machine learning is presented, the system comprising: the online experiment module is used for providing basic service for machine learning codes of users, and the basic service comprises data maintenance, data editing and code operation; the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and the remote debugging module is used for providing a remote debugging function for the training test module for a user.
In an exemplary embodiment of the present disclosure, further comprising: the authentication and authorization module is used for determining the legality of the user and providing different levels of use permission for different users; a mirror repository for providing a compilation environment for the user; the data center is used for storing public data sets; and a private repository for providing mirror collection functionality.
In an exemplary embodiment of the present disclosure, the online experiment module manages the authority of the user through Jupyterhub.
In an exemplary embodiment of the present disclosure, the online experiment module includes: the mirror image starting module is used for processing and starting the mirror image through Jupyter notebox; and the resource module is used for enabling the basic service to obtain GPU resources through the nvidia-docker.
In an exemplary embodiment of the disclosure, a start mode of the Jupyter notebox program of the online experiment module includes a container start mode.
In an exemplary embodiment of the disclosure, the remote debugging module utilizes a VNC remote display principle through a mess cluster management framework, so that the Docker container has a remote debugging function.
In an exemplary embodiment of the present disclosure, the remote debugging module includes: and the port submodule is used for providing an IP address and a port so that the user can carry out remote debugging operation by accessing the IP address and the port returned by the background.
In an exemplary embodiment of the disclosure, the scheduling policy of the training test module is a first-in-first-out policy.
In an exemplary embodiment of the present disclosure, the training test module includes: the parameter interface submodule is used for providing an interface for parameter transmission so that the machine learning task keeps continuously running; and the APi interface module is used for providing an API (application programming interface) for the machine learning task.
In an exemplary embodiment of the present disclosure, the mirror repository includes: and the creating sub-module is used for creating a mirror image according to the Dockerfile of the machine learning training test module.
According to an aspect of the invention, a method for machine learning is proposed, the method comprising: acquiring machine learning related codes; uploading the machine learning code to a predetermined system; training and testing the machine learning code based on the predetermined system; wherein the predetermined system comprises: the online experiment module is used for providing basic service for machine learning codes of users, and the basic service comprises data maintenance, data editing and code operation; the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and the remote debugging module is used for providing a remote debugging function for the training test module for a user.
According to an aspect of the present invention, there is provided an electronic apparatus including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.
According to an aspect of the invention, a computer-readable medium is proposed, on which a computer program is stored which, when being executed by a processor, carries out the method as above.
According to the system for the computer service of the machine learning and the method for the machine learning, the environment deployment and the data selection of the machine learning task can be simply and rapidly carried out, the special requirements facing the machine learning task are met, and the isolation of the machine learning experiment environment is ensured.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are only some embodiments of the invention and other drawings may be derived from those drawings by a person skilled in the art without inventive effort.
FIG. 1 is a block diagram illustrating a system of computing services for machine learning, according to an example embodiment.
FIG. 2 is a block diagram illustrating a system of computing services for machine learning, according to another example embodiment.
FIG. 3 is a flow chart illustrating a method for machine learning, according to an example embodiment.
FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Fig. 5 schematically illustrates a computer-readable storage medium in an exemplary embodiment of the disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the disclosed concept. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be appreciated by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or flow charts in the drawings are not necessarily required to practice the present invention and are, therefore, not intended to limit the scope of the present invention.
FIG. 1 is a block diagram illustrating a system of computing services for machine learning, according to an example embodiment. The system 10 for a computing service for machine learning includes: an online experiment module 102, a training test module 104, and a remote debug module 106.
The online experiment module 102 is configured to provide basic services for machine learning codes of a user, where the basic services include data maintenance, data editing, and code running. The online experiment module 102 manages the authority of the user through Jupyterhub. Jupiterhub refers to a Jupiter Notebook server supporting multiple users, and is used for creating, managing and proxying multiple Jupiter Notebook instances. Has expansibility and customizability. Jupyter Notebook (also known as IPython Notebook) is an interactive Notebook that supports running over 40 programming languages. The essence of Jupiter Notebook is a Web application program, which is convenient for creating and sharing the document of the literature program and supports real-time codes, mathematical equations, visualization and markdown. The application comprises the following steps: data cleaning and conversion, numerical simulation, statistical modeling, machine learning, and the like.
In one embodiment, the online experiment module 102 includes: the mirror image starting module is used for processing and starting the mirror image through Jupyter notebox; and the resource module is used for enabling the basic service to obtain GPU resources through the nvidia-docker. NVIDIA Docker is a container technology. The Jupyter notewood processing of the online experimental module is started in a container mode. Virtual machines run one or more independent machines virtually on physical hardware, while containers run directly on the user space of the operating system kernel, providing a lightweight, fast environment to run developers' programs. The platform provides a development environment required by machine learning in a mirror image mode, and computing resources required by developers can be effectively scheduled through cluster management.
The line experiment module 102 takes Jupyter notewood service as a core, manages the authority of the user by utilizing Jupyterhub, and provides basic services such as code data maintenance, editing, operation and the like for developers; the aspects of Jupyterhub authentication, parameter delivery, service initiation, etc. can be customized, for example, according to different actual situations.
The line experiment module 102 implements the following functions through the above processes:
(1) causing the system 10 for machine learning computing services to employ a white list authentication mechanism;
(2) designing a page for parameter transmission can enable a developer to set computing resources such as a memory, a CPU (central processing unit), a GPU (graphics processing unit) and the like according to requirements;
(3) customizing Jupiter Notebook starting mirror image, and enabling the machine learning service to use GPU resources through nvidia-docker;
(4) the Jupyter notewood is started in a container mode, and the basic service is mounted with a public database and private data of a developer;
(5) the operation of a developer on the basic service can be persisted, and the data of the data center can be changed correspondingly;
(6) when the developer exits the online basic service, the corresponding container is destroyed, and the resources are released, so that the computing resources can be effectively utilized.
The training and testing module 104 trains and tests machine learning tasks submitted by users, and the training and testing module 104 is implemented by a Docker container. The scheduling policy of the training test module 104 is a first-in-first-out policy. Docker is an open source application container engine, so that developers can package their applications and dependency packages into a portable container, and then distribute the container to any popular Linux machine, and also realize virtualization. The containers are fully sandboxed without any interface between each other.
The training test module 104 includes: the parameter interface submodule is used for providing an interface for parameter transmission so that the machine learning task keeps continuously running; and the APi interface module is used for providing an API (application programming interface) for the machine learning task.
Through the above arrangement, the training test module 104 can implement the following functions:
(1) and various machine learning calculation tasks submitted by a user are scheduled and operated under the condition of ensuring high availability by adopting a first-in first-out scheduling strategy. The framework provides an API interface for task management.
(2) An interface for transmitting codes and data sets is provided, and a user can mount data into a container for running a machine learning task only by providing a corresponding path of a data center. The platform allows the data set to be mapped into the container, and the design has great flexibility and is convenient for developers to use;
(3) an interface for parameter transmission is provided, a sufficient amount of resources exist in the system 10 of the computing service for machine learning, the number of the tasks of the user in operation does not exceed the upper limit of the number specified by the use authority, and the machine learning tasks will continue to operate in the cloud platform until the task operation is completed or the error termination occurs;
(4) and during the running of the task, the platform displays the running state and the running log of the task in real time. Relevant logs during runtime will be saved in the cloud platform.
(5) Developers can share the development configuration environment in a commit mirror fashion. The method can save time of other developers, and is convenient for quick implementation of the core algorithm.
The remote debugging module 106 is used for providing a user with a remote debugging function for the training test module. The remote debugging module 106 utilizes the VNC remote display principle through the mess cluster management framework, so that the Docker container has a remote debugging function.
The remote debugging module 106 includes: and the port submodule is used for providing an IP address and a port so that the user can carry out remote debugging operation by accessing the IP address and the port returned by the background.
In the embodiment of the present invention, the remote debugging module 106 is implemented based on a mess cluster management framework. With the development of the internet, various big data computing frameworks continuously appear, and various distributed computing frameworks such as MapReduce supporting offline processing, Storm supporting online processing, iterative computing framework Spark and streaming processing framework S4 … … are generated at the same time, so that the problems in certain applications are respectively solved. In an internet company, several different frameworks may be employed. Considering factors such as resource utilization rate, operation and maintenance cost, data sharing and the like, different computing frameworks are often deployed in a common cluster to share cluster resources, and resources (CPUs, memories, network I/os and the like) required by different tasks are different, and the tasks run in the same cluster without mutual interference and resource competition, so that the efficiency is low, and therefore, a unified resource management and scheduling platform is born, and two typical representatives are: mesos and Yarn. Where Mesos is an open source distributed resource management framework under Apache, it is referred to as the kernel of the distributed system. Mesos was originally developed by AMPLab at the university of california at berkeley division and was later widely used at Twitter.
The remote debugging module 106 provides a remote debugging function of the container by using a mess cluster management framework and using a vnc (virtual Network computing) remote display principle. Specifically, the following functions are provided.
(1) And taking the Docker mirror image for installing VNC software as a basic mirror image, configuring an experimental environment, installing compiling software and customizing a remote debugging mirror image according to user requirements.
(2) The interface for data mounting is provided, so that selection can be performed in a data center, git format downloading is supported, and users can operate and maintain data flexibly.
(3) And an interface for parameter transmission is provided, a user can apply for CPU/GPU/memory resources according to required computing resources, and the meso background can start corresponding remote debugging service under the conditions that the background computing resource permission is met and the maximum task number of the user is less than the number upper limit specified by the authority.
(4) And providing a state display function, returning to different running states such as 'waiting state', 'preparing state', 'running state' and the like for a user according to the running stage of the background, and when the running state is reached, the user can carry out remote debugging experiments by accessing an IP address and a port returned by the background.
(5) The isolation of the ports between the containers on the single server is provided, and the dynamically allocated port numbers are exposed according to the number of the ports required to be accessed by the user, so that the user can access a remote debugging desktop and related services required to be opened inside.
According to the system for the computer learning service, disclosed by the invention, the experiment environment of the machine learning is effectively isolated through a virtualization technology and pooling of computing resources, and limited computing resources such as a CPU (central processing unit), a memory and a GPU (graphics processing unit) in a cluster are effectively utilized; the operation and maintenance service is simplified by uniformly monitoring the use condition of the cluster computing resources. The environment deployment and the data selection of the machine learning task can be simply and rapidly carried out, the special requirements facing the machine learning task are met, and the isolation of the machine learning experiment environment is guaranteed.
According to the system for the computer service for machine learning, the Jupyterhub is used for multi-user instance management; the remote debugging service is provided, and a virtual desktop is dynamically opened up for developers to conveniently debug codes by utilizing a visual IDE (integrated development environment); the platform provides an interface for running machine learning tasks in batches and can show task states and running results in real time; and a data center, a mirror image warehouse and a private warehouse are provided, and data can be directly called by experiments and tasks, so that convenience is provided for developers.
FIG. 2 is a block diagram illustrating a system of computing services for machine learning, according to an example embodiment. The system 20 for a machine-learned computing service includes an online experiment module 102, a training test module 104, and a remote debugging module 106 as shown in fig. 1, and further includes:
the authentication and authorization module 202 is used for determining the validity of the user and providing different levels of use rights for different users. The authentication authorization module 202 is an entrance for entering a system of computing services for machine learning, and using resources of the system of computing services for machine learning, and determines a valid user through authentication of information such as a user name, a password, a job number, and the like; in the authorization aspect, different use authorities are given according to different user levels, and the maximum computing resource application number, the maximum running task number and the like are utilized.
The mirror repository 204 is used to provide a compilation environment for the user. Creating a mirror image according to Dockerfile of a machine learning training/testing model; the mirror image provides the most basic compiling environment and provides the foundation for the remote debugging module and the training/testing module.
The data center 206 is used to store public data sets. The service allows developers to make data selections based on the project category to which the developer belongs.
Private repository 208 is used to provide image collection functionality. The developer can select the mirror image relevant to the development from the mirror image warehouse, and the selected mirror image is synchronized to the mirror image name selectable by the remote debugging and training/evaluation module.
In addition, the system 20 for machine-learned computing services includes functional modules (not shown in the figure) for cluster resource scheduling, user management, resource monitoring, and the like. The modules are independent of each other and interact with each other in a corresponding mode. And a certain module is modified, so that the whole platform is not influenced.
The cluster resource scheduling module in the present application adopts a resource scheduling framework based on messes, and may use Kubernetes and other similar resource scheduling open source software, and the similar resource scheduling software can bring the same effect, but the present invention is not limited thereto.
FIG. 3 is a flow chart illustrating a method for machine learning, according to an example embodiment.
As shown in fig. 3, in S302, a machine learning related code is acquired.
In S304, the machine learning related code is uploaded to a predetermined system. Wherein the predetermined system comprises: the online experiment module is used for providing basic service for machine learning codes of users, and the basic service comprises data maintenance, data editing and code operation; the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and the remote debugging module is used for providing a remote debugging function for the training test module for a user.
In S306, the machine learning code is trained and tested based on the predetermined system.
It should be clearly understood that the present disclosure describes how to make and use particular examples, but the principles of the present disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.
Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. The computer program, when executed by the CPU, performs the functions defined by the method provided by the present invention. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.
Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment.
An electronic device 200 according to this embodiment of the invention is described below with reference to fig. 4. The electronic device 200 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 4, the electronic device 200 is embodied in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.
Wherein the storage unit stores program code executable by the processing unit 210 to cause the processing unit 210 to perform the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, the processing unit 210 may perform the steps as shown in fig. 3.
The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.
The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 200 may also communicate with one or more external devices 300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 200 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above-mentioned electronic prescription flow processing method according to the embodiments of the present disclosure.
Fig. 5 schematically illustrates a computer-readable storage medium in an exemplary embodiment of the disclosure.
Referring to fig. 5, a program product 500 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: acquiring machine learning related codes; uploading the machine learning related code to a predetermined system; training and testing the machine learning code based on the predetermined system; wherein the predetermined system comprises: the online experiment module is used for providing basic service for machine learning related codes of a user, and the basic service comprises data maintenance, data editing and code operation; the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and the remote debugging module is used for providing a remote debugging function for the training test module for a user.
Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.
Exemplary embodiments of the present invention are specifically illustrated and described above. It is to be understood that the invention is not limited to the precise construction, arrangements, or instrumentalities described herein; on the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
In addition, the structures, the proportions, the sizes, and the like shown in the drawings of the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used for limiting the limit conditions which the present disclosure can implement, so that the present disclosure has no technical essence, and any modification of the structures, the change of the proportion relation, or the adjustment of the sizes, should still fall within the scope which the technical contents disclosed in the present disclosure can cover without affecting the technical effects which the present disclosure can produce and the purposes which can be achieved. In addition, the terms "above", "first", "second" and "a" as used in the present specification are for the sake of clarity only, and are not intended to limit the scope of the present disclosure, and changes or modifications of the relative relationship may be made without substantial technical changes and modifications.

Claims (12)

1. A system of computing services for machine learning, comprising:
the online experiment module is used for providing basic service for machine learning codes of users, and the basic service comprises data maintenance, data editing and code operation;
the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and
the remote debugging module is used for providing a remote debugging function for the training test module for a user;
the data center is used for storing public data sets;
the training test module comprises: the parameter interface submodule is used for providing an interface for parameter transmission, and the interface for parameter transmission enables the task quantity of the machine learning task in operation to be lower than a specified upper limit of the number, so that the machine learning task keeps continuously operating; the API interface module is used for providing an API interface for the machine learning task; a code and data set transfer interface module to provide an interface that maps the public data set into a container of the machine learning task.
2. The system of claim 1, further comprising:
the authentication and authorization module is used for determining the legality of the user and providing different levels of use permission for different users;
the mirror image warehouse is used for providing a compiling environment for a user; and
and the private warehouse is used for providing a mirror image collection function.
3. The system of claim 1, wherein the online experiment module manages the permissions of users through Jupyterhub.
4. The system of claim 3, wherein the online experiment module comprises:
the mirror image starting module is used for processing and starting the mirror image through Jupyter notebox; and
and the resource module is used for enabling the basic service to obtain GPU resources through the nvidia-docker.
5. The system of claim 4, wherein the manner in which the Jupyter notewood program in the online experiment module is initiated comprises: the container start-up mode.
6. The system of claim 1, wherein the remote debugging module utilizes VNC remote display principles through a messs cluster management framework to enable the Docker container to have remote debugging functionality.
7. The system of claim 1, wherein the remote debugging module comprises:
and the port submodule is used for providing an IP address and a port so that the user can carry out remote debugging operation by accessing the IP address and the port returned by the background.
8. The system of claim 1, wherein the scheduling policy of the training test module is a first-in-first-out policy.
9. The system of claim 2, wherein the mirror repository comprises:
and the creating sub-module is used for creating a mirror image according to the Dockerfile of the training test module.
10. A method for machine learning, comprising:
acquiring a machine learning code;
uploading the machine learning code to a predetermined system;
training and testing the machine learning code based on the predetermined system;
wherein the predetermined system comprises:
the online experiment module is used for providing basic service for machine learning codes of users, and the basic service comprises data maintenance, data editing and code operation;
the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and
the remote debugging module is used for providing a remote debugging function for the training test module for a user;
the data center is used for storing public data sets;
wherein the training test module comprises: the parameter interface submodule is used for providing an interface for parameter transmission, and the interface for parameter transmission enables the task quantity of the machine learning task in operation to be lower than a specified upper limit of the number, so that the machine learning task keeps continuously operating; the API interface module is used for providing an API interface for the machine learning task; a code and data set transfer interface module to provide an interface that maps the public data set into a container of the machine learning task.
11. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method as recited in claim 10.
12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method as claimed in claim 10.
CN201711391831.6A 2017-12-21 2017-12-21 System of computing services for machine learning and method for machine learning Active CN109961151B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711391831.6A CN109961151B (en) 2017-12-21 2017-12-21 System of computing services for machine learning and method for machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711391831.6A CN109961151B (en) 2017-12-21 2017-12-21 System of computing services for machine learning and method for machine learning

Publications (2)

Publication Number Publication Date
CN109961151A CN109961151A (en) 2019-07-02
CN109961151B true CN109961151B (en) 2021-05-14

Family

ID=67018582

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711391831.6A Active CN109961151B (en) 2017-12-21 2017-12-21 System of computing services for machine learning and method for machine learning

Country Status (1)

Country Link
CN (1) CN109961151B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516934A (en) * 2019-08-13 2019-11-29 湖南智擎科技有限公司 Intelligent big data practical training method and system based on scalable cluster
CN110990864B (en) * 2019-11-27 2023-01-10 支付宝(杭州)信息技术有限公司 Report authority management method, device and equipment
CN113033814A (en) * 2019-12-09 2021-06-25 北京中关村科金技术有限公司 Method, apparatus and storage medium for training machine learning model
CN111901294A (en) * 2020-06-09 2020-11-06 北京迈格威科技有限公司 Method for constructing online machine learning project and machine learning system
CN112434284B (en) * 2020-10-29 2022-05-17 格物钛(上海)智能科技有限公司 Machine learning training platform implementation based on sandbox environment
CN112311605B (en) * 2020-11-06 2023-12-22 北京格灵深瞳信息技术股份有限公司 Cloud platform and method for providing machine learning service
CN112417358A (en) * 2020-12-03 2021-02-26 合肥中科类脑智能技术有限公司 AI model training on-line practical training learning system and method
CN112560244B (en) * 2020-12-08 2021-12-10 河海大学 Virtual simulation experiment system and method based on Docker
CN112463389A (en) * 2020-12-10 2021-03-09 中国科学院深圳先进技术研究院 Resource management method and device for distributed machine learning task
CN112633501A (en) * 2020-12-25 2021-04-09 深圳晶泰科技有限公司 Development method and system of machine learning model framework based on containerization technology
CN113190238A (en) * 2021-03-26 2021-07-30 曙光信息产业(北京)有限公司 Framework deployment method and device, computer equipment and storage medium
CN112966833B (en) * 2021-04-07 2023-01-31 福州大学 Machine learning model platform based on Kubernetes cluster
CN113377529B (en) * 2021-05-24 2024-04-19 阿里巴巴创新公司 Intelligent acceleration card and data processing method based on intelligent acceleration card
CN114167748B (en) * 2021-10-26 2024-04-09 北京航天自动控制研究所 Flight control algorithm integrated training platform
CN114594893A (en) * 2022-01-17 2022-06-07 阿里巴巴(中国)有限公司 Performance analysis method and device, electronic equipment and computer readable storage medium
CN114996117B (en) * 2022-03-28 2024-02-06 湖南智擎科技有限公司 Client GPU application evaluation system and method for SaaS mode
CN114841298B (en) * 2022-07-06 2022-09-27 山东极视角科技有限公司 Method and device for training algorithm model, electronic equipment and storage medium
CN117234954B (en) * 2023-11-14 2024-02-06 杭银消费金融股份有限公司 Intelligent online testing method and system based on machine learning algorithm

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824614A (en) * 2015-12-15 2016-08-03 广东亿迅科技有限公司 Building method and device for distributed development environment based on Docker
CN106873975A (en) * 2016-12-30 2017-06-20 武汉默联股份有限公司 Devops based on Docker persistently pays and automated system and method
CN106961351A (en) * 2017-03-03 2017-07-18 南京邮电大学 Intelligent elastic telescopic method based on Docker container clusters
CN107038482A (en) * 2017-04-21 2017-08-11 上海极链网络科技有限公司 Applied to AI algorithm engineerings, the Distributed Architecture of systematization
CN107066310A (en) * 2017-03-11 2017-08-18 郑州云海信息技术有限公司 It is a kind of to build and using the method and device in the privately owned warehouses of safe Docker
CN107229520A (en) * 2017-04-27 2017-10-03 北京数人科技有限公司 Data center operating system
CN107450961A (en) * 2017-09-22 2017-12-08 济南浚达信息技术有限公司 A kind of distributed deep learning system and its building method, method of work based on Docker containers

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824614A (en) * 2015-12-15 2016-08-03 广东亿迅科技有限公司 Building method and device for distributed development environment based on Docker
CN106873975A (en) * 2016-12-30 2017-06-20 武汉默联股份有限公司 Devops based on Docker persistently pays and automated system and method
CN106961351A (en) * 2017-03-03 2017-07-18 南京邮电大学 Intelligent elastic telescopic method based on Docker container clusters
CN107066310A (en) * 2017-03-11 2017-08-18 郑州云海信息技术有限公司 It is a kind of to build and using the method and device in the privately owned warehouses of safe Docker
CN107038482A (en) * 2017-04-21 2017-08-11 上海极链网络科技有限公司 Applied to AI algorithm engineerings, the Distributed Architecture of systematization
CN107229520A (en) * 2017-04-27 2017-10-03 北京数人科技有限公司 Data center operating system
CN107450961A (en) * 2017-09-22 2017-12-08 济南浚达信息技术有限公司 A kind of distributed deep learning system and its building method, method of work based on Docker containers

Also Published As

Publication number Publication date
CN109961151A (en) 2019-07-02

Similar Documents

Publication Publication Date Title
CN109961151B (en) System of computing services for machine learning and method for machine learning
US10552161B2 (en) Cluster graphical processing unit (GPU) resource sharing efficiency by directed acyclic graph (DAG) generation
Bentaleb et al. Containerization technologies: Taxonomies, applications and challenges
US10409654B2 (en) Facilitating event-driven processing using unikernels
US9886303B2 (en) Specialized micro-hypervisors for unikernels
US10310908B2 (en) Dynamic usage balance of central processing units and accelerators
Azab Enabling docker containers for high-performance and many-task computing
Scolati et al. A containerized big data streaming architecture for edge cloud computing on clustered single-board devices
US20130268638A1 (en) Mapping requirements to a system topology in a networked computing environment
US10394971B2 (en) Hybrid simulation of a computing solution in a cloud computing environment with a simplified computing solution and a simulation model
JP2016517120A (en) Control runtime access to application programming interfaces
US8938712B2 (en) Cross-platform virtual machine and method
US8825862B2 (en) Optimization of resource provisioning in a networked computing environment
CN116414518A (en) Data locality of big data on Kubernetes
US11010149B2 (en) Shared middleware layer containers
US20170262405A1 (en) Remote direct memory access-based on static analysis of asynchronous blocks
WO2022078060A1 (en) Tag-driven scheduling of computing resources for function execution
Carvalho et al. Towards a dataflow runtime environment for edge, fog and in-situ computing
US11163603B1 (en) Managing asynchronous operations in cloud computing environments
CN114860401A (en) Heterogeneous cloud desktop scheduling system, method, service system, device and medium
CN114579250A (en) Method, device and storage medium for constructing virtual cluster
Kumar et al. Machine translation system as virtual appliance: for scalable service deployment on cloud
Kim et al. RETRACTED ARTICLE: Simulator considering modeling and performance evaluation for high-performance computing of collaborative-based mobile cloud infrastructure
CN109271179A (en) Virtual machine application management method, device, equipment and readable storage medium storing program for executing
US11983561B2 (en) Configuring hardware multithreading in containers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant