CN109961151B

CN109961151B - System of computing services for machine learning and method for machine learning

Info

Publication number: CN109961151B
Application number: CN201711391831.6A
Authority: CN
Inventors: 李博文; 杨洪雪
Original assignee: Tongfang Vision Technology Jiangsu Co ltd
Current assignee: Tongfang Vision Technology Jiangsu Co ltd
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2021-05-14
Anticipated expiration: 2037-12-21
Also published as: CN109961151A

Abstract

A system of computing services for machine learning and a method for machine learning are disclosed. Relates to the field of computer information processing, and the system comprises: the online experiment module is used for providing basic service for machine learning codes of users, and the basic service comprises data maintenance, data editing and code operation; the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and the remote debugging module is used for providing a remote debugging function for the training test module for a user. The system for the computer service of the machine learning and the method for the machine learning can simply and quickly carry out the environment deployment and the data selection of the machine learning task, meet the special requirements facing the machine learning task, and ensure the isolation of the machine learning experiment environment.

Description

System of computing services for machine learning and method for machine learning

Technical Field

The invention relates to the field of computer information processing, in particular to a system of computing service for machine learning and a method for machine learning.

Background

With the advent of the big data era and the continuous development of artificial intelligence technology, more and more engineers are dedicated to the research of algorithms in machine learning. This depends not only on the theoretical basis of the individual but also on the operational capabilities for high performance hardware (such as GPU) resources. This is because, at present, machine learning research, especially deep learning, often involves relatively complex environment configuration, parameter debugging, etc., which brings great challenges to inexperienced developers. After the experimental environment is changed, a developer needs to perform repeated environment building and code compiling processes, which results in time waste and efficiency reduction. On the other hand, the high-performance GPU server is often statically bound with the developer, and lacks uniform scheduling and management, resulting in coexistence of a phenomenon of idle consumption waiting of computing resources and a phenomenon of serious shortage. In order to enable developers to efficiently and quickly enter the research of the machine learning algorithm, the demand of the machine learning-oriented computing cloud platform is more and more urgent.

At present, computing service platforms at home and abroad are still in a development stage. The AWS cloud computing service platform provides a general GPU computing product, the AWS corresponding product is built according to general requirements, and computing service aiming at machine learning is not provided. The computing service of machine learning has particularity, and depends on a complex compiling environment and high-performance hardware resources. In order to build an experimental environment, a virtualization technology is often adopted, wherein the virtual machine technology is to run one or more independent machines on physical hardware in a virtual mode, and the resource occupancy rate is high. The container virtualization technology has the characteristics of light weight, environment isolation, efficient deployment and the like. In the example of using container virtualization in platform construction, most services are run in containers. In the patent nos.: CN106569895A is a container-based multi-tenant big data platform construction method, which is based on a container virtualization technology, and independently encapsulates different functional components of the big data platform, such as storage, computation, monitoring, caching, and backup. This way of running services in containers is not suitable for machine learning computing platforms, because machine learning models are not uniform and cannot meet the requirements of model training/testing by means of modular services in containers.

In the prior art, machine learning services can be provided through a universal demand cloud platform and a virtual machine-based cloud platform, but the universal demand cloud platform provides the universal demand cloud platform for users, which is equivalent to performing resource isolation and pooling, and does not bring convenience to research and development of machine learning algorithms. In the cloud platform based on the virtual machine, the resource occupancy rate of the virtual machine technology is high, and the performance overhead is large.

Therefore, a new system of computing services for machine learning and a method for machine learning are needed.

The above information disclosed in this background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of this, the invention provides a system of computing service for machine learning and a method for machine learning, which can simply and quickly perform environment deployment and data selection of a machine learning task, meet the specific requirements for the machine learning task, and ensure the isolation of a machine learning experiment environment.

Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.

According to an aspect of the invention, a system for a computing service for machine learning is presented, the system comprising: the online experiment module is used for providing basic service for machine learning codes of users, and the basic service comprises data maintenance, data editing and code operation; the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and the remote debugging module is used for providing a remote debugging function for the training test module for a user.

In an exemplary embodiment of the present disclosure, further comprising: the authentication and authorization module is used for determining the legality of the user and providing different levels of use permission for different users; a mirror repository for providing a compilation environment for the user; the data center is used for storing public data sets; and a private repository for providing mirror collection functionality.

In an exemplary embodiment of the present disclosure, the online experiment module manages the authority of the user through Jupyterhub.

In an exemplary embodiment of the present disclosure, the online experiment module includes: the mirror image starting module is used for processing and starting the mirror image through Jupyter notebox; and the resource module is used for enabling the basic service to obtain GPU resources through the nvidia-docker.

In an exemplary embodiment of the disclosure, a start mode of the Jupyter notebox program of the online experiment module includes a container start mode.

In an exemplary embodiment of the disclosure, the remote debugging module utilizes a VNC remote display principle through a mess cluster management framework, so that the Docker container has a remote debugging function.

In an exemplary embodiment of the present disclosure, the remote debugging module includes: and the port submodule is used for providing an IP address and a port so that the user can carry out remote debugging operation by accessing the IP address and the port returned by the background.

In an exemplary embodiment of the disclosure, the scheduling policy of the training test module is a first-in-first-out policy.

In an exemplary embodiment of the present disclosure, the training test module includes: the parameter interface submodule is used for providing an interface for parameter transmission so that the machine learning task keeps continuously running; and the APi interface module is used for providing an API (application programming interface) for the machine learning task.

In an exemplary embodiment of the present disclosure, the mirror repository includes: and the creating sub-module is used for creating a mirror image according to the Dockerfile of the machine learning training test module.

According to an aspect of the invention, a method for machine learning is proposed, the method comprising: acquiring machine learning related codes; uploading the machine learning code to a predetermined system; training and testing the machine learning code based on the predetermined system; wherein the predetermined system comprises: the online experiment module is used for providing basic service for machine learning codes of users, and the basic service comprises data maintenance, data editing and code operation; the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and the remote debugging module is used for providing a remote debugging function for the training test module for a user.

According to an aspect of the present invention, there is provided an electronic apparatus including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.

According to an aspect of the invention, a computer-readable medium is proposed, on which a computer program is stored which, when being executed by a processor, carries out the method as above.

According to the system for the computer service of the machine learning and the method for the machine learning, the environment deployment and the data selection of the machine learning task can be simply and rapidly carried out, the special requirements facing the machine learning task are met, and the isolation of the machine learning experiment environment is ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are only some embodiments of the invention and other drawings may be derived from those drawings by a person skilled in the art without inventive effort.

FIG. 1 is a block diagram illustrating a system of computing services for machine learning, according to an example embodiment.

FIG. 2 is a block diagram illustrating a system of computing services for machine learning, according to another example embodiment.

FIG. 3 is a flow chart illustrating a method for machine learning, according to an example embodiment.

FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Fig. 5 schematically illustrates a computer-readable storage medium in an exemplary embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the disclosed concept. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be appreciated by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or flow charts in the drawings are not necessarily required to practice the present invention and are, therefore, not intended to limit the scope of the present invention.

FIG. 1 is a block diagram illustrating a system of computing services for machine learning, according to an example embodiment. The system 10 for a computing service for machine learning includes: an online experiment module 102, a training test module 104, and a remote debug module 106.

The online experiment module 102 is configured to provide basic services for machine learning codes of a user, where the basic services include data maintenance, data editing, and code running. The online experiment module 102 manages the authority of the user through Jupyterhub. Jupiterhub refers to a Jupiter Notebook server supporting multiple users, and is used for creating, managing and proxying multiple Jupiter Notebook instances. Has expansibility and customizability. Jupyter Notebook (also known as IPython Notebook) is an interactive Notebook that supports running over 40 programming languages. The essence of Jupiter Notebook is a Web application program, which is convenient for creating and sharing the document of the literature program and supports real-time codes, mathematical equations, visualization and markdown. The application comprises the following steps: data cleaning and conversion, numerical simulation, statistical modeling, machine learning, and the like.

In one embodiment, the online experiment module 102 includes: the mirror image starting module is used for processing and starting the mirror image through Jupyter notebox; and the resource module is used for enabling the basic service to obtain GPU resources through the nvidia-docker. NVIDIA Docker is a container technology. The Jupyter notewood processing of the online experimental module is started in a container mode. Virtual machines run one or more independent machines virtually on physical hardware, while containers run directly on the user space of the operating system kernel, providing a lightweight, fast environment to run developers' programs. The platform provides a development environment required by machine learning in a mirror image mode, and computing resources required by developers can be effectively scheduled through cluster management.

The line experiment module 102 takes Jupyter notewood service as a core, manages the authority of the user by utilizing Jupyterhub, and provides basic services such as code data maintenance, editing, operation and the like for developers; the aspects of Jupyterhub authentication, parameter delivery, service initiation, etc. can be customized, for example, according to different actual situations.

The line experiment module 102 implements the following functions through the above processes:

(1) causing the system 10 for machine learning computing services to employ a white list authentication mechanism;

(2) designing a page for parameter transmission can enable a developer to set computing resources such as a memory, a CPU (central processing unit), a GPU (graphics processing unit) and the like according to requirements;

(3) customizing Jupiter Notebook starting mirror image, and enabling the machine learning service to use GPU resources through nvidia-docker;

(4) the Jupyter notewood is started in a container mode, and the basic service is mounted with a public database and private data of a developer;

(5) the operation of a developer on the basic service can be persisted, and the data of the data center can be changed correspondingly;

(6) when the developer exits the online basic service, the corresponding container is destroyed, and the resources are released, so that the computing resources can be effectively utilized.

The training and testing module 104 trains and tests machine learning tasks submitted by users, and the training and testing module 104 is implemented by a Docker container. The scheduling policy of the training test module 104 is a first-in-first-out policy. Docker is an open source application container engine, so that developers can package their applications and dependency packages into a portable container, and then distribute the container to any popular Linux machine, and also realize virtualization. The containers are fully sandboxed without any interface between each other.

The training test module 104 includes: the parameter interface submodule is used for providing an interface for parameter transmission so that the machine learning task keeps continuously running; and the APi interface module is used for providing an API (application programming interface) for the machine learning task.

Through the above arrangement, the training test module 104 can implement the following functions:

(1) and various machine learning calculation tasks submitted by a user are scheduled and operated under the condition of ensuring high availability by adopting a first-in first-out scheduling strategy. The framework provides an API interface for task management.

(2) An interface for transmitting codes and data sets is provided, and a user can mount data into a container for running a machine learning task only by providing a corresponding path of a data center. The platform allows the data set to be mapped into the container, and the design has great flexibility and is convenient for developers to use;

(3) an interface for parameter transmission is provided, a sufficient amount of resources exist in the system 10 of the computing service for machine learning, the number of the tasks of the user in operation does not exceed the upper limit of the number specified by the use authority, and the machine learning tasks will continue to operate in the cloud platform until the task operation is completed or the error termination occurs;

(4) and during the running of the task, the platform displays the running state and the running log of the task in real time. Relevant logs during runtime will be saved in the cloud platform.

(5) Developers can share the development configuration environment in a commit mirror fashion. The method can save time of other developers, and is convenient for quick implementation of the core algorithm.

The remote debugging module 106 is used for providing a user with a remote debugging function for the training test module. The remote debugging module 106 utilizes the VNC remote display principle through the mess cluster management framework, so that the Docker container has a remote debugging function.

The remote debugging module 106 includes: and the port submodule is used for providing an IP address and a port so that the user can carry out remote debugging operation by accessing the IP address and the port returned by the background.

In the embodiment of the present invention, the remote debugging module 106 is implemented based on a mess cluster management framework. With the development of the internet, various big data computing frameworks continuously appear, and various distributed computing frameworks such as MapReduce supporting offline processing, Storm supporting online processing, iterative computing framework Spark and streaming processing framework S4 … … are generated at the same time, so that the problems in certain applications are respectively solved. In an internet company, several different frameworks may be employed. Considering factors such as resource utilization rate, operation and maintenance cost, data sharing and the like, different computing frameworks are often deployed in a common cluster to share cluster resources, and resources (CPUs, memories, network I/os and the like) required by different tasks are different, and the tasks run in the same cluster without mutual interference and resource competition, so that the efficiency is low, and therefore, a unified resource management and scheduling platform is born, and two typical representatives are: mesos and Yarn. Where Mesos is an open source distributed resource management framework under Apache, it is referred to as the kernel of the distributed system. Mesos was originally developed by AMPLab at the university of california at berkeley division and was later widely used at Twitter.

The remote debugging module 106 provides a remote debugging function of the container by using a mess cluster management framework and using a vnc (virtual Network computing) remote display principle. Specifically, the following functions are provided.

(1) And taking the Docker mirror image for installing VNC software as a basic mirror image, configuring an experimental environment, installing compiling software and customizing a remote debugging mirror image according to user requirements.

(2) The interface for data mounting is provided, so that selection can be performed in a data center, git format downloading is supported, and users can operate and maintain data flexibly.

(3) And an interface for parameter transmission is provided, a user can apply for CPU/GPU/memory resources according to required computing resources, and the meso background can start corresponding remote debugging service under the conditions that the background computing resource permission is met and the maximum task number of the user is less than the number upper limit specified by the authority.

(4) And providing a state display function, returning to different running states such as 'waiting state', 'preparing state', 'running state' and the like for a user according to the running stage of the background, and when the running state is reached, the user can carry out remote debugging experiments by accessing an IP address and a port returned by the background.

(5) The isolation of the ports between the containers on the single server is provided, and the dynamically allocated port numbers are exposed according to the number of the ports required to be accessed by the user, so that the user can access a remote debugging desktop and related services required to be opened inside.

According to the system for the computer learning service, disclosed by the invention, the experiment environment of the machine learning is effectively isolated through a virtualization technology and pooling of computing resources, and limited computing resources such as a CPU (central processing unit), a memory and a GPU (graphics processing unit) in a cluster are effectively utilized; the operation and maintenance service is simplified by uniformly monitoring the use condition of the cluster computing resources. The environment deployment and the data selection of the machine learning task can be simply and rapidly carried out, the special requirements facing the machine learning task are met, and the isolation of the machine learning experiment environment is guaranteed.

According to the system for the computer service for machine learning, the Jupyterhub is used for multi-user instance management; the remote debugging service is provided, and a virtual desktop is dynamically opened up for developers to conveniently debug codes by utilizing a visual IDE (integrated development environment); the platform provides an interface for running machine learning tasks in batches and can show task states and running results in real time; and a data center, a mirror image warehouse and a private warehouse are provided, and data can be directly called by experiments and tasks, so that convenience is provided for developers.

FIG. 2 is a block diagram illustrating a system of computing services for machine learning, according to an example embodiment. The system 20 for a machine-learned computing service includes an online experiment module 102, a training test module 104, and a remote debugging module 106 as shown in fig. 1, and further includes:

the authentication and authorization module 202 is used for determining the validity of the user and providing different levels of use rights for different users. The authentication authorization module 202 is an entrance for entering a system of computing services for machine learning, and using resources of the system of computing services for machine learning, and determines a valid user through authentication of information such as a user name, a password, a job number, and the like; in the authorization aspect, different use authorities are given according to different user levels, and the maximum computing resource application number, the maximum running task number and the like are utilized.

The mirror repository 204 is used to provide a compilation environment for the user. Creating a mirror image according to Dockerfile of a machine learning training/testing model; the mirror image provides the most basic compiling environment and provides the foundation for the remote debugging module and the training/testing module.

The data center 206 is used to store public data sets. The service allows developers to make data selections based on the project category to which the developer belongs.

Private repository 208 is used to provide image collection functionality. The developer can select the mirror image relevant to the development from the mirror image warehouse, and the selected mirror image is synchronized to the mirror image name selectable by the remote debugging and training/evaluation module.

In addition, the system 20 for machine-learned computing services includes functional modules (not shown in the figure) for cluster resource scheduling, user management, resource monitoring, and the like. The modules are independent of each other and interact with each other in a corresponding mode. And a certain module is modified, so that the whole platform is not influenced.

The cluster resource scheduling module in the present application adopts a resource scheduling framework based on messes, and may use Kubernetes and other similar resource scheduling open source software, and the similar resource scheduling software can bring the same effect, but the present invention is not limited thereto.

As shown in fig. 3, in S302, a machine learning related code is acquired.

In S304, the machine learning related code is uploaded to a predetermined system. Wherein the predetermined system comprises: the online experiment module is used for providing basic service for machine learning codes of users, and the basic service comprises data maintenance, data editing and code operation; the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and the remote debugging module is used for providing a remote debugging function for the training test module for a user.

In S306, the machine learning code is trained and tested based on the predetermined system.

It should be clearly understood that the present disclosure describes how to make and use particular examples, but the principles of the present disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.

Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. The computer program, when executed by the CPU, performs the functions defined by the method provided by the present invention. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.

Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

An electronic device 200 according to this embodiment of the invention is described below with reference to fig. 4. The electronic device 200 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 4, the electronic device 200 is embodied in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.

Wherein the storage unit stores program code executable by the processing unit 210 to cause the processing unit 210 to perform the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, the processing unit 210 may perform the steps as shown in fig. 3.

The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.

The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 200 may also communicate with one or more external devices 300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 200 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above-mentioned electronic prescription flow processing method according to the embodiments of the present disclosure.

Referring to fig. 5, a program product 500 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: acquiring machine learning related codes; uploading the machine learning related code to a predetermined system; training and testing the machine learning code based on the predetermined system; wherein the predetermined system comprises: the online experiment module is used for providing basic service for machine learning related codes of a user, and the basic service comprises data maintenance, data editing and code operation; the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and the remote debugging module is used for providing a remote debugging function for the training test module for a user.

Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.

Exemplary embodiments of the present invention are specifically illustrated and described above. It is to be understood that the invention is not limited to the precise construction, arrangements, or instrumentalities described herein; on the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

In addition, the structures, the proportions, the sizes, and the like shown in the drawings of the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used for limiting the limit conditions which the present disclosure can implement, so that the present disclosure has no technical essence, and any modification of the structures, the change of the proportion relation, or the adjustment of the sizes, should still fall within the scope which the technical contents disclosed in the present disclosure can cover without affecting the technical effects which the present disclosure can produce and the purposes which can be achieved. In addition, the terms "above", "first", "second" and "a" as used in the present specification are for the sake of clarity only, and are not intended to limit the scope of the present disclosure, and changes or modifications of the relative relationship may be made without substantial technical changes and modifications.

Claims

1. A system of computing services for machine learning, comprising:

the online experiment module is used for providing basic service for machine learning codes of users, and the basic service comprises data maintenance, data editing and code operation;

the training and testing module is used for training and testing a machine learning task submitted by a user and is realized by adopting a Docker container; and

the remote debugging module is used for providing a remote debugging function for the training test module for a user;

the data center is used for storing public data sets;

the training test module comprises: the parameter interface submodule is used for providing an interface for parameter transmission, and the interface for parameter transmission enables the task quantity of the machine learning task in operation to be lower than a specified upper limit of the number, so that the machine learning task keeps continuously operating; the API interface module is used for providing an API interface for the machine learning task; a code and data set transfer interface module to provide an interface that maps the public data set into a container of the machine learning task.

2. The system of claim 1, further comprising:

the authentication and authorization module is used for determining the legality of the user and providing different levels of use permission for different users;

the mirror image warehouse is used for providing a compiling environment for a user; and

and the private warehouse is used for providing a mirror image collection function.

3. The system of claim 1, wherein the online experiment module manages the permissions of users through Jupyterhub.

4. The system of claim 3, wherein the online experiment module comprises:

the mirror image starting module is used for processing and starting the mirror image through Jupyter notebox; and

and the resource module is used for enabling the basic service to obtain GPU resources through the nvidia-docker.

5. The system of claim 4, wherein the manner in which the Jupyter notewood program in the online experiment module is initiated comprises: the container start-up mode.

6. The system of claim 1, wherein the remote debugging module utilizes VNC remote display principles through a messs cluster management framework to enable the Docker container to have remote debugging functionality.

7. The system of claim 1, wherein the remote debugging module comprises:

and the port submodule is used for providing an IP address and a port so that the user can carry out remote debugging operation by accessing the IP address and the port returned by the background.

8. The system of claim 1, wherein the scheduling policy of the training test module is a first-in-first-out policy.

9. The system of claim 2, wherein the mirror repository comprises:

and the creating sub-module is used for creating a mirror image according to the Dockerfile of the training test module.

10. A method for machine learning, comprising:

acquiring a machine learning code;

uploading the machine learning code to a predetermined system;

training and testing the machine learning code based on the predetermined system;

wherein the predetermined system comprises:

the data center is used for storing public data sets;

wherein the training test module comprises: the parameter interface submodule is used for providing an interface for parameter transmission, and the interface for parameter transmission enables the task quantity of the machine learning task in operation to be lower than a specified upper limit of the number, so that the machine learning task keeps continuously operating; the API interface module is used for providing an API interface for the machine learning task; a code and data set transfer interface module to provide an interface that maps the public data set into a container of the machine learning task.

11. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method as recited in claim 10.

12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method as claimed in claim 10.