WO2022078280A1 - Artificial intelligence (ai) training method, system, and device - Google Patents

Artificial intelligence (ai) training method, system, and device Download PDF

Info

Publication number
WO2022078280A1
WO2022078280A1 PCT/CN2021/123021 CN2021123021W WO2022078280A1 WO 2022078280 A1 WO2022078280 A1 WO 2022078280A1 CN 2021123021 W CN2021123021 W CN 2021123021W WO 2022078280 A1 WO2022078280 A1 WO 2022078280A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
cloud
information
code
environment
Prior art date
Application number
PCT/CN2021/123021
Other languages
French (fr)
Chinese (zh)
Inventor
陈普
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2022078280A1 publication Critical patent/WO2022078280A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/61Installation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the technical field of artificial intelligence (AI), and in particular, to a method, system and device for AI training.
  • AI artificial intelligence
  • AI is applied to all walks of life.
  • Typical AI technologies include deep learning technology, machine learning technology, etc.
  • the basic idea is to design an AI model and train the AI model based on training data, so that the trained AI model has certain functions, such as object detection. , target recognition function, etc.
  • An AI model is an algorithm implemented using AI technology, such as a deep learning model based on deep learning technology.
  • AI frameworks In order to facilitate the development and training of AI models, various AI frameworks have emerged in the industry. Developers are usually accustomed to installing AI frameworks on local devices and developing and training AI models based on the AI framework. Because the training of AI models needs to rely on the support of strong computing power. The AI model developed by the developer on the local device based on the AI framework is usually limited by the computing power of the local device.
  • This application provides an artificial intelligence AI training method.
  • the method uploads the training task information to the cloud training system when the training code in the local device is running, and uses the resources of the cloud to realize the training of the AI model, so that developers can
  • the development of AI models executed on the local device is no longer limited by the computing power of the local device.
  • the present application provides an AI training method, which is applied to a guidance system.
  • the guidance system After a user of the cloud training system triggers the running of the training code on the local device, the guidance system performs the following steps: Training code to obtain training task information, where the training code is used to train an AI model, and the AI model is developed and obtained by the user based on the AI framework installed on the local device; upload the training task information to the cloud training system, and notify The cloud training system executes the training task corresponding to the training task information.
  • the guidance system in this application can perform the operations of acquiring training task information and uploading training task information to the cloud training system.
  • This method enables to use the resources on the cloud to train the AI model while maintaining the user's habit of developing and training models locally, which solves the problem of insufficient resources for training on local devices and brings convenience to users. .
  • the aforementioned training task information includes the information obtained from the training code by using the obtaining component in the AI framework, and the information obtained from the local device according to the information in the training code.
  • the acquisition component in the AI framework can obtain information by reading the training code and intercepting the calling API in the training code during the running of the training code. Obtaining the information in the training code through the acquisition component can make the guidance system faster to acquire to the training task information.
  • the aforementioned training task information includes one or more of the following data: training parameters in the training code, the AI model to be trained, and the training code used for training the AI model
  • the uploaded training task information enables the cloud training system to smoothly prepare the cloud training environment and execute the training task in the cloud environment.
  • the training environment information of the local device includes: version information of the AI framework, and version information of the programming language of the training code.
  • the training system enables the cloud training system to prepare in advance the version of the AI framework and programming language that matches the AI model and training code to be trained when preparing the cloud environment.
  • the method for guiding the execution of the system further includes: receiving a training data acquisition request sent by a cloud training system during the execution of the training task; acquiring the training data and sending the training data data to the cloud training system.
  • the above method makes it unnecessary for the guidance system to upload training data or upload all training data to the cloud training system before the cloud training system starts to execute the training task, avoids too long waiting for transmission before the cloud training task starts, and improves the user experience.
  • the method before notifying the cloud training system to execute the training task corresponding to the training task information, the method further includes: receiving an environment preparation success response returned by the cloud training system.
  • the method further includes: receiving a trained AI model returned by the cloud training system.
  • the guidance system can store the trained AI model in the local device, and can also provide prompts to the user, such as prompting the user where the trained AI model is stored on the local device, so that the user can The trained AI model can be obtained more conveniently, which facilitates the user's subsequent application of the trained AI model.
  • the guidance system may be obtained from a cloud training system and installed in the local device.
  • the cloud training system can provide an interface for downloading the bootstrap system.
  • the present application also provides an AI training method, which is applied to a cloud training system, including: acquiring training task information sent by the guidance system after a user triggers running a training code on a local device, where the training task information includes local Training environment information of the device; perform preparation of a cloud training environment according to the training environment information; perform training tasks corresponding to the training task information based on the cloud training environment.
  • the training environment information of the local device includes: version information of the AI framework on which the AI model to be trained depends and the version of the programming language used by the training code used to train the AI model information.
  • performing the preparation of the cloud training environment according to the training environment information includes: setting the cloud training environment to execute training tasks according to the version information of the AI framework and the version information of the programming language of the training code AI frameworks and programming languages used.
  • the training task information further includes: training parameters in the training code, an AI model, and training program logic in the training code for training the AI model; based on the cloud training environment Performing the training task corresponding to the training task information includes: performing the training of the AI model in the prepared cloud training environment according to the training parameters and the training program logic.
  • the training task information further includes cloud training access information
  • the method before performing the preparation of the cloud training environment according to the training environment information, the method further includes: according to the cloud training access information
  • the input information performs authentication and/or charging query on the training task corresponding to the training task information.
  • the present application further provides a guidance system, comprising: an acquisition module for, after a user of the cloud training system triggers the running of the training code on the local device, according to the training code running on the local device, Obtain training task information, wherein the training code is used to train an AI model, and the AI model is developed and obtained by the user based on the AI framework installed on the local device; a sending module is used to send the training task The information is uploaded to the cloud training system, and the cloud training system is notified to execute the training task corresponding to the training task information.
  • the training task information includes information obtained from the training code by using an obtaining component in the AI framework, and information obtained from the local device according to the information in the training code information.
  • the training task information includes one or more of the following data: training parameters in the training code, the AI model, and data used in the training code for all
  • the training environment information of the local device includes: version information of the AI framework, and version information of the programming language of the training code.
  • the system further includes a receiving unit, the receiving unit is configured to receive a training data acquisition request sent by the cloud training system during the execution of the training task; the The obtaining unit is further configured to obtain the training data; the sending unit is further configured to send the training data to the cloud training system.
  • system further includes a receiving unit, configured to receive, before the sending unit notifies the cloud training system to execute the training task corresponding to the training task information
  • the environment returned by the cloud training system is ready to respond successfully.
  • system further includes a receiving unit, where the receiving unit is configured to receive the trained AI model returned by the cloud training system.
  • the guidance system is obtained from the cloud training system and installed in the local device.
  • the present application further provides a cloud training system, including: an environment preparation unit for acquiring training task information sent by the guidance system after a user triggers running a training code on a local device, where the training task information includes the local device the training environment information; perform the preparation of the cloud training environment according to the training environment information; a training task execution unit is configured to execute the training task corresponding to the training task information based on the cloud training environment.
  • the training environment information of the local device includes: version information of the AI framework on which the AI model to be trained depends, and version information of the programming language used to train the training code of the AI model .
  • the environment preparation unit is specifically configured to perform training tasks in the cloud training environment according to version information of the AI framework and version information of the programming language of the training code AI frameworks and programming languages used.
  • the training task information further includes: training parameters in the training code, an AI model, and training program logic in the training code for training the AI model; the training task The execution unit is specifically configured to execute the training of the AI model in the prepared cloud training environment according to the training parameters and the training program logic.
  • the training task information further includes cloud training access information
  • the environment preparation unit is further configured to: correspond to the training task information according to the cloud training access information authentication and/or billing queries for training tasks.
  • the present application also provides a computing device, comprising a processor and a memory, the memory storing computer instructions, the processor executing the computer instructions, so that the computing device performs the aforementioned first aspect or the first Aspects may implement the methods described in, or perform the methods described in the preceding second aspect or possible implementations of the second aspect.
  • the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores computer program codes, and when the computer program codes are executed by a computing device, the computing device executes the aforementioned first aspect or a method in a possible implementation of the first aspect, or perform the aforementioned second aspect or a method in a possible implementation of the second aspect.
  • the computer-readable storage medium includes, but is not limited to, volatile memory, such as random access memory, non-volatile memory, such as flash memory, hard disk (English: hard disk drive, abbreviation: HDD), solid state disk (English: solid state drive, abbreviation: SSD).
  • the present application further provides a computer program product, the computer program product comprising computer program code, when the computer program code is executed by a computing device, the computing device may perform the aforementioned first aspect or the first aspect.
  • the computing device may perform the aforementioned first aspect or the first aspect.
  • the computer program product may be a software installation package, and when the method provided in the foregoing first aspect or possible implementation of the first aspect needs to be used, or the method provided in the foregoing second aspect or possible implementation of the second aspect needs to be used,
  • the computer program product can be downloaded and executed on a computing device.
  • the present application further provides an artificial intelligence AI system, including the guidance system described in the foregoing third aspect and possible implementation of the third aspect, and the foregoing fourth aspect and possible implementation of the fourth aspect. Cloud training system.
  • FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a guidance system 106 and a cloud training system 120 provided by an embodiment of the present application;
  • FIG. 3 is a schematic flowchart of an AI training method provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of an authentication and charging query performed by a cloud training system according to training task information according to an embodiment of the present application
  • FIG. 5 is a schematic flowchart of a cloud training system performing a training task according to an embodiment of the present application
  • FIG. 6 is a schematic structural diagram of a computing device 300 according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computing device 400 according to an embodiment of the present application.
  • AI training It is a process of using training data to update the parameters in the AI model, so that the AI model can learn the features and laws of the training data, so that the AI model can realize specific application functions.
  • Training code It is the code that drives the training of the AI model.
  • the training code defines the training parameters and training program logic in the AI training process. It can also include the name or address information of the AI model to be trained, the name of the training data set or address information, etc.
  • the training code also includes cloud training access information for connecting with the cloud training system in the cloud.
  • AI framework It is a library (or tool) used for the development, training and testing of AI models.
  • the AI framework includes a variety of packaged code components. Each code component is a set of pre-built and optimized functions. This application The code components in the AI framework are also called functional components. Developers can use the AI framework to develop and train AI models that meet application goals more quickly, without having to understand the underlying algorithm implementation for AI model construction and training in detail.
  • Typical AI frameworks include deep learning frameworks. There are various types of deep learning frameworks in the industry. Different deep learning frameworks can provide different functions, but they all aim to provide developers with deep learning models.
  • AI technology is developing rapidly and is gradually applied to more and more complex scenarios.
  • the structure of AI models is becoming more and more complex, and many complex scenarios have higher and higher performance requirements for AI models. Therefore, it is often necessary to use a large amount of training data to train AI models with complex structures.
  • the AI development mode that developers are used to is: install the AI framework on the local device, and then use the local editor, for example: use the local application integrated development environment (integrated development environment, IDE), call the AI framework to build the AI model, and write the training code . Then use local resources to perform training to obtain the trained AI model.
  • IDE integrated development environment
  • Some AI frameworks currently support distributed training of AI models, developers still often encounter the problem that locally developed AI models cannot have sufficient training resources locally.
  • the local or local device in this application refers to the device used by the developer to develop the AI model (for example, the server or virtual machine used by the developer), and/or belongs to the same device as the device used by the developer to develop the AI model The owner's other devices or clusters of devices.
  • each local device is in a relatively close physical environment, such as a computer room.
  • a remote virtual machine for example, a desktop cloud
  • cloud service providers use a large number of basic resources (such as computing resources, storage resources, communication resources) to build a cloud environment, and cloud service providers can provide cloud tenants with various basic resources, platforms or application capabilities. This enables cloud tenants to carry out their own business.
  • basic resources such as computing resources, storage resources, communication resources
  • the present application provides an AI training method, which can use the basic resources on the cloud to train the locally constructed AI model for the developer while the developer maintains the habit of local AI development.
  • This method can solve the problem of The problem of insufficient local training resources, and it is more convenient and quick to realize the training of AI models for developers.
  • the developer described in this application may also be referred to as a user, which means a person who uses the AI training method of this application. Since, before using the AI training method of the present application, the user needs to register on the cloud platform and purchase the cloud training service, therefore, the user (or developer) in the present application is also the user of the cloud training system.
  • FIG. 1 is a schematic diagram of an exemplary system architecture provided by the present application.
  • the system architecture of the present application includes a local device 100 and a cloud training system 120 in a cloud environment.
  • the local device 100 may be a server, a virtual machine, or a server cluster or a virtual machine cluster owned by the developer, and the developer uses the local device 100 for AI development.
  • the cloud training system 120 in the cloud environment may be a background system corresponding to a cloud training service provided by a cloud service provider, and the cloud training system 120 may use the basic resources in the cloud environment to perform cloud training on AI models developed locally by developers .
  • An editor 102 and an AI framework 104 are installed in the local device 100 , and a guidance system 106 is also installed in the local device 100 .
  • the AI framework 104 is used as a library for AI development, and is used to provide functional components for developers in the process of building and training AI models.
  • the editor 102 is used by the developer to perform editing of the training code, and the editor 102 may also be used to compile the training code.
  • the developer can call the components in the AI framework 104 through the editor 102 in the local device 100 to construct and train the AI model.
  • developers usually develop training code in a local editor, write the training parameters required for training, and the training program logic used during training.
  • the training program logic in the training code instructs how to use the training data used during training and train the AI model to be trained according to the set training parameters.
  • a guidance system 106 is also installed on the above-mentioned local device.
  • the guidance system 106 is configured to acquire training task information when the training code is running, and upload the training task information to the cloud training system 120 in the cloud environment, so that the cloud training system 120 executes the training task corresponding to the training task information.
  • the training task information may include training parameters required to perform training on the AI model, training program logic, training environment information, cloud training access information, and the AI model to be trained.
  • the guidance system 106 described above may be a stand-alone software program installed in the local device 100 or a tool coupled with the AI framework 104 .
  • FIG. 1 of the present application takes the guidance system 106 as an independent software program as an example.
  • the cloud training system 120 in the cloud environment is configured to receive the training task information sent by the guidance system 106, and use the training task information to prepare the cloud training environment, where the set cloud training environment matches the AI model and the training program logic.
  • the cloud training system 120 is further configured to receive the training notification sent by the guidance system 106, and train the developer's AI model in the set cloud training environment. After the training task is completed, the cloud training system may also return a successful training response and/or return the trained AI model to the guidance system 106 .
  • the present application mainly utilizes the guidance system 106 in the local device 100 to interact with the cloud training system 120 in the cloud environment, so as to realize the assignment of training tasks corresponding to the training codes developed in the local device 100 to the cloud training system 120 for execution.
  • the structure and functional division of the guidance system 106 and the cloud training system 120 are described below with reference to FIG. 2 .
  • an acquisition component 1042 is embedded in the AI framework 104 used for the construction and training of the AI model, and the acquisition component 1042 may be a tool, a plug-in or a patch in the AI framework 104.
  • the obtaining component 1042 is configured to communicate with the guidance system 106 and transmit the obtained initial information of the training task to the guidance system 106 . Specifically, the obtaining component 1042 is configured to obtain the initial information of the training task from the training code when the training code is running. In the embodiment of the present application, the information acquired by the acquisition component 1042 in the AI framework is collectively referred to as the initial information of the training task.
  • the initial information of the training task may include two types of information, one is the model that needs to be used when executing the training task Or the identification information or address information of the data, such as: the address information of the AI model to be trained, the address information of the training data, etc.; the other is the information that can be used directly when performing training tasks, such as: training parameters, training program logic, etc. .
  • the acquisition component 1042 acquires the initial information of the training task and sends it to the guidance system 106 .
  • the guidance system 106 may include a receiving unit 1061 , an obtaining unit 1062 and a sending unit 1063 . It should be understood that the unit division of the guidance system 106 described above is only an example, and does not constitute a limitation on the guidance system 106 of the present application.
  • the receiving unit 1062 is configured to receive the initial information of the training task sent by the obtaining component 1042 .
  • the receiving unit 1062 may also receive instructions or information sent by the cloud training system 120 .
  • the obtaining unit 1064 in the guidance system 106 can be configured to obtain information from the local device according to the initial information of the training task received by the receiving unit 1062 100 to obtain training task information. For example, after the receiving unit 1062 receives the address information of the AI model obtained by the obtaining component 1042 of the AI framework, the obtaining unit 1064 obtains the AI model to be trained according to the address information of the AI model.
  • the initial information of these training tasks obtained by the obtaining component 1042 and the content of the corresponding training task information sent by the guidance system 106 to the cloud training system 120 same.
  • this type of information can be used as the training information to be sent by the guidance system to the cloud training system 120 .
  • task information In the embodiment of the present application, the data and information sent by the guidance system 106 to the cloud training system 120 for performing the training task are collectively referred to as training task information.
  • the sending unit 1066 is configured to send the training task information to the cloud training system 120 .
  • the sending unit 1066 is further configured to send training notifications and/or training data to the cloud training system 120 .
  • the cloud training system 120 includes an environment preparation unit 122 and a training task execution unit 124 .
  • the environment preparation unit 122 is configured to receive the training task information sent by the guidance system 1066, and prepare the cloud training environment according to the training task information, for example, setting the resources to be used when executing the training task on the cloud and the libraries that the training depends on.
  • the environment preparation unit 122 may also be configured to return a response message indicating that the environment preparation is complete to the guidance system 106 .
  • the environment preparation unit 122 may also be configured to send partial training task information and/or training task execution instructions to the training task execution unit 124 .
  • the training task execution unit 124 may execute the training task according to the training task execution instruction sent by the environment preparation unit 122 or the guidance system 106 .
  • the training task execution unit 124 mainly executes the training program logic in the training task information to train the AI model to be trained.
  • the training task execution unit 124 is further configured to return a successful training response to the guidance unit and/or return the trained AI model.
  • the developer can purchase the cloud training service through the cloud platform of the cloud service provider and configure the cloud training.
  • the acquisition component 1042 in the guidance system 106 and the AI framework 104 may be a software program or tool developed by a cloud service provider in conjunction with a cloud service that provides cloud training. After the developer purchases and configures the cloud training service, the above-mentioned guidance system 106 and acquisition component 1042 can be installed on the local device.
  • the developer can run the training code locally to start the training of the AI model performed by the cloud training system 120 .
  • the function of the obtaining component 1042 in the AI framework 104 may also be provided by the guidance system 106 , that is, the aforementioned action of the obtaining component 1042 obtaining the initial information of the training task according to the training code may be performed by the guidance system 106 .
  • the guidance system 106 may include four functional units, namely: an acquisition unit that performs the actions of the acquisition component 1042 , and the aforementioned receiving unit 1062 , acquiring unit 1064 , and transmitting unit 1066 .
  • the foregoing division of the functional units of the guidance system 106 is only an example, and there may be different division manners, which will not be repeated here.
  • FIG. 3 is a schematic flowchart of an AI training method provided by an embodiment of the present application. The specific implementation of the AI training method of the present application will be described in detail below with reference to FIG. 3 .
  • the AI training method may be performed collaboratively by the aforementioned guidance system 106 , the cloud training system 120 , and the acquisition component 1042 in the AI framework 104 .
  • S201 The developer develops and runs the training code.
  • developers can develop the code for training the built AI model in the editor.
  • the editor can be various IDEs in the industry.
  • the training code can include the training parameters set by the developer, such as: learning rate, batch (batch)
  • the processing value, batch size, etc. can also include the name of the training dataset, the name of the AI model to be trained, and the training program logic.
  • the training program logic may include the training program logic written by the developer, or the training program logic in the AI framework invoked by the developer.
  • the training program logic may include algorithm logic such as loss function and optimizer. It should be understood that the AI model built by the developer depends on the AI framework, and the execution of the training program logic in the training code also depends on the AI framework.
  • the developer when developing the training code, the developer also needs to set the cloud training mode in the training code, for example, use the code to indicate that the training mode is the cloud training mode. Developers can also write cloud training access information in the training code, such as cloud access address information, cloud authentication information, account information, etc.
  • the training code developed by the developer After the training code developed by the developer is compiled by the compiler, it can be started and run on the local device. After starting and running the training code, since the cloud training mode is set in the training code, in some embodiments, the cloud training mode in the training code can trigger the acquisition component in the AI framework to perform the acquisition operation for the initial information of the training task.
  • the acquisition component in the AI framework acquires the initial information of the training task corresponding to the training code according to the training code, and sends the initial information of the training task to the guidance system.
  • the initial information of the training task that the acquisition component can acquire from the training code includes: address information of the AI model to be trained.
  • the acquisition component can intercept the application program interface (API) when the model is loaded in the training code, thereby obtaining the path information of the AI model to be trained in the local device .
  • API application program interface
  • the initial information of the training task that the acquisition component can acquire from the training code further includes: address information of the training data.
  • the acquisition component acquires the address information of the training data, it can also intercept the API when the training data is loaded in the training code, and then obtain the path information of the training data in the local device.
  • the initial information of the training task that the acquisition component can acquire from the training code also includes: some training program logic, cloud training access information, training parameters, training environment information, and the like.
  • the cloud training access information may include: cloud access address information, cloud authentication information, and account information.
  • Training parameters can include: learning rate used during training, batch value, batch size, etc.
  • the training environment information may include: version information of the AI framework, programming language version information of the training code, programming language version or some plug-in or library information of the AI framework version, resource specifications and quantities used for training, and the like. The above training environment information can be divided into two categories.
  • One type represents the training environment information of the local device, which represents the environment information when the local device builds the AI model and develops the training code, including: version information of the AI framework, programming language version information of the training code, programming language version or AI framework Version of some plugins or library information, etc.
  • the other type represents the training environment information set in the local device, which represents the environment information when AI model training is performed, including: resource specifications and quantities used to perform training, etc. This type of information is usually set by the user in the training code.
  • S203 Guide the system to upload training task information to the cloud training system.
  • the guidance system can obtain the training task information to be sent according to the initial information of the training task, and upload the training task information to the cloud training system. For example, read the AI model to be trained from the local device according to the address information of the AI model to be trained, and upload it to the cloud training system.
  • the guidance system can also actively detect and acquire some training task information, and this action can be performed by an acquisition unit in the guidance system. For example, when the acquisition component in the AI framework cannot acquire some training environment information (such as the programming language version of the training code), the guidance system can acquire the training environment information by actively detecting the training environment in the local device. Therefore, the training task information includes the information obtained from the training code by using the obtaining component in the AI framework, and the information obtained from the local device according to the information in the training code.
  • the guidance system Before the guidance system uploads the acquired training task information to the cloud training system, it can establish a connection with the cloud training system according to the cloud training access information in the training task information, and establishing a connection may include operations such as authentication and billing query.
  • FIG. 4 the schematic flowchart of the connection between the guidance system and the cloud training system can be shown in FIG. 4 , which specifically includes the following steps S2031-S2036:
  • S2031 Guide the system to send an upload request to the cloud training system.
  • the cloud training access information included in the above upload request includes cloud access information, account information and authentication information of the local device.
  • the cloud access information may be address information of the cloud training system, and an upload request may be sent to the cloud training system according to the address information of the cloud training system.
  • the account information and authentication information may be the information registered and acquired by the developer when purchasing the cloud training service in the cloud platform before using the solution of the present application.
  • the account information may be the user name of the developer on the cloud platform
  • the authentication information may be the key obtained from the cloud platform and corresponding to the cloud training service.
  • the upload request of the bootstrap system may only include the above-mentioned cloud access information for access, authentication and fee query, account information and authentication information of the local device, and in other cases Next, the upload request may also include part or all of the foregoing training task information. If the upload request only includes cloud access information, account information and authentication information, the guidance system can upload other training task information to the cloud training system after receiving the prompt that the authentication and billing query is passed.
  • S2032 The cloud training system receives the upload request, and sends the account information and authentication information to the cloud authentication center.
  • the cloud authentication center authenticates the training task requested by the upload request according to the acquired account information and authentication information, and returns an authentication result.
  • the specific authentication mode may adopt any feasible authentication mode in the industry, which is not limited in this application.
  • the authentication result is returned to the cloud training system.
  • S2034 The cloud training system sends the account information to the cloud billing center.
  • the cloud billing center confirms the fee information of the account corresponding to the account information according to the account information, and returns the fee information to the cloud training system.
  • the present application does not limit the execution order of the above steps S2034-S2035 and steps S2032-S2033.
  • the execution of the above steps S2034-S2035 may also be optional.
  • S2036 The cloud training system returns a response that the authentication and billing query is passed to the guidance system, and receives training task information.
  • the cloud training system When the authentication is passed and the account fee corresponding to the account information is greater than or equal to the preset threshold, the cloud training system returns a response to the authentication and billing query to the guidance system, and receives other training task information uploaded by the guidance system.
  • the training task information includes the aforementioned training task information except the cloud training access information used for cloud access, authentication, and fee query, such as: training parameters, training program logic, AI model to be trained, training environment information, etc. .
  • the cloud training system may directly receive the uploaded training task information without returning a response.
  • the cloud training system can return the upload request failure response to the guidance system, and can also return the request failure reason, such as: authentication failed and/ or insufficient pre-storage fees.
  • the guidance system can successfully send the training task information to the cloud training system.
  • the above steps S202-S203 are described based on one of the aforementioned embodiments of the present application (that is, the embodiment in which the guidance system and the acquisition component in the AI framework cooperate to acquire the training task information).
  • the above-mentioned guidance system may include the function of the acquisition component in the above-mentioned AI framework, and the above-mentioned steps S202 and S203 are both performed by the guidance system.
  • step S204 After the cloud training system receives the training task information, it can prepare the cloud training environment on the cloud and execute the training tasks.
  • the following is a detailed description of step S204:
  • the cloud training system executes a training task corresponding to the training task information according to the received training task information.
  • step S204 can be divided into the following steps:
  • the cloud training system prepares a cloud training environment according to the received training task information.
  • the training environment information may include: version information of the AI framework, version information of the programming language of the training code, programming language version or AI framework version Some plugin or library information, resource specifications used to perform training.
  • the cloud training system needs to ensure that the AI framework and programming language versions that cloud training depends on are ready in the cloud environment according to the version of the AI framework and the programming language of the training code.
  • the cloud environment includes versions of various mainstream AI frameworks and programming languages. Therefore, usually the cloud training system only needs to detect and confirm when preparing the cloud training environment, without temporarily installing these versions.
  • the cloud training system also needs to ensure that these plug-ins or libraries have been installed in the cloud environment according to the programming language version or some plug-in or library information of the AI framework version. Usually, the cloud environment will also update and download the plug-ins of mainstream AI frameworks and programming software in time. and required libraries. If the cloud training system finds that the cloud environment does not have the plug-ins and libraries required to perform training tasks installed in the cloud environment preparation stage, it can be downloaded and installed in time.
  • the cloud training system also needs to prepare corresponding training resources on the cloud according to the information on the resource specification for performing training included in the training task information. For example, according to the required resource specification information, start the relevant virtual machines and containers in the cloud, and mount the corresponding hardware resources, such as graphics processing units (GPUs) or AI training chips.
  • GPUs graphics processing units
  • AI training chips such as AI training chips
  • the cloud training system can perform the following steps:
  • S2042 The cloud training system returns an environment preparation success response to the guidance system.
  • the above steps S2042 and S2043 may not be performed.
  • the guidance system can notify the cloud training system to perform the training task when uploading the training task information, and the cloud training system can start to execute the training task after performing the aforementioned step S2041, and the aforementioned steps S2042 and S2043 are omitted.
  • the cloud training system executes the training task corresponding to the training task information in the cloud training environment.
  • the cloud training system can start the training container with the relevant resources ready to execute the training task. Specifically, when the training task is executed, the functional components in the corresponding AI framework on the cloud are called according to the logic of the training program. Input the training data into the AI model to be trained, use each component in the model to calculate the training data based on the training resources, and update the values of the parameters in the model according to some training parameters and training program logic. When the training reaches the training stop condition, the training of the model is stopped, and the trained AI model is obtained.
  • the guidance system when uploading training task information, the guidance system reads the training data set in the local device, and once Uploading to the cloud training system indiscriminately will result in high transmission delay, and the environment preparation time of the cloud training system will be longer, which will affect the user experience.
  • the cloud training system may send a training data acquisition request to the guiding device at least once in a process of performing training.
  • the following steps may also be performed during the execution of the training task:
  • the cloud training system sends a training data acquisition request to the guidance system
  • S2046 Guide the system to read the training data from the local device, and send the training data to the cloud training system.
  • the training data used for training the AI model may also be pre-saved by the user in a place that can be read by the cloud training system, such as cloud storage.
  • the cloud billing center can also continuously charge according to the duration, resource specifications, resource quantity, etc. of the resources used during training.
  • the cloud training system can successfully train the AI model and obtain the trained AI model.
  • S205 The cloud training system returns the trained AI model to the guidance system.
  • step S205 is only a step performed in one case, and in other cases, the cloud training system may not return the trained AI model to the guidance system after performing the training task.
  • the cloud training system can return a successful training response to the guidance system, or return the address information of the trained AI model stored in the cloud environment to the guidance system. What the cloud training system returns to the guidance system after the training is completed can be determined by the developer through preset settings.
  • the cloud training system may also return the billing bill to the guidance system.
  • the developer can write and run the training code locally, that is, the AI model can be trained using the resources of the cloud environment. It avoids the problem that the resources required for local training are insufficient to support the training of AI models.
  • the above method greatly facilitates developers, and developers do not need to change the habit of locally building AI models and developing training codes when faced with insufficient local resources.
  • the solution of the present application also does not require the developer to perform complex configuration and adaptation, and the cloud training can be quickly realized through the cooperation of the guidance system and the cloud training system.
  • Embodiments of the present application further provide the guidance system 106 shown in FIG. 2 .
  • the guidance system 106 is specifically configured to perform the steps performed by the guidance system shown in the foregoing FIGS. 3 to 5 , and the functions of the guidance system 106 The functions of the units are the same as those described above for FIG. 2 , which will not be repeated here.
  • the guidance system 106 may also be specifically configured to perform the functions of the acquisition components in the guidance system and the AI framework shown in the foregoing FIGS. 3-5 .
  • the embodiment of the present application also provides the cloud training system 120 shown in FIG. 2 .
  • the cloud training system 120 can specifically be used to perform the steps performed by the cloud training system shown in FIG. 3 to FIG. 5 .
  • the functions are the same as those described above for FIG. 2 , which will not be repeated here.
  • This embodiment of the present application further provides a computing device 300 as shown in FIG. 6 , and the computing device 300 may be the aforementioned local device.
  • Computing device 300 includes memory 301 , processor 302 , communication interface 303 , and bus 304 .
  • the memory 301 , the processor 302 , and the communication interface 303 are connected to each other through the bus 304 for communication. It should be understood that the present application does not limit the number of processors and memories in the computing device 300 .
  • Computing device 300 may also represent a cluster of devices composed of multiple servers or virtual machines.
  • the memory 301 may be a read only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
  • the memory 301 may store computer instructions, and when the computer instructions stored in the memory 301 are executed by the processor 302, the processor 302 and the communication interface 303 perform part or all of the AI training performed by the guidance system described in the aforementioned Figures 3-5. method. That is, the computer instructions in the aforementioned guidance system 106 may be stored in the memory 301 .
  • the AI module to be trained and training data can also be stored in the memory 301 .
  • the processor 302 may adopt a general-purpose central processing unit (Central Processing Unit, CPU), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a graphics processor (graphics processing unit, GPU) or any combination thereof.
  • the processor 302 may include one or more chips, and the processor 302 may include an AI accelerator, such as a neural processing unit (NPU).
  • NPU neural processing unit
  • the communication interface 303 uses a transceiver module such as, but not limited to, a transceiver to enable communication between the computing device 300 and other devices or a communication network. For example, the response of successful training or the AI model after training can be obtained through the communication interface 303 .
  • Bus 304 may include pathways for communicating information between various components of computing device 300 (eg, memory 301, processor 302, communication interface 303).
  • the embodiment of the present application further provides a computing device 400 as shown in FIG. 7
  • the computing device 400 may be a cloud server or a cloud server cluster provided by a cloud service provider, or a virtual machine or a virtual machine cluster.
  • Computing device 400 includes memory 401 , processor 402 , communication interface 403 , and bus 404 .
  • the possible hardware structures of the memory 401 , the processor 402 , the communication interface 403 and the bus 404 and the relationship between each part may be the same as or similar to the corresponding parts in the aforementioned computing device 300 , and will not be repeated here.
  • the memory 401 in the computing device 400 may store the computer instructions included in the environment preparation unit 122 and the training task execution unit 124 in the aforementioned cloud training system 120.
  • the processor 402 and the communication interface 403 perform part or all of the AI training method performed by the cloud training system as described in the aforementioned FIGS. 3-5 .
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product implementing the above-mentioned AI training method includes one or more computer instructions, and when these computer program instructions are loaded and executed on the computer, the whole or part of the AI training method flow described in the aforementioned FIGS. 3-5 of the present application is executed. .
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, digital versatile disc (DVD)), or semiconductor media (eg, solid state disk (SSD)) )Wait. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Computer Security & Cryptography (AREA)
  • Stored Programmes (AREA)

Abstract

Provided is an artificial intelligence (AI) training method, applicable to a guidance system (106). After a user of a cloud training system (120) triggers training code to run on a local device (100), said method comprises: according to the training code run on the local device (100), obtaining information about a training task, the training code being used for training an AI model, the AI model being developed and obtained by the user on the basis of an AI framework (104) installed on the local device (100); uploading training task information to a cloud training system (120), and notifying the cloud training system (120) to execute the training task corresponding to the training task information. The method allows a developer performing AI model development and training on the local device (100) to be no longer constrained by the computing power of the local device (100).

Description

一种人工智能AI训练的方法、系统及设备A method, system and device for artificial intelligence AI training
本申请要求于2020年10月14日提交中国知识产权局、申请号为202011096163.6、申请名称为“AI模型训练的方法及装置”的中国专利申请,以及于2020年12月30日提交中国知识产权局、申请号为202011626123.8、申请名称为“一种人工智能AI训练的方法、系统及设备”的中国专利申请优先权,其全部内容通过引用结合在本申请中。This application requires a Chinese patent application with the application number 202011096163.6 and the application name "AI model training method and device" to be submitted to the China Intellectual Property Office on October 14, 2020, and China Intellectual Property Office on December 30, 2020 Bureau, the application number is 202011626123.8, and the application title is "A method, system and device for artificial intelligence AI training". The priority of the Chinese patent application, the entire content of which is incorporated in this application by reference.
技术领域technical field
本申请涉及人工智能(artificial intelligence,AI)技术领域,尤其涉及一种AI训练的方法、系统及设备。The present application relates to the technical field of artificial intelligence (AI), and in particular, to a method, system and device for AI training.
背景技术Background technique
随着AI技术的发展,AI应用于各行各业。典型的AI技术,包括深度学习技术、机器学习技术等,其基本思想是设计AI模型,并基于训练数据对AI模型进行训练,使得训练完成的AI模型具备一定的功能,例如:物体检测的功能、目标识别的功能等。AI模型是一种利用AI技术实现的算法,例如:基于深度学习技术的深度学习模型。With the development of AI technology, AI is applied to all walks of life. Typical AI technologies include deep learning technology, machine learning technology, etc. The basic idea is to design an AI model and train the AI model based on training data, so that the trained AI model has certain functions, such as object detection. , target recognition function, etc. An AI model is an algorithm implemented using AI technology, such as a deep learning model based on deep learning technology.
为了方便地进行AI模型的开发与训练,业界出现了各种各样的AI框架,开发者通常习惯在本地设备上安装AI框架,基于AI框架进行AI模型的开发与训练。由于对AI模型的训练需要依赖较强算力的支撑。开发者在本地设备基于AI框架开发的AI模型在本地被训练通常受到本地设备的算力的限制。In order to facilitate the development and training of AI models, various AI frameworks have emerged in the industry. Developers are usually accustomed to installing AI frameworks on local devices and developing and training AI models based on the AI framework. Because the training of AI models needs to rely on the support of strong computing power. The AI model developed by the developer on the local device based on the AI framework is usually limited by the computing power of the local device.
发明内容SUMMARY OF THE INVENTION
本申请提供了一种人工智能AI训练的方法,该方法通过在本地设备中的训练代码运行时,将训练任务信息上传至云训练系统,利用云端的资源实现对AI模型的训练,使得开发者在本地设备执行AI模型的开发不再受限于本地设备的算力。This application provides an artificial intelligence AI training method. The method uploads the training task information to the cloud training system when the training code in the local device is running, and uses the resources of the cloud to realize the training of the AI model, so that developers can The development of AI models executed on the local device is no longer limited by the computing power of the local device.
第一方面,本申请提供一种AI训练方法,该方法应用于引导系统,当云训练系统的用户在本地设备上触发训练代码的运行后,引导系统执行如下步骤:根据在本地设备上运行的训练代码,获取训练任务信息,其中,该训练代码用于训练AI模型,AI模型由用户基于安装在所述本地设备上的AI框架进行开发获得;将训练任务信息上传至云训练系统,并通知云训练系统执行该训练任务信息对应的训练任务。In the first aspect, the present application provides an AI training method, which is applied to a guidance system. After a user of the cloud training system triggers the running of the training code on the local device, the guidance system performs the following steps: Training code to obtain training task information, where the training code is used to train an AI model, and the AI model is developed and obtained by the user based on the AI framework installed on the local device; upload the training task information to the cloud training system, and notify The cloud training system executes the training task corresponding to the training task information.
通过上述方法,云训练系统的用户需要对在本地构建的AI模型进行训练时,可以在本地设备利用编辑器编写对应的训练代码,当用户在本地运行训练代码后,本申请中的引导系统即可以执行训练任务信息的获取以及上传训练任务信息至云训练系统的操作。该方法使得在保持用户在本地进行模型开发和模型训练的习惯的情况下,利用云上的资源进行AI模型的训练,解决了本地设备用于训练的资源不足的问题,给用户带来了方便。Through the above method, when the user of the cloud training system needs to train the locally built AI model, he can use the editor to write the corresponding training code on the local device. After the user runs the training code locally, the guidance system in this application is It can perform the operations of acquiring training task information and uploading training task information to the cloud training system. This method enables to use the resources on the cloud to train the AI model while maintaining the user's habit of developing and training models locally, which solves the problem of insufficient resources for training on local devices and brings convenience to users. .
在第一方面的一种可能实现中,前述训练任务信息包括利用AI框架中的获取组件从训 练代码中获取的信息,以及根据所述训练代码中的信息从本地设备获取的信息。AI框架中的获取组件可以在训练代码运行的过程中通过读取训练代码和拦截训练代码中的调用API等方式获取信息,通过获取组件获取训练代码中的信息,可以使得引导系统更快捷地获取到训练任务信息。In a possible implementation of the first aspect, the aforementioned training task information includes the information obtained from the training code by using the obtaining component in the AI framework, and the information obtained from the local device according to the information in the training code. The acquisition component in the AI framework can obtain information by reading the training code and intercepting the calling API in the training code during the running of the training code. Obtaining the information in the training code through the acquisition component can make the guidance system faster to acquire to the training task information.
在第一方面的一种可能实现中,前述训练任务信息包括以下数据中的一种或多种:训练代码中的训练参数、待训练的AI模型、训练代码中用于对该AI模型进行训练的训练程序逻辑、本地设备的训练环境信息、用于与云训练系统连接的云训练接入信息。上述被上传的训练任务信息使得云训练系统可以在云环境顺利地准备云训练的环境并执行训练任务。In a possible implementation of the first aspect, the aforementioned training task information includes one or more of the following data: training parameters in the training code, the AI model to be trained, and the training code used for training the AI model The training program logic, the training environment information of the local device, and the cloud training access information used to connect with the cloud training system. The uploaded training task information enables the cloud training system to smoothly prepare the cloud training environment and execute the training task in the cloud environment.
在第一方面的一种可能实现中,本地设备的训练环境信息包括:所述AI框架的版本信息,所述训练代码的编程语言的版本信息。In a possible implementation of the first aspect, the training environment information of the local device includes: version information of the AI framework, and version information of the programming language of the training code.
由于AI模型和训练代码是用户基于本地设备中安装的AI框架和编程语言进行开发的,将本地的训练环境信息中的AI框架的版本信息,所述训练代码的编程语言的版本信息上传至云训练系统,可以使云训练系统在进行云环境的准备时预先准备好与待训练的AI模型和训练代码匹配的AI框架和编程语言的版本。Since the AI model and training code are developed by the user based on the AI framework and programming language installed in the local device, upload the version information of the AI framework and the programming language of the training code in the local training environment information to the cloud. The training system enables the cloud training system to prepare in advance the version of the AI framework and programming language that matches the AI model and training code to be trained when preparing the cloud environment.
在第一方面的一种可能实现中,引导系统执行的方法还包括:接收云训练系统在执行所述训练任务的过程中发送的训练数据获取请求;获取所述训练数据,并发送所述训练数据至所述云训练系统。In a possible implementation of the first aspect, the method for guiding the execution of the system further includes: receiving a training data acquisition request sent by a cloud training system during the execution of the training task; acquiring the training data and sending the training data data to the cloud training system.
上述方法使得引导系统不用在云训练系统开始执行训练任务之前就上传训练数据或者上传所有的训练数据至云训练系统,避免了云端训练任务开始之前耗费太长的等待传输时间,提升了用户体验。The above method makes it unnecessary for the guidance system to upload training data or upload all training data to the cloud training system before the cloud training system starts to execute the training task, avoids too long waiting for transmission before the cloud training task starts, and improves the user experience.
在第一方面的一种可能实现中,在通知云训练系统执行所述训练任务信息对应的训练任务之前,该方法还包括:接收所述云训练系统返回的环境准备成功响应。In a possible implementation of the first aspect, before notifying the cloud training system to execute the training task corresponding to the training task information, the method further includes: receiving an environment preparation success response returned by the cloud training system.
在第一方面的一种可能实现中,该方法还包括:接收云训练系统返回的训练完成的AI模型。引导系统在接收到训练完成的AI模型后,可以将训练完成的AI模型存储在本地设备中,还可以向用户提供提示,例如提示用户训练完成的AI模型存储在本地设备的什么位置,使得用户可以更方便地获取到训练完成的AI模型,方便了用户对训练完成的AI模型的后续应用。In a possible implementation of the first aspect, the method further includes: receiving a trained AI model returned by the cloud training system. After receiving the trained AI model, the guidance system can store the trained AI model in the local device, and can also provide prompts to the user, such as prompting the user where the trained AI model is stored on the local device, so that the user can The trained AI model can be obtained more conveniently, which facilitates the user's subsequent application of the trained AI model.
在第一方面的一种可能实现中,引导系统可以从云训练系统中获得,并安装于所述本地设备中。例如,云训练系统中可以提供下载引导系统的接口。In a possible implementation of the first aspect, the guidance system may be obtained from a cloud training system and installed in the local device. For example, the cloud training system can provide an interface for downloading the bootstrap system.
第二方面,本申请还提供一种AI训练方法,该方法应用于云训练系统,包括:获取引导系统在用户在本地设备触发运行训练代码后发送的训练任务信息,所述训练任务信息包括本地设备的训练环境信息;根据所述训练环境信息执行云训练环境的准备;基于所述云训练环境执行所述训练任务信息对应的训练任务。In a second aspect, the present application also provides an AI training method, which is applied to a cloud training system, including: acquiring training task information sent by the guidance system after a user triggers running a training code on a local device, where the training task information includes local Training environment information of the device; perform preparation of a cloud training environment according to the training environment information; perform training tasks corresponding to the training task information based on the cloud training environment.
在第二方面的一种可能实现中,本地设备的训练环境信息包括:待训练的AI模型所依赖的AI框架的版本信息以及用于训练所述AI模型的训练代码所使用的编程语言的版本信息。In a possible implementation of the second aspect, the training environment information of the local device includes: version information of the AI framework on which the AI model to be trained depends and the version of the programming language used by the training code used to train the AI model information.
在第二方面的一种可能实现中,根据所述训练环境信息执行云训练环境的准备,包括: 根据AI框架的版本信息,以及训练代码的编程语言的版本信息设置云训练环境中执行训练任务使用的AI框架和编程语言。In a possible implementation of the second aspect, performing the preparation of the cloud training environment according to the training environment information includes: setting the cloud training environment to execute training tasks according to the version information of the AI framework and the version information of the programming language of the training code AI frameworks and programming languages used.
在第二方面的一种可能实现中,训练任务信息还包括:训练代码中的训练参数、AI模型、训练代码中用于对所述AI模型进行训练的训练程序逻辑;基于所述云训练环境执行所述训练任务信息对应的训练任务,包括:根据所述训练参数和所述训练程序逻辑,在所述准备的云训练环境中执行对所述AI模型的训练。In a possible implementation of the second aspect, the training task information further includes: training parameters in the training code, an AI model, and training program logic in the training code for training the AI model; based on the cloud training environment Performing the training task corresponding to the training task information includes: performing the training of the AI model in the prepared cloud training environment according to the training parameters and the training program logic.
在第二方面的一种可能实现中,训练任务信息还包括:云训练接入信息,在根据所述训练环境信息执行云训练环境的准备之前,所述方法还包括:根据所述云训练接入信息对所述训练任务信息对应的训练任务进行鉴权和/或计费查询。In a possible implementation of the second aspect, the training task information further includes cloud training access information, and before performing the preparation of the cloud training environment according to the training environment information, the method further includes: according to the cloud training access information The input information performs authentication and/or charging query on the training task corresponding to the training task information.
上述第二方面和第二方面的可能实现的方式中的特征的有益效果可以参考前述第一方面中对应特征的有益效果,此处不再赘述。For the beneficial effects of the features in the second aspect and possible implementation manners of the second aspect, reference may be made to the beneficial effects of the corresponding features in the foregoing first aspect, which will not be repeated here.
第三方面,本申请还提供一种引导系统,包括:获取模块,用于当云训练系统的用户在本地设备上触发训练代码的运行后,根据所述本地设备上运行的所述训练代码,获取训练任务信息,其中,所述训练代码用于训练AI模型,所述AI模型由所述用户基于安装在所述本地设备上的AI框架进行开发获得;发送模块,用于将所述训练任务信息上传至云训练系统,并通知云训练系统执行所述训练任务信息对应的训练任务。In a third aspect, the present application further provides a guidance system, comprising: an acquisition module for, after a user of the cloud training system triggers the running of the training code on the local device, according to the training code running on the local device, Obtain training task information, wherein the training code is used to train an AI model, and the AI model is developed and obtained by the user based on the AI framework installed on the local device; a sending module is used to send the training task The information is uploaded to the cloud training system, and the cloud training system is notified to execute the training task corresponding to the training task information.
在第三方面的一种可能实现中,训练任务信息包括利用所述AI框架中的获取组件从所述训练代码中获取的信息,以及根据所述训练代码中的信息从所述本地设备获取的信息。In a possible implementation of the third aspect, the training task information includes information obtained from the training code by using an obtaining component in the AI framework, and information obtained from the local device according to the information in the training code information.
在第三方面的一种可能实现中,所述训练任务信息包括以下数据中的一种或多种:所述训练代码中的训练参数、所述AI模型、所述训练代码中用于对所述AI模型进行训练的训练程序逻辑、所述本地设备的训练环境信息、用于与所述云训练系统连接的云训练接入信息。In a possible implementation of the third aspect, the training task information includes one or more of the following data: training parameters in the training code, the AI model, and data used in the training code for all The training program logic for training the AI model, the training environment information of the local device, and the cloud training access information for connecting with the cloud training system.
在第三方面的一种可能实现中,所述本地设备的训练环境信息包括:所述AI框架的版本信息,所述训练代码的编程语言的版本信息。In a possible implementation of the third aspect, the training environment information of the local device includes: version information of the AI framework, and version information of the programming language of the training code.
在第三方面的一种可能实现中,所述系统还包括接收单元,所述接收单元,用于接收所述云训练系统在执行所述训练任务的过程中发送的训练数据获取请求;所述获取单元,还用于获取所述训练数据;所述发送单元,还用于发送所述训练数据至所述云训练系统。In a possible implementation of the third aspect, the system further includes a receiving unit, the receiving unit is configured to receive a training data acquisition request sent by the cloud training system during the execution of the training task; the The obtaining unit is further configured to obtain the training data; the sending unit is further configured to send the training data to the cloud training system.
在第三方面的一种可能实现中,所述系统还包括接收单元,所述接收单元,用于在所述发送单元通知所述云训练系统执行所述训练任务信息对应的训练任务之前,接收所述云训练系统返回的环境准备成功响应。In a possible implementation of the third aspect, the system further includes a receiving unit, configured to receive, before the sending unit notifies the cloud training system to execute the training task corresponding to the training task information The environment returned by the cloud training system is ready to respond successfully.
在第三方面的一种可能实现中,所述系统还包括接收单元,所述接收单元,用于接收所述云训练系统返回的训练完成的AI模型。In a possible implementation of the third aspect, the system further includes a receiving unit, where the receiving unit is configured to receive the trained AI model returned by the cloud training system.
在第三方面的一种可能实现中,所述引导系统从所述云训练系统中获得,并安装于所述本地设备中。In a possible implementation of the third aspect, the guidance system is obtained from the cloud training system and installed in the local device.
第四方面,本申请还提供一种云训练系统,包括:环境准备单元,用于获取引导系统在用户在本地设备上触发运行训练代码后发送的训练任务信息,所述训练任务信息包括本地设备的训练环境信息;根据所述训练环境信息执行云训练环境的准备;训练任务执行单元,用于基于所述云训练环境执行所述训练任务信息对应的训练任务。In a fourth aspect, the present application further provides a cloud training system, including: an environment preparation unit for acquiring training task information sent by the guidance system after a user triggers running a training code on a local device, where the training task information includes the local device the training environment information; perform the preparation of the cloud training environment according to the training environment information; a training task execution unit is configured to execute the training task corresponding to the training task information based on the cloud training environment.
在第四方面的一种可能实现中,所述本地设备的训练环境信息包括:待训练的AI模型依赖的AI框架的版本信息,用于训练所述AI模型的训练代码的编程语言的版本信息。In a possible implementation of the fourth aspect, the training environment information of the local device includes: version information of the AI framework on which the AI model to be trained depends, and version information of the programming language used to train the training code of the AI model .
在第四方面的一种可能实现中,所述环境准备单元,具体用于根据所述AI框架的版本信息,以及所述训练代码的编程语言的版本信息设置所述云训练环境中执行训练任务使用的AI框架和编程语言。In a possible implementation of the fourth aspect, the environment preparation unit is specifically configured to perform training tasks in the cloud training environment according to version information of the AI framework and version information of the programming language of the training code AI frameworks and programming languages used.
在第四方面的一种可能实现中,所述训练任务信息还包括:训练代码中的训练参数、AI模型、训练代码中用于对所述AI模型进行训练的训练程序逻辑;所述训练任务执行单元,具体用于根据所述训练参数和所述训练程序逻辑,在所述准备的云训练环境中执行对所述AI模型的训练。In a possible implementation of the fourth aspect, the training task information further includes: training parameters in the training code, an AI model, and training program logic in the training code for training the AI model; the training task The execution unit is specifically configured to execute the training of the AI model in the prepared cloud training environment according to the training parameters and the training program logic.
在第四方面的一种可能实现中,所述训练任务信息还包括:云训练接入信息,所述环境准备单元,还用于:根据所述云训练接入信息对所述训练任务信息对应的训练任务进行鉴权和/或计费查询。In a possible implementation of the fourth aspect, the training task information further includes cloud training access information, and the environment preparation unit is further configured to: correspond to the training task information according to the cloud training access information authentication and/or billing queries for training tasks.
第五方面,本申请还提供一种计算设备,包括处理器和存储器,所述存储器存储计算机指令,所述处理器执行所述计算机指令,以使所述计算设备执行前述第一方面或第一方面可能实现中所述的方法,或者执行前述第二方面或第二方面可能实现中所述的方法。In a fifth aspect, the present application also provides a computing device, comprising a processor and a memory, the memory storing computer instructions, the processor executing the computer instructions, so that the computing device performs the aforementioned first aspect or the first Aspects may implement the methods described in, or perform the methods described in the preceding second aspect or possible implementations of the second aspect.
第六方面,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序代码,当所述计算机程序代码被计算设备执行时,所述计算设备执行前述第一方面或第一方面可能实现中的方法,或者执行前述第二方面或第二方面可能实现中的方法。该计算机可读存储介质包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(英文:hard disk drive,缩写:HDD)、固态硬盘(英文:solid state drive,缩写:SSD)。In a sixth aspect, the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores computer program codes, and when the computer program codes are executed by a computing device, the computing device executes the aforementioned first aspect or a method in a possible implementation of the first aspect, or perform the aforementioned second aspect or a method in a possible implementation of the second aspect. The computer-readable storage medium includes, but is not limited to, volatile memory, such as random access memory, non-volatile memory, such as flash memory, hard disk (English: hard disk drive, abbreviation: HDD), solid state disk (English: solid state drive, abbreviation: SSD).
第七方面,本申请还提供一种计算机程序产品,所述计算机程序产品包括计算机程序代码,在所述计算机程序代码被计算设备执行时,所述计算设备执行前述第一方面或第一方面可能实现中提供的方法,或执行前述第二方面或第二方面可能实现中提供的方法。该计算机程序产品可以为一个软件安装包,在需要使用前述第一方面或第一方面可能实现中提供的方法,或者需要使用前述第二方面或第二方面可能实现中提供的方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。In a seventh aspect, the present application further provides a computer program product, the computer program product comprising computer program code, when the computer program code is executed by a computing device, the computing device may perform the aforementioned first aspect or the first aspect. Implement the method provided in, or perform the method provided in the aforementioned second aspect or possible implementations of the second aspect. The computer program product may be a software installation package, and when the method provided in the foregoing first aspect or possible implementation of the first aspect needs to be used, or the method provided in the foregoing second aspect or possible implementation of the second aspect needs to be used, The computer program product can be downloaded and executed on a computing device.
第八方面,本申请还提供一种人工智能AI系统,包括前述第三方面及第三方面的可能实现中所述的引导系统,以及前述第四方面及第四方面的可能实现中所述的云训练系统。In an eighth aspect, the present application further provides an artificial intelligence AI system, including the guidance system described in the foregoing third aspect and possible implementation of the third aspect, and the foregoing fourth aspect and possible implementation of the fourth aspect. Cloud training system.
附图说明Description of drawings
图1本申请实施例提供的一种系统架构示意图;1 is a schematic diagram of a system architecture provided by an embodiment of the present application;
图2为本申请实施例提供的一种引导系统106和云训练系统120的结构示意图;FIG. 2 is a schematic structural diagram of a guidance system 106 and a cloud training system 120 provided by an embodiment of the present application;
图3为本申请实施例提供的一种AI训练方法的流程示意图;3 is a schematic flowchart of an AI training method provided by an embodiment of the present application;
图4为本申请实施例提供的一种云训练系统根据训练任务信息进行鉴权和计费查询的流程示意图;4 is a schematic flowchart of an authentication and charging query performed by a cloud training system according to training task information according to an embodiment of the present application;
图5为本申请实施例提供的一种云训练系统执行训练任务的流程示意图;5 is a schematic flowchart of a cloud training system performing a training task according to an embodiment of the present application;
图6为本申请实施例提供的一种计算设备300的结构示意图;FIG. 6 is a schematic structural diagram of a computing device 300 according to an embodiment of the present application;
图7为本申请实施例提供的一种计算设备400的结构示意图。FIG. 7 is a schematic structural diagram of a computing device 400 according to an embodiment of the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application clearer, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.
为了更清楚地描述本申请实施例的技术方案,下面先对本申请涉及的技术术语进行解释:In order to describe the technical solutions of the embodiments of the present application more clearly, the technical terms involved in the present application are explained below:
AI训练:是一种利用训练数据对AI模型中的参数进行更新,以使得AI模型学习到训练数据中的特征和规律,使得AI模型实现特定应用功能的过程。AI training: It is a process of using training data to update the parameters in the AI model, so that the AI model can learn the features and laws of the training data, so that the AI model can realize specific application functions.
在AI训练过程中需要执行大量计算工作,需要耗费较多的计算资源(也即:算力)和存储资源。In the AI training process, a large amount of computing work needs to be performed, which requires more computing resources (ie: computing power) and storage resources.
训练代码:是驱动对AI模型进行训练的代码,训练代码中定义了AI训练过程中的训练参数和训练程序逻辑,还可以包括待训练的AI模型的名称或地址信息、训练数据集的名称或地址信息等。训练代码还包括用于与云端的云训练系统进行连接的云训练接入信息。Training code: It is the code that drives the training of the AI model. The training code defines the training parameters and training program logic in the AI training process. It can also include the name or address information of the AI model to be trained, the name of the training data set or address information, etc. The training code also includes cloud training access information for connecting with the cloud training system in the cloud.
AI框架:是一种用于AI模型的开发、训练和测试的库(或工具),AI框架中包括多种封装好的代码组件,各个代码组件是预先构建和优化好的功能集合,本申请中AI框架中的代码组件也称为功能组件。开发者可以利用AI框架更快速地开发和训练出满足应用目标的AI模型,而无需详细地了解AI模型构建和训练时的底层算法实现。目前,业界出现了许多AI框架,利用一种或多种AI框架进行AI模型的开发和训练给开发者带来了极大的便利。典型的AI框架包括深度学习框架,业界出现各种类型的深度学习框架,不同的深度学习框架所能提供的功能有差异,但是都旨在为开发者提供用于深度学习模型的开发、训练和测试的库(或工具)。AI framework: It is a library (or tool) used for the development, training and testing of AI models. The AI framework includes a variety of packaged code components. Each code component is a set of pre-built and optimized functions. This application The code components in the AI framework are also called functional components. Developers can use the AI framework to develop and train AI models that meet application goals more quickly, without having to understand the underlying algorithm implementation for AI model construction and training in detail. At present, there are many AI frameworks in the industry. Using one or more AI frameworks to develop and train AI models brings great convenience to developers. Typical AI frameworks include deep learning frameworks. There are various types of deep learning frameworks in the industry. Different deep learning frameworks can provide different functions, but they all aim to provide developers with deep learning models. A library (or tool) for testing.
AI技术迅速发展,逐渐应用于更多更复杂的场景。目前AI模型的结构越来越复杂,许多复杂场景对AI模型的性能要求也越来越高,因此目前常常需要利用大量的训练数据对复杂结构的AI模型进行训练。然而开发者习惯的AI开发模式为:在本地设备安装AI框架,然后利用本地编辑器,例如:利用本地应用集成开发环境(integrated development environment,IDE),调用AI框架构建AI模型,并编写训练代码。再利用本地资源执行训练,获得训练完成的AI模型。目前虽然有些AI框架支持对AI模型执行分布式训练,但是开发者依然常常遇到本地开发的AI模型在本地无法有足够的训练资源的问题。AI technology is developing rapidly and is gradually applied to more and more complex scenarios. At present, the structure of AI models is becoming more and more complex, and many complex scenarios have higher and higher performance requirements for AI models. Therefore, it is often necessary to use a large amount of training data to train AI models with complex structures. However, the AI development mode that developers are used to is: install the AI framework on the local device, and then use the local editor, for example: use the local application integrated development environment (integrated development environment, IDE), call the AI framework to build the AI model, and write the training code . Then use local resources to perform training to obtain the trained AI model. Although some AI frameworks currently support distributed training of AI models, developers still often encounter the problem that locally developed AI models cannot have sufficient training resources locally.
应理解,本申请中的本地或本地设备表示开发者开发AI模型所使用的设备(例如:开发者使用的服务器、虚拟机),和/或与开发者开发AI模型所使用的设备属于同一个所有者的其他设备或设备集群。通常各个本地设备之间处在较近的物理环境中,例如:一个机房。值得注意的是,在利用本地终端中的客户端使用远端的虚拟机(例如:桌面云)进行AI开发的情况下,也可以将本地终端与远端的虚拟机一起称为本地设备。It should be understood that the local or local device in this application refers to the device used by the developer to develop the AI model (for example, the server or virtual machine used by the developer), and/or belongs to the same device as the device used by the developer to develop the AI model The owner's other devices or clusters of devices. Usually, each local device is in a relatively close physical environment, such as a computer room. It is worth noting that, in the case of using the client in the local terminal to use a remote virtual machine (for example, a desktop cloud) for AI development, the local terminal and the remote virtual machine can also be called a local device together.
在云计算模式下,云服务提供商利用大量的基础资源(例如:计算资源、存储资源、通信资源)构建云环境,云服务提供商可以向云租户提供各种基础资源、平台或应用能力,使得云租户可以开展自己的业务。In the cloud computing model, cloud service providers use a large number of basic resources (such as computing resources, storage resources, communication resources) to build a cloud environment, and cloud service providers can provide cloud tenants with various basic resources, platforms or application capabilities. This enables cloud tenants to carry out their own business.
基于上述背景,本申请提供了一种AI训练方法,该方法可以在开发者保持本地的AI 开发的习惯下,利用云上的基础资源来为开发者训练本地构建的AI模型,该方法可以解决本地训练资源不足的问题,且更方便快捷地为开发者实现AI模型的训练。应理解,本申请中所述的开发者也可以称为用户,表示使用本申请的AI训练方法的人员。由于,在使用本申请的AI训练方法之前,用户需要在云平台注册并购买云训练服务,因此,本申请中的用户(或称为开发者)也是云训练系统的用户。Based on the above background, the present application provides an AI training method, which can use the basic resources on the cloud to train the locally constructed AI model for the developer while the developer maintains the habit of local AI development. This method can solve the problem of The problem of insufficient local training resources, and it is more convenient and quick to realize the training of AI models for developers. It should be understood that the developer described in this application may also be referred to as a user, which means a person who uses the AI training method of this application. Since, before using the AI training method of the present application, the user needs to register on the cloud platform and purchase the cloud training service, therefore, the user (or developer) in the present application is also the user of the cloud training system.
在描述本申请的具体方案之前,先描述本申请应用的系统架构。图1为本申请提供的一种示例性的系统架构示意图,如图1所示,本申请的系统架构中包括本地设备100和云环境中的云训练系统120。本地设备100可以是开发者拥有的服务器、虚拟机或者服务器集群、虚拟机集群,开发者利用本地设备100进行AI开发。云环境中的云训练系统120可以是云服务提供商提供的一种云训练服务对应的后台系统,云训练系统120可以利用云环境中的基础资源对开发者在本地开发的AI模型执行云训练。Before describing the specific solution of the present application, the system architecture of the application of the present application is described first. FIG. 1 is a schematic diagram of an exemplary system architecture provided by the present application. As shown in FIG. 1 , the system architecture of the present application includes a local device 100 and a cloud training system 120 in a cloud environment. The local device 100 may be a server, a virtual machine, or a server cluster or a virtual machine cluster owned by the developer, and the developer uses the local device 100 for AI development. The cloud training system 120 in the cloud environment may be a background system corresponding to a cloud training service provided by a cloud service provider, and the cloud training system 120 may use the basic resources in the cloud environment to perform cloud training on AI models developed locally by developers .
本地设备100中安装有编辑器102和AI框架104,本地设备100中还安装有引导系统106。AI框架104作为AI开发被使用的库,用于为开发者在AI模型的构建、训练过程中提供功能组件。编辑器102用于开发者执行训练代码的编辑,编辑器102还可以用于编译训练代码。开发者可以通过本地设备100中的编辑器102调用所述AI框架104中的组件进行AI模型的构建和训练。要对构建好的AI模型进行训练,开发者通常在本地编辑器中开发训练代码,编写训练所需的训练参数以及训练时采用的训练程序逻辑。训练代码中的训练程序逻辑指示如何使用训练时采用的训练数据并按照设置的训练参数对待训练的AI模型进行训练。开发者编辑好的训练代码被运行时即开启了对AI模型的训练。An editor 102 and an AI framework 104 are installed in the local device 100 , and a guidance system 106 is also installed in the local device 100 . The AI framework 104 is used as a library for AI development, and is used to provide functional components for developers in the process of building and training AI models. The editor 102 is used by the developer to perform editing of the training code, and the editor 102 may also be used to compile the training code. The developer can call the components in the AI framework 104 through the editor 102 in the local device 100 to construct and train the AI model. To train the built AI model, developers usually develop training code in a local editor, write the training parameters required for training, and the training program logic used during training. The training program logic in the training code instructs how to use the training data used during training and train the AI model to be trained according to the set training parameters. When the training code edited by the developer is run, the training of the AI model is started.
为了使开发者构建的AI模型可以利用云上的资源被训练,上述本地设备上还安装有引导系统106。引导系统106用于在训练代码运行时获取训练任务信息,并将训练任务信息上传至云环境中的云训练系统120,以使云训练系统120执行训练任务信息对应的训练任务。训练任务信息可以包括执行对AI模型的训练所需要的训练参数、训练程序逻辑、训练环境信息、云训练接入信息,以及待训练的AI模型。应理解,上述引导系统106可以是安装在本地设备100中的独立软件程序,或者是与AI框架104耦合在一起的工具。本申请图1以引导系统106为独立的软件程序进行举例。In order to enable the AI model built by the developer to be trained using the resources on the cloud, a guidance system 106 is also installed on the above-mentioned local device. The guidance system 106 is configured to acquire training task information when the training code is running, and upload the training task information to the cloud training system 120 in the cloud environment, so that the cloud training system 120 executes the training task corresponding to the training task information. The training task information may include training parameters required to perform training on the AI model, training program logic, training environment information, cloud training access information, and the AI model to be trained. It should be understood that the guidance system 106 described above may be a stand-alone software program installed in the local device 100 or a tool coupled with the AI framework 104 . FIG. 1 of the present application takes the guidance system 106 as an independent software program as an example.
云环境中的云训练系统120,用于接收引导系统106发送的训练任务信息,并利用训练任务信息准备云训练环境,所设置的云训练环境与AI模型和训练程序逻辑相匹配。云训练系统120还用于接收引导系统106发送的训练通知,在设置好的云训练环境中对开发者的AI模型进行训练。训练任务完成后,云训练系统还可以返回训练成功的响应和/或返回训练完成的AI模型至引导系统106。The cloud training system 120 in the cloud environment is configured to receive the training task information sent by the guidance system 106, and use the training task information to prepare the cloud training environment, where the set cloud training environment matches the AI model and the training program logic. The cloud training system 120 is further configured to receive the training notification sent by the guidance system 106, and train the developer's AI model in the set cloud training environment. After the training task is completed, the cloud training system may also return a successful training response and/or return the trained AI model to the guidance system 106 .
由上述可知,本申请主要利用本地设备100中的引导系统106与云环境中的云训练系统120交互,实现将在本地设备100中开发的训练代码对应的训练任务分派至云训练系统120执行。As can be seen from the above, the present application mainly utilizes the guidance system 106 in the local device 100 to interact with the cloud training system 120 in the cloud environment, so as to realize the assignment of training tasks corresponding to the training codes developed in the local device 100 to the cloud training system 120 for execution.
下面结合图2描述引导系统106和云训练系统120的结构和功能划分。The structure and functional division of the guidance system 106 and the cloud training system 120 are described below with reference to FIG. 2 .
由于,开发者开发的AI模型的结构和开发的训练代码依赖于AI框架中的功能组件,在训练代码运行时,需要调用AI框架104。在本申请的一种实施例中,被用于AI模型的构建和训练的AI框架104中嵌入有获取组件1042,获取组件1042可以是AI框架104中 的工具、插件或者补丁。Since the structure of the AI model developed by the developer and the training code developed by the developer depend on the functional components in the AI framework, the AI framework 104 needs to be called when the training code runs. In an embodiment of the present application, an acquisition component 1042 is embedded in the AI framework 104 used for the construction and training of the AI model, and the acquisition component 1042 may be a tool, a plug-in or a patch in the AI framework 104.
获取组件1042用于与引导系统106通信,将获取到的训练任务的初始信息传送给引导系统106。具体地,获取组件1042用于在训练代码运行时从训练代码中获取训练任务的初始信息。本申请实施例中,将AI框架中的获取组件1042获取到的信息统称为训练任务的初始信息,训练任务的初始信息可以包括两种类型的信息,一种是执行训练任务时需要使用的模型或数据的标识信息或者地址信息,例如:待训练的AI模型的地址信息、训练数据的地址信息等;另一种是执行训练任务时可以直接使用的信息,例如:训练参数、训练程序逻辑等。获取组件1042获取到训练任务的初始信息后发送至引导系统106。The obtaining component 1042 is configured to communicate with the guidance system 106 and transmit the obtained initial information of the training task to the guidance system 106 . Specifically, the obtaining component 1042 is configured to obtain the initial information of the training task from the training code when the training code is running. In the embodiment of the present application, the information acquired by the acquisition component 1042 in the AI framework is collectively referred to as the initial information of the training task. The initial information of the training task may include two types of information, one is the model that needs to be used when executing the training task Or the identification information or address information of the data, such as: the address information of the AI model to be trained, the address information of the training data, etc.; the other is the information that can be used directly when performing training tasks, such as: training parameters, training program logic, etc. . The acquisition component 1042 acquires the initial information of the training task and sends it to the guidance system 106 .
引导系统106中可以包括接收单元1061、获取单元1062和发送单元1063。应理解,上述引导系统106的单元划分仅是一种示例,并不构成对本申请引导系统106的限定。The guidance system 106 may include a receiving unit 1061 , an obtaining unit 1062 and a sending unit 1063 . It should be understood that the unit division of the guidance system 106 described above is only an example, and does not constitute a limitation on the guidance system 106 of the present application.
接收单元1062用于接收获取组件1042发送的训练任务的初始信息。接收单元1062也可以接收云训练系统120发送的指令或信息。The receiving unit 1062 is configured to receive the initial information of the training task sent by the obtaining component 1042 . The receiving unit 1062 may also receive instructions or information sent by the cloud training system 120 .
由于训练任务的初始信息包括一些执行训练任务时需要使用的模型或数据的标识信息或者地址信息,引导系统106中获取单元1064可以用于根据接收单元1062接收到的训练任务的初始信息从本地设备100中获取训练任务信息。例如:接收单元1062接收到AI框架的获取组件1042获取的AI模型的地址信息后,获取单元1064根据AI模型的地址信息获取到待训练的AI模型。应理解,对于执行训练任务时可以直接使用的这一类训练任务的初始信息,获取组件1042获取的这些训练任务的初始信息与引导系统106向云训练系统120发送的对应的训练任务信息的内容相同。换言之,引导系统106中的接收单元1062从获取组件1042接收到这类型的信息后,无需由获取单元1064进行进一步获取,即可将这类型的信息作为引导系统要发送给云训练系统120的训练任务信息。本申请实施例中将引导系统106向云训练系统120发送的用于执行训练任务的数据和信息统称为训练任务信息。Since the initial information of the training task includes identification information or address information of some models or data that need to be used when executing the training task, the obtaining unit 1064 in the guidance system 106 can be configured to obtain information from the local device according to the initial information of the training task received by the receiving unit 1062 100 to obtain training task information. For example, after the receiving unit 1062 receives the address information of the AI model obtained by the obtaining component 1042 of the AI framework, the obtaining unit 1064 obtains the AI model to be trained according to the address information of the AI model. It should be understood that, for the initial information of this type of training tasks that can be directly used when performing training tasks, the initial information of these training tasks obtained by the obtaining component 1042 and the content of the corresponding training task information sent by the guidance system 106 to the cloud training system 120 same. In other words, after the receiving unit 1062 in the guidance system 106 receives this type of information from the acquisition component 1042 , without further acquisition by the acquisition unit 1064 , this type of information can be used as the training information to be sent by the guidance system to the cloud training system 120 . task information. In the embodiment of the present application, the data and information sent by the guidance system 106 to the cloud training system 120 for performing the training task are collectively referred to as training task information.
发送单元1066,用于发送训练任务信息至云训练系统120。发送单元1066还用于发送训练通知和/或训练数据至云训练系统120。The sending unit 1066 is configured to send the training task information to the cloud training system 120 . The sending unit 1066 is further configured to send training notifications and/or training data to the cloud training system 120 .
如图2所示,云训练系统120包括环境准备单元122和训练任务执行单元124。As shown in FIG. 2 , the cloud training system 120 includes an environment preparation unit 122 and a training task execution unit 124 .
环境准备单元122,用于接收引导系统1066发送的训练任务信息,根据训练任务信息准备云训练环境,例如:设置云上执行训练任务时需要使用的资源以及训练时依赖的库。可选的,环境准备单元122还可以用于向引导系统106返回环境准备完成的响应消息。可选的,环境准备单元122还可以用于向训练任务执行单元124发送部分训练任务信息和/或训练任务执行指令。The environment preparation unit 122 is configured to receive the training task information sent by the guidance system 1066, and prepare the cloud training environment according to the training task information, for example, setting the resources to be used when executing the training task on the cloud and the libraries that the training depends on. Optionally, the environment preparation unit 122 may also be configured to return a response message indicating that the environment preparation is complete to the guidance system 106 . Optionally, the environment preparation unit 122 may also be configured to send partial training task information and/or training task execution instructions to the training task execution unit 124 .
训练任务执行单元124,可以根据环境准备单元122或者引导系统106发送的训练任务执行指令执行训练任务。训练任务执行单元124在执行训练任务时主要执行训练任务信息中的训练程序逻辑对待训练的AI模型进行训练。训练任务执行单元124还用于向引导单元返回训练成功的响应和/或返回训练完成的AI模型。The training task execution unit 124 may execute the training task according to the training task execution instruction sent by the environment preparation unit 122 or the guidance system 106 . When executing the training task, the training task execution unit 124 mainly executes the training program logic in the training task information to train the AI model to be trained. The training task execution unit 124 is further configured to return a successful training response to the guidance unit and/or return the trained AI model.
开发者在需要使用本申请所述的云训练系统120实现利用云上的资源对本地开发的AI模型的训练之前,开发者可以通过云服务提供商的云平台购买云训练的服务,配置云训练系统120训练时可用的资源总量、设置鉴权密钥信息等。在一些实施例中,引导系统106和AI框架104中的获取组件1042可以是云服务提供商为提供云训练的云服务配套开发的 软件程序或工具。开发者在购买和配置云训练的服务后,可以在本地设备安装上述引导系统106和获取组件1042。由此,开发者即可以在本地运行训练代码启动由云训练系统120执行AI模型的训练。Before the developer needs to use the cloud training system 120 described in this application to realize the training of the locally developed AI model by using the resources on the cloud, the developer can purchase the cloud training service through the cloud platform of the cloud service provider and configure the cloud training. The total amount of resources available to the system 120 for training, setting authentication key information, and the like. In some embodiments, the acquisition component 1042 in the guidance system 106 and the AI framework 104 may be a software program or tool developed by a cloud service provider in conjunction with a cloud service that provides cloud training. After the developer purchases and configures the cloud training service, the above-mentioned guidance system 106 and acquisition component 1042 can be installed on the local device. Thus, the developer can run the training code locally to start the training of the AI model performed by the cloud training system 120 .
上述图2描述的仅仅是本申请的一种可能实施例。在另一些实施例中,AI框架104中的获取组件1042的功能也可以由引导系统106来提供,即前述获取组件1042根据训练代码获取训练任务的初始信息的动作可以由引导系统106执行。在这种情况下,引导系统106可以包括四个功能单元,即:执行获取组件1042的动作的获取单元,以及前述接收单元1062、获取单元1064、发送单元1066。前述对引导系统106的功能单元的划分仅是一种示例,还可以有不同的划分方式,此处不再赘述。2 described above is only one possible embodiment of the present application. In other embodiments, the function of the obtaining component 1042 in the AI framework 104 may also be provided by the guidance system 106 , that is, the aforementioned action of the obtaining component 1042 obtaining the initial information of the training task according to the training code may be performed by the guidance system 106 . In this case, the guidance system 106 may include four functional units, namely: an acquisition unit that performs the actions of the acquisition component 1042 , and the aforementioned receiving unit 1062 , acquiring unit 1064 , and transmitting unit 1066 . The foregoing division of the functional units of the guidance system 106 is only an example, and there may be different division manners, which will not be repeated here.
本申请下述实施例以引导系统106不包括AI框架104中的获取组件1042的功能为例进行描述。应理解,下述实施例的内容也可以适应性地应用于引导系统106包括AI框架104中的获取组件1042的功能的场景。The following embodiments of the present application are described by taking the guidance system 106 not including the function of the acquisition component 1042 in the AI framework 104 as an example for description. It should be understood that the content of the following embodiments can also be adaptively applied to scenarios where the guidance system 106 includes the functions of the acquisition component 1042 in the AI framework 104 .
图3为本申请实施例提供的AI训练方法的流程示意图,下面结合图3具体描述本申请的AI训练方法的具体实现。该AI训练方法可以由前述引导系统106、云训练系统120,以及AI框架104中的获取组件1042协同执行。FIG. 3 is a schematic flowchart of an AI training method provided by an embodiment of the present application. The specific implementation of the AI training method of the present application will be described in detail below with reference to FIG. 3 . The AI training method may be performed collaboratively by the aforementioned guidance system 106 , the cloud training system 120 , and the acquisition component 1042 in the AI framework 104 .
S201:开发者开发和运行训练代码。S201: The developer develops and runs the training code.
具体地,开发者可以通过在编辑器中开发对已构建的AI模型进行训练的代码,编辑器可以是业界各种IDE,训练代码中可以包括开发者设置的训练参数,例如:学习率、批(batch)处理值、批尺寸等,还可以包括训练数据集的名称、待训练的AI模型的名称、训练程序逻辑。训练程序逻辑可以包括开发者自行编写的训练程序逻辑,也可以包括开发者调用的AI框架中的训练程序逻辑。训练程序逻辑中可以包括损失函数、优化器等算法逻辑。应理解,开发者构建的AI模型依赖于AI框架,训练代码中训练程序逻辑的执行也依赖于AI框架。Specifically, developers can develop the code for training the built AI model in the editor. The editor can be various IDEs in the industry. The training code can include the training parameters set by the developer, such as: learning rate, batch (batch) The processing value, batch size, etc., can also include the name of the training dataset, the name of the AI model to be trained, and the training program logic. The training program logic may include the training program logic written by the developer, or the training program logic in the AI framework invoked by the developer. The training program logic may include algorithm logic such as loss function and optimizer. It should be understood that the AI model built by the developer depends on the AI framework, and the execution of the training program logic in the training code also depends on the AI framework.
由于本申请将对AI模型的训练任务由云端执行,开发者在开发训练代码时,还需要在训练代码中设置云训练模式,例如:用代码表示训练模式为云训练模式。开发者还可以在训练代码中写入云训练接入信息,例如:云接入地址信息、云鉴权信息、账户信息等。Since the training task of the AI model in this application is performed by the cloud, when developing the training code, the developer also needs to set the cloud training mode in the training code, for example, use the code to indicate that the training mode is the cloud training mode. Developers can also write cloud training access information in the training code, such as cloud access address information, cloud authentication information, account information, etc.
开发者开发的训练代码经过编译器编译后,即可以在本地设备上启动运行。启动运行训练代码后,由于训练代码中设置有云训练模式,在一些实施例中,训练代码中的云训练模式可以触发AI框架中的获取组件执行对训练任务的初始信息的获取操作。After the training code developed by the developer is compiled by the compiler, it can be started and run on the local device. After starting and running the training code, since the cloud training mode is set in the training code, in some embodiments, the cloud training mode in the training code can trigger the acquisition component in the AI framework to perform the acquisition operation for the initial information of the training task.
S202:AI框架中的获取组件根据训练代码获取训练代码对应的训练任务的初始信息,向引导系统发送该训练任务的初始信息。S202: The acquisition component in the AI framework acquires the initial information of the training task corresponding to the training code according to the training code, and sends the initial information of the training task to the guidance system.
获取组件可以从训练代码中获取的训练任务的初始信息包括:待训练的AI模型的地址信息。获取组件在获取待训练的AI模型的地址信息时,可以拦截训练代码中加载模型时的应用程序接口(application program interface,API),由此获取到待训练的AI模型在本地设备中的路径信息。The initial information of the training task that the acquisition component can acquire from the training code includes: address information of the AI model to be trained. When acquiring the address information of the AI model to be trained, the acquisition component can intercept the application program interface (API) when the model is loaded in the training code, thereby obtaining the path information of the AI model to be trained in the local device .
获取组件可以从训练代码中获取的训练任务的初始信息还包括:训练数据的地址信息。获取组件在获取训练数据的地址信息时,也可以通过拦截训练代码中加载训练数据时的API,然后获取到训练数据在本地设备中的路径信息。The initial information of the training task that the acquisition component can acquire from the training code further includes: address information of the training data. When the acquisition component acquires the address information of the training data, it can also intercept the API when the training data is loaded in the training code, and then obtain the path information of the training data in the local device.
获取组件可以从训练代码中获取的训练任务的初始信息还包括:一些训练程序逻辑、 云训练接入信息、训练参数、训练环境信息等。云训练接入信息可以包括:云接入地址信息、云鉴权信息、账户信息。训练参数可以包括:训练时采用的学习率、批处理值、批处理尺寸等。训练环境信息可以包括:AI框架的版本信息、训练代码的编程语言版本信息、编程语言版本或者AI框架版本的一些插件或库信息、用于执行训练的资源规格及数量等。上述训练环境信息可以分为两类。一类表示本地设备的训练环境信息,表示本地设备进行AI模型的构建和训练代码的开发时的环境信息,包括:AI框架的版本信息、训练代码的编程语言版本信息、编程语言版本或者AI框架版本的一些插件或库信息等。另一类表示在本地设备中设置的训练环境信息,表示执行AI模型训练时的环境信息,包括:用于执行训练的资源规格、数量等,这类信息通常由用户在训练代码中设置。The initial information of the training task that the acquisition component can acquire from the training code also includes: some training program logic, cloud training access information, training parameters, training environment information, and the like. The cloud training access information may include: cloud access address information, cloud authentication information, and account information. Training parameters can include: learning rate used during training, batch value, batch size, etc. The training environment information may include: version information of the AI framework, programming language version information of the training code, programming language version or some plug-in or library information of the AI framework version, resource specifications and quantities used for training, and the like. The above training environment information can be divided into two categories. One type represents the training environment information of the local device, which represents the environment information when the local device builds the AI model and develops the training code, including: version information of the AI framework, programming language version information of the training code, programming language version or AI framework Version of some plugins or library information, etc. The other type represents the training environment information set in the local device, which represents the environment information when AI model training is performed, including: resource specifications and quantities used to perform training, etc. This type of information is usually set by the user in the training code.
S203:引导系统向云训练系统上传训练任务信息。S203: Guide the system to upload training task information to the cloud training system.
引导系统接收到AI框架中的获取组件发送的训练任务的初始信息后,引导系统可以根据训练任务的初始信息获得要发送的训练任务信息,并将训练任务信息上传至云训练系统。例如:根据待训练的AI模型的地址信息从本地设备中读取待训练的AI模型,并上传至云训练系统。可选的,引导系统也可以主动检测和获取一些训练任务信息,这个动作可以由引导系统中的获取单元执行。例如:当AI框架中的获取组件不能获取到一些训练环境信息时(如:训练代码的编程语言版本),引导系统可以通过主动检测本地设备中的训练环境来获取训练环境信息。因此,训练任务信息包括利用AI框架中的获取组件从所述训练代码中获取的信息,以及根据所述训练代码中的信息从所述本地设备获取的信息。After the guidance system receives the initial information of the training task sent by the acquisition component in the AI framework, the guidance system can obtain the training task information to be sent according to the initial information of the training task, and upload the training task information to the cloud training system. For example, read the AI model to be trained from the local device according to the address information of the AI model to be trained, and upload it to the cloud training system. Optionally, the guidance system can also actively detect and acquire some training task information, and this action can be performed by an acquisition unit in the guidance system. For example, when the acquisition component in the AI framework cannot acquire some training environment information (such as the programming language version of the training code), the guidance system can acquire the training environment information by actively detecting the training environment in the local device. Therefore, the training task information includes the information obtained from the training code by using the obtaining component in the AI framework, and the information obtained from the local device according to the information in the training code.
引导系统将获取到的训练任务信息上传至云训练系统之前,可以根据训练任务信息中的云训练接入信息与云训练系统建立连接,建立连接可以包括进行鉴权和计费查询等操作。Before the guidance system uploads the acquired training task information to the cloud training system, it can establish a connection with the cloud training system according to the cloud training access information in the training task information, and establishing a connection may include operations such as authentication and billing query.
具体地,引导系统与云训练系统建立连接的流程示意图,可以如图4所示,具体包括如下步骤S2031-S2036:Specifically, the schematic flowchart of the connection between the guidance system and the cloud training system can be shown in FIG. 4 , which specifically includes the following steps S2031-S2036:
S2031:引导系统向云训练系统发送上传请求。S2031: Guide the system to send an upload request to the cloud training system.
上述上传请求中包括的云训练接入信息有云接入信息、本地设备的账户信息和鉴权信息。云接入信息可以是云训练系统的地址信息,根据云训练系统的地址信息可以将上传请求发送至云训练系统。账户信息和鉴权信息可以是开发者在使用本申请的方案之前在云平台中购买云训练服务时注册和获取到的信息。例如:账户信息可以是开发者在云平台的用户名,鉴权信息可以是从云平台中获取的与云训练服务对应的密钥。The cloud training access information included in the above upload request includes cloud access information, account information and authentication information of the local device. The cloud access information may be address information of the cloud training system, and an upload request may be sent to the cloud training system according to the address information of the cloud training system. The account information and authentication information may be the information registered and acquired by the developer when purchasing the cloud training service in the cloud platform before using the solution of the present application. For example, the account information may be the user name of the developer on the cloud platform, and the authentication information may be the key obtained from the cloud platform and corresponding to the cloud training service.
值得注意的是,在一些情况下,引导系统的上传请求中可以仅包括上述用于接入、鉴权和费用查询的云接入信息、本地设备的账户信息和鉴权信息,在另一些情况下,上传请求中还可以包括前述训练任务信息中的部分或全部内容。若上传请求中仅包括云接入信息、账户信息和鉴权信息,则引导系统可以在接收到鉴权和计费查询通过的提示后,再上传其他的训练任务信息至云训练系统。It is worth noting that, in some cases, the upload request of the bootstrap system may only include the above-mentioned cloud access information for access, authentication and fee query, account information and authentication information of the local device, and in other cases Next, the upload request may also include part or all of the foregoing training task information. If the upload request only includes cloud access information, account information and authentication information, the guidance system can upload other training task information to the cloud training system after receiving the prompt that the authentication and billing query is passed.
S2032:云训练系统接收上传请求,并将账户信息和鉴权信息发送至云鉴权中心。S2032: The cloud training system receives the upload request, and sends the account information and authentication information to the cloud authentication center.
S2033:云鉴权中心根据获取到的账户信息和鉴权信息对上传请求所请求的训练任务进行鉴权,返回鉴权结果。S2033: The cloud authentication center authenticates the training task requested by the upload request according to the acquired account information and authentication information, and returns an authentication result.
具体的鉴权方式可以采用业界任意可行的鉴权方式,本申请不对此作限定。The specific authentication mode may adopt any feasible authentication mode in the industry, which is not limited in this application.
云鉴权中心鉴权完成后,向云训练系统返回鉴权结果。After the authentication of the cloud authentication center is completed, the authentication result is returned to the cloud training system.
S2034:云训练系统将账户信息发送至云计费中心。S2034: The cloud training system sends the account information to the cloud billing center.
S2035:云计费中心根据账户信息确认账户信息对应的账户的费用信息,向云训练系统返回费用信息。S2035: The cloud billing center confirms the fee information of the account corresponding to the account information according to the account information, and returns the fee information to the cloud training system.
本申请不限定上述步骤S2034-S2035与步骤S2032-S2033的执行顺序。上述步骤S2034-S2035的执行也可以是可选的。The present application does not limit the execution order of the above steps S2034-S2035 and steps S2032-S2033. The execution of the above steps S2034-S2035 may also be optional.
S2036:云训练系统向引导系统返回鉴权和计费查询通过的响应,并接收训练任务信息。S2036: The cloud training system returns a response that the authentication and billing query is passed to the guidance system, and receives training task information.
在鉴权通过且账户信息对应的账户费用大于或等于预设阈值的情况下,云训练系统向引导系统返回鉴权和计费查询通过的响应,并接收引导系统上传的其他训练任务信息,其他训练任务信息包括前述描述的除了用于云接入、鉴权和费用查询的云训练接入信息外的训练任务信息,如:训练参数、训练程序逻辑、待训练的AI模型、训练环境信息等。When the authentication is passed and the account fee corresponding to the account information is greater than or equal to the preset threshold, the cloud training system returns a response to the authentication and billing query to the guidance system, and receives other training task information uploaded by the guidance system. The training task information includes the aforementioned training task information except the cloud training access information used for cloud access, authentication, and fee query, such as: training parameters, training program logic, AI model to be trained, training environment information, etc. .
应理解,上述步骤为可选的,在上述上传请求中包括其他训练任务信息的情况下,云训练系统可以不返回响应,直接接收上传的训练任务信息。It should be understood that the above steps are optional, and when the above upload request includes other training task information, the cloud training system may directly receive the uploaded training task information without returning a response.
在鉴权未通过或者账户信息对应的账户的预存费用小于预设阈值的情况下,云训练系统可以向引导系统返回上传请求失败响应,还可以返回请求失败原因,例如:鉴权不通过和/或预存费用不足。In the case that the authentication fails or the pre-stored fee of the account corresponding to the account information is less than the preset threshold, the cloud training system can return the upload request failure response to the guidance system, and can also return the request failure reason, such as: authentication failed and/ or insufficient pre-storage fees.
在执行完上述步骤S2031-S2036后,引导系统可以成功地将训练任务信息发送至云训练系统。After performing the above steps S2031-S2036, the guidance system can successfully send the training task information to the cloud training system.
应理解,上述步骤S202-S203是基于本申请前述的一种实施例(即:引导系统与AI框架中的获取组件协同进行训练任务信息的获取的实施例)进行描述的。在另一种实施例中,上述引导系统可以包括上述AI框架中的获取组件的功能,则上述步骤S202与S203均由引导系统执行。It should be understood that the above steps S202-S203 are described based on one of the aforementioned embodiments of the present application (that is, the embodiment in which the guidance system and the acquisition component in the AI framework cooperate to acquire the training task information). In another embodiment, the above-mentioned guidance system may include the function of the acquisition component in the above-mentioned AI framework, and the above-mentioned steps S202 and S203 are both performed by the guidance system.
云训练系统接收到训练任务信息后,即可以开展在云上进行云训练环境的准备和训练任务额执行工作了,下面用步骤S204具体描述:After the cloud training system receives the training task information, it can prepare the cloud training environment on the cloud and execute the training tasks. The following is a detailed description of step S204:
S204:云训练系统根据接收到的训练任务信息,执行训练任务信息对应的训练任务。S204: The cloud training system executes a training task corresponding to the training task information according to the received training task information.
如图5所示,具体地,步骤S204可以分为以下几个步骤:As shown in Figure 5, specifically, step S204 can be divided into the following steps:
S2041:云训练系统根据接收到的训练任务信息准备云训练环境。S2041: The cloud training system prepares a cloud training environment according to the received training task information.
云训练系统在执行训练任务之前,需要准备云训练的环境,使得云训练的环境与本地训练代码和本地待训练的AI模型相匹配。具体地,云训练系统需要根据训练任务信息中的训练环境信息准备云训练的环境,训练环境信息可以包括:AI框架的版本信息、训练代码的编程语言的版本信息、编程语言版本或者AI框架版本的一些插件或库信息、用于执行训练的资源规格。Before the cloud training system performs training tasks, it needs to prepare the cloud training environment so that the cloud training environment matches the local training code and the local AI model to be trained. Specifically, the cloud training system needs to prepare a cloud training environment according to the training environment information in the training task information. The training environment information may include: version information of the AI framework, version information of the programming language of the training code, programming language version or AI framework version Some plugin or library information, resource specifications used to perform training.
云训练系统需要根据AI框架和训练代码的编程语言的版本,确保云环境中已准备好执行云训练需要依赖的AI框架和编程语言版本。通常云环境中会包括各种主流的AI框架和编程语言版本,因此,通常云训练系统在准备云训练环境时仅需要检测和确认,而无需临时进行安装这些版本。The cloud training system needs to ensure that the AI framework and programming language versions that cloud training depends on are ready in the cloud environment according to the version of the AI framework and the programming language of the training code. Usually, the cloud environment includes versions of various mainstream AI frameworks and programming languages. Therefore, usually the cloud training system only needs to detect and confirm when preparing the cloud training environment, without temporarily installing these versions.
云训练系统还需要根据编程语言版本或者AI框架版本的一些插件或库信息,确保云环境中已安装好这些插件或者库,通常云环境也会及时更新和下载主流的AI框架和编程软件的插件和所需的库,若在云环境准备阶段,云训练系统发现云环境没有安装执行训练任务 所需的插件和库,可以及时下载并安装。The cloud training system also needs to ensure that these plug-ins or libraries have been installed in the cloud environment according to the programming language version or some plug-in or library information of the AI framework version. Usually, the cloud environment will also update and download the plug-ins of mainstream AI frameworks and programming software in time. and required libraries. If the cloud training system finds that the cloud environment does not have the plug-ins and libraries required to perform training tasks installed in the cloud environment preparation stage, it can be downloaded and installed in time.
云训练系统还需要根据训练任务信息中包括的执行训练的资源规格的信息,在云上准备相应的训练资源。例如:根据所需的资源规格的信息,在云端启动相关的虚拟机、容器,并挂载相应的硬件资源,比如图形处理单元(graphical processing unit,GPU)或者AI训练芯片等。The cloud training system also needs to prepare corresponding training resources on the cloud according to the information on the resource specification for performing training included in the training task information. For example, according to the required resource specification information, start the relevant virtual machines and containers in the cloud, and mount the corresponding hardware resources, such as graphics processing units (GPUs) or AI training chips.
在云训练环境准备完成后,云训练系统可以执行以下步骤:After the cloud training environment is prepared, the cloud training system can perform the following steps:
S2042:云训练系统向引导系统返回环境准备成功响应。S2042: The cloud training system returns an environment preparation success response to the guidance system.
S2043:引导系统向云训练系统发送训练通知。S2043: The guidance system sends a training notification to the cloud training system.
值得注意的是,在另一些实施例中,上述步骤S2042和S2043也可以不执行。例如:引导系统可以在上传训练任务信息时一并通知云训练系统执行训练任务,则云训练系统在执行完前述步骤S2041后,即可以开始执行训练任务,省略了上述步骤S2042和S2043。It should be noted that, in other embodiments, the above steps S2042 and S2043 may not be performed. For example, the guidance system can notify the cloud training system to perform the training task when uploading the training task information, and the cloud training system can start to execute the training task after performing the aforementioned step S2041, and the aforementioned steps S2042 and S2043 are omitted.
S2044:云训练系统在云训练环境中执行训练任务信息对应的训练任务。S2044: The cloud training system executes the training task corresponding to the training task information in the cloud training environment.
云训练系统可以启动准备好相关资源的训练容器执行训练任务,具体地,在执行训练任务时,根据训练程序逻辑调用云上对应的AI框架中的功能组件。将训练数据输入至待训练的AI模型,基于训练资源利用模型中的各个组件对训练数据进行计算,并按照一些训练参数和训练程序逻辑更新模型中的参数的值,如此迭代,直到AI模型的训练达到训练停止条件时停止对模型的训练,获得训练完成的AI模型,训练停止条件例如:损失函数收敛到小于预设阈值,或者,训练的回合数达到预设的值。The cloud training system can start the training container with the relevant resources ready to execute the training task. Specifically, when the training task is executed, the functional components in the corresponding AI framework on the cloud are called according to the logic of the training program. Input the training data into the AI model to be trained, use each component in the model to calculate the training data based on the training resources, and update the values of the parameters in the model according to some training parameters and training program logic. When the training reaches the training stop condition, the training of the model is stopped, and the trained AI model is obtained.
由于用于对AI模型进行训练的训练数据较多(例如:几万张图片,或者几万段视频),在上传训练任务信息时,由引导系统读取本地设备中的训练数据集,并一次性地上传至云训练系统,会导致传输时延较高,云训练系统的环境准备时间较长,影响用户体验。在一些实施例中,云训练系统可以在执行训练的过程中分至少一次地向引导设备发送训练数据获取请求。Due to the large amount of training data used to train the AI model (for example: tens of thousands of pictures, or tens of thousands of videos), when uploading training task information, the guidance system reads the training data set in the local device, and once Uploading to the cloud training system indiscriminately will result in high transmission delay, and the environment preparation time of the cloud training system will be longer, which will affect the user experience. In some embodiments, the cloud training system may send a training data acquisition request to the guiding device at least once in a process of performing training.
即,可选的,在执行训练任务的过程中还可以执行如下步骤:That is, optionally, the following steps may also be performed during the execution of the training task:
S2045:云训练系统向引导系统发送训练数据获取请求;S2045: The cloud training system sends a training data acquisition request to the guidance system;
S2046:引导系统从本地设备中读取训练数据,并将训练数据发送至云训练系统。S2046: Guide the system to read the training data from the local device, and send the training data to the cloud training system.
在另一些实施例中,用于对AI模型进行训练的训练数据也可以由用户预先保存在云训练系统可以读取的地方,例如:云存储器。In other embodiments, the training data used for training the AI model may also be pre-saved by the user in a place that can be read by the cloud training system, such as cloud storage.
值得注意的是,在上述执行训练任务的过程中,云计费中心还可以持续地根据训练时所使用的资源的时长、资源规格、资源数量等,进行持续地计费。It is worth noting that during the above process of executing the training task, the cloud billing center can also continuously charge according to the duration, resource specifications, resource quantity, etc. of the resources used during training.
经过上述步骤S204,云训练系统可以成功地对AI模型进行训练,获得训练完成的AI模型。After the above step S204, the cloud training system can successfully train the AI model and obtain the trained AI model.
S205:云训练系统向引导系统返回训练完成的AI模型。S205: The cloud training system returns the trained AI model to the guidance system.
值得注意的是,上述步骤S205仅是一种情况下执行的步骤,在另一些情况下,云训练系统在执行完训练任务后可以不将训练完成的AI模型返回给引导系统。例如:云训练系统可以将训练成功的响应返回给引导系统,或者将训练完成的AI模型存储在云环境中的地址信息返回给引导系统。云训练系统在训练完成后向引导系统返回什么,可以由开发者通过预先设置确定。可选的,云训练系统还可以返回计费话单至引导系统。It is worth noting that the above step S205 is only a step performed in one case, and in other cases, the cloud training system may not return the trained AI model to the guidance system after performing the training task. For example, the cloud training system can return a successful training response to the guidance system, or return the address information of the trained AI model stored in the cloud environment to the guidance system. What the cloud training system returns to the guidance system after the training is completed can be determined by the developer through preset settings. Optionally, the cloud training system may also return the billing bill to the guidance system.
通过上述步骤S201-S205,开发者在本地编写和运行训练代码,即可以实现利用云环境的资源训练AI模型。避免了本地训练所需的资源不足无法支撑AI模型的训练的问题。上述方法极大地方便了开发者,开发者无需在面临本地资源不足的情况下,改变本地构建AI模型和开发训练代码的习惯。本申请的方案也无需开发者进行复杂的配置和适应,通过引导系统和云训练系统的协作即快速地实现云训练。Through the above steps S201-S205, the developer can write and run the training code locally, that is, the AI model can be trained using the resources of the cloud environment. It avoids the problem that the resources required for local training are insufficient to support the training of AI models. The above method greatly facilitates developers, and developers do not need to change the habit of locally building AI models and developing training codes when faced with insufficient local resources. The solution of the present application also does not require the developer to perform complex configuration and adaptation, and the cloud training can be quickly realized through the cooperation of the guidance system and the cloud training system.
本申请实施例还提供图2所示的引导系统106,在一些实施例中,引导系统106具体用于执行前述图3-图5中所示的引导系统执行的步骤,引导系统106的各功能单元的功能如前述对图2的描述,此处不再赘述。在另一些实施例中,引导系统106还可以具体用于执行前述图3-图5中所示的引导系统和AI框架中的获取组件的功能。Embodiments of the present application further provide the guidance system 106 shown in FIG. 2 . In some embodiments, the guidance system 106 is specifically configured to perform the steps performed by the guidance system shown in the foregoing FIGS. 3 to 5 , and the functions of the guidance system 106 The functions of the units are the same as those described above for FIG. 2 , which will not be repeated here. In other embodiments, the guidance system 106 may also be specifically configured to perform the functions of the acquisition components in the guidance system and the AI framework shown in the foregoing FIGS. 3-5 .
本申请实施例还提供图2所示的云训练系统120,云训练系统120具体可以用于执行前述图3-图5所示的云训练系统执行的步骤,云训练系统120的各功能单元的功能如前述对图2的描述,此处不再赘述。The embodiment of the present application also provides the cloud training system 120 shown in FIG. 2 . The cloud training system 120 can specifically be used to perform the steps performed by the cloud training system shown in FIG. 3 to FIG. 5 . The functions are the same as those described above for FIG. 2 , which will not be repeated here.
本申请实施例还提供一种如图6所示的计算设备300,上述计算设备300可以是前述的本地设备。计算设备300包括存储器301、处理器302、通信接口303以及总线304。其中,存储器301、处理器302、通信接口303通过总线304实现彼此之间的通信连接。应理解,本申请不限定计算设备300中的处理器、存储器的个数。计算设备300也可以表示多个服务器或虚拟机构成的设备集群。This embodiment of the present application further provides a computing device 300 as shown in FIG. 6 , and the computing device 300 may be the aforementioned local device. Computing device 300 includes memory 301 , processor 302 , communication interface 303 , and bus 304 . The memory 301 , the processor 302 , and the communication interface 303 are connected to each other through the bus 304 for communication. It should be understood that the present application does not limit the number of processors and memories in the computing device 300 . Computing device 300 may also represent a cluster of devices composed of multiple servers or virtual machines.
存储器301可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器301可以存储计算机指令,当存储器301中存储的计算机指令被处理器302执行时,处理器302和通信接口303执行前述图3-图5中描述的由引导系统执行的部分或全部的AI训练方法。也即前述引导系统106中计算机指令可以存储在存储器301。存储器301中还可以存储待训练的AI模块和训练数据。The memory 301 may be a read only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM). The memory 301 may store computer instructions, and when the computer instructions stored in the memory 301 are executed by the processor 302, the processor 302 and the communication interface 303 perform part or all of the AI training performed by the guidance system described in the aforementioned Figures 3-5. method. That is, the computer instructions in the aforementioned guidance system 106 may be stored in the memory 301 . The AI module to be trained and training data can also be stored in the memory 301 .
处理器302可以采用通用的中央处理器(Central Processing Unit,CPU),应用专用集成电路(Application Specific Integrated Circuit,ASIC),图形处理器(graphics processing unit,GPU)或其任意组合。处理器302可以包括一个或多个芯片,处理器302可以包括AI加速器,例如:神经网络处理器(neural processing unit,NPU)。The processor 302 may adopt a general-purpose central processing unit (Central Processing Unit, CPU), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a graphics processor (graphics processing unit, GPU) or any combination thereof. The processor 302 may include one or more chips, and the processor 302 may include an AI accelerator, such as a neural processing unit (NPU).
通信接口303使用例如但不限于收发器一类的收发模块,来实现计算设备300与其他设备或通信网络之间的通信。例如,可以通过通信接口303获取训练成功的响应或者训练完成的AI模型。The communication interface 303 uses a transceiver module such as, but not limited to, a transceiver to enable communication between the computing device 300 and other devices or a communication network. For example, the response of successful training or the AI model after training can be obtained through the communication interface 303 .
总线304可包括在计算设备300各个部件(例如,存储器301、处理器302、通信接口303)之间传送信息的通路。Bus 304 may include pathways for communicating information between various components of computing device 300 (eg, memory 301, processor 302, communication interface 303).
本申请实施例还提供一种如图7所示的计算设备400,计算设备400可以是云服务提供商提供的云服务器或者云服务器集群,也可以是虚拟机或者虚拟机集群。计算设备400包括存储器401、处理器402、通信接口403以及总线404。上述存储器401、处理器402、通信接口403以及总线404的可能的硬件结构以及各部分之间的关系可以与前述计算设备300中的对应部分相同或相似,此处不再赘述。计算设备400中的存储器401可以存储有前述云训练系统120中的环境准备单元122和训练任务执行单元124中包括的计算机指令, 当存储器401中存储的计算机指令被处理器402执行时,处理器402和通信接口403执行前述图3-图5中描述的由云训练系统执行的部分或全部的AI训练方法。The embodiment of the present application further provides a computing device 400 as shown in FIG. 7 , the computing device 400 may be a cloud server or a cloud server cluster provided by a cloud service provider, or a virtual machine or a virtual machine cluster. Computing device 400 includes memory 401 , processor 402 , communication interface 403 , and bus 404 . The possible hardware structures of the memory 401 , the processor 402 , the communication interface 403 and the bus 404 and the relationship between each part may be the same as or similar to the corresponding parts in the aforementioned computing device 300 , and will not be repeated here. The memory 401 in the computing device 400 may store the computer instructions included in the environment preparation unit 122 and the training task execution unit 124 in the aforementioned cloud training system 120. When the computer instructions stored in the memory 401 are executed by the processor 402, the processor 402 and the communication interface 403 perform part or all of the AI training method performed by the cloud training system as described in the aforementioned FIGS. 3-5 .
上述各个附图对应的流程的描述各有侧重,某个流程中没有详述的部分,可以参见其他流程的相关描述。The descriptions of the processes corresponding to the above figures have their own emphasis, and for parts that are not described in detail in a certain process, please refer to the relevant descriptions of other processes.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。实现上述AI训练方法的计算机程序产品包括一个或多个计算机指令,在计算机上加载和执行这些计算机程序指令时,全部或部分地执行按照本申请前述图3-5所述的AI训练的方法流程。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product implementing the above-mentioned AI training method includes one or more computer instructions, and when these computer program instructions are loaded and executed on the computer, the whole or part of the AI training method flow described in the aforementioned FIGS. 3-5 of the present application is executed. .
所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(digital subscriber line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(digital versatile disc,DVD))、或者半导体介质(例如:固态硬盘(solid state disk,SSD))等。。The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, digital versatile disc (DVD)), or semiconductor media (eg, solid state disk (SSD)) )Wait. .

Claims (28)

  1. 一种人工智能AI训练方法,其特征在于,所述方法应用于引导系统,当云训练系统的用户在本地设备上触发训练代码的运行后,所述方法包括:An artificial intelligence AI training method, wherein the method is applied to a guidance system, and after a user of the cloud training system triggers the running of a training code on a local device, the method includes:
    根据所述本地设备上运行的所述训练代码,获取训练任务信息,其中,所述训练代码用于训练AI模型,所述AI模型由所述用户基于安装在所述本地设备上的AI框架进行开发获得;将所述训练任务信息上传至所述云训练系统,并通知所述云训练系统执行所述训练任务信息对应的训练任务。Acquire training task information according to the training code running on the local device, where the training code is used to train an AI model, and the AI model is performed by the user based on the AI framework installed on the local device Development and acquisition; upload the training task information to the cloud training system, and notify the cloud training system to execute the training task corresponding to the training task information.
  2. 根据权利要求1所述的方法,其特征在于,所述训练任务信息包括利用所述AI框架中的获取组件从所述训练代码中获取的信息,以及根据所述训练代码中的信息从所述本地设备获取的信息。The method according to claim 1, wherein the training task information includes information obtained from the training code by using an obtaining component in the AI framework, and obtained from the training code according to the information in the training code. Information obtained by the local device.
  3. 根据权利要求1或2所述的方法,其特征在于,所述训练任务信息包括以下数据中的一种或多种:所述训练代码中的训练参数、所述AI模型、所述训练代码中用于对所述AI模型进行训练的训练程序逻辑、所述本地设备的训练环境信息、用于与所述云训练系统连接的云训练接入信息。The method according to claim 1 or 2, wherein the training task information includes one or more of the following data: training parameters in the training code, the AI model, the training code Training program logic for training the AI model, training environment information of the local device, and cloud training access information for connecting with the cloud training system.
  4. 根据权利要求3所述的方法,其特征在于,所述本地设备的训练环境信息包括:所述AI框架的版本信息,和/或所述训练代码的编程语言的版本信息。The method according to claim 3, wherein the training environment information of the local device comprises: version information of the AI framework, and/or version information of a programming language of the training code.
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-4, wherein the method further comprises:
    接收所述云训练系统在执行所述训练任务的过程中发送的训练数据获取请求;receiving a training data acquisition request sent by the cloud training system during the execution of the training task;
    根据所述训练数据获取请求获取所述训练数据,并发送所述训练数据至所述云训练系统。Acquire the training data according to the training data acquisition request, and send the training data to the cloud training system.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,在通知所述云训练系统执行所述训练任务信息对应的训练任务之前,所述方法还包括:The method according to any one of claims 1-5, wherein before notifying the cloud training system to execute the training task corresponding to the training task information, the method further comprises:
    接收所述云训练系统返回的环境准备成功响应。Receive an environment preparation success response returned by the cloud training system.
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述方法还包括:接收所述云训练系统返回的训练完成的AI模型。The method according to any one of claims 1-6, wherein the method further comprises: receiving a trained AI model returned by the cloud training system.
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述引导系统从所述云训练系统中获得,并安装于所述本地设备中。The method according to any one of claims 1-7, wherein the guidance system is obtained from the cloud training system and installed in the local device.
  9. 一种AI训练方法,其特征在于,所述方法应用于云训练系统,包括:An AI training method, wherein the method is applied to a cloud training system, comprising:
    获取引导系统在用户在本地设备触发运行训练代码后发送的训练任务信息,所述训练任务信息包括所述本地设备的训练环境信息;acquiring training task information sent by the guidance system after the user triggers the running of the training code on the local device, where the training task information includes the training environment information of the local device;
    根据所述训练环境信息执行云训练环境的准备;performing the preparation of the cloud training environment according to the training environment information;
    基于所述云训练环境执行所述训练任务信息对应的训练任务。The training task corresponding to the training task information is executed based on the cloud training environment.
  10. 根据权利要求9所述的方法,其特征在于,所述本地设备的训练环境信息包括:待训练的AI模型所依赖的AI框架的版本信息,和/或用于训练所述AI模型的所述训练代码所使用的编程语言的版本信息。The method according to claim 9, wherein the training environment information of the local device includes: version information of the AI framework on which the AI model to be trained depends, and/or the information used for training the AI model. Version information for the programming language used by the training code.
  11. 根据权利要求10所述的方法,其特征在于,所述根据所述训练环境信息执行云训练环境的准备,包括:The method according to claim 10, wherein the performing the preparation of the cloud training environment according to the training environment information comprises:
    根据所述AI框架的版本信息,以及所述训练代码的编程语言的版本信息设置所述云训练环境中执行训练任务使用的AI框架和编程语言。The AI framework and programming language used to execute the training task in the cloud training environment are set according to the version information of the AI framework and the version information of the programming language of the training code.
  12. 根据权利要求9-11任一项所述的方法,其特征在于,所述训练任务信息还包括:训练代码中的训练参数、AI模型、训练代码中用于对所述AI模型进行训练的训练程序逻辑;The method according to any one of claims 9-11, wherein the training task information further comprises: training parameters in the training code, an AI model, and training in the training code for training the AI model program logic;
    所述基于所述云训练环境执行所述训练任务信息对应的训练任务,包括:根据所述训练参数和所述训练程序逻辑,在所述准备的云训练环境中执行对所述AI模型的训练。The performing the training task corresponding to the training task information based on the cloud training environment includes: performing the training of the AI model in the prepared cloud training environment according to the training parameters and the training program logic .
  13. 根据权利要求9-12任一项所述的方法,其特征在于,所述训练任务信息还包括:云训练接入信息,在根据所述训练环境信息执行云训练环境的准备之前,所述方法还包括:The method according to any one of claims 9-12, wherein the training task information further comprises: cloud training access information, and before performing the preparation of the cloud training environment according to the training environment information, the method Also includes:
    根据所述云训练接入信息对所述训练任务信息对应的训练任务进行鉴权和/或计费查询。According to the cloud training access information, the authentication and/or charging query is performed on the training task corresponding to the training task information.
  14. 一种引导系统,其特征在于,包括:A guidance system, characterized in that it includes:
    获取模块,用于当云训练系统的用户在本地设备上触发训练代码的运行后,根据所述本地设备上运行的所述训练代码,获取训练任务信息,其中,所述训练代码用于训练AI模型,所述AI模型由所述用户基于安装在所述本地设备上的AI框架进行开发获得;The acquisition module is used to acquire training task information according to the training code running on the local device after the user of the cloud training system triggers the running of the training code on the local device, wherein the training code is used to train AI a model, where the AI model is developed and obtained by the user based on the AI framework installed on the local device;
    发送模块,用于将所述训练任务信息上传至所述云训练系统,并通知云训练系统执行所述训练任务信息对应的训练任务。A sending module, configured to upload the training task information to the cloud training system, and notify the cloud training system to execute the training task corresponding to the training task information.
  15. 根据权利要求14所述的系统,其特征在于,所述训练任务信息包括利用所述AI框架中的获取组件从所述训练代码中获取的信息,以及根据所述训练代码中的信息从所述本地设备获取的信息。The system according to claim 14, wherein the training task information comprises information obtained from the training code by using an obtaining component in the AI framework, and obtained from the training code according to the information in the training code Information obtained by the local device.
  16. 根据权利要求14或15所述的系统,其特征在于,所述训练任务信息包括以下数据中的一种或多种:所述训练代码中的训练参数、所述AI模型、所述训练代码中用于对所述AI模型进行训练的训练程序逻辑、所述本地设备的训练环境信息、用于与所述云训练系统连接的云训练接入信息。The system according to claim 14 or 15, wherein the training task information includes one or more of the following data: training parameters in the training code, the AI model, the training code Training program logic for training the AI model, training environment information of the local device, and cloud training access information for connecting with the cloud training system.
  17. 根据权利要求16所述的系统,其特征在于,所述本地设备的训练环境信息包括:所述AI框架的版本信息,和/或所述训练代码的编程语言的版本信息。The system according to claim 16, wherein the training environment information of the local device comprises: version information of the AI framework, and/or version information of the programming language of the training code.
  18. 根据权利要求14-17任一项所述的系统,其特征在于,所述系统还包括接收单元,The system according to any one of claims 14-17, wherein the system further comprises a receiving unit,
    所述接收单元,用于接收所述云训练系统在执行所述训练任务的过程中发送的训练数据获取请求;the receiving unit, configured to receive a training data acquisition request sent by the cloud training system during the execution of the training task;
    所述获取单元,还用于根据所述训练数据获取请求获取所述训练数据;The obtaining unit is further configured to obtain the training data according to the training data obtaining request;
    所述发送单元,还用于发送所述训练数据至所述云训练系统。The sending unit is further configured to send the training data to the cloud training system.
  19. 根据权利要求14-18任一项所述的系统,其特征在于,所述系统还包括接收单元,The system according to any one of claims 14-18, wherein the system further comprises a receiving unit,
    所述接收单元,用于在所述发送单元通知所述云训练系统执行所述训练任务信息对应的训练任务之前,接收所述云训练系统返回的环境准备成功响应。The receiving unit is configured to receive an environment preparation success response returned by the cloud training system before the sending unit notifies the cloud training system to execute the training task corresponding to the training task information.
  20. 根据权利要求14-19任一项所述的系统,其特征在于,所述系统还包括接收单元,The system according to any one of claims 14-19, wherein the system further comprises a receiving unit,
    所述接收单元,用于接收所述云训练系统返回的训练完成的AI模型。The receiving unit is configured to receive the trained AI model returned by the cloud training system.
  21. 根据权利要求14-20任一项所述的系统,其特征在于,所述引导系统从所述云训练系统中获得,并安装于所述本地设备中。The system according to any one of claims 14-20, wherein the guidance system is obtained from the cloud training system and installed in the local device.
  22. 一种云训练系统,其特征在于,包括:A cloud training system, comprising:
    环境准备单元,用于获取引导系统在用户在本地设备触发运行训练代码后发送的训练任务信息,所述训练任务信息包括本地设备的训练环境信息;根据所述训练环境信息执行云训练环境的准备;An environment preparation unit, configured to obtain training task information sent by the guidance system after the user triggers the running of the training code on the local device, where the training task information includes the training environment information of the local device; executes the preparation of the cloud training environment according to the training environment information ;
    训练任务执行单元,用于基于所述云训练环境执行所述训练任务信息对应的训练任务。A training task execution unit, configured to execute a training task corresponding to the training task information based on the cloud training environment.
  23. 根据权利要求22所述的系统,其特征在于,所述本地设备的训练环境信息包括:待训练的AI模型依赖的AI框架的版本信息,和/或用于训练所述AI模型的所述训练代码的编程语言的版本信息。The system according to claim 22, wherein the training environment information of the local device comprises: version information of the AI framework that the AI model to be trained depends on, and/or the training information used for training the AI model Version information for the programming language of the code.
  24. 根据权利要求23所述的系统,其特征在于,所述环境准备单元,具体用于根据所述AI框架的版本信息,以及所述训练代码的编程语言的版本信息设置所述云训练环境中执行训练任务使用的AI框架和编程语言。The system according to claim 23, wherein the environment preparation unit is specifically configured to set execution in the cloud training environment according to the version information of the AI framework and the version information of the programming language of the training code The AI framework and programming language used for the training task.
  25. 根据权利要求22-24任一项所述的系统,其特征在于,所述训练任务信息还包括:训练代码中的训练参数、AI模型、训练代码中用于对所述AI模型进行训练的训练程序逻辑;The system according to any one of claims 22-24, wherein the training task information further comprises: training parameters in the training code, an AI model, and training in the training code for training the AI model program logic;
    所述训练任务执行单元,具体用于根据所述训练参数和所述训练程序逻辑,在所述准备的云训练环境中执行对所述AI模型的训练。The training task execution unit is specifically configured to execute the training of the AI model in the prepared cloud training environment according to the training parameters and the training program logic.
  26. 根据权利要求22-25任一项所述的系统,其特征在于,所述训练任务信息还包括:云训练接入信息,所述环境准备单元,还用于:根据所述云训练接入信息对所述训练任务信息对应的训练任务进行鉴权和/或计费查询。The system according to any one of claims 22-25, wherein the training task information further comprises: cloud training access information, and the environment preparation unit is further configured to: according to the cloud training access information Perform authentication and/or charging query on the training task corresponding to the training task information.
  27. 一种计算设备,其特征在于,包括处理器和存储器,所述存储器存储计算机指令,所述处理器执行所述计算机指令,以使所述计算设备执行前述权利要求1-8任一项权利要求所述的方法,或者执行前述权利要求9-13任一项权利要求所述的方法。A computing device, characterized by comprising a processor and a memory, the memory storing computer instructions, the processor executing the computer instructions to cause the computing device to perform any one of the preceding claims 1-8 the method described, or perform the method described in any one of the preceding claims 9-13.
  28. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序代码,当所述计算机程序代码被计算设备执行时,所述计算设备执行前述权利要求1-8任一项权利要求所述的方法,或者执行前述权利要求9-13任一项权利要求所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program code, and when the computer program code is executed by a computing device, the computing device executes any one of the preceding claims 1-8 A method as claimed in claim 1, or performing a method as claimed in any one of the preceding claims 9-13.
PCT/CN2021/123021 2020-10-14 2021-10-11 Artificial intelligence (ai) training method, system, and device WO2022078280A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011096163 2020-10-14
CN202011096163.6 2020-10-14
CN202011626123.8 2020-12-30
CN202011626123.8A CN114358302A (en) 2020-10-14 2020-12-30 Artificial intelligence AI training method, system and equipment

Publications (1)

Publication Number Publication Date
WO2022078280A1 true WO2022078280A1 (en) 2022-04-21

Family

ID=81089624

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123021 WO2022078280A1 (en) 2020-10-14 2021-10-11 Artificial intelligence (ai) training method, system, and device

Country Status (2)

Country Link
CN (1) CN114358302A (en)
WO (1) WO2022078280A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117670340A (en) * 2022-08-25 2024-03-08 华为技术有限公司 Rights and interests distribution method and device
CN118377627B (en) * 2024-06-26 2024-09-13 成都天巡微小卫星科技有限责任公司 Space machine vision platform based on heterogeneous AI framework and calculation training method thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013064903A2 (en) * 2011-11-04 2013-05-10 Furuno Electric Co., Ltd. Computer-aided training systems, methods and apparatuses
CN110378463A (en) * 2019-07-15 2019-10-25 北京智能工场科技有限公司 A kind of artificial intelligence model standardized training platform and automated system
CN110427983A (en) * 2019-07-15 2019-11-08 北京智能工场科技有限公司 A kind of whole process artificial intelligence matching system and its data processing method based on local model and cloud feedback
CN110795141A (en) * 2019-10-12 2020-02-14 广东浪潮大数据研究有限公司 Training task submitting method, device, equipment and medium
CN111625361A (en) * 2020-05-26 2020-09-04 华东师范大学 Joint learning framework based on cooperation of cloud server and IoT (Internet of things) equipment
CN111738404A (en) * 2020-05-08 2020-10-02 深圳市万普拉斯科技有限公司 Model training task processing method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013064903A2 (en) * 2011-11-04 2013-05-10 Furuno Electric Co., Ltd. Computer-aided training systems, methods and apparatuses
CN110378463A (en) * 2019-07-15 2019-10-25 北京智能工场科技有限公司 A kind of artificial intelligence model standardized training platform and automated system
CN110427983A (en) * 2019-07-15 2019-11-08 北京智能工场科技有限公司 A kind of whole process artificial intelligence matching system and its data processing method based on local model and cloud feedback
CN110795141A (en) * 2019-10-12 2020-02-14 广东浪潮大数据研究有限公司 Training task submitting method, device, equipment and medium
CN111738404A (en) * 2020-05-08 2020-10-02 深圳市万普拉斯科技有限公司 Model training task processing method and device, electronic equipment and storage medium
CN111625361A (en) * 2020-05-26 2020-09-04 华东师范大学 Joint learning framework based on cooperation of cloud server and IoT (Internet of things) equipment

Also Published As

Publication number Publication date
CN114358302A (en) 2022-04-15

Similar Documents

Publication Publication Date Title
US11467816B1 (en) Method and system of running an application
CN107766126B (en) Container mirror image construction method, system and device and storage medium
JP6856749B2 (en) Systems and methods for implementing native contracts on the blockchain
CN111033468B (en) System and method for implementing different types of blockchain contracts
TWI737233B (en) System and method for executing different types of blockchain contracts
US9928050B2 (en) Automatic recognition of web application
WO2018137564A1 (en) Service processing method and apparatus
KR20200080296A (en) Create and distribute packages for machine learning on end devices
WO2022078280A1 (en) Artificial intelligence (ai) training method, system, and device
WO2017166447A1 (en) Method and device for loading kernel module
WO2021135584A1 (en) Front-end project framework construction method and apparatus, computer device, and storage medium
US9614931B2 (en) Identifying a resource set require for a requested application and launching the resource set in a container for execution in a host operating system
US10901804B2 (en) Apparatus and method to select services for executing a user program based on a code pattern included therein
KR20160060023A (en) Method and apparatus for code virtualization and remote process call generation
KR20160061305A (en) Method and apparatus for customized software development kit (sdk) generation
JP2018530070A (en) System and method for building, optimizing and implementing a platform on a cloud-based computing environment
US11356508B1 (en) Retry strategies for handling failures during continuous delivery of software artifacts in a cloud platform
US20230035486A1 (en) Managing execution of continuous delivery pipelines for a cloud platform based data center
WO2018090528A1 (en) Method and system for mirror image package preparation and application operation
WO2019029451A1 (en) Method for publishing mobile applications and electronic apparatus
CN114489704A (en) Version compiling and deploying method and device based on strategy
US7630988B2 (en) Computer product and session management method
CN115421765A (en) Big data management deployment method applied to domestic operating system
CN109739655A (en) A kind of parameter setting method and device of gRPC request
CN106802805B (en) Application service management method and device suitable for server management

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21879324

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21879324

Country of ref document: EP

Kind code of ref document: A1