CN116048812A

CN116048812A - Design method and device for shared AI model training platform

Info

Publication number: CN116048812A
Application number: CN202310120160.9A
Authority: CN
Inventors: 宋虎
Original assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Current assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date: 2023-02-16
Filing date: 2023-02-16
Publication date: 2023-05-02

Abstract

The invention relates to the field of artificial intelligence, in particular to a design method of a shared AI model training platform, which comprises the following steps: s1, a terminal user creates a user at a cloud end and declares an organization where the terminal user is located, other users join the organization, the user who creates the organization first defaults to be an administrator, and after the user creates, a node access instruction is generated to contain a cloud service address and a registration verification key for use when an edge node registers; s2, a registration instruction generated by edge operation of the GPU computing resources becomes a child node under organization for all users to use, and the cloud can check the computing power use condition of the node and the related occupied process after registration is completed; s3, uploading standardized format data which are marked and completed by the user, wherein all users in the same organization of the data set can view and use the standardized format data, and different organizations cannot view and use the standardized format data. Compared with the prior art, the method and the system enable the platform training resources to be shared and used, meanwhile, the data of the platform training resources are isolated to a certain extent, and smoothness of operation and use of the platform is guaranteed.

Description

Design method and device for shared AI model training platform

Technical Field

The invention relates to the field of artificial intelligence, and particularly provides a design method and device for a shared AI model training platform.

Background

Related researchers in the fields of machine learning and deep learning need to perform model training continuously to expect a model with good effect for practical production, and model training is a complex and chaotic process, so that how to build an orderly shared training platform is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a design method of a shared AI model training platform with strong practicability.

The invention further aims to provide a shared AI model training platform design device which is reasonable in design, safe and applicable.

The technical scheme adopted for solving the technical problems is as follows:

a design method of a shared AI model training platform comprises the following steps:

s1, a terminal user creates a user at a cloud end and declares an organization where the terminal user is located, other users join the organization, the user who creates the organization first defaults to be an administrator, and after the user creates, a node access instruction is generated to contain a cloud service address and a registration verification key for use when an edge node registers;

s2, a registration instruction generated by edge operation of the GPU computing resources becomes a child node under organization for all users to use, and the cloud can check the computing power use condition of the node and the related occupied process after registration is completed;

s3, uploading standardized format data which are marked and completed by the user, wherein all users in the same organization of the data set can view and use the standardized format data, and different organizations cannot view and use the standardized format data.

Further, in step S3, a data set used for selecting a training task is created, the computing power of a node is checked, an idle node is selected, and various parameters are set, so that the task can be executed after the task is created, and the result is automatically uploaded to the cloud after the training task is completed;

and when the task is executed, the cloud service transmits the data and parameter settings to the edge nodes for training, and all result data in the training process are transmitted back to the cloud management center in real time.

Further, the management center of the cloud comprises user management, data center, task management, log monitoring, node management, resource monitoring and model management;

the terminal needs to register in user management;

the data center stores a standardized data set uploaded by the terminal user;

the task management manages model training tasks created by the terminal user;

the log monitoring records the change process of each important index parameter in the model training process and the consumption condition of computing power resources in the task training process;

the node management manages child nodes with computing power resources in the same organization;

the resource monitoring monitors the service condition of calculation force in the child node;

and the model manages and stores training results.

Further, in the user management, when the end user registers, the organization is declared to be in, each organization has independent resource space, all the computing resources and data among the organizations are absolutely isolated, and the related users in the same organization see the data and all the computing resources in the same organization share each other.

Further, in the data center, the standardized data set is marked with the finished picture or audio file data, and the data storage is externally connected with public cloud object storage or cloud disk storage.

Further, in task management, the created model training task comprises starting, pausing, ending and checking to track the whole training process in real time, and finishing data set selection, super-parameter setting and hardware computing power resource selection.

Further, the node management manages the child nodes with computing power resources in the same organization, including generating a child node registration instruction and monitoring the child node state.

Further, the computing power use condition in the resource monitoring sub-node comprises GPU use rate, video memory occupancy rate, CPU use rate, memory occupancy rate and resource occupation condition of each process, and the computing power use condition is controlled to arrange training tasks.

Further, the model management stores training results, and the model results are downloaded offline or directly released to a cloud service store for other users to use.

A shared AI model training platform design apparatus, comprising: at least one memory and at least one processor;

the at least one memory for storing a machine readable program;

the at least one processor is configured to invoke the machine-readable program to execute a shared AI model training platform design method.

Compared with the prior art, the design method and the device for the shared AI model training platform have the following outstanding beneficial effects:

the invention enables the related scientific research personnel platform to train the resource sharing use and the data of the platform to be isolated to a certain extent, thereby ensuring the smoothness of the platform operation use.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a framework of a shared AI model training platform design methodology.

Detailed Description

In order to provide a better understanding of the aspects of the present invention, the present invention will be described in further detail with reference to specific embodiments. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

A preferred embodiment is given below:

as shown in fig. 1, a design method of a shared AI model training platform in this embodiment includes the following steps:

s3, uploading marked standardized format data by the user, wherein all users in the same organization of the data set can check and use the standardized format data, different organizations cannot check and use the standardized format data, creating the data set used for selecting training tasks, checking node computing power, selecting idle nodes and setting various parameters, the standardized format data can be executed after the task is created, and the result is automatically uploaded to the cloud after the training tasks are completed.

Further, the management center of the cloud comprises user management, data center, task management, log monitoring, node management, resource monitoring and model management.

When registering, the terminal user needs to declare the organization where the terminal user is located, each organization has independent resource space, all computing resources and data between the organizations are absolutely isolated, and the related user in the same organization can see the data sets, training tasks, training results and the like uploaded by the related user, and share the data sets, training tasks, training results and the like with all computing resources in the organization.

The data center is used for storing standardized training data sets uploaded by the terminal users, usually the data such as the marked pictures or audio files and the like, and the data storage can be externally connected with media such as public cloud object storage or cloud disk storage.

The task management is used for managing model training tasks created by the end user, and comprises sub-modules of starting, pausing, ending, checking and the like so as to track the whole training process in real time, and complete data set selection, super-parameter setting, hardware computing power resource selection and the like.

the node management is used for managing the child nodes with computing power resources in the same organization, and comprises the steps of generating a child node registration instruction, monitoring the state of the child nodes and the like.

The resource monitoring is used for monitoring the computing power use condition in the child node, and comprises GPU use rate, video memory occupancy rate, CPU use rate, memory occupancy rate and the resource occupation condition of each process, so that the computing power use condition can be controlled conveniently, and training tasks can be reasonably arranged.

The model management is used for storing training results, and the model results finally generated by the model management are greatly different according to the selected deep learning frames, so that the model results can be downloaded offline or directly released to a cloud service store for other users to use.

The edge service accesses a machine with GPU computing power to a model training platform, registers nodes to the cloud according to instructions provided by the cloud, and declares an organization to which the machine belongs during registration. The edge server can be a computer in a local area network or a virtual machine with GPU computing power resources in public cloud.

Based on the above method, a design device for a shared AI model training platform is characterized by comprising: at least one memory and at least one processor;

the at least one memory for storing a machine readable program;

The above specific embodiments are merely illustrative of specific cases of the present invention, and the scope of the present invention includes, but is not limited to, the above specific embodiments, any suitable changes or substitutions made by one of ordinary skill in the art, which conform to the design method and apparatus claims of the shared AI model training platform of the present invention, shall fall within the scope of the present invention.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The design method of the shared AI model training platform is characterized by comprising the following steps:

2. The method for designing a shared AI model training platform according to claim 1, wherein in step S3, a data set used for training task selection is created, node calculation force is checked, idle nodes are selected, various parameters are set, the task creation is completed, the task creation is executed, and the result is automatically uploaded to the cloud end after the training task is completed;

3. The method for designing a shared AI model training platform according to claim 1 or 2, wherein the management center of the cloud comprises user management, data center, task management, log monitoring, node management, resource monitoring and model management;

the terminal needs to register in user management;

the data center stores a standardized data set uploaded by the terminal user;

the task management manages model training tasks created by the terminal user;

and the model manages and stores training results.

4. The method for designing a shared AI model training platform according to claim 3, wherein in the user management, the end user declares the organization where the end user is located when registering, each organization has independent resource space, all the computing resources and data between the organizations are absolutely isolated, and the related users in the same organization see each other's data and all the computing resources in the same organization share each other's data.

5. The method for designing a shared AI model training platform according to claim 4, wherein in the data center, the standardized data set is marked with finished picture or audio file data, and the data storage is externally connected with public cloud object storage or cloud disk storage.

6. The method of claim 5, wherein in task management, the created model training tasks include start, pause, end, view to track the training process in real time, complete data set selection, super-parameter setting, and hardware computing power resource selection.

7. The method of claim 6, wherein the node management manages the child nodes with computing power resources in the same organization, including generating a child node registration instruction and monitoring the child node status.

8. The method for designing a shared AI model training platform according to claim 7, wherein the resource monitoring monitors the computing power usage in the child nodes, including GPU usage, memory occupancy, CPU usage, memory occupancy, and resource occupancy for each process, and manages the computing power usage to schedule training tasks.

9. The method of claim 8, wherein the model management stores the training results, downloads the model results offline or directly publishes the model results to a cloud service store for other users.

10. A shared AI model training platform design apparatus, comprising: at least one memory and at least one processor;

the at least one memory for storing a machine readable program;

the at least one processor being configured to invoke the machine readable program to perform the method of any of claims 1 to 9.