CN116048812A - Design method and device for shared AI model training platform - Google Patents

Design method and device for shared AI model training platform Download PDF

Info

Publication number
CN116048812A
CN116048812A CN202310120160.9A CN202310120160A CN116048812A CN 116048812 A CN116048812 A CN 116048812A CN 202310120160 A CN202310120160 A CN 202310120160A CN 116048812 A CN116048812 A CN 116048812A
Authority
CN
China
Prior art keywords
data
organization
user
management
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310120160.9A
Other languages
Chinese (zh)
Inventor
宋虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong New Generation Information Industry Technology Research Institute Co Ltd
Original Assignee
Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong New Generation Information Industry Technology Research Institute Co Ltd filed Critical Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority to CN202310120160.9A priority Critical patent/CN116048812A/en
Publication of CN116048812A publication Critical patent/CN116048812A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the field of artificial intelligence, in particular to a design method of a shared AI model training platform, which comprises the following steps: s1, a terminal user creates a user at a cloud end and declares an organization where the terminal user is located, other users join the organization, the user who creates the organization first defaults to be an administrator, and after the user creates, a node access instruction is generated to contain a cloud service address and a registration verification key for use when an edge node registers; s2, a registration instruction generated by edge operation of the GPU computing resources becomes a child node under organization for all users to use, and the cloud can check the computing power use condition of the node and the related occupied process after registration is completed; s3, uploading standardized format data which are marked and completed by the user, wherein all users in the same organization of the data set can view and use the standardized format data, and different organizations cannot view and use the standardized format data. Compared with the prior art, the method and the system enable the platform training resources to be shared and used, meanwhile, the data of the platform training resources are isolated to a certain extent, and smoothness of operation and use of the platform is guaranteed.

Description

Design method and device for shared AI model training platform
Technical Field
The invention relates to the field of artificial intelligence, and particularly provides a design method and device for a shared AI model training platform.
Background
Related researchers in the fields of machine learning and deep learning need to perform model training continuously to expect a model with good effect for practical production, and model training is a complex and chaotic process, so that how to build an orderly shared training platform is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a design method of a shared AI model training platform with strong practicability.
The invention further aims to provide a shared AI model training platform design device which is reasonable in design, safe and applicable.
The technical scheme adopted for solving the technical problems is as follows:
a design method of a shared AI model training platform comprises the following steps:
s1, a terminal user creates a user at a cloud end and declares an organization where the terminal user is located, other users join the organization, the user who creates the organization first defaults to be an administrator, and after the user creates, a node access instruction is generated to contain a cloud service address and a registration verification key for use when an edge node registers;
s2, a registration instruction generated by edge operation of the GPU computing resources becomes a child node under organization for all users to use, and the cloud can check the computing power use condition of the node and the related occupied process after registration is completed;
s3, uploading standardized format data which are marked and completed by the user, wherein all users in the same organization of the data set can view and use the standardized format data, and different organizations cannot view and use the standardized format data.
Further, in step S3, a data set used for selecting a training task is created, the computing power of a node is checked, an idle node is selected, and various parameters are set, so that the task can be executed after the task is created, and the result is automatically uploaded to the cloud after the training task is completed;
and when the task is executed, the cloud service transmits the data and parameter settings to the edge nodes for training, and all result data in the training process are transmitted back to the cloud management center in real time.
Further, the management center of the cloud comprises user management, data center, task management, log monitoring, node management, resource monitoring and model management;
the terminal needs to register in user management;
the data center stores a standardized data set uploaded by the terminal user;
the task management manages model training tasks created by the terminal user;
the log monitoring records the change process of each important index parameter in the model training process and the consumption condition of computing power resources in the task training process;
the node management manages child nodes with computing power resources in the same organization;
the resource monitoring monitors the service condition of calculation force in the child node;
and the model manages and stores training results.
Further, in the user management, when the end user registers, the organization is declared to be in, each organization has independent resource space, all the computing resources and data among the organizations are absolutely isolated, and the related users in the same organization see the data and all the computing resources in the same organization share each other.
Further, in the data center, the standardized data set is marked with the finished picture or audio file data, and the data storage is externally connected with public cloud object storage or cloud disk storage.
Further, in task management, the created model training task comprises starting, pausing, ending and checking to track the whole training process in real time, and finishing data set selection, super-parameter setting and hardware computing power resource selection.
Further, the node management manages the child nodes with computing power resources in the same organization, including generating a child node registration instruction and monitoring the child node state.
Further, the computing power use condition in the resource monitoring sub-node comprises GPU use rate, video memory occupancy rate, CPU use rate, memory occupancy rate and resource occupation condition of each process, and the computing power use condition is controlled to arrange training tasks.
Further, the model management stores training results, and the model results are downloaded offline or directly released to a cloud service store for other users to use.
A shared AI model training platform design apparatus, comprising: at least one memory and at least one processor;
the at least one memory for storing a machine readable program;
the at least one processor is configured to invoke the machine-readable program to execute a shared AI model training platform design method.
Compared with the prior art, the design method and the device for the shared AI model training platform have the following outstanding beneficial effects:
the invention enables the related scientific research personnel platform to train the resource sharing use and the data of the platform to be isolated to a certain extent, thereby ensuring the smoothness of the platform operation use.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a framework of a shared AI model training platform design methodology.
Detailed Description
In order to provide a better understanding of the aspects of the present invention, the present invention will be described in further detail with reference to specific embodiments. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
A preferred embodiment is given below:
as shown in fig. 1, a design method of a shared AI model training platform in this embodiment includes the following steps:
s1, a terminal user creates a user at a cloud end and declares an organization where the terminal user is located, other users join the organization, the user who creates the organization first defaults to be an administrator, and after the user creates, a node access instruction is generated to contain a cloud service address and a registration verification key for use when an edge node registers;
s2, a registration instruction generated by edge operation of the GPU computing resources becomes a child node under organization for all users to use, and the cloud can check the computing power use condition of the node and the related occupied process after registration is completed;
s3, uploading marked standardized format data by the user, wherein all users in the same organization of the data set can check and use the standardized format data, different organizations cannot check and use the standardized format data, creating the data set used for selecting training tasks, checking node computing power, selecting idle nodes and setting various parameters, the standardized format data can be executed after the task is created, and the result is automatically uploaded to the cloud after the training tasks are completed.
And when the task is executed, the cloud service transmits the data and parameter settings to the edge nodes for training, and all result data in the training process are transmitted back to the cloud management center in real time.
Further, the management center of the cloud comprises user management, data center, task management, log monitoring, node management, resource monitoring and model management.
When registering, the terminal user needs to declare the organization where the terminal user is located, each organization has independent resource space, all computing resources and data between the organizations are absolutely isolated, and the related user in the same organization can see the data sets, training tasks, training results and the like uploaded by the related user, and share the data sets, training tasks, training results and the like with all computing resources in the organization.
The data center is used for storing standardized training data sets uploaded by the terminal users, usually the data such as the marked pictures or audio files and the like, and the data storage can be externally connected with media such as public cloud object storage or cloud disk storage.
The task management is used for managing model training tasks created by the end user, and comprises sub-modules of starting, pausing, ending, checking and the like so as to track the whole training process in real time, and complete data set selection, super-parameter setting, hardware computing power resource selection and the like.
The log monitoring records the change process of each important index parameter in the model training process and the consumption condition of computing power resources in the task training process;
the node management is used for managing the child nodes with computing power resources in the same organization, and comprises the steps of generating a child node registration instruction, monitoring the state of the child nodes and the like.
The resource monitoring is used for monitoring the computing power use condition in the child node, and comprises GPU use rate, video memory occupancy rate, CPU use rate, memory occupancy rate and the resource occupation condition of each process, so that the computing power use condition can be controlled conveniently, and training tasks can be reasonably arranged.
The model management is used for storing training results, and the model results finally generated by the model management are greatly different according to the selected deep learning frames, so that the model results can be downloaded offline or directly released to a cloud service store for other users to use.
The edge service accesses a machine with GPU computing power to a model training platform, registers nodes to the cloud according to instructions provided by the cloud, and declares an organization to which the machine belongs during registration. The edge server can be a computer in a local area network or a virtual machine with GPU computing power resources in public cloud.
Based on the above method, a design device for a shared AI model training platform is characterized by comprising: at least one memory and at least one processor;
the at least one memory for storing a machine readable program;
the at least one processor is configured to invoke the machine-readable program to execute a shared AI model training platform design method.
The above specific embodiments are merely illustrative of specific cases of the present invention, and the scope of the present invention includes, but is not limited to, the above specific embodiments, any suitable changes or substitutions made by one of ordinary skill in the art, which conform to the design method and apparatus claims of the shared AI model training platform of the present invention, shall fall within the scope of the present invention.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. The design method of the shared AI model training platform is characterized by comprising the following steps:
s1, a terminal user creates a user at a cloud end and declares an organization where the terminal user is located, other users join the organization, the user who creates the organization first defaults to be an administrator, and after the user creates, a node access instruction is generated to contain a cloud service address and a registration verification key for use when an edge node registers;
s2, a registration instruction generated by edge operation of the GPU computing resources becomes a child node under organization for all users to use, and the cloud can check the computing power use condition of the node and the related occupied process after registration is completed;
s3, uploading standardized format data which are marked and completed by the user, wherein all users in the same organization of the data set can view and use the standardized format data, and different organizations cannot view and use the standardized format data.
2. The method for designing a shared AI model training platform according to claim 1, wherein in step S3, a data set used for training task selection is created, node calculation force is checked, idle nodes are selected, various parameters are set, the task creation is completed, the task creation is executed, and the result is automatically uploaded to the cloud end after the training task is completed;
and when the task is executed, the cloud service transmits the data and parameter settings to the edge nodes for training, and all result data in the training process are transmitted back to the cloud management center in real time.
3. The method for designing a shared AI model training platform according to claim 1 or 2, wherein the management center of the cloud comprises user management, data center, task management, log monitoring, node management, resource monitoring and model management;
the terminal needs to register in user management;
the data center stores a standardized data set uploaded by the terminal user;
the task management manages model training tasks created by the terminal user;
the log monitoring records the change process of each important index parameter in the model training process and the consumption condition of computing power resources in the task training process;
the node management manages child nodes with computing power resources in the same organization;
the resource monitoring monitors the service condition of calculation force in the child node;
and the model manages and stores training results.
4. The method for designing a shared AI model training platform according to claim 3, wherein in the user management, the end user declares the organization where the end user is located when registering, each organization has independent resource space, all the computing resources and data between the organizations are absolutely isolated, and the related users in the same organization see each other's data and all the computing resources in the same organization share each other's data.
5. The method for designing a shared AI model training platform according to claim 4, wherein in the data center, the standardized data set is marked with finished picture or audio file data, and the data storage is externally connected with public cloud object storage or cloud disk storage.
6. The method of claim 5, wherein in task management, the created model training tasks include start, pause, end, view to track the training process in real time, complete data set selection, super-parameter setting, and hardware computing power resource selection.
7. The method of claim 6, wherein the node management manages the child nodes with computing power resources in the same organization, including generating a child node registration instruction and monitoring the child node status.
8. The method for designing a shared AI model training platform according to claim 7, wherein the resource monitoring monitors the computing power usage in the child nodes, including GPU usage, memory occupancy, CPU usage, memory occupancy, and resource occupancy for each process, and manages the computing power usage to schedule training tasks.
9. The method of claim 8, wherein the model management stores the training results, downloads the model results offline or directly publishes the model results to a cloud service store for other users.
10. A shared AI model training platform design apparatus, comprising: at least one memory and at least one processor;
the at least one memory for storing a machine readable program;
the at least one processor being configured to invoke the machine readable program to perform the method of any of claims 1 to 9.
CN202310120160.9A 2023-02-16 2023-02-16 Design method and device for shared AI model training platform Pending CN116048812A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310120160.9A CN116048812A (en) 2023-02-16 2023-02-16 Design method and device for shared AI model training platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310120160.9A CN116048812A (en) 2023-02-16 2023-02-16 Design method and device for shared AI model training platform

Publications (1)

Publication Number Publication Date
CN116048812A true CN116048812A (en) 2023-05-02

Family

ID=86122196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310120160.9A Pending CN116048812A (en) 2023-02-16 2023-02-16 Design method and device for shared AI model training platform

Country Status (1)

Country Link
CN (1) CN116048812A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112539A (en) * 2023-10-23 2023-11-24 北京万界数据科技有限责任公司 Machine learning-oriented data model management system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112539A (en) * 2023-10-23 2023-11-24 北京万界数据科技有限责任公司 Machine learning-oriented data model management system
CN117112539B (en) * 2023-10-23 2024-01-05 北京万界数据科技有限责任公司 Machine learning-oriented data model management system

Similar Documents

Publication Publication Date Title
US10515000B2 (en) Systems and methods for performance testing cloud applications from multiple different geographic locations
CN108123994B (en) Industrial-field-oriented cloud platform architecture
CN104866374B (en) Discrete event parallel artificial and method for synchronizing time based on multitask
Schulte et al. Elastic Business Process Management: State of the art and open challenges for BPM in the cloud
CN104463492B (en) A kind of operation management method of power system cloud emulation platform
Gu et al. Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters
CN104737133B (en) Optimized using the Distributed Application of service group
US20220379202A1 (en) Data packet synchronization method and apparatus, device, and storage medium
CN105677836A (en) Big data processing and solving system simultaneously supporting offline data and real-time online data
CN108920948A (en) A kind of anti-fraud streaming computing device and method
CN107370796A (en) A kind of intelligent learning system based on Hyper TF
CN105471662B (en) Cloud Server, virtual network strategy centralized control system and method
CN113760180A (en) Storage resource management method, device, equipment and computer readable storage medium
CN116048812A (en) Design method and device for shared AI model training platform
CN103678892A (en) Role object management method and role object management device
CN103502939B (en) The method and system that virtual machine is managed
CN110162407A (en) A kind of method for managing resource and device
CN112395341B (en) Federal learning management method and system based on federal cloud cooperation network
CN111966585B (en) Execution method, device, equipment and system of test task
CN113902122A (en) Federal model collaborative training method and device, computer equipment and storage medium
Aldin et al. Strict timed causal consistency as a hybrid consistency model in the cloud environment
RU122505U1 (en) HARDWARE-COMPUTER COMPLEX FOR PROVIDING ACCESS TO THE SOFTWARE IN THE CONCEPT OF CLOUD COMPUTING
CN106254452A (en) The big data access method of medical treatment under cloud platform
Podolskiy et al. Practical education in iot through collaborative work on open-source projects with industry and entrepreneurial organizations
CN112037103A (en) Government affair management system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination