CN118153649A - Soft and hard all-in-one machine integrating large model training and reasoning and large model training method - Google Patents

Soft and hard all-in-one machine integrating large model training and reasoning and large model training method Download PDF

Info

Publication number
CN118153649A
CN118153649A CN202410585108.5A CN202410585108A CN118153649A CN 118153649 A CN118153649 A CN 118153649A CN 202410585108 A CN202410585108 A CN 202410585108A CN 118153649 A CN118153649 A CN 118153649A
Authority
CN
China
Prior art keywords
model
training
model training
request
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410585108.5A
Other languages
Chinese (zh)
Other versions
CN118153649B (en
Inventor
刘维炜
纪志强
张涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shengshi Tianan Technology Co ltd
Original Assignee
Beijing Shengshi Tianan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shengshi Tianan Technology Co ltd filed Critical Beijing Shengshi Tianan Technology Co ltd
Priority to CN202410585108.5A priority Critical patent/CN118153649B/en
Publication of CN118153649A publication Critical patent/CN118153649A/en
Application granted granted Critical
Publication of CN118153649B publication Critical patent/CN118153649B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Neurology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a soft and hard integrated machine integrating large model training and reasoning and a large model training method, which provide strong local computing capacity, reduce dependence on external cloud resources, enable a user to select and combine different model modules from a model algorithm library through an AI platform control unit so as to customize an individualized model and configure parameters of the individualized model, receive a model training request initiated by the user, decompose the model training request into a plurality of model training tasks, insert the model training tasks into a task queue after adding corresponding safety protection logic for the model training tasks, and add corresponding safety protection logic for the model reasoning request initiated by the user.

Description

Soft and hard all-in-one machine integrating large model training and reasoning and large model training method
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a soft and hard integrated machine integrating large model training and reasoning and a large model training method.
Background
With the rapid development of artificial intelligence technology, large AI (ARTIFICIAL INTELLIGENCE ) models have shown great potential in a number of industries. However, the existing large model training and deployment usually depends on a central server or cloud service, which can bring about problems of data transmission and network delay, so that the efficiency of AI model training deployment is lower, the existing large model has the characteristics of huge parameter quantity, high calculation complexity and massive training samples, the calculation resources required for training the large model are also larger, and the efficient training of a plurality of models is difficult to consider when the training requests of a plurality of large models are faced. In addition, the training deployment requirements of AI models may raise concerns about data security and privacy protection. In particular, in industries such as finance, telecommunications, medical, etc., the processing requirements of a large amount of sensitive data place higher demands on data security. In addition, enterprises face technical challenges such as hardware adaptation, data security audit and the like in the full-flow development of large models, and the challenges raise technical thresholds and limit the wide application of AI technology.
Disclosure of Invention
The invention provides a soft and hard integrated machine integrating large model training and reasoning and a large model training method, which are used for solving the defects that the large model training efficiency is low and the data safety is hidden in the prior art.
The invention provides a soft and hard all-in-one machine integrating large model training and reasoning, which comprises:
An AI chip, a storage unit and an AI platform control unit;
The AI chip is used for carrying out task scheduling on model training tasks initiated by a plurality of users, distributing computing resources for the scheduled model training tasks, executing the scheduled model training tasks based on the hardware acceleration module, and executing model reasoning requests initiated by the users;
the storage unit comprises a memory for storing model parameters and a solid state disk for storing training data sets;
the AI platform control unit is used for providing a model algorithm library, so that a user can select and combine different model modules from the model algorithm library to customize a personalized model and perform parameter configuration on the personalized model; the AI platform control unit is also used for receiving a model training request initiated by a user, decomposing the model training request into a plurality of model training tasks, adding corresponding safety protection logic for the model training tasks, inserting the model training tasks into a task queue, and adding corresponding safety protection logic for model reasoning requests initiated by the user.
According to the soft and hard all-in-one machine integrating large model training and reasoning provided by the invention, the AI platform control unit further comprises:
an automatic model tuning tool for searching the optimal model super-parameter configuration for the current model by using a reinforcement learning algorithm;
The distributed collaborative processing framework is used for providing a message transfer interface or a parameter server to share computing resources among a plurality of soft and hard all-in-one machines, and scheduling model training tasks corresponding to model training requests initiated by users to the corresponding soft and hard all-in-one machines for distributed training;
And the model compression and acceleration module is used for providing a model compression function and a model acceleration function.
According to the soft and hard all-in-one machine integrating large model training and reasoning provided by the invention, the model algorithm library further comprises:
the online learning interface is used for automatically updating the model module in the model algorithm library according to the latest data and/or model optimization trend;
And the custom algorithm interface is used for receiving a custom model module provided by a user and deploying and operating the custom model module based on container technology.
The invention also provides a large model training method based on the integrated large model training and reasoning soft and hard all-in-one machine, which comprises the following steps:
After receiving a model training request initiated by a user, determining the resource requirement of the model training request;
Determining a training mode of the model training request based on the resource requirement of the model training request and the available resources of the current soft and hard all-in-one machine; wherein the training mode is a single training mode or a distributed training mode;
Dividing the model training request into a plurality of model training tasks based on the training mode of the model training request, and inserting the model training tasks into task queues of corresponding soft and hard all-in-one machines based on the training mode of the model training request;
Determining a currently scheduled model training task in a current task queue based on a time sequence relation of each model training task inserted into the task queue in the current task queue of the current soft and hard integrated machine, a model corresponding to each model training task and available resources of the current soft and hard integrated machine; the current scheduled model training tasks are one or more;
and distributing computing resources for the currently scheduled model training task, and executing the currently scheduled model training task.
According to the large model training method provided by the invention, the resource requirement of the model training request is determined, and the method specifically comprises the following steps:
decomposing the training process of the model corresponding to the model training request into a plurality of training steps;
Combining training description information corresponding to the plurality of training steps into training description time sequence data based on the time sequence of the plurality of training steps of the model corresponding to the model training request; the training description information corresponding to any training step comprises a model structure, super parameters, tuning target parameters and sample data volume related to the training step;
And inputting the training description time sequence data into a pre-trained training resource prediction model to obtain the resource requirement of the model training request output by the training resource prediction model.
According to the large model training method provided by the invention, the training resource prediction model is constructed based on a cyclic neural network and a full-connection layer; the input of the full-connection layer is connected with the output of the hidden layer of the circulating neural network.
According to the large model training method provided by the invention, the training description time sequence data is input into a training resource prediction model trained in advance to obtain the resource requirement of the model training request output by the training resource prediction model, and the method specifically comprises the following steps:
Sequentially inputting training description information corresponding to the training steps in the training description time sequence data to an input layer of the cyclic neural network in the training resource prediction model according to time steps to obtain an embedded vector output by a hidden layer of the cyclic neural network in the training resource prediction model at each time step;
And determining a global resource prediction result as a resource requirement of the model training request according to the embedded vector output by the hidden layer at each time step based on the fully connected layer in the training resource prediction model.
According to the large model training method provided by the invention, the training resource prediction model is obtained based on training of the following steps:
Decomposing a training process of a sample model into a plurality of sample training steps, and combining training description information corresponding to the plurality of sample training steps into sample training description time sequence data based on time sequence of the plurality of sample training steps of the sample model;
Sequentially inputting training description information corresponding to sample training steps in the sample training description time sequence data to an input layer of the cyclic neural network in the training resource prediction model according to time steps to obtain a single-step resource prediction result output by an output layer of the cyclic neural network in the training resource prediction model at each time step and a global resource prediction result output by the full-connection layer in the training resource prediction model;
And adjusting model parameters of the training resource prediction model based on the single-step resource prediction result and the global resource prediction result output by each time step, the actual occupied resources of each sample training step and the global occupied resources of the whole training process of the sample model.
According to the large model training method provided by the invention, the model training request is divided into a plurality of model training tasks based on the training mode of the model training request, and the large model training method specifically comprises the following steps:
if the training mode of the model training request is a single training mode, dividing the model training request into a plurality of model training tasks based on a data parallel mechanism or a model parallel mechanism;
If the training mode of the model training request is a distributed training mode, dividing the model training request into a plurality of model training tasks based on a pipeline parallel mechanism.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a large model training method as any one of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a large model training method as any one of the above.
According to the integrated large model training and reasoning soft and hard integrated machine and large model training method, the AI chip, the storage unit and the AI platform control unit are integrated, strong local computing capacity is provided, high-efficiency training of the large model is supported, dependence on external cloud resources is reduced, data transmission delay is reduced, the AI platform control unit is used for a user to select and combine different model modules from a model algorithm library so as to customize an individualized model and conduct parameter configuration on the individualized model, a model training request initiated by the user is received, the model training request is decomposed into a plurality of model training tasks, corresponding safety protection logic is added for the model training tasks, the model training tasks are inserted into a task queue, and corresponding safety protection logic is added for the model reasoning request initiated by the user, on one hand, the task execution parallelism of the local soft and hard integrated machine and/or other networked soft and hard integrated machines is improved by decomposing the model training request and utilizing the task scheduling capacity and the resource allocation capacity of the AI chip, meanwhile, the safety protection logic is added, and the full-scale privacy protection data can be provided for the safety protection and the full-scale data in the process.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, a brief description will be given below of the drawings used in the embodiments or the description of the prior art, it being obvious that the drawings in the following description are some embodiments of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a soft and hard all-in-one machine integrating large model training and reasoning provided by the invention;
FIG. 2 is a flow chart of the large model training method provided by the invention;
FIG. 3 is a flow chart of a resource demand determination method provided by the present invention;
fig. 4 is a schematic structural diagram of a training resource prediction model provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a schematic structural diagram of a soft and hard all-in-one machine integrating large model training and reasoning, as shown in fig. 1, the soft and hard all-in-one machine includes:
AI chip 110, storage unit 120, and AI platform control unit 130;
The AI chip 110 is configured to perform task scheduling on model training tasks initiated by a plurality of users, allocate computing resources for the scheduled model training tasks, execute the scheduled model training tasks based on a hardware acceleration module, and execute a model reasoning request initiated by the users;
The storage unit 120 includes a memory for storing model parameters and a solid state disk for storing training data sets;
The AI platform control unit 130 is configured to provide a model algorithm library, so that a user can select and combine different model modules from the model algorithm library to customize a personalized model and perform parameter configuration on the personalized model; the AI platform control unit is also used for receiving a model training request initiated by a user, decomposing the model training request into a plurality of model training tasks, adding corresponding safety protection logic for the model training tasks, inserting the model training tasks into a task queue, and adding corresponding safety protection logic for model reasoning requests initiated by the user.
Specifically, AI chip 110 may employ a high-performance AI processor to provide powerful computing capabilities to support multi-modal data processing and multi-agent collaboration mechanisms. Among other things, efficient configuration and performance optimization of the AI chip 110 may be achieved through programmable computing cores, dynamic resource management, multitasking parallel processing, and adaptive hardware acceleration techniques. For programmable computing cores, a field programmable gate array or application specific integrated circuit may be employed as the core of AI chip 110, which may be programmed and reconfigured according to user requirements to accommodate different AI models and algorithms; through high-level hardware description languages such as VHDL or Verilog, a user can customize specific computational logic to optimize the process flow of a specific AI task. For dynamic resource management, dynamic resource management firmware can be adopted to intelligently allocate computing resources, such as core number, memory bandwidth and the like, on the AI chip 110 according to the resource requirements of the currently scheduled model training task or the resource requirements of the model reasoning request. For model training tasks, the resource requirements of the tasks can be predicted through a machine learning algorithm, and the resource allocation strategy can be automatically adjusted. For the model reasoning request, the resource requirement of the model reasoning request can be evaluated by running the model reasoning process of the corresponding model on the computing node. For multi-task parallel processing, the AI chip architecture supports multi-task parallel processing, and can allow model training tasks or model reasoning requests of a plurality of AI models to be executed simultaneously through a task scheduling algorithm, so that the processing efficiency is improved. For self-adaptive hardware acceleration, a hardware acceleration technology can be utilized, so that the operation mode of the AI chip 110 can be dynamically adjusted according to the characteristics and data characteristics of the model, for example, the calculation accuracy of an activation function is self-adaptively adjusted in deep learning, the calculation flow is optimized through real-time performance monitoring and a feedback mechanism, unnecessary calculation is reduced, and the energy efficiency ratio is improved.
The storage unit 120 is configured with a high-speed memory to store model parameters, and a high-capacity solid state disk to store training data sets, so as to ensure quick reading, writing and processing of data.
The AI platform control unit 130 may provide a library of model algorithms to the user. The model algorithm library provides algorithms (comprising model training algorithms and model reasoning algorithms) of various model modules by adopting a modularized thought, such as a feature extractor, a classifier, an optimizer and the like, so that a user can conveniently select, combine and customize the model modules in the model algorithm library in a graphical mode to obtain a personalized model, and perform parameter configuration on the customized personalized model, thereby realizing rapid customization and optimization of the model so as to adapt to different application scenes and user requirements. The personalized model may be a text processing model, an image processing model, an audio processing model, or a multimodal processing model, among others. The multi-mode model is integrated in the model algorithm library, so that data from different fields such as text, images and audio can be processed and fused, richer application scenes are provided for users, a cross-field model transfer learning framework is realized, models trained in one field are allowed to be transferred to the other field, and the retraining cost is reduced.
In addition, the model algorithm library also comprises an online learning interface, which is used for automatically updating the model module in the model algorithm library according to the latest data and/or model optimization trend, so that the model algorithm library can be automatically updated and evolved according to the latest data and model optimization trend, and the timeliness and accuracy of the model are maintained. By integrating an online learning algorithm, the model is allowed to be updated incrementally when new data is received instead of retraining from scratch, so that timeliness and accuracy of the model are maintained, a self-adaptive updating strategy is realized, learning rate and updating frequency are automatically adjusted according to the change trend and distribution of the data, and the model is ensured to capture the latest trend of the data. With the version control system, each update of the model is recorded so that the previous version can be rolled back when needed. The model algorithm library also provides a custom algorithm interface, receives a custom model module developed by a user according to a unified interface specification through an open API and a plug-in framework, deploys and operates the custom model module based on a container technology such as a Docker, ensures that the deployment and operation of the custom model module are compatible with the existing system, and is easy to manage.
In addition, the AI platform control unit is further used for receiving a model training request initiated by a user, decomposing the model training request into a plurality of model training tasks, adding corresponding safety protection logic for the model training tasks, and then inserting the model training tasks into task queues corresponding to the model training tasks. The AI platform control unit also adds corresponding security protection logic for the model reasoning request initiated by the user, and inserts the model reasoning request into a task queue corresponding to the model reasoning request. The security protection logic can be provided by a data security component and selected by a user, and the data security component can provide differential privacy technology, hardware-level security isolation, data encryption and key management, a data leakage protection system and security audit and compliance inspection, so that the overall security of data is ensured.
(1) And a differential privacy technology is introduced, so that sensitive data used in the model training process is ensured to be protected, and personal information leakage is prevented. And introducing a differential privacy algorithm in the data preprocessing and model training stages to ensure that the data output in the model training process does not reveal the characteristics of the original data. The differential privacy library is realized, a series of tools such as data disturbance, gradient update and model evaluation are provided, and differential privacy strategies with different complexity and precision requirements are supported. Providing a user with configurable differential privacy parameters allows for a balance between privacy and accuracy to be adjusted according to data sensitivity and compliance requirements.
(2) And the safety isolation of the hardware level is realized, independent safety domains are allocated for each model training task, and the data cross contamination is prevented. A separate security domain is created for each model training task using hardware virtualization techniques. And running isolated model training examples in each safety domain, ensuring the isolation of data, memory and computing resources and preventing the cross contamination of data among different tasks. The hardware-level access control is implemented to ensure that only authenticated and authorized tasks can access the security domains assigned to them.
(3) End-to-end encryption is implemented on stored and transmitted data, and the security of the data in the whole life cycle is protected by using an industry standard encryption algorithm. The integrated key management system is used for automatically generating, distributing, rotating and destroying keys, and ensuring the effectiveness and compliance of encryption measures. And a visual configuration interface for data encryption is provided, so that a non-technical user can easily manage the data encryption strategy.
(4) A data leakage protection system is deployed, and the system can monitor and control the flow of sensitive data and prevent the data from being leaked through unauthorized channels. Content recognition techniques are implemented, including pattern matching and machine learning algorithms, to identify and classify sensitive data, regardless of its format or location. DLP policies are integrated that tightly control access, use, and transmission of sensitive data, including but not limited to email monitoring, cloud storage protection, and terminal data protection.
(5) The security audit tool is integrated to record all data access and processing activities, including user operations, API calls, and system events, to support post-audit and compliance checking. The access control and the minimum authority principle based on roles are realized, and the user is ensured to only access the data and the resources required by the task. And the data protection rules are followed, so that the data processing activity is ensured to meet legal requirements, and the compliance risk is reduced.
In other embodiments, the AI platform control unit 130 further integrates an automated model tuning tool, a distributed co-processing framework, and model compression and acceleration modules to improve model training and reasoning efficiency. The automatic model tuning tool uses a reinforcement learning algorithm to automate the super-parameter tuning process of the model, and through interaction with the environment, an intelligent body can learn how to select optimal super-parameters such as learning rate, batch size, regularization parameters and the like. The distributed collaborative processing framework allows a plurality of all-in-one machines to share computing resources, accelerates large-scale data processing, enables a plurality of soft and hard all-in-one machines to work cooperatively based on a message transmission interface or a distributed system processing framework of a parameter server framework, completes training tasks of large-scale data together, introduces a fault tolerance mechanism, ensures that the training tasks can be continuously executed when single-point faults occur, improves the robustness of the system, integrates a resource scheduler, automatically manages cluster resources, dynamically distributes computing nodes according to task requirements, and optimizes resource utilization. The model compression and acceleration module integrates model compression algorithms such as knowledge distillation, network pruning and quantization to reduce model size, improve reasoning speed while maintaining higher model performance, and provides a model acceleration tool set including a model optimizer and hardware acceleration plug-in to adapt to performance requirements of different hardware platforms.
In summary, the soft and hard all-in-one machine provided by the embodiment of the invention integrates the AI chip, the storage unit and the AI platform control unit, provides strong local computing capability, supports efficient training of a large model, reduces dependence on external cloud resources, reduces data transmission delay, enables a user to select and combine different model modules from a model algorithm library through the AI platform control unit so as to customize an individualized model and configure parameters of the individualized model, receives a model training request initiated by the user, decomposes the model training request into a plurality of model training tasks, adds corresponding security protection logic for the model training tasks, inserts the model training tasks into a task queue, and adds corresponding security protection logic for the model reasoning request initiated by the user, and on one hand, efficiently utilizes computing resources of the local soft and hard all-in-one machine and/or other networked soft and hard all-in-one machines by decomposing the model training request, thereby improving the efficiency of model training, and simultaneously, can provide all-dimensional protection for sensitive data and guarantee security training in the security and security training process by adding the security protection logic.
Based on any of the above embodiments, fig. 2 is a schematic flow chart of a large model training method provided by the present invention, where the large model training method is built on the above-mentioned soft and hard all-in-one machine, as shown in fig. 2, and the method includes:
Step 210, after receiving a model training request initiated by a user, determining a resource requirement of the model training request;
Step 220, determining a training mode of the model training request based on the resource requirement of the model training request and the available resources of the current soft and hard all-in-one machine; wherein the training mode is a single training mode or a distributed training mode;
Step 230, dividing the model training request into a plurality of model training tasks based on the training mode of the model training request, and inserting the model training tasks into task queues of corresponding soft and hard integrated machines based on the training mode of the model training request;
Step 240, determining a currently scheduled model training task in a current task queue of the current soft and hard integrated machine based on a time sequence relation of each model training task inserted into the task queue in the current task queue of the current soft and hard integrated machine, a model corresponding to each model training task and available resources of the current soft and hard integrated machine; the current scheduled model training tasks are one or more;
and step 250, computing resources are allocated to the currently scheduled model training task, and the currently scheduled model training task is executed.
Specifically, after the AI platform control unit receives the model training request initiated by the user, the AI platform control unit may determine the resource requirement of the model training request, including the number of CPU cores, the memory size, and the like. In some embodiments, as shown in fig. 3, the resource requirements of the model training request may be determined based on the following steps:
Step 310, decomposing the training process of the model corresponding to the model training request into a plurality of training steps;
Step 320, combining training description information corresponding to the plurality of training steps into training description time sequence data based on the time sequence of the plurality of training steps of the model corresponding to the model training request; the training description information corresponding to any training step comprises a model structure, super parameters, tuning target parameters and sample data volume related to the training step;
And 330, inputting the training description time sequence data into a pre-trained training resource prediction model to obtain the resource requirement of the model training request output by the training resource prediction model.
Specifically, the training process of the model corresponding to the model training request may be decomposed into a plurality of training steps, for example, into a forward propagation step (forward), a backward propagation step (backward), and a parameter update step (update). Based on the time sequence of the plurality of training steps of the model corresponding to the model training request (for example, the forward propagation step precedes the backward propagation step and the backward propagation step precedes the parameter updating step), the training description information corresponding to the plurality of training steps is combined into training description time sequence data, for example, the training description information of the forward propagation step, the training description information of the backward propagation step and the training description information of the parameter updating step >. The training description information corresponding to any training step comprises a model structure, super parameters, tuning target parameters (such as iteration times, accuracy and the like) and sample data size related to the training step. And then, inputting the training description time sequence data into a pre-trained training resource prediction model to obtain the resource requirement of the model training request output by the training resource prediction model.
In some embodiments, as shown in fig. 4, the training resource prediction model is built based on a recurrent neural network and a fully connected layer. The input of the full-connection layer is connected with the output of the hidden layer of the circulating neural network, and the output of the full-connection layer is the output of the whole training resource prediction model. It should be noted that, in the reasoning process of the training resource prediction model, the output layer of the recurrent neural network may be removed, that is, the output layer of the recurrent neural network only acts in the training process of the training resource prediction model.
In other embodiments, when training the resource prediction model, the training process of the sample model is decomposed into a plurality of sample training steps based on a similar manner as above, and training description information corresponding to the plurality of sample training steps is combined into sample training description time sequence data based on a time sequence of the plurality of sample training steps of the sample model. And then, sequentially inputting training description information corresponding to the sample training steps in the sample training description time sequence data to an input layer of the cyclic neural network in the training resource prediction model according to time steps to obtain a single-step resource prediction result output by an output layer of the cyclic neural network in the training resource prediction model at each time step and a global resource prediction result output by a full-connection layer in the training resource prediction model. Based on the single-step resource prediction result and the global resource prediction result output by each time step, the actual occupied resources of each sample training step and the global occupied resources of the whole training process of the sample model, calculating model loss, and adjusting model parameters of the training resource prediction model based on the model loss. Here, by acquiring the single-step resource prediction result output by the output layer of the cyclic neural network at each time step, calculating model loss based on the difference between the single-step resource prediction result output by each time step and the actual occupied resource of each sample training step, combining the difference between the global resource prediction result and the global occupied resource of the whole training process of the sample model, and improving the accuracy of the output result of each time step of the training resource prediction model by adjusting the parameters of the hidden layer, so as to improve the performance of the hidden layer, thereby assisting the fully connected layer of the training resource prediction model to output more accurate prediction results.
After training to obtain a training resource prediction model, in an inference stage of the training resource prediction model, training description information corresponding to a plurality of training steps in training description time sequence data can be sequentially input into an input layer of a circulating neural network in the training resource prediction model according to time steps to obtain an embedded vector output by a hidden layer of the circulating neural network in the training resource prediction model at each time step, and then, based on a full-connection layer in the training resource prediction model, a global resource prediction result is determined according to the embedded vector output by the hidden layer at each time step and is used as a resource requirement of the model training request.
Then, the AI platform control unit determines a training mode of the model training request based on the resource requirement of the model training request and the available resource condition of the current soft and hard all-in-one machine where the AI platform control unit is located. The training mode may be a single training mode (i.e. the whole training process is completed on the same soft and hard all-in-one machine) or a distributed training mode (i.e. the training process is completed on different soft and hard all-in-one machines together). If the available resources of the current soft and hard all-in-one machine are insufficient to meet the resource requirements of the model training request, or the ratio of the resource requirements of the model training request to the available resources of the current soft and hard all-in-one machine exceeds a preset threshold, determining that the training mode of the model training request is a distributed training mode so as to avoid the excessively high resource occupancy rate of the current soft and hard all-in-one machine; otherwise, the training mode of the model training request may be determined to be a monomer training mode.
The model training request may then be partitioned into a plurality of model training tasks based on the training pattern of the model training request. According to the training mode of the model training request, a proper task segmentation mode is selected, so that the execution parallelism of the model training task can be improved, and the model training efficiency is improved. Specifically, when the training mode of the model training request is a single training mode, considering that the whole training process of the model corresponding to the model training request is completed on the current soft and hard all-in-one machine, in order to improve the parallelism, the model training request can be divided into a plurality of model training tasks based on a data parallel mechanism or a model parallel mechanism. When a model training request is segmented based on a data parallel mechanism, a sample data set is divided into a plurality of subsets, and each segmented model training task is responsible for training a corresponding model based on any subset of the sample data set, so that each model training task can be responsible for execution by an independent thread, and the parallelism of task execution is improved; when the model training request is segmented based on the model parallel mechanism, the network layer of the corresponding model is segmented into a plurality of substructures, and each segmented model training task is responsible for training any substructures of the model based on the sample data set. When the training mode of the model training request is a distributed training mode, the model training request may be partitioned into a plurality of model training tasks based on a pipeline parallelism mechanism. When the model training request is segmented based on the pipeline parallel mechanism, the network layer of the corresponding model can be segmented into a plurality of substructures, meanwhile, the sample data set is divided into a plurality of subsets, and each segmented model training task is responsible for training any substructure of the model based on any subset of the sample data set, so that waiting time among model training tasks deployed on different soft and hard integrated machines is reduced, and model training efficiency is improved.
After the model training request is divided into a plurality of model training tasks, the AI platform control unit inserts the model training tasks into the task queues of the corresponding soft and hard all-in-one machines based on the training mode of the model training request. Specifically, when the training mode of the model training request is a single training mode, inserting a plurality of model training tasks into a task queue of the current soft and hard all-in-one machine; when the training mode of the model training request is a distributed training mode, a plurality of model training tasks are inserted into task queues of different soft and hard integrated machines.
For the current soft and hard all-in-one machine, the task scheduling algorithm in the AI chip can determine the currently scheduled model training task in the current task queue based on the time sequence relation of each model training task in the current task queue of the current soft and hard all-in-one machine inserted into the task queue, the model corresponding to each model training task and the available resources of the current soft and hard all-in-one machine. It should be noted that the model training task currently scheduled may be one or more. Here, the more forward the time inserted into the task queue, the smaller the model parameter number corresponding to the model training task, the more likely the corresponding model training task is scheduled, and the total amount of resources occupied by the currently scheduled model training task does not exceed the preset proportion of available resources of the current soft and hard all-in-one machine. In some embodiments, corresponding weights may be set for the time when the model training task is inserted into the task queue and the parameter amount of the model corresponding to the model training task, and then the scheduling score (t×wt+p×pt) of the model training task is determined based on the time when the model training task is inserted into the task queue (supposing t) and the weight thereof (supposing wt) and the parameter amount of the model corresponding to the model training task (supposing p) and the weight thereof (supposing pt), so that the currently invoked model training task is determined according to the scheduling score of each model training task (the lower the scheduling score is, the more likely to be scheduled). The more available resources of the current soft and hard all-in-one machine are, the smaller the weight of the parameter number of the model corresponding to the model training task is.
After that, the dynamic resource management firmware in the AI chip allocates computing resources for the currently scheduled model training task and performs the currently scheduled model training task. The resource demand of any model training task can be predicted by utilizing the training resource prediction model, so that computing resources are allocated to the model training task according to the resource demand of the model training task.
In summary, according to the large model training method provided by the embodiment of the invention, the resource requirement of the model training request is determined, the training mode of the model training request is determined based on the resource requirement of the model training request and the available resources of the current soft and hard all-in-one machine, the model training request is divided into a plurality of model training tasks based on the training mode of the model training request, the plurality of model training tasks are inserted into the task queues of the corresponding soft and hard all-in-one machines based on the training mode of the model training request, then the time sequence relation of each model training task inserted into the task queues in the current task queues of the current soft and hard all-in-one machine, the model corresponding to each model training task and the available resources of the current soft and hard all-in-one machine are determined, the calculation resources are allocated for the currently scheduled model training tasks, the currently scheduled model training tasks are executed, and the efficiency of large model training is improved.
In one aspect, the present invention provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the execution of a large model training method provided by the methods described above, the method comprising: after receiving a model training request initiated by a user, determining the resource requirement of the model training request; determining a training mode of the model training request based on the resource requirement of the model training request and the available resources of the current soft and hard all-in-one machine; wherein the training mode is a single training mode or a distributed training mode; dividing the model training request into a plurality of model training tasks based on the training mode of the model training request, and inserting the model training tasks into task queues of corresponding soft and hard all-in-one machines based on the training mode of the model training request; determining a currently scheduled model training task in a current task queue based on a time sequence relation of each model training task inserted into the task queue in the current task queue of the current soft and hard integrated machine, a model corresponding to each model training task and available resources of the current soft and hard integrated machine; the current scheduled model training tasks are one or more; and distributing computing resources for the currently scheduled model training task, and executing the currently scheduled model training task.
In another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above provided large model training methods, the method comprising: after receiving a model training request initiated by a user, determining the resource requirement of the model training request; determining a training mode of the model training request based on the resource requirement of the model training request and the available resources of the current soft and hard all-in-one machine; wherein the training mode is a single training mode or a distributed training mode; dividing the model training request into a plurality of model training tasks based on the training mode of the model training request, and inserting the model training tasks into task queues of corresponding soft and hard all-in-one machines based on the training mode of the model training request; determining a currently scheduled model training task in a current task queue based on a time sequence relation of each model training task inserted into the task queue in the current task queue of the current soft and hard integrated machine, a model corresponding to each model training task and available resources of the current soft and hard integrated machine; the current scheduled model training tasks are one or more; and distributing computing resources for the currently scheduled model training task, and executing the currently scheduled model training task.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A soft and hard all-in-one of integrated big model training and reasoning, characterized by comprising:
An AI chip, a storage unit and an AI platform control unit;
The AI chip is used for carrying out task scheduling on model training tasks initiated by a plurality of users, distributing computing resources for the scheduled model training tasks, executing the scheduled model training tasks based on the hardware acceleration module, and executing model reasoning requests initiated by the users;
the storage unit comprises a memory for storing model parameters and a solid state disk for storing training data sets;
the AI platform control unit is used for providing a model algorithm library, so that a user can select and combine different model modules from the model algorithm library to customize a personalized model and perform parameter configuration on the personalized model; the AI platform control unit is also used for receiving a model training request initiated by a user, decomposing the model training request into a plurality of model training tasks, adding corresponding safety protection logic for the model training tasks, inserting the model training tasks into a task queue, and adding corresponding safety protection logic for model reasoning requests initiated by the user.
2. The integrated large model training and reasoning soft-hard all-in-one of claim 1, wherein the AI platform control unit further comprises:
an automatic model tuning tool for searching the optimal model super-parameter configuration for the current model by using a reinforcement learning algorithm;
The distributed collaborative processing framework is used for providing a message transfer interface or a parameter server to share computing resources among a plurality of soft and hard all-in-one machines, and scheduling model training tasks corresponding to model training requests initiated by users to the corresponding soft and hard all-in-one machines for distributed training;
And the model compression and acceleration module is used for providing a model compression function and a model acceleration function.
3. The integrated large model training and reasoning soft-hard all-in-one of claim 1, wherein the model algorithm library further comprises:
the online learning interface is used for automatically updating the model module in the model algorithm library according to the latest data and/or model optimization trend;
And the custom algorithm interface is used for receiving a custom model module provided by a user and deploying and operating the custom model module based on container technology.
4. A large model training method based on the integrated large model training and reasoning machine as claimed in any one of claims 1 to 3, comprising:
After receiving a model training request initiated by a user, determining the resource requirement of the model training request;
Determining a training mode of the model training request based on the resource requirement of the model training request and the available resources of the current soft and hard all-in-one machine; wherein the training mode is a single training mode or a distributed training mode;
Dividing the model training request into a plurality of model training tasks based on the training mode of the model training request, and inserting the model training tasks into task queues of corresponding soft and hard all-in-one machines based on the training mode of the model training request;
Determining a currently scheduled model training task in a current task queue based on a time sequence relation of each model training task inserted into the task queue in the current task queue of the current soft and hard integrated machine, a model corresponding to each model training task and available resources of the current soft and hard integrated machine; the current scheduled model training tasks are one or more;
and distributing computing resources for the currently scheduled model training task, and executing the currently scheduled model training task.
5. The large model training method according to claim 4, wherein said determining the resource requirement of the model training request specifically comprises:
decomposing the training process of the model corresponding to the model training request into a plurality of training steps;
Combining training description information corresponding to the plurality of training steps into training description time sequence data based on the time sequence of the plurality of training steps of the model corresponding to the model training request; the training description information corresponding to any training step comprises a model structure, super parameters, tuning target parameters and sample data volume related to the training step;
And inputting the training description time sequence data into a pre-trained training resource prediction model to obtain the resource requirement of the model training request output by the training resource prediction model.
6. The large model training method of claim 5, wherein the training resource prediction model is constructed based on a recurrent neural network and a fully connected layer; the input of the full-connection layer is connected with the output of the hidden layer of the circulating neural network.
7. The large model training method according to claim 6, wherein the inputting the training description time sequence data into a training resource prediction model trained in advance, to obtain the resource requirement of the model training request output by the training resource prediction model, specifically includes:
Sequentially inputting training description information corresponding to the training steps in the training description time sequence data to an input layer of the cyclic neural network in the training resource prediction model according to time steps to obtain an embedded vector output by a hidden layer of the cyclic neural network in the training resource prediction model at each time step;
And determining a global resource prediction result as a resource requirement of the model training request according to the embedded vector output by the hidden layer at each time step based on the fully connected layer in the training resource prediction model.
8. The large model training method of claim 7, wherein the training resource prediction model is trained based on the steps of:
Decomposing a training process of a sample model into a plurality of sample training steps, and combining training description information corresponding to the plurality of sample training steps into sample training description time sequence data based on time sequence of the plurality of sample training steps of the sample model;
Sequentially inputting training description information corresponding to sample training steps in the sample training description time sequence data to an input layer of the cyclic neural network in the training resource prediction model according to time steps to obtain a single-step resource prediction result output by an output layer of the cyclic neural network in the training resource prediction model at each time step and a global resource prediction result output by the full-connection layer in the training resource prediction model;
And adjusting model parameters of the training resource prediction model based on the single-step resource prediction result and the global resource prediction result output by each time step, the actual occupied resources of each sample training step and the global occupied resources of the whole training process of the sample model.
9. The large model training method according to claim 4, wherein the dividing the model training request into a plurality of model training tasks based on the training pattern of the model training request specifically comprises:
if the training mode of the model training request is a single training mode, dividing the model training request into a plurality of model training tasks based on a data parallel mechanism or a model parallel mechanism;
If the training mode of the model training request is a distributed training mode, dividing the model training request into a plurality of model training tasks based on a pipeline parallel mechanism.
10. A non-transitory computer readable storage medium, having stored thereon a computer program, which when executed by a processor implements the large model training method according to any of claims 4 to 9.
CN202410585108.5A 2024-05-13 2024-05-13 Soft and hard all-in-one machine integrating large model training and reasoning and large model training method Active CN118153649B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410585108.5A CN118153649B (en) 2024-05-13 2024-05-13 Soft and hard all-in-one machine integrating large model training and reasoning and large model training method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410585108.5A CN118153649B (en) 2024-05-13 2024-05-13 Soft and hard all-in-one machine integrating large model training and reasoning and large model training method

Publications (2)

Publication Number Publication Date
CN118153649A true CN118153649A (en) 2024-06-07
CN118153649B CN118153649B (en) 2024-07-26

Family

ID=91285504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410585108.5A Active CN118153649B (en) 2024-05-13 2024-05-13 Soft and hard all-in-one machine integrating large model training and reasoning and large model training method

Country Status (1)

Country Link
CN (1) CN118153649B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114035937A (en) * 2021-10-15 2022-02-11 北京潞晨科技有限公司 Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence
CN114154641A (en) * 2020-09-07 2022-03-08 华为云计算技术有限公司 AI model training method and device, computing equipment and storage medium
WO2023020355A1 (en) * 2021-08-20 2023-02-23 华为云计算技术有限公司 Distributed training method for ai model and related device
CN116821043A (en) * 2023-01-30 2023-09-29 杭州指令集智能科技有限公司 Soft and hard integrated application extension device of Internet of things operating system and application thereof
CN117350342A (en) * 2023-11-15 2024-01-05 杭州电子科技大学 FPGA-based hardware acceleration method for UNET liver cancer image cutting part
CN117742959A (en) * 2023-12-20 2024-03-22 北京百度网讯科技有限公司 Training method and device based on clusters, electronic equipment and storage medium
CN117971502A (en) * 2024-03-29 2024-05-03 南京认知物联网研究院有限公司 Method and device for carrying out online optimization scheduling on AI reasoning cluster
CN117971475A (en) * 2024-01-31 2024-05-03 酷标物联科技江苏有限公司 Intelligent management method and system for GPU computing force pool

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154641A (en) * 2020-09-07 2022-03-08 华为云计算技术有限公司 AI model training method and device, computing equipment and storage medium
WO2022048557A1 (en) * 2020-09-07 2022-03-10 华为云计算技术有限公司 Ai model training method and apparatus, and computing device and storage medium
WO2023020355A1 (en) * 2021-08-20 2023-02-23 华为云计算技术有限公司 Distributed training method for ai model and related device
CN114035937A (en) * 2021-10-15 2022-02-11 北京潞晨科技有限公司 Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence
CN116821043A (en) * 2023-01-30 2023-09-29 杭州指令集智能科技有限公司 Soft and hard integrated application extension device of Internet of things operating system and application thereof
CN117350342A (en) * 2023-11-15 2024-01-05 杭州电子科技大学 FPGA-based hardware acceleration method for UNET liver cancer image cutting part
CN117742959A (en) * 2023-12-20 2024-03-22 北京百度网讯科技有限公司 Training method and device based on clusters, electronic equipment and storage medium
CN117971475A (en) * 2024-01-31 2024-05-03 酷标物联科技江苏有限公司 Intelligent management method and system for GPU computing force pool
CN117971502A (en) * 2024-03-29 2024-05-03 南京认知物联网研究院有限公司 Method and device for carrying out online optimization scheduling on AI reasoning cluster

Also Published As

Publication number Publication date
CN118153649B (en) 2024-07-26

Similar Documents

Publication Publication Date Title
CN107888669B (en) Deep learning neural network-based large-scale resource scheduling system and method
Guo et al. Cloud resource scheduling with deep reinforcement learning and imitation learning
US11715033B2 (en) Dynamically scaled training fleets for machine learning
CN110168578A (en) Multitask neural network with task particular path
US11740941B2 (en) Method of accelerating execution of machine learning based application tasks in a computing device
CN109992404A (en) PC cluster resource regulating method, device, equipment and medium
CN113821332B (en) Method, device, equipment and medium for optimizing efficiency of automatic machine learning system
CN113435998B (en) Loan overdue prediction method and device, electronic equipment and storage medium
CN111768004A (en) Model self-adaption method and system based on intelligent computing framework
CN111538852B (en) Multimedia resource processing method, device, storage medium and equipment
CN116057518A (en) Automatic query predicate selective prediction using machine learning model
CN115543577A (en) Kubernetes resource scheduling optimization method based on covariates, storage medium and equipment
CN116047934A (en) Real-time simulation method and system for unmanned aerial vehicle cluster and electronic equipment
CN115080248A (en) Scheduling optimization method for scheduling device, and storage medium
CN117076077A (en) Planning and scheduling optimization method based on big data analysis
Shanbhag et al. Investigating the application of transfer learning techniques in cloud-based AI systems for improved performance and reduced training time
Yadwadkar Machine learning for automatic resource management in the datacenter and the cloud
Shahoud et al. A meta learning approach for automating model selection in big data environments using microservice and container virtualization technologies
RU2411574C2 (en) Intellectual grid-system for highly efficient data processing
US12001174B2 (en) Determination of task automation using an artificial intelligence model
US20210110287A1 (en) Causal Reasoning and Counterfactual Probabilistic Programming Framework Using Approximate Inference
CN118153649B (en) Soft and hard all-in-one machine integrating large model training and reasoning and large model training method
CN117235527A (en) End-to-end containerized big data model construction method, device, equipment and medium
CN114490094B (en) GPU (graphics processing Unit) video memory allocation method and system based on machine learning
US20230141408A1 (en) Utilizing machine learning and natural language generation models to generate a digitized dynamic client solution

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: Room 1003, 10th Floor, Building 1, No. 10 Kegu 1st Street, Beijing Economic and Technological Development Zone, Daxing District, Beijing 101111

Patentee after: Beijing Shengshi Tianan Technology Co.,Ltd.

Country or region after: China

Address before: 101100 Beijing Tongzhou District Canal Core Area IV-108 Multifunctional Land Project 8A Business Office Building 6th Floor 604

Patentee before: Beijing Shengshi Tianan Technology Co.,Ltd.

Country or region before: China