WO2023226284A1 - 一种深度学习模型的训练方法、装置、设备及存储介质 - Google Patents

一种深度学习模型的训练方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023226284A1
WO2023226284A1 PCT/CN2022/126231 CN2022126231W WO2023226284A1 WO 2023226284 A1 WO2023226284 A1 WO 2023226284A1 CN 2022126231 W CN2022126231 W CN 2022126231W WO 2023226284 A1 WO2023226284 A1 WO 2023226284A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
deep learning
learning model
preset
model
Prior art date
Application number
PCT/CN2022/126231
Other languages
English (en)
French (fr)
Inventor
范高俊
曾炜
王晖
Original Assignee
鹏城实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 鹏城实验室 filed Critical 鹏城实验室
Publication of WO2023226284A1 publication Critical patent/WO2023226284A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present invention relates to the technical field of model training, and specifically to a training method, device, equipment and storage medium for a deep learning model.
  • Deep learning is an increasingly popular machine learning method in the industry and can be used in a variety of scenarios such as images, voice, video, and machine translation. Taking machine translation as an example, the effect of machine translation based on neural networks has been significantly improved. Currently, in some languages and scenarios, the translation quality can even reach the level of manual translation.
  • Data parallelism is a form of distributed training of deep learning models. It divides the training data into multiple parts and trains them on different computing nodes. If the computing nodes do not have shared public memory and only have local memory with limited capacity, and the training data set is too large to be stored in local memory, the training data set needs to be divided and allocated to each computing node at once. , and then the computing nodes train the deep learning model based on their respective allocated local data. During the distributed training process, each computing node needs to communicate with other nodes to exchange gradient data.
  • the existing technology does not consider the capacity of the training data set when allocating the training data set to local nodes (computing nodes), resulting in low efficiency for the node to train the deep learning model based on the allocated training data set.
  • the present invention provides a deep learning model training method, device, equipment and storage medium, which solves the problem of low efficiency in training deep learning models in the existing technology.
  • the present invention provides a training method for a deep learning model, which includes:
  • storing the training data set to a local node based on the data capacity includes:
  • the entire training data set in the database is downloaded to the local node, and the database is located outside the local node.
  • completing the training of the deep learning model based on the training data set stored on the local node includes:
  • the training of the deep learning model is completed based on the model training accuracy and the model preset training accuracy corresponding to the deep learning model.
  • completing the training of the deep learning model based on the model training accuracy and the model preset training accuracy corresponding to the deep learning model includes:
  • the local distributed storage method in the target storage method is obtained, and the number of nodes corresponding to the local distributed storage method is greater than the number of nodes corresponding to the local storage method. number of nodes;
  • the updated training data set continue to train the deep learning model after pre-training to complete the training of the deep learning model.
  • storing the training data set to a local node based on the data capacity includes:
  • the training data set in the database is downloaded to the local node in a parallel manner according to the training progress of the deep learning model.
  • completing the training of the deep learning model based on the training data set stored on the local node includes:
  • a parallel training method corresponding to the local distributed storage method is obtained;
  • each acceleration card constituting the parallel training method is obtained, and the acceleration card is a hardware device required for training the deep learning model;
  • the target update method of the training data set is obtained
  • the training of the deep learning model is completed.
  • completing the training of the deep learning model based on the training data set stored on the local node includes:
  • the deep learning model after the single training is continued to be trained to complete the training of the deep learning model.
  • adjusting the gradient value of the deep learning model according to the gradient adjustment coefficient includes:
  • a parallel training method corresponding to the local distributed storage method is obtained;
  • each acceleration card constituting the parallel training method is obtained, and the acceleration card is a hardware device required for training the deep learning model;
  • the maximum single training duration is obtained
  • the gradient value of the deep learning model is adjusted according to the maximum single training duration and the gradient adjustment coefficient.
  • adjusting the gradient value of the deep learning model based on the maximum single training duration and the gradient adjustment coefficient includes:
  • the updated gradient adjustment coefficient is obtained
  • the gradient value of the deep learning model is adjusted according to the updated gradient adjustment coefficient.
  • completing the training of the deep learning model based on the training data set stored on the local node further includes:
  • the deep learning model is retrained.
  • the training method further includes:
  • the training method further includes:
  • the number of nodes used to train the deep learning model is adjusted.
  • the training method further includes:
  • the data parallel weight coefficient and the model parallel weight coefficient are adjusted according to the training accuracy, the preset data parallel weight coefficient, and the preset model parallel weight coefficient.
  • embodiments of the present invention also provide a training device for a deep learning model, wherein the device includes the following components:
  • Capacity calculation module used to obtain the data capacity of the training data set
  • a data storage module configured to store the training data set to a local node according to the data capacity
  • a training module configured to complete training of the deep learning model based on the training data set stored on the local node.
  • embodiments of the present invention further provide a terminal device, wherein the terminal device includes a memory, a processor, and a training program for a deep learning model stored in the memory and executable on the processor, When the processor executes the training program of the deep learning model, the steps of the training method of the deep learning model described above are implemented.
  • embodiments of the present invention further provide a computer-readable storage medium.
  • the computer-readable storage medium stores a training program for a deep learning model.
  • the training program for the deep learning model is executed by a processor, The steps of the training method of the deep learning model described above.
  • the present invention first selects the storage method of storing the training data set to the local node according to the capacity (data set size) of the training data set required by the deep learning model to be trained, and then completes the storage operation of the training data set in the local node. , and finally the local node uses the training data set to train the deep learning model.
  • the present invention stores the training data set in the local node according to the capacity of the training data set, which can save the time required for data storage, thereby saving the overall time required for training, thereby improving training efficiency.
  • Figure 1 is an overall flow chart of the present invention
  • Figure 2 is a flow chart in an embodiment of the present invention.
  • Figure 3 is a functional block diagram of the internal structure of a terminal device provided by an embodiment of the present invention.
  • Deep learning is an increasingly popular machine learning method in the industry and can be used in a variety of scenarios such as images, voice, video, and machine translation.
  • machine translation Taking machine translation as an example, the effect of machine translation based on neural networks has been significantly improved.
  • Data parallelism is a form of distributed training of deep learning models. It divides the training data into multiple parts and trains them on different computing nodes. If the computing nodes do not have shared public memory and only have local memory with limited capacity, and the training data set is too large to be stored in local memory, the training data set needs to be divided and allocated to each computing node at once. , and then the computing nodes train the deep learning model based on their respective allocated local data.
  • each computing node needs to communicate with other nodes to exchange gradient data.
  • the existing technology does not consider the capacity of the training data set when allocating the training data set to local nodes (computing nodes), resulting in low efficiency for the node to train the deep learning model based on the allocated training data set.
  • the present invention provides a deep learning model training method, device, equipment and storage medium, which solves the problem of low efficiency in training deep learning models in the existing technology.
  • the data capacity of the training data set is obtained; the training data set is stored in the local node based on the data capacity; and the training of the deep learning model is completed based on the training data set stored in the local node.
  • the invention can improve the training efficiency of the model.
  • a database outside the local node for storing all data.
  • different storage methods are selected according to the size (capacity) of the training data set required to train the deep learning model.
  • Different storage methods store training data sets on local nodes. For example, a training data set with capacity A corresponds to storage method a, and a training data set with capacity B corresponds to storage method b.
  • the training method of the deep learning model in this embodiment can be applied to a terminal device, which can be a terminal product with computing functions, such as a computer.
  • a terminal device which can be a terminal product with computing functions, such as a computer.
  • the training method of the deep learning model specifically includes the following steps:
  • the application goals of deep learning models are different, and the data capacity of the training data sets used to train deep learning models is also different. For example, if one deep learning model is used to recognize images, and another deep learning model is used to recognize sounds, then the data capacity of the training data sets required by the two deep learning models is different. There are also different training data sets required to train to different accuracies. Therefore, the corresponding data capacity can be obtained based on the deep learning model and the accuracy required for training.
  • S200 Store the training data set to a local node according to the data capacity.
  • Step S200 includes two situations.
  • the first situation the data capacity E is less than the preset capacity E0 corresponding to the training data set, and the training data set is downloaded from the database to the local node using local storage.
  • Local memory is set up on the local node specifically for storing training data sets. Download all training data sets into local memory at one time. Because the capacity of the training data set is small, even if all are downloaded to the local node at one time, it will not occupy too much memory of the local node, so it will not affect the local node training. Deep learning model, and since all training data sets have been downloaded to the local node before training, there is no need to download data from the database during the training process, which ultimately improves training efficiency.
  • the second case if the data capacity E is greater than or equal to the preset capacity E0 corresponding to the training data set, the local distributed storage method is used to download the data in the database to the local storage node (the local storage node is different from the local node.
  • the local storage node is only used to store data), and then the data on the local storage node is transferred to the local node during the training process.
  • the local node uses the training data set to complete the training of the deep learning model. Since the training data set is very large, if it is placed directly on the local node, it will occupy a large amount of content on the local node, thus reducing the training efficiency.
  • the training data set is first placed on the local storage node, which can reduce the training data set's occupation of the local node memory, thereby improving training efficiency.
  • the training data set can also be directly downloaded from the database to the memory of the local node. However, this is not a one-time download, but a step-by-step download as the training progresses.
  • S300 Complete the training of the deep learning model based on the training data set stored on the local node.
  • step S300 includes the following steps S301 to S306:
  • the method of the present invention can determine whether to perform local distributed storage data training by setting the preset capacity E0, the preset training duration T0, and the model's preset training accuracy
  • the pre-processing and prioritization of the distributed training method of the learning model improves the efficiency of model training.
  • S304 Re-download data from the database to the local node according to the local distributed storage method.
  • S306 Continue to train the deep learning model after pre-training based on the updated training data set to complete the training of the deep learning model.
  • the model training accuracy X when training the model using the training data set in local storage mode, the model training accuracy X will also be monitored during the training process. If the training time T reaches the preset training time T0, but the model training accuracy X has not yet reached The preset training accuracy of the model is
  • step S200 When step S200 is the second situation (local distributed storage method), when the training data set downloaded using the local distributed storage method cannot complete the model training, the training data set needs to be downloaded again to complete the training.
  • step S300 It includes the following steps S301a to S308a:
  • S301a According to the local distributed storage method, obtain the parallel training method corresponding to the local distributed storage method.
  • the parallel training method is model parallel (there are multiple local nodes, and each local node trains the model, which constitutes a parallel training method)
  • each acceleration card that constitutes the parallel training method obtains each acceleration card that constitutes the parallel training method.
  • the acceleration card is a hardware device required for training the deep learning model.
  • the accelerator card is the hardware device that the local node (server) relies on to train the model.
  • S303a Count the single training time required for a single training of the deep learning model in each of the accelerator cards.
  • S304a Count the number of accelerator cards corresponding to the single training duration greater than the preset timing duration t0 (preset calculation duration).
  • S305a Calculate the ratio of the number of accelerator cards corresponding to a time period greater than the preset time period to the total number of accelerator cards, and obtain the quantity ratio B.
  • the quantity ratio (proportion) B is Eight out of ten.
  • S306a Obtain the target update method of the training data set based on the quantity ratio.
  • the central control module is provided with a first preset proportion B1 and a second preset proportion B2, where B1 ⁇ B2.
  • the central control module determines to use the asynchronous update method for training, and the target update method at this time is the asynchronous update method.
  • the central control module determines not to use asynchronous update training.
  • the central control module determines to continue using the synchronous update method. At this time, the target update method is synchronous update.
  • Reducing the gradient is to reduce the size of the values corresponding to the parameters in the model.
  • S307a Update the training data set according to the target update method.
  • S308a Complete the training of the deep learning model based on the updated training data set.
  • the asynchronous update method is to update the training data set on each local node asynchronously
  • the synchronous update method is to synchronously download new data from the database to the training data set on the local node to update the training data set.
  • the first preset proportion B1 and the second preset proportion B2 are set, and the synchronous update and asynchronous update methods can be judged and selected, further realizing the pre-processing and distribution of the distributed training method for the deep learning model. Prioritize and improve the efficiency of model training.
  • the model's preset training accuracy X0 after obtaining the quantity ratio B in step S306a, the model's preset training accuracy X0 will be adjusted according to the quantity ratio B, and the adjusted model's preset training accuracy X' will be obtained.
  • the model will be interpreted as X' the next time it is trained. Whether the model has completed training, adjusting the model's preset training accuracy X0 specifically includes the following steps S3061a, S3062a, and S3063a:
  • S3061a Obtain the first preset ratio B′ and the second preset ratio B′′ corresponding to the quantity ratio, and the second preset ratio B′′ is greater than the first preset ratio B′.
  • the central control module adjusts the training duration in the process of increasing the gradient, and adjusts the training accuracy in the process of decreasing the gradient to keep the accuracy within a reasonable range.
  • the central control module uses ⁇ 1 to adjust the training accuracy.
  • the central control module uses ⁇ 2 to adjust the training accuracy.
  • the training can be improved.
  • the duration and training accuracy are adjusted to further realize the pre-processing and priority selection of distributed training methods for deep learning models and improve the efficiency of model training.
  • step S200 when step S200 is the second situation (local distributed storage mode), the training of the model is completed by adjusting the gradient value (the value corresponding to the parameter in the model).
  • step S300 includes the following steps S301b to S3010b:
  • S301b Use the training data set to train the deep learning model, and obtain the model training accuracy of the deep learning model after a single training.
  • the gradient adjustment coefficient ⁇ is how much the gradient value needs to be increased or decreased according to the gradient adjustment coefficient when performing the next training after completing a single training. For example, after a single training, one of the parameter values (gradient values) in the model is h. Before the next training, the parameter value is adjusted to h ⁇ .
  • the central control module is provided with a preset model training accuracy X0 corresponding to the preset distributed storage data, a preset gradient adjustment coefficient ⁇ 0 corresponding to the local distributed storage method, and a preset first accuracy difference value ⁇ X1 And the preset second accuracy difference value ⁇
  • the training accuracy is X0.
  • the central control module determines that the training gradient needs to be increased.
  • the gradient adjustment coefficient can be determined, further realizing the distributed training method for the deep learning model. pre-processing and prioritization and improve the efficiency of model training.
  • each acceleration card constituting the parallel training method obtains each acceleration card constituting the parallel training method.
  • the acceleration card is a hardware device required for training the deep learning model.
  • S305b Count each single training time required by each accelerator card to train the deep learning model.
  • the central control module determines not to adjust the gradient.
  • the central control module uses ⁇ 3 to adjust the gradient.
  • the central control module uses ⁇ 3 to adjust the gradient.
  • step S306b when the maximum single training duration is greater than the set duration t', step S306b includes the following steps S307b to S3010b:
  • S307b Update the training data set, train the deep learning model based on the updated training data set, and obtain the updated model training accuracy corresponding to the deep learning model.
  • S309b Adjust the gradient value of the deep learning model according to the updated gradient adjustment coefficient.
  • S3010b Based on the adjusted gradient value and the training data set, continue training the deep learning model after the single training to complete the training of the deep learning model.
  • the central control module uses asynchronous update for training.
  • the gradient can be adjusted according to the accuracy difference, thereby achieving the gradient of training.
  • Control further realizes the pre-processing and priority selection of distributed training methods for deep learning models and improves the efficiency of model training.
  • step S300 adjusts the number of local nodes during the training of the model.
  • the specific process is: calculate the scalability of the deep learning model; based on the scalability, adjust the number of local nodes used to train the deep learning model. The number of nodes in the model.
  • the central control module determines the number of nodes according to the parallel scalability of the model.
  • the central control module is provided with a preset first scalability limit H1, a preset second scalability limit H2 and a preset node number W0. If H ⁇ H1 , the central control module will reduce the number of nodes, and the reduced number of nodes is 0.5 ⁇ W0; if H1 ⁇ H ⁇ H2, the central control module will not increase or decrease the number of nodes; if H>H2, the central control module will The module will not increase the number of nodes, and the number of nodes after the increase will be 1.5W0.
  • the number of nodes can be determined and adjusted, further realizing the pre-processing and processing of the distributed training method for the deep learning model. Prioritize and improve the efficiency of model training.
  • step S300 also adjusts the weight of each training data set and the weight of the model located on each local node.
  • the specific process includes: obtaining the preset data parallel weight coefficient corresponding to the data parallel weight coefficient of the local distributed storage method; obtaining the preset model parallel weight coefficient corresponding to the model parallel weight coefficient of the deep learning model; Use a parallel training method to train the deep learning model to obtain the deep learning model after training; calculate the training accuracy of the deep learning model after training; based on the training accuracy, the preset data parallel weight coefficient, and the Preset the model parallel weight coefficient, and adjust the data parallel weight coefficient and the model parallel weight coefficient.
  • the central control module determines that the training accuracy meets the standard and does not adjust the weight coefficient.
  • the weight coefficient can be adjusted to improve training accuracy, further realizing pre-processing and prioritization of distributed training methods for deep learning models. Select and improve the efficiency of model training.
  • the central control module when performing local distributed storage training, increases the training gradient based on the accuracy of a single training process. When the accuracy of a single data training does not meet the requirements, the central control module increases the training gradient based on Different training nodes gradually increase the gradient of training. By increasing the training gradient according to the accuracy of a single training process, when the accuracy of a single training does not meet the requirements, the central control module gradually increases the training gradient according to different training nodes, further realizing the distribution of deep learning models. Pre-processing and prioritization of training methods improves the efficiency of model training.
  • the central control module when training is completed using data parallelism, if the central control module determines that the training results do not meet the standards, the central control module will determine whether it is necessary to re-train in combination with model parallelism based on the training results in data parallelism. It is determined that the central control module has a preset training duration T0 and a preset data parallel training method with an overall training accuracy of X'0.
  • the present invention first selects the storage method of storing the training data set to the local node according to the capacity (data set size) of the training data set required by the deep learning model to be trained, and then completes the storage operation of the training data set in the local node. , and finally the local node uses the training data set to train the deep learning model.
  • the present invention stores the training data set to the local node according to the capacity of the training data set, which can save the time required to store the data, thereby saving the overall time required for training, thereby improving the training efficiency.
  • the central control module Determine whether to start local distributed or local storage training based on the memory of the data set to be trained, preset training duration, preset reading accuracy, and platform training task conditions, and increase the training gradient based on the training accuracy during a single training process to achieve It pre-processes and prioritizes distributed training methods for deep learning models and improves the efficiency of model training.
  • This embodiment also provides a deep learning model training device, which includes the following components:
  • Capacity calculation module used to obtain the data capacity of the training data set
  • a data storage module configured to store the training data set to a local node according to the data capacity
  • a training module used to complete the training of the deep learning model based on the training data set stored on the local node.
  • the present invention also provides a terminal device, the functional block diagram of which can be shown in Figure 3 .
  • the terminal device includes a processor, memory, network interface, display screen, and temperature sensor connected through a system bus.
  • the processor of the terminal device is used to provide computing and control capabilities.
  • the memory of the terminal device includes non-volatile storage media and internal memory.
  • the non-volatile storage medium stores operating systems and computer programs. This internal memory provides an environment for the execution of operating systems and computer programs in non-volatile storage media.
  • the network interface of the terminal device is used to communicate with an external terminal through a network connection.
  • the computer program when executed by the processor, implements a method for training a deep learning model.
  • the display screen of the terminal device may be a liquid crystal display screen or an electronic ink display screen.
  • the temperature sensor of the terminal device is pre-set inside the terminal device for detecting the operating temperature of the internal device.
  • a terminal device includes a memory, a processor, and a training program for a deep learning model that is stored in the memory and can be run on the processor.
  • the processor executes the training program for the deep learning model, , implement the following operation instructions:
  • the training of the deep learning model is completed.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Synchlink DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

本发明涉及模型训练技术领域,具体是涉及一种深度学习模型的训练方法、装置、设备及存储介质。本发明首先根据待训练的深度学习模型所需要的训练数据集的容量选择将训练数据集存储至本地节点的存储方式,之后完成训练数据集在本地节点的存储操作,最后本地节点采用训练数据集训练深度学习模型。本发明根据训练数据集的容量将训练数据集存储至本地节点,能够节省存储数据所需要的时间,进而节省了训练所需要的整体时间,从而提高了训练效率。

Description

一种深度学习模型的训练方法、装置、设备及存储介质 技术领域
本发明涉及模型训练技术领域,具体是涉及一种深度学习模型的训练方法、装置、设备及存储介质。
背景技术
深度学习是业界逐渐流行的机器学习方法,可被用于图像、语音、视频、机器翻译等多种场景。以机器翻译为例,基于神经网络的机器翻译的效果明显提升。目前在某些语种和场景下,译文质量甚至可以达到人工翻译的水平。数据并行是对深度学习模型进行分布式训练的一种形式,其把训练数据分成多份,在不同计算节点上训练。如果计算节点没有共享的公共内存,只有容量受限的本地内存,而训练数据集的规模很大,无法存储于本地内存,就需要对训练数据集进行划分,一次性全部分配到各个计算节点上,然后计算节点依据各自分配的局部数据对深度学习模型进行训练。在分布式训练过程中,各计算节点需要与其他节点进行通信,以交换梯度数据。
现有技术在将训练数据集分配至本地节点(计算节点)时并没有考虑训练数据集的容量,导致节点依据分配的训练数据集训练深度学习模型的效率较低。
综上所述,现有技术训练深度学习模型的效率较低。
因此,现有技术还有待改进和提高。
发明内容
为解决上述技术问题,本发明提供了一种深度学习模型的训练方法、装置、设备及存储介质,解决了现有技术训练深度学习模型效率较低的问题。
为实现上述目的,本发明采用了以下技术方案:
第一方面,本发明提供一种深度学习模型的训练方法,其中,包括:
获取训练数据集的数据容量;
依据所述数据容量,存储所述训练数据集至本地节点;
依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训 练。
在一种实现方式中,所述依据所述数据容量,存储所述训练数据集至本地节点,包括:
当所述数据容量小于与所述训练数据集所对应的预设容量,得到本地存储方式;
依据所述本地存储方式,将数据库中的所述训练数据集整体下载至所述本地节点,所述数据库位于所述本地节点的外部。
在一种实现方式中,所述依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练,包括:
依据所述训练数据集训练所述深度学习模型,直至训练时长达到训练预设时长时,得到预训练之后的所述深度学习模型;
计算预训练之后的所述深度学习模型的模型训练精度;
依据所述模型训练精度和与所述深度学习模型对应的模型预设训练精度,完成所述深度学习模型的训练。
在一种实现方式中,所述依据所述模型训练精度和与所述深度学习模型对应的模型预设训练精度,完成所述深度学习模型的训练,包括:
当所述模型训练精度小于所述模型预设训练精度,得到所述目标存储方式中的本地分布式存储方式,所述本地分布式存储方式所对应的节点数量大于所述本地存储方式所对应的节点数量;
依据所述本地分布式存储方式,从所述数据库中重新下载数据至所述本地节点;
依据重新下载的数据,更新所述训练数据集;
依据更新之后的所述训练数据集,继续训练预训练之后的所述深度学习模型,完成所述深度学习模型的训练。
在一种实现方式中,所述依据所述数据容量,存储所述训练数据集至本地节点,包括:
当所述数据容量大于等于与所述训练数据集所对应的预设容量,得到本地分布式存储方式;
依据所述本地分布式存储方式,将数据库中的所述训练数据集采用并行方式 按照所述深度学习模型的训练进度下载至所述本地节点。
在一种实现方式中,所述依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练,包括:
依据所述本地分布式存储方式,得到与所述本地分布式存储方式所对应的并行训练方式;
依据所述并行训练方式,得到构成所述并行训练方式的各个加速卡,所述加速卡为训练所述深度学习模型所需的硬件设备;
统计各个所述加速卡中单次训练所述深度学习模型所需的单次训练时长,
统计所述单次训练时长大于预设计时时长所对应的所述加速卡的数量;
计算大于预设计时时长所对应的所述加速卡的数量与所述加速卡的总量之比,得到数量比值;
依据所述数量比值,得到所述训练数据集的目标更新方式;
依据所述目标更新方式,更新所述训练数据集;
依据更新之后的所述训练数据集,完成所述深度学习模型的训练。
在一种实现方式中,所述依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练,包括:
采用所述训练数据集对所述深度学习模型进行训练,得到单次训练之后的所述深度学习模型的模型训练精度;
依据所述模型训练精度和所述深度学习模型所对应的模型预设训练精度,得到所述深度学习模型的梯度调节系数;
依据所述梯度调节系数,调整所述深度学习模型的梯度值;
依据调整之后的所述梯度值、所述训练数据集,对单次训练之后的所述深度学习模型进行继续训练,完成所述深度学习模型的训练。
在一种实现方式中,所述依据所述梯度调节系数,调整所述深度学习模型的梯度值,包括:
依据所述本地分布式存储方式,得到与所述本地分布式存储方式所对应的并行训练方式;
依据所述并行训练方式,得到构成所述并行训练方式的各个加速卡,所述加速卡为训练所述深度学习模型所需的硬件设备;
统计各个所述加速卡单次训练所述深度学习模型所需的各个单次训练时长;
依据各个所述单次训练时长,得到最大的所述单次训练时长;
依据最大的所述单次训练时长、所述梯度调节系数,调整所述深度学习模型的梯度值。
在一种实现方式中,所述依据最大的所述单次训练时长、所述梯度调节系数,调整所述深度学习模型的梯度值,包括:
当最大的所述单次训练时长小于等于设定时长,将所述梯度调节系数乘以预设梯度,得到乘积结果;
依据所述乘积结果,调整所述深度学习模型的梯度值;
或者,当最大的所述单次训练时长大于所述设定时长,更新所述训练数据集;
依据更新之后的所述训练数据集,训练所述深度学习模型,得到与所述深度学习模型对应的更新之后的所述模型训练精度;
依据更新之后的所述模型训练精度,得到更新之后的所述梯度调节系数;
依据更新之后的所述梯度调节系数,调整所述深度学习模型的梯度值。
在一种实现方式中,所述依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练,之后还包括:
统计完成所述深度学习模型的训练所需要的总时长;
计算完成训练之后的所述深度学习模型的模型精度;
当所述总时长大于训练预设训练时长且所述模型精度小于模型预设精度,重新训练所述深度学习模型。
在一种实现方式中,所述训练方法,还包括:
获取与所述数量比值所对应的所述第一预设比值和第二预设比值,所述第二预设比值大于所述第一预设比值;
获取与所述模型预设训练精度所对应的第一预设精度调节系数和第二预设精度调节系数,所述第二预设精度调节系数大于所述第一预设精度调节系数;
当所述数量比值大于所述第一预设比值而小于所述第二预设比值,将所述模型预设训练精度乘以所述第一预设精度调节系数,得到调节之后的所述模型预设训练精度;
或者,当所述数量比值大于所述第二预设比值,将所述模型预设训练精度乘 以所述第二预设精度调节系数,得到调节之后的所述模型预设训练精度。
在一种实现方式中,所述训练方法,还包括:
计算所述深度学习模型的伸缩性;
依据所述伸缩性,调整用于训练所述深度学习模型的节点数量。
在一种实现方式中,所述训练方法,还包括:
获取与所述本地分布式存储方式的数据并行权重系数所对应的预设数据并行权重系数;
获取与所述深度学习模型的模型并行权重系数所对应的预设模型并行权重系数;
采用并行训练方式训练所述深度学习模型,得到训练之后的所述深度学习模型;
计算训练之后的所述深度学习模型的训练精度;
依据所述训练精度、所述预设数据并行权重系数、所述预设模型并行权重系数,调整所述数据并行权重系数和所述模型并行权重系数。
第二方面,本发明实施例还提供一种深度学习模型的训练装置,其中,所述装置包括如下组成部分:
容量计算模块,用于获取训练数据集的数据容量;
数据存储模块,用于依据所述数据容量,存储所述训练数据集至本地节点;
训练模块,用于依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练。
第三方面,本发明实施例还提供一种终端设备,其中,所述终端设备包括存储器、处理器及存储在所述存储器中并可在所述处理器上运行的深度学习模型的训练程序,所述处理器执行所述深度学习模型的训练程序时,实现上述所述的深度学习模型的训练方法的步骤。
第四方面,本发明实施例还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有深度学习模型的训练程序,所述深度学习模型的训练程序被处理器执行时,实现上述所述的深度学习模型的训练方法的步骤。
有益效果:本发明首先根据待训练的深度学习模型所需要的训练数据集的容量(数据集大小)选择将训练数据集存储至本地节点的存储方式,之后完成训练 数据集在本地节点的存储操作,最后本地节点采用训练数据集训练深度学习模型。本发明根据训练数据集的容量将训练数据集存储至本地节点,能够节省存储数据所需要的时间,进而节省了训练所需要的整体时间,从而提高了训练效率。
附图说明
图1为本发明的整体流程图;
[根据细则91更正 28.01.2023]
图2为本发明实施例中的流程图;
图3为本发明实施例提供的终端设备的内部结构原理框图。
具体实施方式
以下结合实施例和说明书附图,对本发明中的技术方案进行清楚、完整地描述。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
经研究发现,深度学习是业界逐渐流行的机器学习方法,可被用于图像、语音、视频、机器翻译等多种场景。以机器翻译为例,基于神经网络的机器翻译的效果明显提升。目前在某些语种和场景下,译文质量甚至可以达到人工翻译的水平。数据并行是对深度学习模型进行分布式训练的一种形式,其把训练数据分成多份,在不同计算节点上训练。如果计算节点没有共享的公共内存,只有容量受限的本地内存,而训练数据集的规模很大,无法存储于本地内存,就需要对训练数据集进行划分,一次性全部分配到各个计算节点上,然后计算节点依据各自分配的局部数据对深度学习模型进行训练。在分布式训练过程中,各计算节点需要与其他节点进行通信,以交换梯度数据。现有技术在将训练数据集分配至本地节点(计算节点)时并没有考虑训练数据集的容量,导致节点依据分配的训练数据集训练深度学习模型的效率较低。
为解决上述技术问题,本发明提供了一种深度学习模型的训练方法、装置、设备及存储介质,解决了现有技术训练深度学习模型效率较低的问题。具体实施时,获取训练数据集的数据容量;依据数据容量,存储训练数据集至本地节点;依据存储至本地节点上的训练数据集,完成深度学习模型的训练。本发明能够提高模型的训练效率。
举例说明,在本地节点的外部有一个数据库用于存储所有的数据,当需要训练深度学习模型时,根据训练深度学习模型所需要的训练数据集的大小(容量)而选择不同的存储方式,根据不同的存储方式将训练数据集存储至本地节点上。比如容量为A的训练数据集对应存储方式a,容量为B的训练数据集对应存储方式b。
示例性方法
本实施例的深度学习模型的训练方法可应用于终端设备中,所述终端设备可为具有计算功能的终端产品,比如电脑等。在本实施例中,如图1中所示,所述深度学习模型的训练方法具体包括如下步骤:
S100,获取训练数据集的数据容量。
深度学习模型应用的目标不同,用于训练深度学习模型的训练数据集的数据容量也不同。比如,一种深度学习模型是用于识别图像的,另一种深度学习模型是用于识别声音的,那么这两种深度学习模型所需要的训练数据集的数据容量就不同。还有训练到不同的精度所需要的训练数据集也不同。因此可以根据深度学习模型以及训练所要达到的精度得到对应的数据容量。
S200,依据所述数据容量,存储所述训练数据集至本地节点。
步骤S200包括两种情况,第一种情况:数据容量E小于与训练数据集所对应的预设容量E0,就采用本地存储方式将训练数据集从数据库下载到本地节点。本地节点上设置了专门用于存储训练数据集的本地内存。将所有的训练数据集一次下载到本地内存中,因为训练数据集的容量小,所以即使全部一次性下载到本地节点上,也不会占用本地节点的太多内存,因此不会影响本地节点训练深度学习模型,而且由于所有的训练数据集在训练之前已经都下载到了本地节点上了,因此在训练的过程中不需要从数据库下载数据了,最终提高了训练效率。
第二种情况:数据容量E大于等于与训练数据集所对应的预设容量E0,就采用本地分布式存储方式将数据库中的数据先下载到本地存储节点上(本地存储节点不同于本地节点,本地存储节点只用于存储数据),之后在训练的过程中再将本地存储节点上的数据转存至本地节点上,本地节点利用训练数据集完成深度学习模型的训练。由于训练数据集很大,如果将其直接放在本地节点上,就会占用本地节点很大的内容,从而降低了训练效率。而本实施例将训练数据集先放在 本地存储节点上,就能够减少训练数据集对本地节点内存的占用,从而提高了训练效率。在一个实施例中,也可以直接将训练数据集直接从数据库中下载至本地节点的内存上,但是不是一次下载,而是随着训练的进行逐步下载。
S300,依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练。
当步骤S200为第一种情况时,步骤S300包括如下的步骤S301至S306:
S301,依据所述训练数据集训练所述深度学习模型,直至训练时长T达到训练预设时长T0时,得到预训练之后的所述深度学习模型。
S302,计算预训练之后的所述深度学习模型的模型训练精度X。
S303,当所述模型训练精度X小于所述模型预设训练精度X0,得到所述目标存储方式中的本地分布式存储方式,所述本地分布式存储方式所对应的节点数量大于所述本地存储方式所对应的节点数量。
本发明所述方法通过设置预设容量E0、训练预设时长T0以及模型预设训练精度X0,可以判定是否进行本地分布式存储数据训练,实现了训练整体过程的预先选择处理,实现了针对深度学习模型的分布式训练方法的预先处理和优先选择以及提高了模型训练的效率。
S304,依据所述本地分布式存储方式,从所述数据库中重新下载数据至所述本地节点。
S305,依据重新下载的数据,更新所述训练数据集。
S306,依据更新之后的所述训练数据集,继续训练预训练之后的所述深度学习模型,完成所述深度学习模型的训练。
本实施例在使用本地存储方式的训练数据集训练模型时,在训练的过程中还会监控模型训练精度X,如果在训练时长T达到训练预设时长T0时,但是模型训练精度X还没有达到模型预设训练精度X0,就说明使用本地存储方式的训练数据集不能很好的训练模型,就需要采用本地分布式存储方式重新从数据库中下载数据以继续完成后续的训练。
当然若T<T0这段时间内X>X0,就可以认为已经完成了模型的训练,就可以通过中控模块控制本地节点停止训练模型了。
当步骤S200为第二种情况(本地分布式存储方式)时,采用本地分布式存 储方式下载的训练数据集并不能完成模型训练时,就需要重新下载训练数据集以完成训练,此时步骤S300包括如下的步骤S301a至S308a:
S301a,依据所述本地分布式存储方式,得到与所述本地分布式存储方式所对应的并行训练方式。
本实施例中并行训练方式即模型并行(本地节点有多个,每个本地节点都对模型进行训练,就构成了并行训练方法)
S302a,依据所述并行训练方式,得到构成所述并行训练方式的各个加速卡,所述加速卡为训练所述深度学习模型所需的硬件设备。
加速卡是本地节点(服务器)用于训练模型所依托的硬件设备。
S303a,统计各个所述加速卡中单次训练所述深度学习模型所需的单次训练时长。
S304a,统计所述单次训练时长大于预设计时时长t0(预设计算时长)所对应的所述加速卡的数量。
S305a,计算大于预设计时时长所对应的所述加速卡的数量与所述加速卡的总量之比,得到数量比值B。
比如有10个本地节点对模型进行并行训练,10个本地节点就有10个加速卡,其中有8个加速卡的单次训练时长大于预设计时时长t0,那么数量比值(占比)B就是十分之八。
S306a,依据所述数量比值,得到所述训练数据集的目标更新方式。
中控模块设有第一预设占比B1,第二预设占比B2,其中,B1<B2。
若B≤B1,中控模块判定采用异步更新方式训练,此时的目标更新方式为异步更新方式。
若B1<B≤B2,中控模块判定不采用异步更新方式训练。
若B>B2,中控模块判定继续采用同步更新的方式,此时的目标更新方式为同步更新,中控模块计算△B并根据△B降低梯度,设定△B=B-B2。
降低梯度就是降低模型中的参数所对应的值的大小。
S307a,依据所述目标更新方式,更新所述训练数据集。
S308a,依据更新之后的所述训练数据集,完成所述深度学习模型的训练。
本实施例中,异步更新方式就是不同步更新各个本地节点上的训练数据集, 同步更新方式就是同步从数据库中下载新的数据至本地节点上的训练数据集以更新训练数据集。
本实施例,设置第一预设占比B1和第二预设占比B2,可以对同步更新和异步更新的方式进行判断选择,进一步实现了针对深度学习模型的分布式训练方法的预先处理和优先选择以及提高了模型训练的效率。
在一种实施例中,步骤S306a得到数量比值B之后,会根据数量比值B调整模型预设训练精度X0,得到调整之后的模型预设训练精度X’,模型下次训练时就以X’判读模型是否完成训练,调整模型预设训练精度X0具体包括如下步骤S3061a、S3062a、S3063a:
S3061a,获取与所述数量比值所对应的所述第一预设比值B′和第二预设比值B″,所述第二预设比值B″大于所述第一预设比值B′。
在一个实施例中,中控模块在增加梯度的过程中调节训练时长,在降低梯度的过程中调节训练的精度以使精度处于合理区间,中控模块设有第一预设占比比例差值△B1(B′=ΔB1+B2)、第二预设占比比例差值△B2(B″=ΔB2+B2)、第一预设精度调节系数β1以及第二预设精度调节系数β2,其中,0<△B1<△B2,0<β1<β2。
S3062a,获取与所述模型预设训练精度所对应的第一预设精度调节系数β1和第二预设精度调节系数β2,所述第二预设精度调节系数大于所述第一预设精度调节系数。
S3063a,当所述数量比值B大于第一预设比值B′而小于所述第二预设比值B″,将所述模型预设训练精度乘以所述第一预设精度调节系数β1,得到调节之后的所述模型预设训练精度X’。
即在梯度降低的过程中:
若△B1<△B<△B2,中控模块使用β1对训练精度进行调节,调节后的训练精度记为X’,设定X’=X0×β1。
或者,当所述数量比值大于所述第二预设比值,将所述模型预设训练精度乘以所述第二预设精度调节系数,得到调节之后的所述模型预设训练精度X’。在梯度降低过程中,中控模块使用β2对训练精度进行调节,调节后的训练精度记为X’,设定X’=X0×β2。
本实施例通过设置第一预设占比比例差值△B1、第二预设占比比例差值△B2、第一预设精度调节系数β1以及第二预设精度调节系数β2,可以对训练时长和训练精度进行调节,进一步实现了针对深度学习模型的分布式训练方法的预先处理和优先选择以及提高了模型训练的效率。
在一种实施例中,当步骤S200为第二种情况(本地分布式存储方式)时,通过调整梯度值(模型中的参数所对应的值)以完成模型的训练。此时步骤S300包括如下的步骤S301b至S3010b:
S301b,采用所述训练数据集对所述深度学习模型进行训练,得到单次训练之后的所述深度学习模型的模型训练精度。
S302b,依据所述模型训练精度X和所述深度学习模型所对应的模型预设训练精度X0,得到所述深度学习模型的梯度调节系数α。
梯度调节系数α就是完成单次训练之后再进行下次训练时,需要将梯度值按照梯度调节系数调大多少或调小多少。比如单次训练之后,模型中的其中一个参数值(梯度值)为h,再进行下次训练之前先将该参数值调整为h×α。
本实施例中,中控模块设有预设分布式存储数据所对应的模型预设训练精度X0、本地分布式存储方式所对应的预设梯度调节系数α0、预设第一精度差值△X1以及预设第二精度差值△X2,其中,0<△X1<△X2,当中控模块判定需进行本地分布式存储数据训练时,在单次训练过程中,若训练精度X低于预设训练精度X0,中控模块判定需增加训练梯度,中控模块根据精度差值△X(△X=X0-X)确定梯度调节系数α。具体过程如下:
若△X<△X1,所述中控模块将梯度调节系数记为α1,设定α1=α0×1.2,此时梯度调节系数α就是α1;
若△X1<△X<△X2,所述中控模块将梯度调节系数记为α2,设定α2=α0×1.4,此时梯度调节系数α就是α2;
若△X>△X2,所述中控模块将梯度调节系数记为α3,设定α3=α0×1.6,此时梯度调节系数α就是α3。
本实施例通过设置预设训练精度、预设梯度调节系数、预设第一精度差值以及预设第二精度差值,可以确定梯度调节系数,进一步实现了针对深度学习模型的分布式训练方法的预先处理和优先选择以及提高了模型训练的效率。
S303b,依据所述本地分布式存储方式,得到与所述本地分布式存储方式所对应的并行训练方式。
S304b,依据所述并行训练方式,得到构成所述并行训练方式的各个加速卡,所述加速卡为训练所述深度学习模型所需的硬件设备。
S305b,统计各个所述加速卡单次训练所述深度学习模型所需的各个单次训练时长。
S306b,依据各个所述单次训练时长,得到最大的所述单次训练时长tmax;当最大的所述单次训练时长tmax小于等于设定时长t′,将所述梯度调节系数α乘以预设梯度S0(大于0),得到乘积结果S=α×S0,将乘积结果S作为调整之后的梯度值。本实施例中,设定时长t′=t0+Δt1,△t1为第一预设时间差值。
在梯度降低过程中:
若△B<△B1,中控模块判定不调节梯度。
若△B>△B2,所述中控模块使用α3对梯度进行调节,调节后的梯度记为S’,设定S’=S0×α3。
若△B>△B2,所述中控模块使用α3对梯度进行调节,调节后的梯度记为S’,设定S’=S0×α3。
在一种实施例中,当最大的所述单次训练时长大于所述设定时长t′,步骤S306b之后包括如下的步骤S307b至S3010b:
S307b,更新所述训练数据集,依据更新之后的所述训练数据集,训练所述深度学习模型,得到与所述深度学习模型对应的更新之后的所述模型训练精度。
S308b,依据更新之后的所述模型训练精度,得到更新之后的所述梯度调节系数。
S309b,依据更新之后的所述梯度调节系数,调整所述深度学习模型的梯度值。
S3010b,依据调整之后的所述梯度值、所述训练数据集,对单次训练之后的所述深度学习模型进行继续训练,完成所述深度学习模型的训练。
S307b至S3010b:若tmax大于t′但是却小于等于另一个设定时长t″(t″=t0+Δt2,Δt2为第二预设时间差值),所述中控模块检测实际计算时长t0的GPU加速卡的数量与该次训练中使用的GPU加速卡的总数的占比B与预设 计算时长t0的GPU加速卡的数量与该次训练中使用的GPU加速卡的总数的占比B0的关系并根据检测结果判定是否选用异步更新方式;
若tmax大于t″,中控模块选用异步更新的方式进行训练。
本实施例通过设置预设计时时长t0、第一预设时间差值△t1、第二预设时间差值Δt2以及预设梯度S0,可以根据精度差值对梯度进行调节,从而实现训练的梯度控制,进一步实现了针对深度学习模型的分布式训练方法的预先处理和优先选择以及提高了模型训练的效率。
在一种实施例方式中,步骤S300在训练模型的过程中该调整本地节点的数量,具体过程:计算所述深度学习模型的伸缩性;依据所述伸缩性,调整用于训练所述深度学习模型的节点数量。
中控模块根据模型并行的伸缩性确定节点数量,所述中控模块设有预设第一伸缩性限值H1、预设第二伸缩性限值H2和预设节点数量W0,若H<H1,所述中控模块将减少节点数量,减少后的节点数量为0.5×W0;若H1<H<H2,所述中控模块将不对节点数量进行增减;若H>H2,所述中控模块将不对节点数量进行增加,增加后的节点数量为1.5W0。
通过设置预设第一伸缩性限值、预设第二伸缩性限值和预设节点数量,可以对节点数量进行确定和调节,进一步实现了针对深度学习模型的分布式训练方法的预先处理和优先选择以及提高了模型训练的效率。
在一种实施例方式中,步骤S300在训练模型的过程中还调整各个训练数据集的权重以及位于各个本地节点上的模型的权重。具体过程包括:获取与所述本地分布式存储方式的数据并行权重系数所对应的预设数据并行权重系数;获取与所述深度学习模型的模型并行权重系数所对应的预设模型并行权重系数;采用并行训练方式训练所述深度学习模型,得到训练之后的所述深度学习模型;计算训练之后的所述深度学习模型的训练精度;依据所述训练精度、所述预设数据并行权重系数、所述预设模型并行权重系数,调整所述数据并行权重系数和所述模型并行权重系数。
当采用数据并行和模型并行一同对数据进行训练时,采用加权求和的方式确定针对数据的实际训练精度,可以根据实际情况调节权重系数,所述中控模块设有预设数据并行权重系数D0、预设数据并行训练精度Xa、预设模型并行训练精 度Xb、预设数据并行训练精度权重系数Ka0、预设模型并行训练精度权重系数Kb0以及预设模型并行权重系数A0,其中,D0+A0=1,Ka+Kb=1,所述数据并行和模型并行一同训练的训练精度计算公式为:X’=Ka×Xa+Kb×Xb。
当X’>Ka0×Xa+Kb0×Xb时,所述中控模块判定训练精度符合标准,不对权重系数进行调整。
当X’<Ka0×Xa+Kb0×Xb时,所述中控模块判定训练精度不符合标准,中控模块对数据并行权重系数和模型并行权重系数分别进行调整,调整后的实际数据并行权重系数记为D’,设定D’=D0-0.3D0,调整后的实际模型并行权重系数记为A’,设定A’=A0+0.3D0。
通过设置预设数据并行权重系数、预设训练精度以及预设模型并行权重系数,可以对权重系数进行调节以实现提高训练精度,进一步实现了针对深度学习模型的分布式训练方法的预先处理和优先选择以及提高了模型训练的效率。
在一种实施例中,当进行本地分布式存储训练时,所述中控模块根据单次训练过程的精度来增加训练的梯度,当单次数据训练的精度不符合要求时,中控模块根据不同的训练节点逐步增加训练的梯度。通过根据单次训练过程的精度来增加训练的梯度,当单次训练的精度不符合要求时,所述中控模块根据不同的训练节点逐步增加训练的梯度,进一步实现了针对深度学习模型的分布式训练方法的预先处理和优先选择以及提高了模型训练的效率。
如图2所示,在采用数据并行的方式完成训练时,若所述中控模块判定训练结果不符合标准时,中控模块根据数据并行方式的训练结果对是否需要结合模型并行方式重新进行训练进行判定,中控模块设有预设训练时长T0和预设数据并行训练方式总体训练精度X’0。
综上,本发明首先根据待训练的深度学习模型所需要的训练数据集的容量(数据集大小)选择将训练数据集存储至本地节点的存储方式,之后完成训练数据集在本地节点的存储操作,最后本地节点采用训练数据集训练深度学习模型。本发明根据训练数据集的容量将训练数据集存储至本地节点,能够节省存储数据所需要的时间,进而节省了训练所需要的整体时间,从而提高了训练效率,在训练开始前,中控模块根据待训练数据集的内存、预设训练时长、预设读取精度及平台训练任务情况判定是否启动本地分布式或本地存储训练,根据单次训练过程 中的训练精度来增加训练的梯度,实现了针对深度学习模型的分布式训练方法的预先处理和优先选择以及提高了模型训练的效率。
示例性装置
本实施例还提供一种深度学习模型的训练装置,所述装置包括如下组成部分:
容量计算模块,用于获取训练数据集的数据容量;
数据存储模块,用于依据所述数据容量,存储所述训练数据集至本地节点;
训练模块,用于依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练
基于上述实施例,本发明还提供了一种终端设备,其原理框图可以如图3所示。该终端设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏、温度传感器。其中,该终端设备的处理器用于提供计算和控制能力。该终端设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该终端设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种深度学习模型的训练方法。该终端设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该终端设备的温度传感器是预先在终端设备内部设置,用于检测内部设备的运行温度。
本领域技术人员可以理解,图3中示出的原理框图,仅仅是与本发明方案相关的部分结构的框图,并不构成对本发明方案所应用于其上的终端设备的限定,具体的终端设备以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提供了一种终端设备,终端设备包括存储器、处理器及存储在存储器中并可在处理器上运行的深度学习模型的训练程序,处理器执行深度学习模型的训练程序时,实现如下操作指令:
获取训练数据集的数据容量;
依据所述数据容量,存储所述训练数据集至本地节点;
依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是 可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本发明所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (16)

  1. 一种深度学习模型的训练方法,其特征在于,包括:
    获取训练数据集的数据容量;
    依据所述数据容量,存储所述训练数据集至本地节点;
    依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练。
  2. 如权利要求1所述的深度学习模型的训练方法,其特征在于,所述依据所述数据容量,存储所述训练数据集至本地节点,包括:
    当所述数据容量小于与所述训练数据集所对应的预设容量,得到本地存储方式;
    依据所述本地存储方式,将数据库中的所述训练数据集整体下载至所述本地节点,所述数据库位于所述本地节点的外部。
  3. 如权利要求2所述的深度学习模型的训练方法,其特征在于,所述依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练,包括:
    依据所述训练数据集训练所述深度学习模型,直至训练时长达到训练预设时长时,得到预训练之后的所述深度学习模型;
    计算预训练之后的所述深度学习模型的模型训练精度;
    依据所述模型训练精度和与所述深度学习模型对应的模型预设训练精度,完成所述深度学习模型的训练。
  4. 如权利要求3所述的深度学习模型的训练方法,其特征在于,所述依据所述模型训练精度和与所述深度学习模型对应的模型预设训练精度,完成所述深度学习模型的训练,包括:
    当所述模型训练精度小于所述模型预设训练精度,得到本地分布式存储方式,所述本地分布式存储方式所对应的节点数量大于所述本地存储方式所对应的节点数量;
    依据所述本地分布式存储方式,从所述数据库中重新下载数据至所述本地节点;
    依据重新下载的数据,更新所述训练数据集;
    依据更新之后的所述训练数据集,继续训练预训练之后的所述深度学习模型,完成所述深度学习模型的训练。
  5. 如权利要求1所述的深度学习模型的训练方法,其特征在于,所述依据所述数据容量,存储所述训练数据集至本地节点,包括:
    当所述数据容量大于等于与所述训练数据集所对应的预设容量,得到本地分布式存储方式;
    依据所述本地分布式存储方式,将数据库中的所述训练数据集采用并行方式按照所述深度学习模型的训练进度下载至本地节点。
  6. 如权利要求5所述的深度学习模型的训练方法,其特征在于,所述依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练,包括:
    依据所述本地分布式存储方式,得到与所述本地分布式存储方式所对应的并行训练方式;
    依据所述并行训练方式,得到构成所述并行训练方式的各个加速卡,所述加速卡为训练所述深度学习模型所需的硬件设备;
    统计各个所述加速卡中单次训练所述深度学习模型所需的单次训练时长;
    统计所述单次训练时长大于预设计时时长所对应的所述加速卡的数量;
    计算大于预设计时时长所对应的所述加速卡的数量与所述加速卡的总量之比,得到数量比值;
    依据所述数量比值,得到所述训练数据集的目标更新方式;
    依据所述目标更新方式,更新所述训练数据集;
    依据更新之后的所述训练数据集,完成所述深度学习模型的训练。
  7. 如权利要求5所述的深度学习模型的训练方法,其特征在于,所述依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练,包括:
    采用所述训练数据集对所述深度学习模型进行训练,得到单次训练之后的所述深度学习模型的模型训练精度;
    依据所述模型训练精度和所述深度学习模型所对应的模型预设训练精度,得到所述深度学习模型的梯度调节系数;
    依据所述梯度调节系数,调整所述深度学习模型的梯度值;
    依据调整之后的所述梯度值、所述训练数据集,对单次训练之后的所述深度学习模型进行继续训练,完成所述深度学习模型的训练。
  8. 如权利要求7所述的深度学习模型的训练方法,其特征在于,所述依据 所述梯度调节系数,调整所述深度学习模型的梯度值,包括:
    依据所述本地分布式存储方式,得到与所述本地分布式存储方式所对应的并行训练方式;
    依据所述并行训练方式,得到构成所述并行训练方式的各个加速卡,所述加速卡为训练所述深度学习模型所需的硬件设备;
    统计各个所述加速卡单次训练所述深度学习模型所需的各个单次训练时长;
    依据各个所述单次训练时长,得到最大的所述单次训练时长;
    依据最大的所述单次训练时长、所述梯度调节系数,调整所述深度学习模型的梯度值。
  9. 如权利要求8所述的深度学习模型的训练方法,其特征在于,所述依据最大的所述单次训练时长、所述梯度调节系数,调整所述深度学习模型的梯度值,包括:
    当最大的所述单次训练时长小于等于设定时长,将所述梯度调节系数乘以预设梯度,得到乘积结果;
    依据所述乘积结果,调整所述深度学习模型的梯度值;
    或者,当最大的所述单次训练时长大于所述设定时长,更新所述训练数据集;
    依据更新之后的所述训练数据集,训练所述深度学习模型,得到与所述深度学习模型对应的更新之后的所述模型训练精度;
    依据更新之后的所述模型训练精度,得到更新之后的所述梯度调节系数;
    依据更新之后的所述梯度调节系数,调整所述深度学习模型的梯度值。
  10. 如权利要求1所述的深度学习模型的训练方法,其特征在于,所述依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练,之后还包括:
    统计完成所述深度学习模型的训练所需要的总时长;
    计算完成训练之后的所述深度学习模型的模型精度;
    当所述总时长大于训练预设训练时长且所述模型精度小于模型预设精度,重新训练所述深度学习模型。
  11. 如权利要求6所述的深度学习模型的训练方法,其特征在于,所述训练方法,还包括:
    获取与所述数量比值所对应的第一预设比值和第二预设比值,所述第二预设比值大于所述第一预设比值;
    获取所述深度学习模型所对应的模型预设训练精度;
    获取与所述模型预设训练精度所对应的第一预设精度调节系数和第二预设精度调节系数,所述第二预设精度调节系数大于所述第一预设精度调节系数;
    当所述数量比值大于所述第一预设比值而小于所述第二预设比值,将所述模型预设训练精度乘以所述第一预设精度调节系数,得到调节之后的所述模型预设训练精度;
    或者,当所述数量比值大于所述第二预设比值,将所述模型预设训练精度乘以所述第二预设精度调节系数,得到调节之后的所述模型预设训练精度。
  12. 如权利要求1所述的深度学习模型的训练方法,其特征在于,所述训练方法,还包括:
    计算所述深度学习模型的伸缩性;
    依据所述伸缩性,调整用于训练所述深度学习模型的节点数量。
  13. 如权利要求5所述的深度学习模型的训练方法,其特征在于,所述训练方法,还包括:
    获取与所述本地分布式存储方式的数据并行权重系数所对应的预设数据并行权重系数;
    获取与所述深度学习模型的模型并行权重系数所对应的预设模型并行权重系数;
    采用并行训练方式训练所述深度学习模型,得到训练之后的所述深度学习模型;
    计算训练之后的所述深度学习模型的训练精度;
    依据所述训练精度、所述预设数据并行权重系数、所述预设模型并行权重系数,调整所述数据并行权重系数和所述模型并行权重系数。
  14. 一种深度学习模型的训练装置,其特征在于,所述装置包括如下组成部分:
    容量计算模块,用于获取训练数据集的数据容量;
    数据存储模块,用于依据所述数据容量,存储所述训练数据集至本地节点;
    训练模块,用于依据存储至所述本地节点上的所述训练数据集,完成所述深度学习模型的训练。
  15. 一种终端设备,其特征在于,所述终端设备包括存储器、处理器及存储在所述存储器中并可在所述处理器上运行的深度学习模型的训练程序,所述处理器执行所述深度学习模型的训练程序时,实现如权利要求1-13任一项所述的深度学习模型的训练方法的步骤。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有深度学习模型的训练程序,所述深度学习模型的训练程序被处理器执行时,实现如权利要求1-13任一项所述的深度学习模型的训练方法的步骤。
PCT/CN2022/126231 2022-05-26 2022-10-19 一种深度学习模型的训练方法、装置、设备及存储介质 WO2023226284A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210582633.2 2022-05-26
CN202210582633.2A CN114676795B (zh) 2022-05-26 2022-05-26 一种深度学习模型的训练方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023226284A1 true WO2023226284A1 (zh) 2023-11-30

Family

ID=82079923

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/126231 WO2023226284A1 (zh) 2022-05-26 2022-10-19 一种深度学习模型的训练方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN114676795B (zh)
WO (1) WO2023226284A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117651024A (zh) * 2023-12-01 2024-03-05 北京基流科技有限公司 一种预测数据中心网络链路拥塞的方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676795B (zh) * 2022-05-26 2022-08-23 鹏城实验室 一种深度学习模型的训练方法、装置、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860835A (zh) * 2020-07-17 2020-10-30 苏州浪潮智能科技有限公司 一种神经网络模型训练方法和装置
CN112000473A (zh) * 2020-08-12 2020-11-27 中国银联股份有限公司 深度学习模型的分布式训练方法以及装置
CN112306623A (zh) * 2019-07-31 2021-02-02 株式会社理光 深度学习任务的处理方法、装置及计算机可读存储介质
CN113222118A (zh) * 2021-05-19 2021-08-06 北京百度网讯科技有限公司 神经网络训练方法、装置、电子设备、介质和程序产品
CN113344074A (zh) * 2021-06-02 2021-09-03 北京百度网讯科技有限公司 模型训练方法、装置、设备及存储介质
CN113792885A (zh) * 2021-08-20 2021-12-14 山东英信计算机技术有限公司 一种深度学习训练的执行方法及相关装置
CN114676795A (zh) * 2022-05-26 2022-06-28 鹏城实验室 一种深度学习模型的训练方法、装置、设备及存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190450A (zh) * 2018-07-09 2019-01-11 中科遥感科技集团有限公司 基于分布式计算平台的人工智能遥感影像数据提取方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112306623A (zh) * 2019-07-31 2021-02-02 株式会社理光 深度学习任务的处理方法、装置及计算机可读存储介质
CN111860835A (zh) * 2020-07-17 2020-10-30 苏州浪潮智能科技有限公司 一种神经网络模型训练方法和装置
CN112000473A (zh) * 2020-08-12 2020-11-27 中国银联股份有限公司 深度学习模型的分布式训练方法以及装置
CN113222118A (zh) * 2021-05-19 2021-08-06 北京百度网讯科技有限公司 神经网络训练方法、装置、电子设备、介质和程序产品
CN113344074A (zh) * 2021-06-02 2021-09-03 北京百度网讯科技有限公司 模型训练方法、装置、设备及存储介质
CN113792885A (zh) * 2021-08-20 2021-12-14 山东英信计算机技术有限公司 一种深度学习训练的执行方法及相关装置
CN114676795A (zh) * 2022-05-26 2022-06-28 鹏城实验室 一种深度学习模型的训练方法、装置、设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117651024A (zh) * 2023-12-01 2024-03-05 北京基流科技有限公司 一种预测数据中心网络链路拥塞的方法

Also Published As

Publication number Publication date
CN114676795A (zh) 2022-06-28
CN114676795B (zh) 2022-08-23

Similar Documents

Publication Publication Date Title
WO2023226284A1 (zh) 一种深度学习模型的训练方法、装置、设备及存储介质
CN110610242B (zh) 一种联邦学习中参与者权重的设置方法及装置
CN111160531B (zh) 神经网络模型的分布式训练方法、装置及电子设备
WO2019233226A1 (zh) 人脸识别方法、分类模型训练方法、装置、存储介质和计算机设备
CN110991652A (zh) 神经网络模型训练方法、装置及电子设备
JP2022501675A (ja) データ処理方法、装置、コンピュータデバイス、及び記憶媒体
JP2022501677A (ja) データ処理方法、装置、コンピュータデバイス、及び記憶媒体
WO2018133568A1 (zh) 复合模式神经元信息处理方法、系统及计算机设备
WO2022148272A1 (zh) 脉冲神经网络训练方法、数据处理方法、电子设备和介质
CN112862112A (zh) 联邦学习方法、存储介质、终端、服务器、联邦学习系统
CN114547917A (zh) 仿真预测方法、装置、设备及存储介质
CN114912022A (zh) 预测模型训练方法、系统、计算机设备及存储介质
WO2021151324A1 (zh) 基于迁移学习的医疗数据处理方法、装置、设备及介质
CN116863980B (zh) 一种门控信号的动态调节电路和方法
WO2021254498A1 (zh) 一种图像预测方法、设备和存储介质
TW202145078A (zh) 具有動態最小批次尺寸之運算方法,以及用於執行該方法之運算系統及電腦可讀儲存媒體
CN116185568A (zh) 一种容器扩容方法、装置、电子设备及存储介质
CN117009042A (zh) 物联网模式下的信息计算负载调度方法、装置、设备及介质
CN114912627A (zh) 推荐模型训练方法、系统、计算机设备及存储介质
CN112560541A (zh) 目标检测模型的训练装置及方法、电子设备
CN114584476A (zh) 一种流量预测方法、网络训练方法、装置及电子设备
CN110110853B (zh) 一种深度神经网络压缩方法、装置及计算机可读介质
CN112418480A (zh) 气象图像预测方法、装置、计算机设备和存储介质
CN110533158A (zh) 模型建构方法、系统及非易失性电脑可读取记录介质
CN114465957B (zh) 一种数据写入方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22943472

Country of ref document: EP

Kind code of ref document: A1