WO2023221360A1 - 深度学习模型的训练方法、装置、系统、设备和介质 - Google Patents
深度学习模型的训练方法、装置、系统、设备和介质 Download PDFInfo
- Publication number
- WO2023221360A1 WO2023221360A1 PCT/CN2022/121697 CN2022121697W WO2023221360A1 WO 2023221360 A1 WO2023221360 A1 WO 2023221360A1 CN 2022121697 W CN2022121697 W CN 2022121697W WO 2023221360 A1 WO2023221360 A1 WO 2023221360A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target
- parameter
- processor
- memory
- parameters
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 290
- 238000000034 method Methods 0.000 title claims abstract description 94
- 238000013136 deep learning model Methods 0.000 title claims abstract description 79
- 230000015654 memory Effects 0.000 claims abstract description 327
- 238000003860 storage Methods 0.000 claims abstract description 211
- 238000013507 mapping Methods 0.000 claims abstract description 83
- 238000012545 processing Methods 0.000 claims abstract description 53
- 230000004044 response Effects 0.000 claims abstract description 42
- 238000013473 artificial intelligence Methods 0.000 claims abstract description 24
- 238000004364 calculation method Methods 0.000 claims description 27
- 238000012546 transfer Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 10
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 230000009467 reduction Effects 0.000 claims description 6
- 230000001427 coherent effect Effects 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 abstract description 9
- 230000006854 communication Effects 0.000 description 25
- 238000004891 communication Methods 0.000 description 24
- 230000008569 process Effects 0.000 description 20
- 238000010586 diagram Methods 0.000 description 17
- 238000005516 engineering process Methods 0.000 description 12
- 230000006870 function Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000005055 memory storage Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Definitions
- the present disclosure relates to the field of artificial intelligence, specifically to the field of deep learning and intelligent recommendation, and in particular to a training method, device, system, electronic device and storage medium for a deep learning model.
- the present disclosure aims to provide a training method, device, system, electronic device and storage medium for a deep learning model that is conducive to reducing hardware requirements and enabling large-scale model training.
- a method for training a deep learning model including: according to the first training data of the current training round, determining what needs to be written in the first network parameters required for embedding the first training data. Enter the first target parameter into the target memory; wherein the target memory is a memory included in the target processor; determine the remaining storage slots in the target memory according to the first mapping relationship between the storage slots of the target memory and the network parameters; and in response to the remaining storage slots meeting the storage requirement of the first target parameter, writing the first target parameter into the target memory, so that the computing core included in the target processor adjusts the first network parameter according to the first training data.
- a method for training a deep learning model including: a first processor determining, based on the first training data of the current training round, a first step required for embedding the first training data.
- the first target parameter that needs to be written into the target memory among the network parameters; wherein the target memory is the memory included in the second processor; the first processor based on the first mapping relationship between the storage slot of the target memory and the network parameters, Determine the remaining storage slots in the target memory; the first processor responds to the remaining storage slots meeting the storage requirements of the first target parameter, writes the first target parameter into the target memory, and sends a message based on the first training to the second processor.
- training task information of the data the computing core of the second processor adjusts the first network parameters according to the first training data in response to receiving the training task information.
- a training device for a deep learning model including: a target parameter determination module, configured to determine the requirements for embedding the first training data according to the first training data of the current training round.
- the first target parameter needs to be written into the target memory; wherein, the target memory is the memory included in the target processor; the remaining slot determination module is used to determine the relationship between the storage slot of the target memory and the network parameters.
- the first mapping relationship determines the remaining storage slots in the target memory; and the parameter writing module is configured to write the first target parameter into the target memory in response to the remaining storage slots meeting the storage requirements of the first target parameter, so as to
- the computing core included in the target processor is caused to adjust the first network parameters according to the first training data.
- a deep learning model training system including a first processor and a second processor.
- the second processor includes a target memory and a computing core, wherein the first processor is configured to : According to the first training data of the current training round, determine the first target parameter that needs to be written into the target memory among the first network parameters required for embedding the first training data; according to the storage slot and network parameters of the target memory determine the remaining storage slots in the target memory; and in response to the remaining storage slots meeting the storage requirements of the first target parameter, write the first target parameter into the target memory and send it to the second processor Send training task information based on the first training data; the second processor is configured to: the computing core adjusts the first network parameters according to the first training data in response to receiving the training task information.
- an electronic device including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions that can be executed by the at least one processor, and the instructions are At least one processor executes, so that at least one processor can execute the training method of the deep learning model provided by the present disclosure.
- a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the training method of the deep learning model provided by the present disclosure.
- a computer program product including a computer program/instruction that, when executed by a processor, implements the training method of the deep learning model provided by the present disclosure.
- Figure 1 is an application scenario architecture diagram of a deep learning model training method, device and system according to an embodiment of the present disclosure
- Figure 2 is a schematic flowchart of a training method for a deep learning model according to an embodiment of the present disclosure
- Figure 3 is a schematic flowchart of a training method for a deep learning model according to another embodiment of the present disclosure
- Figure 4 is a schematic structural diagram of a processor cache according to an embodiment of the present disclosure.
- Figure 5 is an overall flow diagram of a training method for a deep learning model according to an embodiment of the present disclosure
- Figure 6 is a communication topology diagram of a single-machine multi-card processor according to an embodiment of the present disclosure
- Figure 7 is a schematic diagram of the principle of training a model in the form of an asynchronous pipeline according to an embodiment of the present disclosure
- Figure 8 is a structural block diagram of a training device for a deep learning model according to an embodiment of the present disclosure
- Figure 9 is a structural block diagram of a training system for a deep learning model according to an embodiment of the present disclosure.
- FIG. 10 is a block diagram of an electronic device used to implement the training method of a deep learning model according to an embodiment of the present disclosure.
- a CPU- or GPU-based parameter server architecture can be used to conduct distributed training of large-scale sparse parameters in order to improve training efficiency.
- Parameter server architectures may include, for example, HugeCTR, Paddle-GPUPS, and Persia.
- HugeCTR is a framework that uses GPU to accelerate recommended model training.
- the framework supports multi-machine and multi-card acceleration.
- the framework supports hybrid training of model parallel training for embedding layers with sparse parameter distribution and data parallel training for networks with dense parameter distribution.
- HugeCTR divides the embedding layer into multiple parts and assigns them to multiple machines and cards respectively.
- Each GPU saves a part of the global embedding layer, and each GPU has a complete network with densely distributed parameters.
- the global sample data can be randomly shuffled and divided, and different sample data can be allocated to each GPU for data parallel training.
- HugeCTR two methods of storing the embedded layer are supported. One is to cache sparse parameters belonging to the same slot in the graphics card memory of the same GPU; the other is to break up the entire sparse parameters and store them in in the graphics card memory of different GPUs. In these methods, some sparse parameters are cached repeatedly, which will cause a certain degree of waste of graphics card memory. Moreover, HugeCTR requires multiple CPUs to participate in model training, which results in high training costs.
- HBM High Bandwidth Memory
- the architecture Before training begins, the architecture first loads the sparse parameters required for embedding the features of the currently acquired data in a pass from the CPU memory to the graphics card memory. When loading, the sparse parameters required for the same feature group are broken up and stored in different graphics card memories. In this way, when training a model based on a batch of data extracted from a pass, each GPU needs to copy the required sparse parameters from the memory of other graphics cards based on the feature identifier.
- the communication overhead between GPUs is relatively large, and because an HBM hash table is built and stored on each GPU, the size of the graphics card memory is required to be high.
- Persia is a recommended model training framework for large-scale heterogeneous cluster training.
- This framework makes the maximum number of trainable model parameters reach the trillions level.
- the framework updates the embedding layer asynchronously and updates the network with densely distributed parameters synchronously. Through system optimization, part of the communication process and calculation process can overlap in time.
- This framework introduces the role of Embedding Worker into the traditional framework, splitting the training and update tasks of the embedding layer from the training tasks of the overall model and handing them over to the Embedding Worker for execution. In order to introduce Embedding Worker, this framework requires more CPU, which will increase the training cost of the model.
- AI artificial intelligence chips
- DPU Deep learning processors
- NPU neural network processing Neural Network Processing Unit
- TPU Tensor Processing Unit
- the second-generation Kunlun Core chip is a general-purpose AI chip using GDDR6 video memory.
- the chip runs based on the XPU-R architecture, which can significantly improve the core computing power and enhance the chip's general computing capabilities.
- Figure 1 is an application scenario diagram of a deep learning model training method, device and system according to an embodiment of the present disclosure.
- the application scenario 100 includes electronic equipment, which may be a laptop computer, a desktop computer, a server, etc.
- the electronic device is provided with a processor CPU 110, an artificial intelligence chip 120, a memory 130 and a hard disk memory 140.
- Memory 130 refers to memory storage, which is the space used by the CPU 110 for direct addressing and storage. This memory can temporarily store operating data in the CPU, as well as data exchanged with external memories such as hard disks. As long as the computer is running, the CPU will transfer the data that needs to be calculated into the memory for calculation. When the calculation is completed, the CPU will send the results out.
- the memory 130 can be, for example, a random access memory, so that the CPU can read data from it and also write data.
- the hard disk processor 140 may be, for example, a solid state disk (SSD) with an NVMe interface, which is not limited in this disclosure.
- SSD solid state disk
- the artificial intelligence chip has data processing capabilities and can assist the CPU in its work and improve the overall running speed.
- the artificial intelligence chip may include, for example, the DPU, NPU or TPU described above.
- the artificial intelligence chip 120 may include a computing core, a video memory and related circuits.
- the video memory is the display memory 150, which is the dedicated memory of the artificial intelligence chip. It is used to store the rendering data processed or to be extracted by the computing core. Similar to the memory 130, the display memory 150 is used to store the model parameters and training to be processed. Samples and other information.
- the computing core in the artificial intelligence chip 120 cannot directly read the data in the memory 130.
- the computing core can only read data from the display memory 150.
- the CPU can allocate computing tasks to the computing core.
- data interaction can occur between the memory 130 and the display memory 150 to transfer the data required by the computing core to perform the computing task.
- the data is copied from the memory 130 to the display memory 150, or the data in the memory 130 is directly transferred to the display memory 150.
- the CPU 110 can, for example, assign the training task to the artificial intelligence chip 120 and transfer the model from the memory 130 to the display memory 150.
- the model may be stored in a hard disk storage space provided by the hard disk memory 140 .
- a three-level cache space composed of display memory 150, memory 130 and hard disk storage 140 is established.
- the CPU 110 can read the data from the hard disk memory 140 and cache it into the memory, and allocate training data to the artificial intelligence chip 120 in the CPU 110 During the task, the model parameters involved when the computing core performs the current computing task are transferred from the memory 130 to the display memory 150, and the data processed by the computing core stored in the display memory 150 is transferred from the display memory 150 to the memory 130, so as to Avoid displaying insufficient memory storage space.
- the electronic device may be provided with multiple artificial intelligence chips, for example.
- the multiple artificial intelligence chips may execute model training tasks in parallel based on different training samples, thereby improving the training efficiency of the model.
- the training method of the deep learning model provided by the present disclosure can be executed by an electronic device, and specifically can be implemented by a CPU or an artificial intelligence chip calling the corresponding program code.
- the deep learning model training device and the deep learning model training system provided by the present disclosure can be installed in an electronic device.
- Figure 2 is a schematic flowchart of a training method for a deep learning model according to an embodiment of the present disclosure.
- the deep learning model training method 200 of this embodiment may include operations S210 to S230.
- the method 200 may, for example, be executed by a CPU in the electronic device described above.
- operation S210 according to the first training data of the current training round, determine the first target parameter that needs to be written into the target memory among the first network parameters required for embedding the first training data.
- the remaining storage slots in the target memory are determined according to the first mapping relationship between the storage slots of the target memory and the network parameters.
- the first target parameter is written into the target memory, so that the computing core included in the target processor calculates the first network parameter based on the first training data. Make adjustments.
- the target memory may be, for example, a memory included in the target processor.
- the target processor can be, for example, the artificial intelligence chip described above, or it can also be a graphics processor (GPU), etc.
- the target processor can receive the computing tasks assigned by the CPU and execute the assigned computing tasks based on the data stored in the target memory.
- the computing tasks may include, for example, model training tasks to train a deep learning model.
- Deep learning models may include, for example, image processing models, speech processing models, or text processing models.
- the deep learning model can be a recommendation model. This embodiment can train the recommendation model through methods such as gradient descent based on massive user interaction behavior information on recommended objects. After the model parameters of the recommendation model converge, personalized recommendations can be made to users.
- a deep learning model may include, for example, an embedding layer and a prediction network.
- the embedding layer is used to embed the data input to the deep learning model to project the input data from a high-dimensional sparse space to a low-dimensional dense feature space.
- the first network parameters required for embedding the first training data are the network parameters in the embedding layer.
- the first network parameter may be determined, for example, by calling a kernel function.
- the determined first network parameters can be compared with the network parameters stored in the target memory, and the network parameters that are not stored in the target memory among the first network parameters can be determined as the first network parameters that need to be written into the target memory. target parameters.
- the first network parameter can also be compared with the network parameters stored in the memory and/or the hard disk processor, and the first network parameters can be compared with the network parameters stored in the memory and/or the hard disk processor. as the first target parameter. It can be understood that when comparing network parameters, the comparison can be performed based on the feature signature (Feasign for short) of the data embedded based on the network parameters.
- a training sample can include feature data of multiple objects, each object can include multiple feature data, and one feature data corresponds to a feature identifier.
- fixed network parameters need to be used for embedding processing.
- embodiments of the present disclosure can store the network parameters of the embedding layer according to the corresponding relationship between the network parameters and the feature data, and add the feature identifier of the corresponding feature data to the network parameters.
- the CPU can maintain a mapping relationship table between the feature identifier of the feature data and the network parameters stored in the target memory in the cache or memory.
- the mapping relationship table uses the feature identifier as the key and the mapping relationship.
- the identification information of the network parameter is used as Value.
- This embodiment can query the mapping relationship table according to the feature identifiers of the feature data included in the first training data, determine the feature identifiers that do not exist in the mapping relationship table, and use the first network parameters for the non-existent feature identifiers.
- the network parameters for embedding the identified feature data are used as the first target parameters.
- the network parameters stored in the target memory may be stored in slots (Slots), for example, and the network parameters stored in each slot are all network parameters corresponding to one feature data. That is, network parameters can be stored in groups, and all network parameters corresponding to one feature data constitute a network parameter group. In this way, the target memory can be divided into multiple storage slots, and each storage slot is used to store a network parameter group.
- this embodiment may first determine whether the storage space in the target memory is sufficient, and only if the storage space is sufficient, write the first target parameter into the target memory.
- this embodiment may maintain a first mapping relationship between storage slots and network parameters in the cache or memory of the CPU.
- the first mapping relationship can be stored in the form of a mapping table. Since network parameters correspond to feature data one-to-one, this embodiment can use feature identifiers of feature data to represent network parameters and code storage slots in the target memory. In this way, the first mapping relationship can be expressed as a mapping table between Feasign and FId, using the feature identifier of the feature data as Key and the encoding of the storage slot (set as FId) as Value. In this way, this embodiment can determine the remaining storage slots in the target memory according to the first mapping relationship.
- the total storage space of the target memory is divided into 100 storage slots, and the codes of the 100 storage slots are integers from 0 to 99. If the first mapping relationship only includes integers coded from 0 to 49, mapping information, it can be determined that there are 50 remaining storage slots.
- this embodiment can compare the remaining storage slots with the number of network parameter groups in the first target parameter. If the number of network parameter groups in the first target parameter is less than the remaining storage slots, , then it is determined that the remaining storage slots meet the storage requirements of the first target parameter, and the CPU can transfer the first target parameter from the memory to the target memory.
- the group-by-group writing method described above may be used.
- Embodiments of the present disclosure can achieve management of the storage space of the graphics card memory by maintaining the first mapping relationship in the CPU, determining the remaining storage space of the target memory based on the first mapping relationship, and controlling the writing of network parameters based on this. , avoid the huge pressure on the graphics card memory caused by too many network parameters required for embedding processing during the model training process, which will help reduce the high requirements of large-scale model training on hardware conditions, and help realize the training of large-scale models.
- the first mapping relationship in this embodiment is maintained in the memory or cache accessible to the CPU, compared with the technical solution in the related art where the hash table representing the mapping relationship is stored in the memory of the graphics card, it can be fully Using the graphics card memory for model training also helps reduce the pressure on the graphics card memory and reduces the communication overhead between the CPU and the target processor.
- this embodiment may also first allocate storage slots among the remaining storage slots for the first target parameter, and write the first target parameter into into the allocated storage slot. For example, if the first target parameter includes network parameters corresponding to 10 feature data, and the storage slots coded 0 to 49 in the target memory already store network parameters, then the storage slots coded 50 to 49 can be allocated to the first target parameter. slot.
- the coding of the allocated storage slot that is, the identification information of the storage slot
- the identification information of the first target parameter that is, the characteristic data corresponding to the first target parameter
- this embodiment can also write the third network parameters required for predictive processing of the training data into the target memory for call by the computing core of the target processor, And adjust the third network parameter according to the calling result.
- the third network parameters may be, for example, network parameters included in the prediction network, and the prediction network may include, for example, a multilayer perceptron (MLP).
- MLP multilayer perceptron
- the training process of deep learning models usually includes three parts.
- the first part is the forward calculation process to calculate the loss of the deep learning model
- the second part is the reverse calculation process to calculate the gradient
- the third part is the process of updating the network parameters of the deep learning model based on the gradient.
- the calculation core may adjust the first network parameter and the third network parameter according to the gradient obtained by reverse calculation, so that the network parameters of the deep learning model gradually converge.
- the CPU when the remaining storage slots do not meet the storage requirements of the first target parameter, can, for example, transfer out the temporarily unnecessary network parameters in the target memory to leave sufficient space for the first target parameter. To provide conditions for subsequent training of deep learning models.
- the size of the cache space in the target memory can be dynamically adjusted, and combined with the maintenance of the first mapping relationship in the memory, the communication overhead between the CPU and the target processor can be effectively reduced.
- the CPU may also maintain a second mapping relationship between the storage slot and the parameter status of the network parameter stored in the storage slot in the cache or memory, as a basis for determining the transferable network parameters.
- the parameter status of a network parameter may include a reference status.
- the reference status is set to the referenced status. If the network parameters are not required for the current training round, the reference status is set to the unreferenced status.
- the reference status can be represented by a reference count (referred to as RefCount). If the value of the reference count is 1, it means that it is referenced. If the value of the reference technology is 0, it means that it is not referenced. state.
- the second mapping relationship is represented by a mapping table composed of the correspondence relationship between FId, FeaSign and RefCount described above.
- Each FeaSign corresponds to its own RefCount, which is used to represent the feature data identified by FeaSign. Whether the network parameters required for embedding processing are referenced.
- the network parameter corresponding to the FeaSign whose RefCount value is 0 in the second mapping relationship can be used as the transferable network parameter.
- the parameter status of a network parameter may include the number of uses.
- the number of uses is increased by 1, and the initial value of the number of uses can be 0.
- the number of uses can be represented by frequency (Frequency Count, FreqCount for short).
- the second mapping relationship is represented by a mapping table composed of the correspondence relationship between FId, FeaSign and FreqCount described above.
- Each FeaSign corresponds to its own FreqCount, which is used to represent the feature data identified by FeaSign. The number of times the network parameters required for embedding processing are used.
- the network parameter corresponding to FeaSign whose value of FreqCount in the second mapping relationship is less than the threshold can be used as the transferable network parameter.
- the parameter status of a network parameter can include not only the reference status, but also the number of uses.
- the second mapping relationship is represented by a mapping table composed of the correspondence relationships between FId, FeaSign, RefCount and FreqCount described above. Each FeaSign corresponds to its own RefCount and FreqCount.
- network parameters corresponding to FeaSign whose reference status is not referenced and whose usage count is less than the threshold can be used as transferable network parameters.
- unnecessary network parameters can be transferred out in a timely manner according to the needs, leaving enough storage slots for the training of the deep learning model, which is beneficial to improving the training efficiency of the deep learning model.
- this embodiment can also compare the first network parameter with the network parameters stored in the target memory, and use the network parameters that do not belong to the first network parameter and have a reference status of not referenced as transferable network parameters. For example, when determining the transferable network parameters, the number of groups of network parameters that need to be transferred out can also be determined as the number of target groups based on the number of characteristic data corresponding to the first target parameter. Then several network parameters of the target group that are in an unreferenced state and are used less frequently are determined as transferable network parameters.
- the transferable network parameters can be transferred from the target storage to the memory. And after the transferable network parameters are transferred out, the first target parameters are written into the target memory. It can be understood that, similar to what is described above, when the first target parameter is written into the target memory, the remaining storage slots in the target memory may be allocated to the first target parameter. It can be understood that the remaining storage slots here include storage slots where transferable network parameters are located. The first target parameter is then written into the allocated storage slot. After the storage slot is allocated, this embodiment can also calculate the first mapping relationship described above and the first mapping relationship based on the identification information of the characteristic data corresponding to the first target parameter and the encoding of the storage slot allocated for the first target parameter. The two mapping relationships are updated.
- the parameter status of the first network parameter also needs to be updated.
- the reference state of the first network parameter can be changed to the referenced state, that is, the RefCount of FeaSign corresponding to the first network parameter is changed from 0 to 1.
- this embodiment may also update the second mapping relationship to update the reference status of the first network parameter. Specifically, the RefCount of FeaSign corresponding to the first network parameter can be changed from 1 back to 0.
- a three-level cache structure composed of a target memory, a memory, and a hard disk processor may be used to reduce storage pressure on the memory and target memory.
- the first target parameter can be read from the memory or the hard disk storage.
- the memory can be a cache of the hard disk memory.
- the CPU can write the data cached in the memory to the hard disk memory.
- the CPU when the CPU transfers the transferable network parameters from the target storage to the memory, it can first determine whether the remaining storage space of the memory is less than the space threshold. If it is less than the space threshold, the memory is used as a cache, and the transferable network parameters are written to the hard disk storage via the memory. That is, the transferable network parameters are cached in the memory, and the transferable network parameters cached in the memory are written into the hard disk storage.
- the first network parameters required for embedding the first training data may be determined in the manner described above. Specifically, the feature data included in the first training data may be determined first, and all network parameters corresponding to the feature data may be used as the first network parameters. Then, the first network parameters are deduplicated to obtain the deduplicated network parameters. For example, the first network parameter may be deduplicated according to the identification information of the feature data. Then, according to the first mapping relationship and the identification information of the deduplicated network parameters, the network parameters that are not stored in the target memory among the deduplicated network parameters are determined, and the determined network parameters are used as the first target parameters.
- the feature data included in the first training data can also be deduplicated based on the identification information of the feature data. Subsequently, the network parameters required for embedding the deduplication feature data will be used as the deduplication network parameters.
- the first training data usually includes multiple training data
- different training data may include the same feature data. If all the determined first network parameters are written into the target memory, there may be situations where the same network parameters are written into multiple slots of the target memory.
- the embodiments of the present disclosure can avoid the above situation by deduplicating the first network parameters, and thus can reduce the waste of storage space of the target memory, which is beneficial to improving the utilization of the storage space of the target memory and reducing the need for large-scale model training.
- the pressure brought by the target memory is conducive to the training of large-scale models.
- the training task information based on the first training data can also be sent to the target processor, so that the computing core of the target processor can perform the training according to the data stored in the target memory.
- the first network parameters process the first training data, and adjust the first network parameters according to the processing results.
- the present disclosure also provides another model processing method. This other model training method will be described in detail below with reference to Figure 3.
- Figure 3 is a schematic flowchart of a training method for a deep learning model according to another embodiment of the present disclosure.
- the deep learning model training method 300 of this embodiment may include operations S310 to S340.
- the model training method 300 may be performed by the electronic device described above.
- the first processor determines, based on the first training data of the current training round, the first target parameter that needs to be written into the target memory among the first network parameters required for embedding the first training data.
- the first processor may be the CPU described above
- the target memory may be a memory included in the second processor.
- the second processor is similar to the target processor described above.
- the implementation of operation S310 is similar to the operation S210 described above, which will not be described again.
- the first processor determines the remaining storage slots in the target memory according to the first mapping relationship between the storage slots of the target memory and the network parameters. This operation S320 is similar to the operation S220 described above, and will not be described again here.
- the first processor in response to the remaining storage slots meeting the storage requirement of the first target parameter, the first processor writes the first target parameter into the target memory, and sends training task information based on the first training data to the second processor.
- the computing core of the second processor adjusts the first network parameter according to the first training data in response to receiving the training task information.
- the implementation of writing the first target parameter into the target memory is similar to the implementation of operation S230 described above, and will not be described again here.
- the first processor may send the training task information based on the first training data to the second processor after writing the first target parameter into the target memory.
- the computing core of the second processor can directly call the first network parameters stored in the target memory to process the first training data, and reversely calculate according to the processing results to obtain the first training data.
- Gradient data of the training data to adjust the first network parameters according to the gradient data.
- the first processor may also send training task information based on the first training data to the second processor in the process of writing the first target parameter into the target memory.
- the computing core of the second processor can gradually call the network parameters stored in the target memory after receiving the training task information.
- the required network parameters have not been written into the target memory, it can tentatively The training task is executed until the required network parameters can be read from the target memory.
- the first processor may also write the first training data into the cache of the second processor.
- the training task information may include, for example, forward calculation task information, reverse calculation task information, parameter update task information, etc.
- the task information of the forward calculation may include, for example, the calling information of the first training data, the calling information of the network parameters, the calculation information of the loss, etc.
- the calling information of the network parameters may include identification information of the network parameters that need to be called, calling sequence information of the network parameters, etc.
- the task information of reverse calculation may include, for example, learning rate and other information
- the task information of parameter update may include, for example, adjustment step size, etc.
- Embodiments of the present disclosure can achieve management of the storage space of the graphics card memory by maintaining the first mapping relationship in the CPU, determining the remaining storage space of the target memory based on the first mapping relationship, and controlling the writing of network parameters based on this. , to avoid the huge pressure on the graphics card memory caused by too many network parameters required for embedding processing during the model training process, which is conducive to reducing the high requirements on hardware conditions for large-scale deep learning model training, and is conducive to the realization of large-scale deep learning Model training.
- the first mapping relationship in this embodiment is maintained in the memory or cache accessible to the CPU, compared with the technical solution in the related art where the hash table representing the mapping relationship is stored in the memory of the graphics card, it can be fully Using the graphics card memory for model training also helps reduce the pressure on the graphics card memory and saves communication overhead between the CPU and the target processor.
- this embodiment can also write the third network parameters required for predictive processing of the training data into the target memory for use by the target processor. Calculate the core call, and adjust the third network parameter according to the call result.
- Figure 4 is a schematic structural diagram of a processor cache according to an embodiment of the present disclosure.
- the structure of the processor cache may include a memory 410 and a target memory 420.
- This embodiment is described by taking the target memory 420 as the graphics card memory as an example. It can be understood that the target memory 420 can be any high bandwidth memory (High Bandwidth Memory, HBM).
- HBM High Bandwidth Memory
- the first hash table 411 and the second hash table 412 may be maintained in the memory 410 .
- the first hash table 411 is used to represent the first mapping relationship described above
- the second hash table 412 is used to represent the second mapping relationship described above.
- the Key in the first hash table is the identification information FeaSign of the feature data
- the Value in the first hash table is the number of the storage slot in the graphics card memory 420 .
- the Key in the second hash table is the number of the storage slot in the graphics card memory 420
- the Value is the label information of the feature data (Feature Meta, referred to as FeaMeta).
- the label information may include the identification information FeaSign of the feature data.
- the reference status RefCount and the number of uses FreqCount of network parameters required for embedding processing.
- the graphics card memory 420 in this embodiment is allowed to store at most 100 sets of network parameters for embedding 100 feature data
- the storage slots in the graphics card memory 420 include 100, and the numbers of the 100 storage slots are respectively It is 0, 1, 2, ..., 98, 99.
- the data cached in each storage slot may include a set of embedding layer network parameters and hyperparameters required to adjust the set of embedding layer network parameters.
- the processor CPU 430 can determine the number of available storage slots in the graphics card memory 420 by querying the first mapping table when performing corresponding operations of the deep learning model training method described above, and perform embedding processing on the training data.
- the required target parameters to be written to the graphics card memory 420 are allocated storage slots, and the information in the first hash table 411 and the second hash table 412 stored in the memory 410 is queried, added, deleted, etc. .
- the CPU 430 can also copy the data that needs to be cached to the graphics card memory 420 into the allocated storage slot when performing corresponding operations in the training method of the deep learning model, and complete the adjustment of network parameters and needs on the target processor such as the GPU.
- the relevant network parameters are copied from the graphics card memory 420.
- the CPU 430 essentially plays the role of a cache manager.
- the graphics card memory 420 may be a memory in an artificial intelligence chip. Specifically, it can also be the memory in the Kunlun second-generation chip. In this way, this embodiment can make full use of the computing power of the Kunlun second-generation chip when executing the training method of the deep learning model, which is beneficial to the training of large-scale recommendation models.
- multiple target processors can be provided in an electronic device, so that the target processors can train the deep learning model in parallel based on different training data, thereby improving model training efficiency.
- the target processor described above includes multiple processors.
- multiple batches of data can be obtained, and the data of the multiple batches constitute the first training data.
- only the network parameters required for embedding the data of each batch can be written into the target memory of the processor corresponding to each batch, thereby reducing the cache pressure of the target memory in the target processor.
- this embodiment may first determine the parameters required for embedding a batch of data corresponding to each processor in the first target parameter as the parameters for each processor. Processor specific parameters. Subsequently, the predetermined parameters are used to replace other parameters in the first target parameters except the specified parameters, thereby obtaining the parameters to be written for each processor.
- the number of parameters in the parameters to be written is the same as the number of parameters of the first target parameter. Subsequently, the parameters to be written are written into the target memory included in each processor according to the storage slot allocated for the first target parameter. In this way, the number of network parameters stored in multiple target memories included in multiple target processors and the distribution of network parameters can be made the same. Predetermined parameters can be null. In this way, in addition to reducing the cache pressure of the target memory in the target processor, network parameters can be synchronized through communication between multiple target processors.
- multiple target processors can synchronize the gradient data of the calculated network parameters based on the network parameters stored in their target memories and the slots where the network parameters are located. In this way, the communication overhead between the target processor and the CPU can be reduced.
- the computing core of each processor can perform forward calculation and reverse calculation based on the training data and network parameters of a batch corresponding to each processor to obtain gradient data for the first network parameter.
- the computing core can obtain the network parameters for embedding and predicting the feature data from the target memory based on the feature data in the corresponding training data of a batch, and process the feature data according to the network parameters to obtain the processing results. Then the loss of the deep learning model for the data of the batch is determined based on the processing results, thereby completing the forward calculation task. Subsequently, based on the loss and the network parameters that embedding and predicting the feature data, the back propagation algorithm is used to calculate the gradient data for the first network parameter, thereby completing the reverse calculation task.
- the gradient data for the first network parameter obtained by other target processors is obtained according to the communication with other target processors based on the storage slot where the first network parameter is located.
- gradient data for the third network parameter used for prediction processing obtained by other target processors can also be obtained through communication with other target processors.
- Figure 5 is an overall flow diagram of a deep learning model training method according to an embodiment of the present disclosure.
- the deep learning model training method 500 of this embodiment may include operations S501 to S518. Except that operations S509 to S512 are performed by the target processor described above, other operations are performed by the CPU.
- batches of data are obtained.
- a predetermined number of sample data can be obtained from the hard disk storage or an external database to train the deep learning model.
- the data is globally scrambled to improve the randomness of the training data obtained in batches.
- batch_size*cards of training data can be randomly obtained from the batch of data as the first training data described above.
- the number of cards refers to the number of target processors set in the electronic device.
- batch_size can be set according to actual needs.
- the batch_size can be determined according to the storage capacity of the target memory in the target processor.
- the number of network parameters required for embedding batch_size training data can be related to the storage capacity of the target memory.
- the number of storage slots in the target memory may be twice the number of sets of network parameters required for embedding processing.
- operation S504 it is determined whether the remaining storage slots of the target memory are sufficient. If sufficient, perform operations S505 to S513; otherwise, perform operations S514 to S516. It can be understood that multiple target processors can be set to be processors of the same model, and the storage capacities of multiple target memories included in the multiple target processors are equal.
- a deduplication process is performed on the network parameters required for the embedding process of the first training data to obtain the deduplication network parameters described above.
- an increment relative to the cached parameter in the target memory is determined. That is, the network parameters after deduplication are compared with the network parameters stored in the target memory according to the first mapping relationship to determine the network parameters that need to be written into the target memory, and obtain the first target parameters described above.
- the newly added network parameters are copied (Pull) into the target memory.
- the parameters to be written to each target memory can be determined according to the predetermined parameters as described above, and the parameters to be written are written to into the allocated storage slot.
- each target processor can call the network parameters in the target memory and perform operations S509 to S512 based on a batch of training samples.
- the third network parameters of the prediction network may also be copied into a target memory included in each of the plurality of target processors.
- a forward calculation task is performed to obtain the loss of the deep learning model for the training samples of the batch.
- a reverse calculation task is performed to calculate gradient data for a batch of training samples according to the loss.
- the gradient data should include gradient data of the first network parameter and gradient data of the third network parameter.
- a global reduction algorithm (AllReduce algorithm) is used to aggregate gradient data obtained by multiple target processors. It can be understood that when aggregating the gradient data of the first network parameter, the storage slot where the first network parameter is located should be used as a reference. This is due to differences in the values of the first network parameter stored in different target memories. .
- the value of the network parameter stored in the target memory is updated according to the aggregation result.
- the aggregation results may include, for example, calculating the average of all gradient data for each network parameter to obtain the final gradient, and updating the value of each network parameter based on the final gradient.
- the RefCount value of the feature data corresponding to the network parameters used in the current batch is decremented by 1.
- the target processor has completed adjusting the network parameters based on the first training data.
- transferable network parameters whose RefCount is 0 and FreqCount are low are filtered out.
- the RefCount of the feature data corresponding to the transferable network parameters is 0, and the value of FreqCount is lower than the frequency threshold.
- the transferable network parameters are copied out from the target memory, and the copied transferable network parameters are cached in the memory.
- mapping relationship between FeaSign and FId of the feature data corresponding to the transferable network parameter in the first mapping relationship is deleted.
- operation S504 may be returned to re-determine whether the remaining storage slots are sufficient.
- the CPU may, for example, perform operation S517 to determine whether all the acquired batch data have been trained. That is, whether all the obtained batch data are used as training data for training the deep learning model. If so, operation S518 is performed to copy out the updated network parameters stored in the target memory (for example, HMB) and write them into the memory or hard disk storage. If not, return to operation S503 to start training for the next training round.
- operation S517 to determine whether all the acquired batch data have been trained. That is, whether all the obtained batch data are used as training data for training the deep learning model. If so, operation S518 is performed to copy out the updated network parameters stored in the target memory (for example, HMB) and write them into the memory or hard disk storage. If not, return to operation S503 to start training for the next training round.
- the target memory for example, HMB
- FIG. 6 is a communication topology diagram of a single-machine multi-card processor according to an embodiment of the present disclosure.
- the electronic device with a single-machine multi-card structure may include one CPU and four XPUs, such as XPU#0 to XPU#3.
- the CPU can communicate with four XPUs through the PCIe (Peripheral Component Interconnect Express) interface.
- Network Interface Controller is used to connect electronic devices to the local area network.
- a NIC can be connected to an access switch (TOR Switch) through Ethernet to allow electronic devices to access the LAN.
- XPU can refer to Kunlun chip, specifically, it can refer to Kunlun second-generation chip.
- CCIX cache coherence Protocol
- XPU#0 and XPU#1, XPU#0 and XPU#3, XPU#1 and XPU#2, and XPU#2 and XPU#3 can be interconnected via cache coherence Protocol (CCIX) connections form a processor ring.
- CCIX is an inter-chip interconnect that allows two or more devices to share data through cache coherence.
- the structure of this inter-chip interconnect provides the basis for the use of global reduction algorithms.
- the topology shown in Figure 6 can be the communication topology of the Kunlun second-generation chip. Through this topology, AllReduce communication that supports some sparse parameters (network parameters for embedding processing) can be achieved.
- the network parameters can also be adjusted in a manner in which each XPU broadcasts all gradient data to other XPUs and receives all gradient data from other XPUs.
- the gradient data broadcast by XPU#0 may be forwarded to XPU#2 via XPU#3, XPU#1 or CPU#1, for example.
- a number of electronic devices or more electronic devices can also be used.
- the multiple electronic devices can be connected via a local area network.
- CPUs can communicate via the Common System Interface (QPI), which is an architecture that implements interconnection between chips.
- QPI Common System Interface
- an asynchronous pipeline method can also be used to train the deep learning model, thereby improving model training efficiency.
- Figure 7 is a schematic diagram of the principle of training a model in the form of an asynchronous pipeline according to an embodiment of the present disclosure.
- an asynchronous pipeline (Pipeline) design can be performed.
- the CPU can preprocess 710 the training data of the next training round, and after completing the preprocessing, prepare the data that needs to be written to the target memory.
- the target parameters are allocated slots and the target parameters are copied to the target memory, that is, task 720 of allocating slots and copying data is executed.
- the computing core completes the training task 730 of the current training round, it can directly execute the training task of the next training round.
- the efficiency of model training can be effectively improved, the interval between two adjacent rounds of iterative training can be reduced, and the utilization rate of the target processor can be improved.
- the CPU may respond to the computing core training the first network parameters based on the first training data, and determine to perform embedding processing on the second training data based on the second training data of the next training round.
- a second target parameter needs to be written into the target memory.
- the remaining storage slots in the target memory are determined according to the first mapping relationship between the storage slots of the target memory and the network parameters. Subsequently, when the remaining storage slots meet the storage requirements of the second target parameter, a storage slot is allocated for the second target parameter and the second target parameter is written into the target memory.
- the present disclosure also provides a deep learning model training device, which will be described in detail below with reference to FIG. 8 .
- Figure 8 is a structural block diagram of a training device for a deep learning model according to an embodiment of the present disclosure.
- the deep learning model training device 800 of this embodiment may include a target parameter determination module 810 , a remaining slot determination module 820 and a parameter writing module 830 .
- the target parameter determination module 810 is configured to determine, according to the first training data of the current training round, the first target parameter that needs to be written into the target memory among the first network parameters required for embedding the first training data.
- the target memory is the memory included in the target processor.
- the target parameter determination module 810 may be configured to perform the above-described operation S210, which will not be described again here.
- the remaining slot determination module 820 is configured to determine the remaining storage slots in the target memory according to the first mapping relationship between the storage slots of the target memory and the network parameters. In an embodiment, the remaining slot determination module 820 may be used to perform the above-described operation S220, which will not be described again here.
- the parameter writing module 830 is configured to write the first target parameter into the target memory in response to the remaining storage slots meeting the storage requirements of the first target parameter, so that the computing core included in the target processor performs a test on the first network based on the first training data. Parameters are adjusted. In one embodiment, the parameter writing module 830 may be used to perform the above-described operation S230, which will not be described again.
- the above-mentioned device 800 may further include: a slot allocation module, configured to allocate a storage slot among the remaining storage slots to the first target parameter in response to the remaining storage slots meeting the storage requirements of the first target parameter. bit; and a first relationship update module configured to update the first mapping relationship according to the identification information of the storage slot allocated for the first target parameter and the identification information of the first target parameter.
- the parameter writing module 830 is configured to write the first target parameter into the storage slot allocated for the first target parameter.
- the above-mentioned target parameter determination module 810 may include: a required parameter determination sub-module for determining the first network parameters required for embedding the first training data; and a deduplication sub-module for embedding the first training data. performing deduplication processing on a network parameter to obtain deduplication network parameters; and a target parameter determination submodule for determining, based on the first mapping relationship and the identification information of the deduplication network parameter, that the deduplication network parameters are not stored in the target The network parameters of the memory are used as the first target parameters.
- the above-mentioned device 800 may further include: a transfer parameter determination module, configured to determine, in response to the remaining storage slots not meeting the storage requirements of the first target parameter, the transferable network parameters stored in the target memory. network parameters; and a parameter transfer module for transferring transferable network parameters from the target storage to the memory.
- the parameter writing module 830 is also configured to write the first target parameter into the target memory in response to the transferable network parameter being transferred to the memory.
- the transfer parameter determination module is configured to determine, according to the second mapping relationship between the storage slot of the target memory and the parameter status of the network parameter stored in the storage slot, that the parameter status is the target status and the network parameter is available. Transfer network parameters.
- the parameter state includes at least one of the following: reference state and number of uses; the target state includes at least one of the following: the reference state is an unreferenced state; the number of uses is less than the number threshold.
- the above-mentioned device 800 may further include: a slot allocation module, configured to allocate remaining storage slots in the target memory to the first target parameter in response to the transferable network parameter being transferred to the memory; and a second relationship update module, configured according to Update the second mapping relationship between the storage slot allocated for the first target parameter and the storage slot where other parameters in the first network parameter except the first target parameter are located, to update the parameter status of the first network parameter.
- a slot allocation module configured to allocate remaining storage slots in the target memory to the first target parameter in response to the transferable network parameter being transferred to the memory
- a second relationship update module configured according to Update the second mapping relationship between the storage slot allocated for the first target parameter and the storage slot where other parameters in the first network parameter except the first target parameter are located, to update the parameter status of the first network parameter.
- the second relationship update module may also be configured to update the second mapping relationship in response to the computing core completing the adjustment of the first network parameter to update the reference status of the first network parameter.
- the above parameter transfer module may be configured to write the transferable network parameters into the hard disk memory via the memory in response to the remaining storage space of the memory being less than the space threshold.
- the target parameter determination module 810 is further configured to: in response to the calculation core training the first network parameter according to the first training data, determine the second training data according to the second training data of the next training round. Among the second network parameters required for embedding processing of the training data, the second target parameters need to be written into the target memory.
- the above-mentioned remaining slot determination module 820 is also configured to determine the remaining storage slots in the target memory according to the first mapping relationship between the storage slots of the target memory and the network parameters.
- the above-mentioned parameter writing module 830 is also configured to write the second target parameter into the target memory in response to the remaining storage slots meeting the storage requirements of the second target parameter.
- the target processor includes multiple processors; the first training data includes multiple batches of data respectively corresponding to the multiple processors.
- the above-mentioned parameter writing module 830 may include: a specified parameter determination sub-module, configured to determine, for each processor among the plurality of processors, the first target parameters required for embedding processing of a batch of data corresponding to each processor. The required specified parameters; the parameter replacement submodule is used to replace other parameters in the first target parameter except the specified parameters with predetermined parameter values to obtain the parameters to be written for each processor; and the writing submodule is used to The parameters to be written are written into the target memory included in each processor, so that the computing core included in each processor trains the specified parameters according to a batch of data corresponding to each processor.
- the number of network parameters required for embedding processing of each batch of data is related to the storage capacity of the target memory in the processor corresponding to each batch of data.
- the above-mentioned parameter writing module 830 is also configured to write the third network parameters required for predictive processing of multiple batches of data into the target memory in each processor, so that each processor includes The computing core adjusts the third network parameters according to a batch of data corresponding to each processor.
- the present disclosure also provides a deep learning model training system.
- the system will be described in detail below with reference to FIG. 9 .
- Figure 9 is a structural block diagram of a training system for a deep learning model according to an embodiment of the present disclosure.
- the deep learning model training system 900 of this embodiment may include a first processor 910 and a second sum processor 920.
- the second processor includes a target memory and a computing core.
- the first processor 910 is configured to: according to the first training data of the current training round, determine the first target parameter that needs to be written into the target memory among the first network parameters required for embedding the first training data; according to the target The first mapping relationship between the storage slots of the memory and the network parameters determines the remaining storage slots in the target memory; and in response to the remaining storage slots meeting the storage requirements of the first target parameter, writing the first target parameter to the target memory, and sends training task information based on the first training data to the second processor. It can be understood that the first processor may be configured to perform operations S310 to S330 described above, which will not be described again here.
- the second processor 920 is configured to: the computing core adjusts the first network parameters according to the first training data in response to receiving the training task information.
- the second processor includes a plurality of processors; the first training data includes a plurality of batches of data respectively corresponding to the plurality of processors.
- the above-mentioned first processor 910 is configured to write the first target parameters into the target memory in the following manner: for each processor in the plurality of processors, determine a batch of pairs of the first target parameters corresponding to each processor. Specified parameters required for data embedding processing; using predetermined parameters to replace other parameters except the specified parameters in the first target parameter to obtain parameters to be written for each processor; and writing the parameters to be written into each process The target memory included in the device.
- multiple processors are connected via a cache coherent interconnect protocol to form a processor ring.
- Each processor among the plurality of processors is configured to adjust the first network parameter in the following manner: the computing core performs forward calculation and reverse calculation according to a batch of data corresponding to each processor and specified parameters, and obtains Gradient data for the first network parameter; and according to the storage slot where the first network parameter is located, a global reduction algorithm is used to use the gradient data for the first network parameter and gradient data obtained by other processors in the plurality of processors, Adjust the first network parameters.
- the second processor includes an artificial intelligence chip; and the artificial intelligence chip includes a second-generation Kunlun core chip.
- the collection, storage, use, processing, transmission, provision, disclosure and application of user personal information are all in compliance with relevant laws and regulations, and necessary confidentiality measures have been taken. , and does not violate public order and good customs.
- the user's authorization or consent is obtained before obtaining or collecting the user's personal information.
- the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
- FIG. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement the training method of a deep learning model according to an embodiment of the present disclosure.
- Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
- Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices.
- the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
- the device 1000 includes a computing unit 1001 that can execute according to a computer program stored in a read-only memory (ROM) 1002 or loaded from a storage unit 1008 into a random access memory (RAM) 1003 Various appropriate actions and treatments. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored.
- Computing unit 1001, ROM 1002 and RAM 1003 are connected to each other via bus 1004.
- An input/output (I/O) interface 1005 is also connected to bus 1004.
- I/O interface 1005 Multiple components in the device 1000 are connected to the I/O interface 1005, including: input unit 1006, such as a keyboard, mouse, etc.; output unit 1007, such as various types of displays, speakers, etc.; storage unit 1008, such as a magnetic disk, optical disk, etc. ; and communication unit 1009, such as a network card, modem, wireless communication transceiver, etc.
- the communication unit 1009 allows the device 1000 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
- Computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc.
- the computing unit 1001 performs various methods and processes described above, such as the training method of a deep learning model.
- the training method of the deep learning model may be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 1008.
- part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009.
- the computer program When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the training method of the deep learning model described above may be performed.
- the computing unit 1001 may be configured to perform the training method of the deep learning model in any other suitable manner (eg, by means of firmware).
- Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
- FPGAs field programmable gate arrays
- ASICs application specific integrated circuits
- ASSPs application specific standard products
- SOC system
- CPLD complex programmable logic device
- computer hardware firmware, software, and/or combinations thereof.
- These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
- the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
- An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
- An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
- Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented.
- the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
- a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
- Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
- machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- RAM random access memory
- ROM read only memory
- EPROM or flash memory erasable programmable read only memory
- CD-ROM portable compact disk read-only memory
- magnetic storage device or any suitable combination of the above.
- the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
- a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and pointing device eg, a mouse or a trackball
- Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
- the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
- the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
- Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
- the server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short) Among them, there are defects such as difficult management and weak business scalability.
- the server can also be a distributed system server or a server combined with a blockchain.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Neurology (AREA)
Abstract
Description
Claims (33)
- 一种深度学习模型的训练方法,包括:根据当前训练轮次的第一训练数据,确定对所述第一训练数据进行嵌入处理所需的第一网络参数中需要写入目标存储器的第一目标参数;其中,所述目标存储器为目标处理器所包括的存储器;根据所述目标存储器的存储槽位和网络参数之间的第一映射关系,确定所述目标存储器中的剩余存储槽位;以及响应于所述剩余存储槽位满足所述第一目标参数的存储需求,将所述第一目标参数写入所述目标存储器,以使得所述目标处理器包括的计算核根据所述第一训练数据对所述第一网络参数进行调整。
- 根据权利要求1所述的方法,还包括:响应于所述剩余存储槽位满足所述第一目标参数的存储需求,为所述第一目标参数分配所述剩余存储槽位中的存储槽位;以及根据为所述第一目标参数分配的存储槽位的标识信息和所述第一目标参数的标识信息,更新所述第一映射关系,其中,将所述第一目标参数写入所述目标存储器包括:将所述第一目标参数写入为所述第一目标参数分配的存储槽位处。
- 根据权利要求1所述的方法,其中,确定对所述第一训练数据进行嵌入处理所需的第一网络参数中需要写入目标存储器的第一目标参数包括:确定对所述第一训练数据进行嵌入处理所需的第一网络参数;对所述第一网络参数进行去重处理,得到去重后网络参数;以及根据所述第一映射关系和所述去重后网络参数的标识信息,确定所述去重后网络参数中未存储于所述目标存储器的网络参数,作为所述第一目标参数。
- 根据权利要求1所述的方法,还包括:响应于所述剩余存储槽位不满足所述第一目标参数的存储需求,确定已存储于所述目标存储器的网络参数中的可转移网络参数;将所述可转移网络参数从所述目标存储器转移至内存中;以及响应于所述可转移网络参数被转移至所述内存,将所述第一目标参数写入所述目标存储器。
- 根据权利要求4所述的方法,其中,确定已存储于所述目标存储器的网络参数中的可转移网络参数包括:根据所述目标存储器的存储槽位与存储槽位存储的网络参数的参数状态之间的第二映射关系,确定参数状态为目标状态的网络参数为所述可转移网络参数,其中,所述参数状态包括以下至少之一:引用状态、使用次数;所述目标状态包括以下至少之一:引用状态为未被引用的状态;使用次数小于次数阈值;所述方法还包括:响应于所述可转移网络参数被转移至所述内存,为所述第一目标参数分配所述目标存储器中的剩余存储槽位;以及根据为所述第一目标参数分配的存储槽位和所述第一网络参数中除所述第一目标参数外其他参数所在的存储槽位,对所述第二映射关系进行更新,以更新所述第一网络参数的参数状态。
- 根据权利要求5所述的方法,还包括:响应于所述计算核完成对所述第一网络参数的调整,对所述第二映射关系进行更新,以更新所述第一网络参数的引用状态。
- 根据权利要求4所述的方法,其中,所述将所述可转移网络参数从所述目标存储器转移至内存中包括:响应于所述内存的剩余存储空间小于空间阈值,将所述可转移网络参数经由所述内存写入硬盘存储器。
- 根据权利要求1所述的方法,还包括:响应于所述计算核根据所述第一训练数据对所述第一网络参数进行训练,根据下一训练轮次的第二训练数据,确定对所述第二训练数据进行嵌入处理所需的第二网络参数中需要写入目标存储器的第二目标参数;根据所述目标存储器的存储槽位和网络参数之间的第一映射关系,确定所述目标存储器中的剩余存储槽位;以及响应于所述剩余存储槽位满足所述第二目标参数的存储需求,将所述第二目标参数写入所述目标存储器。
- 根据权利要求1所述的方法,其中,所述目标处理器包括多个处理器;所述第一训练数据包括与所述多个处理器分别对应的多批数据;将所述第一目标参数写入所述目标存储器包括:针对所述多个处理器中的每个处理器,确定所述第一目标参数中对与所述每个处理器对应的一批数据进行嵌入处理所需的指定参数;采用预定参数值替换所述第一目标参数中除所述指定参数外的其他参数,得到针对所述每个处理器的待写入参数;以及将所述待写入参数写入所述每个处理器包括的目标存储器,以使得所述每个处理器包括的计算核根据与所述每个处理器对应的一批数据对所述指定参数进行训练。
- 根据权利要求9所述的方法,其中:针对所述多批数据中的每批数据,对所述每批数据进行嵌入处理所需的网络参数的个数与所述每批数据对应的处理器中目标存储器的存储容量相关。
- 根据权利要求9所述的方法,还包括:将对所述多批数据进行预测处理所需的第三网络参数写入所述每个处理器中的目标存储器,以使得所述每个处理器包括的计算核根据所述每个处理器对应的一批数据对所述第三网络参数进行调整。
- 一种深度学习模型的训练方法,包括:第一处理器根据当前训练轮次的第一训练数据,确定对所述第一训练数据进行嵌入处理所需的第一网络参数中需要写入目标存储器的第一目标参数;其中,所述目标存储器为第二处理器所包括的存储器;第一处理器根据所述目标存储器的存储槽位和网络参数之间的第一映射关系,确定所述目标存储器中的剩余存储槽位;第一处理器响应于所述剩余存储槽位满足所述第一目标参数的存储需求,将所述第一目标参数写入所述目标存储器,并向所述第二处理器发送基于所述第一训练数据的训练任务信息;以及所述第二处理器的计算核响应于接收到所述训练任务信息,根据所述第一训练数据对所述第一网络参数进行调整。
- 根据权利要求12所述的方法,其中,所述第二处理器包括多个处理器;所述第一训练数据包括与所述多个处理器分别对应的多批数据;将所述第一目标参数写入所述目标存储器包括:针对所述多个处理器中的每个处理器,确定所述第一目标参数中对与所述每个处理器对应的一批数据进行嵌入处理所需的指定参数;采用预定参数替换所述第一目标参数中除所述指定参数外的其他参数,得到针对所述每个处理器的待写入参数;以及将所述待写入参数写入所述每个处理器包括的目标存储器。
- 根据权利要求13所述的方法,其中,所述多个处理器经由缓存一致性互联协议连接形成处理器环;根据所述第一训练数据对所述第一网络参数进行调整包括:所述多个处理器中每个处理器的计算核根据与所述每个处理器对应的一批数据和所述指定参数进行前向计算和反向计算,得到针对所述第一网络参数的梯度数据;以及所述每个处理器依据所述第一网络参数所在的存储槽位,采用全局归约算法来根据针对所述第一网络参数的梯度数据与所述多个处理器中其他处理器得到的梯度数据,对所述第一网络参数进行调整。
- 根据权利要求12~14中任一项所述的方法,其中,所述第二处理器包括人工智能芯片;所述人工智能芯片包括昆仑芯二代芯片。
- 一种深度学习模型的训练装置,包括:目标参数确定模块,用于根据当前训练轮次的第一训练数据,确定对所述第一训练数据进行嵌入处理所需的第一网络参数中需要写入目标存储器的第一目标参数;其中,所述目标存储器为目标处理器所包括的存储器;剩余槽位确定模块,用于根据所述目标存储器的存储槽位和网络参数之间的第一映射关系,确定所述目标存储器中的剩余存储槽位;以及参数写入模块,用于响应于所述剩余存储槽位满足所述第一目标参数的存储需求,将所述第一目标参数写入所述目标存储器,以使得所述目标处理器包括的计算核根据所述第一训练数据对所述第一网络参数进行调整。
- 根据权利要求16所述的装置,还包括:槽位分配模块,用于响应于所述剩余存储槽位满足所述第一目标参数的存储需求,为所述第一目标参数分配所述剩余存储槽位中的存储槽位;以及第一关系更新模块,用于根据为所述第一目标参数分配的存储槽位的标识信息和所述第一目标参数的标识信息,更新所述第一映射关系,其中,所述参数写入模块用于将所述第一目标参数写入为所述第一目标参数分配的存储槽位处。
- 根据权利要求16所述的装置,其中,所述目标参数确定模块包括:所需参数确定子模块,用于确定对所述第一训练数据进行嵌入处理所需的第一网络参数;去重子模块,用于对所述第一网络参数进行去重处理,得到去重后网络参数;以及目标参数确定子模块,用于根据所述第一映射关系和所述去重后网络参数的标识信息,确定所述去重后网络参数中未存储于所述目标存储器的网络参数,作为所述第一目标参数。
- 根据权利要求16所述的装置,还包括:转移参数确定模块,用于响应于所述剩余存储槽位不满足所述第一目标参数的存储需求,确定已存储于所述目标存储器的网络参数中的可转移网络参数;以及参数转移模块,用于将所述可转移网络参数从所述目标存储器转移至内存中,其中,所述参数写入模块还用于响应于所述可转移网络参数被转移至所述内存,将所述第一目标参数写入所述目标存储器。
- 根据权利要求19所述的装置,其中,所述转移参数确定模块用于:根据所述目标存储器的存储槽位与存储槽位存储的网络参数的参数状态之间的第二映射关系,确定参数状态为目标状态的网络参数为所述可转移网络参数,其中,所述参数状态包括以下至少之一:引用状态、使用次数;所述目标状态包括以下至少之一:引用状态为未被引用的状态;使用次数小于次数阈值;所述装置还包括:槽位分配模块,用于响应于所述可转移网络参数被转移至所述内存,为所述第一目标参数分配所述目标存储器中的剩余存储槽位;以及第二关系更新模块,用于根据为所述第一目标参数分配的存储槽位和所述第一网络参数中除所述第一目标参数外其他参数所在的存储槽位,对所述第二映射关系进行更新,以更新所述第一网络参数的参数状态。
- 根据权利要求20所述的装置,其中,所述第二关系更新模块还用于:响应于所述计算核完成对所述第一网络参数的调整,对所述第二映射关系进行更新,以更新所述第一网络参数的引用状态。
- 根据权利要求19所述的装置,其中,所述参数转移模块用于:响应于所述内存的剩余存储空间小于空间阈值,将所述可转移网络参数经由所述内存写入硬盘存储器。
- 根据权利要求16所述的装置,其中:所述目标参数确定模块还用于:响应于所述计算核根据所述第一训练数据对所述第一网络参数进行训练,根据下一训练轮次的第二训练数据,确定对所述第二训练数据进行嵌入处理所需的第二网络参数中需要写入目标存储器的第二目标参数;所述剩余槽位确定模块还用于:根据所述目标存储器的存储槽位和网络参数之间的第一映射关系,确定所述目标存储器中的剩余存储槽位;以及所述参数写入模块还用于:响应于所述剩余存储槽位满足所述第二目标参数的存储需求,将所述第二目标参数写入所述目标存储器。
- 根据权利要求16所述的装置,其中,所述目标处理器包括多个处理器;所述第一训练数据包括与所述多个处理器分别对应的多批数据;所述参数写入模块包括:指定参数确定子模块,用于针对所述多个处理器中的每个处理器,确定所述第一目标参数中对与所述每个处理器对应的一批数据进行嵌入处理所需的指定参数;参数替换子模块,用于采用预定参数值替换所述第一目标参数中除所述指定参数外的其他参数,得到针对所述每个处理器的待写入参数;以及写入子模块,用于将所述待写入参数写入所述每个处理器包括的目标存储器,以使得所述每个处理器包括的计算核根据与所述每个处理器对应的一批数据对所述指定参数进行训练。
- 根据权利要求24所述的装置,其中:针对所述多批数据中的每批数据,对所述每批数据进行嵌入处理所需的网络参数的个数与所述每批数据对应的处理器中目标存储器的存储容量相关。
- 根据权利要求24所述的装置,其中,所述参数写入模块还用于:将对所述多批数据进行预测处理所需的第三网络参数写入所述每个处理器中的目标存储器,以使得所述每个处理器包括的计算核根据所述每个处理器对应的一批数据对所述第三网络参数进行调整。
- 一种深度学习模型的训练系统,包括第一处理器和第二处理器,所述第二处理器包括目标存储器和计算核;其中:所述第一处理器被配置为:根据当前训练轮次的第一训练数据,确定对所述第一训练数据进行嵌入处理所需的第一网络参数中需要写入所述目标存储器的第一目标参数;根据所述目标存储器的存储槽位和网络参数之间的第一映射关系,确定所述目标存储器中的剩余存储槽位;以及响应于所述剩余存储槽位满足所述第一目标参数的存储需求,将所述第一目标参数写入所述目标存储器,并向所述第二处理器发送基于所述第一训练数据的训练任务信息;以及所述第二处理器被配置为:所述计算核响应于接收到所述训练任务信息,根据所述第一训练数据对所述第一网络参数进行调整。
- 根据权利要求27所述的系统,其中,所述第二处理器包括多个处理器;所述第一训练数据包括与所述多个处理器分别对应的多批数据;所述第一处理器被配置为通过以下方式将所述第一目标参数写入所述目标存储器:针对所述多个处理器中的每个处理器,确定所述第一目标参数中对与所述每个处理器对应的一批数据进行嵌入处理所需的指定参数;采用预定参数替换所述第一目标参数中除所述指定参数外的其他参数,得到针对所述每个处理器的待写入参数;以及将所述待写入参数写入所述每个处理器包括的目标存储器。
- 根据权利要求28所述的系统,其中,所述多个处理器经由缓存一致性互联协议连接形成处理器环;所述每个处理器被配置为通过以下方式对所述第一网络参数进行调整:计算核根据与所述每个处理器对应的一批数据和所述指定参数进行前向计算和反向计算,得到针对所述第一网络参数的梯度数据;以及依据所述第一网络参数所在的存储槽位,采用全局归约算法来根据针对所述第一网络参数的梯度数据与所述多个处理器中其他处理器得到的梯度数据,对所述第一网络参数进行调整。
- 根据权利要求27~29中任一项所述的系统,其中,所述第二处理器包括人工智能芯片;所述人工智能芯片包括昆仑芯二代芯片。
- 一种电子设备,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1~15中任一项所述的方法。
- 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1~15中任一项所述的方法。
- 一种计算机程序产品,包括计算机程序/指令,所述计算机程序/指令在被处理器执行时实现根据权利要求1~15中任一项所述方法的步骤。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020247009547A KR20240046596A (ko) | 2022-05-19 | 2022-09-27 | 딥러닝 모델의 훈련 방법, 장치, 시스템, 기기, 매체 및 컴퓨터 프로그램 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210559489.0 | 2022-05-19 | ||
CN202210559489.0A CN114861911B (zh) | 2022-05-19 | 2022-05-19 | 深度学习模型的训练方法、装置、系统、设备和介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023221360A1 true WO2023221360A1 (zh) | 2023-11-23 |
Family
ID=82639886
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/121697 WO2023221360A1 (zh) | 2022-05-19 | 2022-09-27 | 深度学习模型的训练方法、装置、系统、设备和介质 |
Country Status (3)
Country | Link |
---|---|
KR (1) | KR20240046596A (zh) |
CN (1) | CN114861911B (zh) |
WO (1) | WO2023221360A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117743973A (zh) * | 2024-02-19 | 2024-03-22 | 北京搜狐新媒体信息技术有限公司 | 一种参数处理方法、装置、设备及存储介质 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114861911B (zh) * | 2022-05-19 | 2023-04-07 | 北京百度网讯科技有限公司 | 深度学习模型的训练方法、装置、系统、设备和介质 |
CN116187426B (zh) * | 2022-11-09 | 2024-04-19 | 北京百度网讯科技有限公司 | 深度学习模型的模型参数多流广播方法及其装置 |
CN115965074B (zh) * | 2022-11-28 | 2023-11-10 | 北京百度网讯科技有限公司 | 深度学习模型的训练方法、数据处理方法、装置和设备 |
CN116185307B (zh) * | 2023-04-24 | 2023-07-04 | 之江实验室 | 一种模型数据的存储方法、装置、存储介质及电子设备 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170357896A1 (en) * | 2016-06-09 | 2017-12-14 | Sentient Technologies (Barbados) Limited | Content embedding using deep metric learning algorithms |
CN109656722A (zh) * | 2019-01-04 | 2019-04-19 | Oppo广东移动通信有限公司 | 内存优化方法、装置、移动终端及存储介质 |
CN110532198A (zh) * | 2019-09-09 | 2019-12-03 | 成都西山居互动娱乐科技有限公司 | 一种存储空间分配的方法及装置 |
CN113885691A (zh) * | 2021-09-30 | 2022-01-04 | 上海商汤阡誓科技有限公司 | 芯片功耗调整、神经网络训练方法、装置以及芯片系统 |
CN114492794A (zh) * | 2022-01-28 | 2022-05-13 | 北京百度网讯科技有限公司 | 用于处理数据的方法、装置、设备、介质和产品 |
CN114861911A (zh) * | 2022-05-19 | 2022-08-05 | 北京百度网讯科技有限公司 | 深度学习模型的训练方法、装置、系统、设备和介质 |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9015083B1 (en) * | 2012-03-23 | 2015-04-21 | Google Inc. | Distribution of parameter calculation for iterative optimization methods |
CN108053029B (zh) * | 2017-12-27 | 2021-08-27 | 上海闪易半导体有限公司 | 一种基于存储阵列的神经网络的训练方法 |
KR20210043778A (ko) * | 2019-10-11 | 2021-04-22 | 삼성전자주식회사 | 불휘발성 메모리 장치를 제어하도록 구성된 스토리지 컨트롤러의 동작 방법 |
CN112257844B (zh) * | 2020-09-29 | 2022-04-26 | 浙江大学 | 一种基于混合精度配置的卷积神经网络加速器及其实现方法 |
CN113159284A (zh) * | 2021-03-31 | 2021-07-23 | 华为技术有限公司 | 一种模型训练方法及装置 |
CN113408696A (zh) * | 2021-05-17 | 2021-09-17 | 珠海亿智电子科技有限公司 | 深度学习模型的定点量化方法及装置 |
CN113505887B (zh) * | 2021-09-12 | 2022-01-04 | 浙江大学 | 一种针对忆阻器误差的忆阻器存储器神经网络训练方法 |
-
2022
- 2022-05-19 CN CN202210559489.0A patent/CN114861911B/zh active Active
- 2022-09-27 WO PCT/CN2022/121697 patent/WO2023221360A1/zh active Application Filing
- 2022-09-27 KR KR1020247009547A patent/KR20240046596A/ko unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170357896A1 (en) * | 2016-06-09 | 2017-12-14 | Sentient Technologies (Barbados) Limited | Content embedding using deep metric learning algorithms |
CN109656722A (zh) * | 2019-01-04 | 2019-04-19 | Oppo广东移动通信有限公司 | 内存优化方法、装置、移动终端及存储介质 |
CN110532198A (zh) * | 2019-09-09 | 2019-12-03 | 成都西山居互动娱乐科技有限公司 | 一种存储空间分配的方法及装置 |
CN113885691A (zh) * | 2021-09-30 | 2022-01-04 | 上海商汤阡誓科技有限公司 | 芯片功耗调整、神经网络训练方法、装置以及芯片系统 |
CN114492794A (zh) * | 2022-01-28 | 2022-05-13 | 北京百度网讯科技有限公司 | 用于处理数据的方法、装置、设备、介质和产品 |
CN114861911A (zh) * | 2022-05-19 | 2022-08-05 | 北京百度网讯科技有限公司 | 深度学习模型的训练方法、装置、系统、设备和介质 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117743973A (zh) * | 2024-02-19 | 2024-03-22 | 北京搜狐新媒体信息技术有限公司 | 一种参数处理方法、装置、设备及存储介质 |
CN117743973B (zh) * | 2024-02-19 | 2024-05-28 | 北京搜狐新媒体信息技术有限公司 | 一种参数处理方法、装置、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
KR20240046596A (ko) | 2024-04-09 |
CN114861911B (zh) | 2023-04-07 |
CN114861911A (zh) | 2022-08-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023221360A1 (zh) | 深度学习模型的训练方法、装置、系统、设备和介质 | |
CN110134636B (zh) | 模型训练方法、服务器和计算机可读存储介质 | |
CN112860695B (zh) | 监控数据查询方法、装置、设备、存储介质及程序产品 | |
US20210326762A1 (en) | Apparatus and method for distributed model training, device, and computer readable storage medium | |
KR20210156243A (ko) | 딥러닝 프레임워크의 훈련 방법, 장치 및 저장 매체 | |
US20190208011A1 (en) | Accelerating data replication using multicast and non-volatile memory enabled nodes | |
US20230333764A1 (en) | Method and apparatus for compressing data of storage system, device, and readable storage medium | |
CN106339475A (zh) | 一种海量数据的分布式存储系统 | |
CN113656176B (zh) | 云设备的分配方法、装置、系统、电子设备、介质及产品 | |
CN115129621B (zh) | 一种内存管理方法、设备、介质及内存管理模块 | |
CN114911596B (zh) | 针对模型训练的调度方法、装置、电子设备和存储介质 | |
CN113344074B (zh) | 模型训练方法、装置、设备及存储介质 | |
Lerat et al. | Single node deep learning frameworks: Comparative study and CPU/GPU performance analysis | |
CN110401681A (zh) | 用于数据传输、数据接收的方法以及电子设备 | |
CN116540938A (zh) | 数据读取方法、装置、分布式存储系统、设备和存储介质 | |
WO2023051319A1 (zh) | 基于多数据对齐的数据发送和接收方法、装置和设备 | |
US11768814B2 (en) | Data transmissions between two databases | |
JP2023031248A (ja) | エッジコンピューティングネットワーク、データ伝送方法、装置、機器、及び記憶媒体 | |
WO2022000851A1 (zh) | 数据处理方法、装置、设备和存储介质 | |
CN113641688A (zh) | 节点更新方法、相关装置及计算机程序产品 | |
WO2017186049A1 (zh) | 信息处理方法和装置 | |
WO2023222113A1 (zh) | 稀疏参数的更新方法、训练节点、设备和存储介质 | |
US20220147373A1 (en) | Method and apparatus for acquiring information | |
WO2023138233A1 (zh) | 模型传输方法、装置、电子设备及可读存储介质 | |
EP4131017A2 (en) | Distributed data storage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22942388 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 20247009547 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022942388 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022942388 Country of ref document: EP Effective date: 20240327 |