WO2024016542A1 - 信息融合方法、数据通信方法、装置及电子设备和非易失性可读存储介质 - Google Patents

信息融合方法、数据通信方法、装置及电子设备和非易失性可读存储介质 Download PDF

Info

Publication number
WO2024016542A1
WO2024016542A1 PCT/CN2022/133806 CN2022133806W WO2024016542A1 WO 2024016542 A1 WO2024016542 A1 WO 2024016542A1 CN 2022133806 W CN2022133806 W CN 2022133806W WO 2024016542 A1 WO2024016542 A1 WO 2024016542A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
parameters
node
nodes
computing
Prior art date
Application number
PCT/CN2022/133806
Other languages
English (en)
French (fr)
Inventor
闫瑞栋
郭振华
赵雅倩
邱志勇
Original Assignee
浪潮电子信息产业股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮电子信息产业股份有限公司 filed Critical 浪潮电子信息产业股份有限公司
Publication of WO2024016542A1 publication Critical patent/WO2024016542A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations

Definitions

  • the present application relates to the field of computer technology, and more specifically, to an information fusion method, a data communication method, an information fusion device, a data communication device, an electronic device and a non-volatile readable storage media.
  • the current distributed training system is a network or cluster composed of multiple computing nodes, and each computing node can be composed of one host or multiple hosts. Each host has a relatively independent storage device or storage unit. Compared with the stand-alone training method with only a single computing node, the scale of stored data and models in distributed systems has been significantly improved, making it possible to train deep neural network models with more than 10 billion parameters on large-scale data sets.
  • the distributed training system effectively shortens the execution time of training tasks through a variety of computing and communication architectures.
  • the distributed training system draws on the idea of "divide and conquer": first, the deep neural network model or large data set to be trained is split in a model parallel, data parallel or hybrid parallel manner, and allocated to the corresponding calculations nodes; then, each computing node separately trains the split small-scale data or sub-models and generates local or intermediate training results; finally, the distributed training system aggregates all local training results in some way to obtain global results. , and output the global training results.
  • the above processes are carried out simultaneously in a parallel manner, which can significantly reduce the traditional serial single-machine training time.
  • the centralized architecture mainly refers to the existence of a central node in the distributed training system, which is responsible for information interaction with other computing nodes and the synchronization and update of global information.
  • the parameter server architecture is the most typical. There are two main roles in this architecture: parameter server (server) and computing node (worker).
  • the parameter server is responsible for collecting information such as gradients or parameters sent by the computing nodes, performing global calculations and synchronization on the collected information, and returning the global synchronization information to each computing node.
  • the computing node receives the global information sent by the parameter server and performs subsequent iterative calculations, and sends the newly generated calculation results to the parameter server.
  • the parameter server and computing node iterate repeatedly according to the above process until the training end condition is reached.
  • the status of all computing nodes is in an "equal state".
  • this architecture there is only one role of computing node.
  • Each computing node only masters its own local data or local model parameters, and in each round of iterative training process, it achieves communication and fusion of global or local information through operations such as All-Reduce, rather than through dedicated The central node realizes the interaction of global information.
  • a comparative analysis of the advantages and disadvantages of centralized architecture and decentralized architecture is as follows:
  • the advantages of centralized architecture are: first, there is no direct information interaction and communication between various computing nodes, and the computing nodes only communicate with the central parameter server node communication. Therefore, the training process between nodes is relatively independent. In other words, each node communicates with the central parameter server node at its own training speed, which can friendly support the asynchronous communication strategy; secondly, because the central parameter server is responsible for the fusion of global information and sends the global information to each computing node, the model The training accuracy and algorithm convergence are fully guaranteed.
  • the centralized architecture is fault-tolerant. Once a new node is added or removed, the change of this node will not have a direct impact on other nodes.
  • the disadvantage is that the central parameter server node easily falls into a "communication bottleneck". Under the condition that the bandwidth of the central parameter server node is limited, as the number of computing nodes continues to increase and all communicate with the central parameter server node, the central parameter server node will encounter a communication bottleneck.
  • the advantage of the decentralized architecture is that each computing node usually only interacts and communicates with its neighbor nodes. Models based on local information have a small amount of calculation and increase the training speed to a certain extent.
  • the disadvantage is that due to the lack of global information synchronization of the central node, the model training accuracy is poor or even the model training fails.
  • the main idea of the synchronization algorithm is: when a computing node in the distributed training system completes the current round of iteration, it must wait for other computing nodes to complete its current round of iteration tasks, and then they can jointly process the next round of training iteration tasks.
  • Typical synchronization algorithms such as bulk synchronous parallel (BSP) algorithm.
  • BSP bulk synchronous parallel
  • a computing node when a computing node completes the current iteration task, it needs to synchronize information such as model parameters or gradients with other computing nodes through different communication topology logic. They then proceed to the next iteration with the same "starting point".
  • the BSP algorithm introduces a global synchronization barrier.
  • the advantage of the synchronization algorithm is to ensure the consistency of the model parameters of each computing node, thereby ensuring that the algorithm convergence analysis has a theoretical basis.
  • the disadvantage of the synchronization algorithm is that the system performance is limited by the node with the slowest training speed, that is, the dragger effect.
  • the main idea of the asynchronous algorithm is that when a computing node in the system completes its current round of iteration, it can continue to execute the next round of iteration without waiting for other computing nodes.
  • the advantage of this algorithm is that it avoids the dragger effect of the synchronization algorithm and gives full play to the system performance.
  • the asynchronous algorithm will cause the problem of gradient obsolescence because different computing nodes with large performance differences produce new and old local gradient information that is used by the central node.
  • the purpose of the embodiments of the present application is to provide an information fusion method, a data communication method, an information fusion device, a data communication device, an electronic device and a computer non-volatile readable storage medium, to improve It improves the distributed training speed of the model and reduces the communication overhead between the central node and the computing node.
  • embodiments of the present application provide an information fusion method, which is applied to the central node in the distributed training system.
  • the above method includes:
  • the communication triggering condition When the communication triggering condition is met, the local parameters of each computing node in the above-mentioned distributed training system are obtained; wherein the above-mentioned communication triggering condition is that all key nodes participating in this round of training have completed the current round of training tasks;
  • the key nodes participating in this round of training all execute the iterative training process for a preset number of times.
  • the above-mentioned selection of N key nodes among each computing node to participate in the next round of training includes:
  • Calculate the average parameter of the local parameters of the above key nodes determine the deviation between the local parameters of each calculation node and the above average parameter, and select the N calculation nodes with the smallest deviation as the key nodes to participate in the next round of training.
  • the above-mentioned determination of the deviation between the local parameters of each calculation node and the above-mentioned average parameters includes:
  • the above-mentioned average parameters are sent to each calculation node, so that each calculation node calculates the deviation between its own local parameter and the above-mentioned average parameter, and returns it to the above-mentioned central node.
  • the above-mentioned training model is divided into multiple training sub-models, including:
  • Multiple training samples are allocated to each computing node, so that each computing node performs an iterative training process based on the corresponding training samples.
  • the above distributes multiple training samples to each computing node, including:
  • Multiple training samples are allocated to each computing node based on the sampling method, or multiple training samples are split according to data dimensions and allocated to each computing node.
  • the above-mentioned allocating multiple training samples to each computing node based on the sampling method includes: allocating the above-mentioned multiple training samples to each computing node through random sampling with replacement and/or local scrambling sampling; or allocating the above-mentioned multiple training samples to each computing node. Multiple training samples are distributed to each computing node through random sampling with replacement and/or total scrambling sampling.
  • the above-mentioned splitting of multiple training samples according to data dimensions and distribution to each computing node includes: when each training sample has multi-dimensional attributes or characteristics, the above-mentioned multiple training samples are split according to different attributes. , and allocate the split sample subsets to the corresponding computing nodes.
  • the above-mentioned fusion of the local parameters of the N above-mentioned key nodes to obtain the global parameters includes: calculating the average value of the local parameters of the N above-mentioned key nodes, and determining the above-mentioned average value as the above-mentioned global parameters.
  • embodiments of the present application provide a data communication method, which is applied to computing nodes in a distributed training system.
  • the above method includes:
  • the communication triggering conditions When the communication triggering conditions are met, it compresses its own local parameters based on the preset compression algorithm, and transmits the compressed local parameters to the above-mentioned central node;
  • the corresponding training task is executed based on the above global parameters.
  • the above-mentioned preset compression algorithm is:
  • x is the above-mentioned local parameter
  • 2 is the L2 norm of x
  • sign(x) is the sign of x
  • d is the dimension of the above-mentioned local parameter
  • x i is the i-th dimension of x
  • C[x] is the compressed local parameter.
  • the above-mentioned compression operation is performed on its own local parameters based on the preset compression algorithm, and the compressed local parameters are transmitted to the central node, including: all key nodes participating in this round of training are executed to complete this process.
  • the above-mentioned acquisition of the global parameters sent by the central node includes: when the above-mentioned computing node is When selected as a key node to participate in the next round of training, obtain the global parameters sent by the above-mentioned central node; when receiving the training command sent by the central node, execute the corresponding training task based on the above-mentioned global parameters, including: when receiving When the above central node sends the training command, the next round of training tasks is executed based on the above global parameters.
  • the above method also includes: obtaining part of the training sub-models among the multiple training sub-models, wherein the above-mentioned multiple training sub-models are sub-models obtained by dividing the training models associated with the above-mentioned training tasks, and the above-mentioned multiple training sub-models
  • the sub-models are assigned to each computing node in the above-mentioned distributed training system; and/or some of the multiple training samples are obtained, wherein the above-mentioned multiple training samples are training samples associated with the above-mentioned training tasks, and the above-mentioned multiple training samples are The training samples are allocated to each computing node, and each computing node is used to perform an iterative training process based on the corresponding training sample.
  • an information fusion device which is applied to the central node in the distributed training system.
  • the above device includes:
  • the first acquisition module is configured to acquire the local parameters of each computing node in the above-mentioned distributed training system when the communication trigger condition is met; wherein the above-mentioned communication trigger condition is that all key nodes participating in this round of training have completed the current round of training tasks. ;
  • the fusion module is set to select N key nodes from each computing node to participate in the next round of training, and fuse the local parameters of the N key nodes to obtain global parameters;
  • the sending module is configured to send the above-mentioned global parameters to each computing node, and send training commands to the above-mentioned key nodes, so that the above-mentioned key nodes execute the next round of training tasks based on the above-mentioned global parameters.
  • inventions of the present application provide a data communication device, which is applied to computing nodes in a distributed training system.
  • the above device includes:
  • the compression module is configured to perform a compression operation on its own local parameters based on a preset compression algorithm when the communication trigger conditions are met, and transmit the compressed local parameters to the above-mentioned central node;
  • the second acquisition module is configured to acquire the global parameters sent by the above-mentioned central node; wherein the above-mentioned global parameters are obtained by the above-mentioned central node fusing the local parameters of N key nodes;
  • the execution module is configured to execute the corresponding training task based on the above global parameters when receiving the training command sent by the central node.
  • an electronic device including:
  • the processor is configured to implement the steps of the above information fusion method or data communication method when executing the above computer program.
  • embodiments of the present application provide a computer non-volatile readable storage medium.
  • the computer program is stored on the computer non-volatile readable storage medium.
  • the computer program is executed by a processor, the above-mentioned Steps of information fusion method or data communication method.
  • the central node only selects N key nodes for information fusion, effectively reducing the number of fused computing nodes.
  • N key nodes are selected to perform training tasks, and other computing nodes do not Execute training tasks and improve the distributed training speed of the model.
  • the computing node compresses its own local parameters based on a preset compression algorithm before transmitting them to the central node, thereby reducing the communication volume between the central node and the computing node, and reducing Communication overhead between central node and computing node.
  • the embodiments of the present application also disclose an information fusion device, a data communication device, an electronic device, and a computer non-volatile readable storage medium, which can also achieve the above technical effects.
  • Figure 1 is a schematic diagram of centralized architecture and decentralized architecture
  • Figure 2 is a diagram of a distributed node information fusion framework oriented to parameter server architecture according to an exemplary embodiment
  • Figure 3 is a flow chart of an information fusion method according to an exemplary embodiment
  • Figure 4 is a flow chart of a data communication method according to an exemplary embodiment
  • Figure 5 is a structural diagram of an information fusion device according to an exemplary embodiment
  • Figure 6 is a structural diagram of a data communication device according to an exemplary embodiment
  • FIG. 7 is a structural diagram of an electronic device according to an exemplary embodiment.
  • the distributed node information fusion framework diagram for parameter server architecture provided by the embodiment of the present application is shown in Figure 2, including data/model partition components, parameter server architecture distributed training system components, node selection and data compression technology components, and training result output components.
  • the data/model partition component mainly completes the task of inputting the data set and model to be processed: among them, the data splitting module is mainly responsible for the task of splitting the data set and deploying the split sub-data sets to the corresponding computing nodes. ; The model splitting module is mainly responsible for splitting the original large model into several smaller sub-models.
  • the parameter server architecture distributed training system components are mainly used to complete actual training tasks.
  • Node selection and data compression technology components are the core technologies of the entire distributed training system framework: among them, the node selection module completes the selection task of key computing nodes, avoiding the calculation of information on all computing nodes, thereby effectively alleviating the "communication bottleneck" of the parameter server Question:
  • the data compression module compresses communication traffic from the perspective of data, thereby increasing the speed of model training.
  • the original distributed training system in Figure 2 includes computing node 1, computing node 2, computing node 3 and parameter server node. Through the selection method designed by the node selection module, the computing node 2 that does not meet the conditions is eliminated. Therefore, in the subsequent iteration process, only computing node 1, computing node 3 and the parameter server node actually participate in the calculation.
  • data compression technology is used for the communication information (such as gradients, model parameters, etc.) of computing node 1 and computing node 3 respectively to reduce the communication volume.
  • the parameter server architecture There are two main roles in the parameter server architecture: worker and server.
  • the worker is mainly responsible for: first, completing local training tasks based on its local data samples; second, communicating with the server through the client interface.
  • the server is mainly responsible for: first, aggregating or merging the local gradients sent by each worker; second, updating the global model parameters and returning them to each worker.
  • the training result output component is responsible for outputting the global solution of the training task and presenting it in a visual way to facilitate subsequent improvement and optimization.
  • each component performs its own duties and collaborates to complete various complex training tasks.
  • the embodiment of this application discloses an information fusion method that improves the distributed training speed of the model.
  • FIG. 3 a flow chart of an information fusion method is shown according to an exemplary embodiment. As shown in Figure 3, it includes:
  • the distributed training system includes multiple computing nodes (workers) and a central node (server). Each worker is bidirectionally connected to the server, indicating that data transmission can be bidirectional. However, there is no direct connection between workers. Each worker can independently carry out its own training tasks using the global parameters provided by the server. In detail, each worker communicates with the server through the following two operations: one is PULL (pull), that is, the worker obtains global parameters from the server; the other is PUSH (push), that is, the worker sends local parameters to the server.
  • the execution subject is the central node.
  • workers pass their respective local parameters Sent to the server, when the key nodes participating in this round of training have completed the training tasks of this round, the server can obtain the local parameters of each worker.
  • the above N nodes participating in training may be some nodes among all computing nodes.
  • the communication trigger condition can be that all key nodes participating in this round of training have completed a preset number of iterative training processes. This embodiment does not limit the preset number of times.
  • the preset number is 1, then when each iteration is completed, the worker and server perform local parameter synchronization. If the preset number is 10, then after each worker completes 10 iterations, , and then perform local parameter synchronization with the server. The preset number is T (total number of iterations). After each worker completes all iterations, local parameters are finally synchronized with the server.
  • S102 Select N key nodes to participate in the next round of training among each computing node, and fuse the local parameters of the N key nodes to obtain global parameters;
  • N key nodes are re-selected for the next round of training, and the local parameters of the N key nodes are fused to obtain the global parameters. That is, the average of the local parameters of the N key nodes is calculated as the global parameter.
  • the above-mentioned selection of N key nodes to participate in the next round of training among each computing node includes: calculating the average parameter of the local parameters of the above-mentioned key nodes, determining the local parameters of each computing node and the above-mentioned average parameter deviation, select the N computing nodes with the smallest deviation as key nodes to participate in the next round of training; as a feasible implementation, the above-selected N computing nodes can be some of the above-mentioned computing nodes.
  • the server computes an average parameter Determine the deviation between the local parameters of the r-th worker and the average parameters: Arrange v 1 , v 2 , ...., v r , ..., v n in order from large to small, and select the smaller N workers from them as key nodes.
  • the above-mentioned determination of the deviation between the local parameter of each computing node and the above-mentioned average parameter includes: sending the above-mentioned average parameter to each computing node, so that each computing node calculates the deviation between its own local parameter and the above-mentioned average parameter. , and return to the above central node.
  • the server returns the calculated average parameters to each worker, and each worker calculates the deviation between its own local parameters and the average parameters, and returns the calculated average parameters to the server.
  • S103 Send the above-mentioned global parameters to each computing node, and send training commands to the above-mentioned key nodes, so that the above-mentioned key nodes execute the next round of training tasks based on the above-mentioned global parameters.
  • the central node sends the global parameters to each computing node, and the computing node updates its own local parameters to the global parameters.
  • the computing node updates its own local parameters to the global parameters.
  • key nodes will perform the next round of training tasks, and other computing nodes that are not selected as key nodes will not perform the next round of training tasks, which improves the distributed training speed of the model.
  • the inputs of the distributed node information fusion algorithm oriented to the parameter server architecture are: the total number of iterations T, the learning rate ⁇ , the initial model parameters x 0 , the iterative communication trigger condition ⁇ , the number of key nodes N, and the output is: global convergence model parameters x T.
  • the execution process is:
  • Each worker concurrently executes the computing node training function Worker_Training(t);
  • the server executes the server node training function Server_Training(t);
  • the computing node training function Worker_Training(t) can be defined as:
  • the worker PULLs the latest global parameter x from the server
  • the parameter server node training function Server_Training(t) is defined as follows:
  • the server sends the above global parameters to each worker
  • the central node only selects N key nodes for information fusion, effectively reducing the number of fused computing nodes.
  • N key nodes are selected to perform training tasks, and other computing nodes do not Execute training tasks and improve the distributed training speed of the model.
  • the data/model partition component is responsible for taking the data set and model to be processed as input to the deep learning training task, and providing an interface for users to access the data or model.
  • This component contains the data splitting module (also known as data parallelism) and the model splitting module (also known as model parallelism).
  • the method further includes: dividing the training model into multiple training sub-models, and allocating the above-mentioned training sub-models to each computing node.
  • Model parallelism splits the model parameters into multiple sub-models, and assigns each sub-model to different computing nodes. It is worth noting that due to the particularity of the neural network model, that is, the hierarchical structure of the neural network model, it has significant advantages in applying model parallelism. Neural network models can be divided into horizontal splitting and vertical splitting according to different splitting methods, that is, the above-mentioned dividing the training model into multiple training sub-models, including: dividing the training model into multiple training sub-models in the horizontal direction or vertical direction. submodel.
  • the method further includes: allocating multiple training samples to each computing node, so that each computing node performs an iterative training process based on the corresponding training samples.
  • Data parallelism relies on multiple processors (computing nodes) in a parallel computing environment to subdivide the data set to achieve split computing.
  • Data parallel algorithms focus on distributing data on different parallel computing nodes, and each computing node executes the same computing model.
  • the data parallel mode is divided into sample-based data parallelism and sample-dimension-based data parallelism according to different splitting strategies of the data set. That is, the above-mentioned allocation of multiple training samples to each computing node includes: allocating multiple training samples to each computing node based on a sampling method, or splitting multiple training samples according to data dimensions and allocating them to each computing node. .
  • Sample-based data parallelism Assume that the distributed training system data set contains multiple data samples and multiple computing nodes.
  • This sample is distributed to multiple computing nodes through two methods: random sampling with replacement and local (global) scrambling sampling. .
  • Data parallelism based on sample dimensions Assume that the data set contains multiple samples and each sample has multi-dimensional attributes or characteristics.
  • the distributed training system includes multiple computing nodes. Starting from the sample attribute dimension, multiple samples are split according to different attributes. split, and allocate the split sample subsets to the corresponding computing nodes.
  • the model splitting module and the data splitting module are used simultaneously, resulting in a hybrid splitting strategy of data and models.
  • the hybrid splitting strategy of data and model (hybrid parallelism), as the name suggests, combines the data parallel mode and the model parallel mode at the same time.
  • the data set is split, and on the other hand, the model is also split, so that it can be applied to more complex in the model training task.
  • the embodiment of the present application discloses a data communication method, which reduces the communication overhead between the central node and the computing node.
  • FIG. 4 a flow chart of a data communication method is shown according to an exemplary embodiment. As shown in Figure 4, it includes:
  • the execution subject of this embodiment is the computing node in the distributed training system.
  • the computing node needs to transmit its own local parameters to the central node.
  • the original 1-bit compression technology is defined as:
  • the above two vector compression results are the same. In other words, different vectors have the same result after using the original 1-bit compression. Obviously, this kind of compression will produce bit errors. On the contrary, the goal of compression should be as differentiated as possible. To this end, this embodiment designs an improved 1-bit compression technology to avoid the above problems.
  • the improved 1-bit compression technology (that is, the preset compression algorithm) is as follows:
  • x is the above-mentioned local parameter
  • 2 is the L2 norm of x
  • sign(x) is the sign of x
  • d is the dimension of the above-mentioned local parameter
  • x i is the i-th dimension of x
  • C[x] is the compressed local parameter.
  • the improved scheme There are two main differences between the improved scheme and the original scheme: first, the ⁇ coefficient is used to avoid bit error problems; second, the L2 norm is used to replace the original L1 norm, because the mathematical properties of the L2 norm are better. It can be seen that through the above preset compression algorithm, the 32-bit or 16-bit data of the original training data can be compressed to 1 bit, thereby reducing the communication overhead between the central node and the computing node.
  • this embodiment also includes: obtaining the average parameter sent by the above-mentioned central node, and calculating the deviation between its own local parameters and the above-mentioned average parameter and returning it to the above-mentioned central node.
  • S202 Obtain the global parameters sent by the above-mentioned central node; wherein the above-mentioned global parameters are obtained by the above-mentioned central node fusing local parameters of N key nodes;
  • the computing node compresses its own local parameters based on a preset compression algorithm before transmitting them to the central node, thereby reducing the communication volume between the central node and the computing node, and reducing Communication overhead between central node and computing node.
  • An information fusion device provided by an embodiment of the present application is introduced below.
  • An information fusion device described below and an information fusion method described above may be referred to each other.
  • FIG. 5 a structural diagram of an information fusion device is shown according to an exemplary embodiment. As shown in Figure 5, it includes:
  • the first acquisition module 501 is configured to acquire the local parameters of each computing node in the above-mentioned distributed training system when the communication trigger condition is met; wherein the above-mentioned communication trigger condition is that all key nodes participating in this round of training have completed this round of training. Task;
  • the fusion module 502 is configured to select N key nodes from each computing node to participate in the next round of training, and fuse the local parameters of the N key nodes to obtain global parameters;
  • the sending module 503 is configured to send the above-mentioned global parameters to each computing node, and send training commands to the above-mentioned key nodes, so that the above-mentioned key nodes execute the next round of training tasks based on the above-mentioned global parameters.
  • the central node only selects N key nodes for information fusion, effectively reducing the number of fused computing nodes.
  • N key nodes are selected to perform training tasks, and other computing nodes do not Execute training tasks and improve the distributed training speed of the model.
  • the above communication triggering condition is that all key nodes participating in this round of training execute a preset number of iterative training processes.
  • the above fusion module 502 includes:
  • the selection unit is set to calculate the average parameter of the local parameters of the above-mentioned key nodes, determine the deviation of the local parameters of each calculation node from the above-mentioned average parameter, and select the N calculation nodes with the smallest deviation as the key nodes to participate in the next round of training;
  • the fusion unit is set to fuse the local parameters of the N above-mentioned key nodes to obtain the global parameters.
  • the above selection unit is configured to: calculate the average parameters of the local parameters of the above key nodes, and send the above average parameters to each computing node, so that each computing node calculates The deviation between its own local parameters and the above-mentioned average parameters is returned to the above-mentioned central node, and the N computing nodes with the smallest deviation are selected as key nodes to participate in the next round of training.
  • the first allocation module is configured to divide the training model into multiple training sub-models and allocate the above-mentioned training sub-models to computing nodes.
  • the above-mentioned first allocation module is configured to: divide the training model into multiple training sub-models in the horizontal direction or vertical direction, and allocate the above-mentioned training sub-models to each computing node.
  • the second allocation module is configured to allocate multiple training samples to each computing node, so that each computing node performs an iterative training process based on the corresponding training samples.
  • the above-mentioned second allocation module is configured to: allocate multiple training samples to each computing node based on the sampling method, or allocate multiple training samples according to the data dimension. Split it and distribute it to each computing node.
  • a data communication device provided by an embodiment of the present application is introduced below.
  • the data communication device described below and the data communication method described above may be referred to each other.
  • FIG. 6 a structural diagram of a data communication device according to an exemplary embodiment is shown. As shown in Figure 6, it includes:
  • the compression module 601 is configured to perform a compression operation on its own local parameters based on a preset compression algorithm when the communication triggering conditions are met, and transmit the compressed local parameters to the above-mentioned central node;
  • the second acquisition module 602 is configured to acquire the global parameters sent by the above-mentioned central node; wherein the above-mentioned global parameters are obtained by the above-mentioned central node fusing the local parameters of N key nodes;
  • the execution module 603 is configured to execute the corresponding training task based on the above-mentioned global parameters when receiving the training command sent by the central node.
  • the computing node compresses its local parameters based on a preset compression algorithm before transmitting them to the central node, thereby reducing the communication volume between the central node and the computing node, and reducing Communication overhead between central node and computing node.
  • the above preset compression algorithm is:
  • x is the above-mentioned local parameter
  • 2 is the L2 norm of x
  • sign(x) is the sign of x
  • d is the dimension of the above-mentioned local parameter
  • x i is the i-th dimension of x
  • C[x] is the compressed local parameter.
  • the calculation module is configured to obtain the average parameters sent by the above-mentioned central node, and calculate the deviation between its own local parameters and the above-mentioned average parameters and return it to the above-mentioned central node.
  • Figure 7 is a structural diagram of an electronic device according to an exemplary embodiment, such as As shown in Figure 7, the electronic equipment includes:
  • Communication interface 1 can interact with other devices such as network devices;
  • the processor 2 is connected to the communication interface 1 to implement information interaction with other devices, and is configured to execute the information fusion method or data communication method provided by one or more of the above technical solutions when running a computer program.
  • the above computer program is stored in the memory 3 .
  • bus system 4 the various components in the electronic equipment are coupled together through the bus system 4 .
  • the bus system 4 is provided to enable connected communication between these components.
  • the bus system 4 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled as bus system 4 in Figure 7 .
  • the memory 3 in the embodiment of the present application is configured to store various types of data to support the operation of the electronic device.
  • Examples of such data include: any computer program configured to operate on an electronic device.
  • the memory 3 may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memories.
  • non-volatile memory can be read-only memory (ROM, Read Only Memory), programmable read-only memory (PROM, Programmable Read-Only Memory), erasable programmable read-only memory (EPROM, Erasable Programmable Read-Only Memory).
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • FRAM Magnetic Random Access Memory
  • Flash Memory Magnetic Surface Memory , optical disk, or compact disc (CD-ROM, Compact Disc Read-Only Memory); magnetic surface memory can be disk storage or tape storage.
  • Volatile memory can be random access memory (RAM, Random Access Memory), which is used as an external cache.
  • RAM Random Access Memory
  • SRAM Static Random Access Memory
  • SSRAM Synchronous Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • SDRAM Synchronous Dynamic Random Access Memory
  • DDRSDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • SLDRAM Synchronous Link Dynamic Random Access Memory
  • DRRAM Direct Rambus Random Access Memory
  • the memory 3 described in the embodiments of the present application is intended to include, but is not limited to, these and any other suitable types of memory.
  • the methods disclosed in the above embodiments of the present application can be applied to the processor 2 or implemented by the processor 2 .
  • the processor 2 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 2 .
  • the above-mentioned processor 2 may be a general-purpose processor, a DSP, or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the processor 2 can implement or execute each method, step and logical block diagram disclosed in the embodiment of this application.
  • a general-purpose processor may be a microprocessor or any conventional processor, etc.
  • the steps of the method disclosed in the embodiments of this application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module may be located in a storage medium, and the storage medium is located in the memory 3.
  • the processor 2 reads the program in the memory 3 and completes the steps of the foregoing method in combination with its hardware.
  • the embodiment of the present application also provides a storage medium, that is, a computer storage medium, which can be a computer non-volatile readable storage medium, such as a memory 3 that stores a computer program.
  • the above computer program can be processed by The processor 2 is executed to complete the above steps of the aforementioned information fusion method or data communication method.
  • Computer non-volatile readable storage media can be FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface memory, optical disk, or CD-ROM and other memories.
  • the aforementioned program can be stored in a computer-readable storage medium.
  • the program When the program is executed, It includes the steps of the above method embodiment; and the aforementioned storage medium includes: various media that can store program codes, such as mobile storage devices, ROM, RAM, magnetic disks or optical disks.
  • the above-mentioned integrated units in the embodiment of the present application are implemented in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.
  • the technical solutions of the embodiments of the present application can be embodied in the form of software products in essence or those that contribute to the existing technology.
  • the computer software products are stored in a storage medium and include a number of instructions to An electronic device (which may be a personal computer, a server, a network device, etc.) is caused to execute all or part of the above methods in various embodiments of the present application.
  • the aforementioned storage media include: mobile storage devices, ROM, RAM, magnetic disks or optical disks and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Computer And Data Communications (AREA)

Abstract

本申请实施例公开了一种信息融合方法、数据通信方法、装置及一种电子设备和一种计算机非易失性可读存储介质,涉及计算机技术领域,该信息融合方法包括:当满足通信触发条件时,获取分布式训练系统中各计算节点的局部参数;其中,通信触发条件为参与本轮训练的关键节点均执行完成本轮训练任务;在各计算节点中选择N个参与下一轮训练的关键节点,对N个关键节点的局部参数进行融合得到全局参数;将全局参数发送至各计算节点,并向关键节点发送训练命令,以便关键节点基于全局参数执行下一轮训练任务。本申请实施例提升了模型的分布式训练速度。

Description

信息融合方法、数据通信方法、装置及电子设备和非易失性可读存储介质
相关申请的交叉引用
本申请要求于2022年7月18日提交中国专利局,申请号为202210838709.3,申请名称为“信息融合方法、数据通信方法、装置及电子设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,更具体地说,涉及一种信息融合方法、一种数据通信方法、一种信息融合装置、一种数据通信装置及一种电子设备和一种非易失性可读存储介质。
背景技术
在过去,由于数据和硬件的限制,基于可获取的有限样本的单机训练成为机器学习模型训练的主要途径。然而,近年来随着大数据、人工智能、高性能计算以及互联网技术的快速发展,催生了各类海量且复杂的数据与模型,推动机器学习、深度学习模型训练逐步向着分布式计算架构领域不断演进,并使其成为实现人工智能技术在计算机视觉、自然语言处理、语言识别、自动驾驶等领域取得突破性进展的关键举措。与以往传统的单机训练方式相比,分布式训练技术有以下两方面的显著优势:
其一,分布式训练技术可存储的数据与模型规模不断提升。现阶段的分布式训练系统是由多个计算节点共同构成的网络或集群,并且每个计算节点可由一台主机或多台主机构成,每台主机均具有相对独立的存储设备或存储单元。相对于仅具有单个计算节点的单机训练方式,分布式系统的存储数据及模型的规模得到了显著提升,并使得大规模数据集训练拥有超过百亿参数的深度神经网络模型成为可能。
其二,分布式训练技术训练时长不断降低。分布式训练系统通过多种计算和通信架构有效缩短了训练任务执行时长。详细而言,分布式训练系统借鉴“分而治之”的思想:首先,将待训练的深度神经网络模型或大数据集以模型并行、数据并行或混合并行的方式进行拆分,并分配至相应的计算节点;然后,各个计算节点分别对拆分后小规模数据或子模型单独进行训练并产生局部或中间训练结果;最后,分布式训练系统将所有局部训练结果再以某种方式进行聚合得到全局结果,并输出全局训练结果。上述过程以并行方式同时开展,因而可 以大幅降低传统串行单机训练时长。
综上,分布式训练方式成为大数据时代的热点与关键技术,学术界和产业界均开展了大量相关的研究和实践工作。为了解决基于海量参数的深度神经网络模型在大数据集上的高效训练问题,研究学者重点探索并研究基于分布式深度学习模型训练节点间通信方法。现有相关技术按照通信架构可以分为中心化架构算法和去中心化架构算法,按照信息同步方式可以分为同步算法和异步算法。
如图1所示,中心化架构主要是指在分布式训练系统中存在中心节点,负责与其他计算节点的信息交互与全局信息的同步与更新。在中心化架构中,以参数服务器架构最为典型。该架构中主要有两种角色:参数服务器(server)与计算节点(worker)。参数服务器负责收集计算节点发送来的梯度或参数等信息,对收集到的信息进行全局计算和同步,并将得到全局同步信息返回至各个计算节点处。计算节点收到参数服务器发送来的全局信息再进行后续的迭代计算,并将新产生的计算结果发送至参数服务器处。参数服务器与计算节点按照上述过程反复迭代直至达到训练结束条件为止。相反地,去中心化架构则不存在类似于中心参数服务器的节点,所有计算节点的地位处于“平等状态”。在该架构中,仅存在计算节点一种角色。每个计算节点仅掌握各自的局部数据或局部模型参数,并在每一轮迭代训练过程中,通过All-Reduce(全归约)等操作实现全局或局部信息的通信与融合,而非通过专用的中心节点实现全局信息的交互。
此外,中心化架构与去中心化架构的优劣势对比分析如下:中心化架构的优势在于:首先,各个计算节点之间不存在直接的信息交互与通信,且计算节点仅与中心参数服务器节点进行通信。因此,使得节点间训练过程相对独立。换言之,各个节点以其各自的训练速度分别与中心参数服务器节点进行通信,能够友好的支持异步通信策略;其次,由于中心参数服务器负责全局信息的融合并将全局信息发送至各个计算节点,使得模型训练精度与算法的收敛性得到充分的保障。最后,中心化架构的容错性良好。一旦有新的节点加入或剔除,该节点的变动不会对其他节点产生直接影响。劣势在于:中心参数服务器节点容易陷入“通信瓶颈”。在中心参数服务器节点带宽受限条件下,随着计算节点的数目不断增加且均与中心参数服务器节点进行通信,那么中心参数服务器节点将会遭遇通信瓶颈的问题。去中心化架构的优势在于:各个计算节点通常仅与其邻居节点进行信息交互与通信,基于局部信息的模型计算量小并且一定程度提升训练速度。劣势在于:由于缺乏中心节点的全局信息同步,导致模型训练精度差、甚至模型训练失败。
同步算法主要思想是:当分布式训练系统中的一个计算节点完成当前轮次迭代时,它必 须等待其他计算节点完成其当前轮次迭代任务,然后它们才能共同处理下一轮次训练迭代任务。典型的同步算法,如整体同步并行(bulk synchronous parallel,BSP)算法。详细而言,在BSP算法中,当某个计算节点完成当前迭代任务后,需要通过不同通信拓扑逻辑与其他计算节点同步模型参数或梯度等信息。然后,它们以相同的“起点”进入下一轮次迭代过程。为了保证迭代以相同的起点进行,BSP算法引入了一个全局同步障碍(synchronization barrier)。它的工作原理是要求那些处理能力较强且迭代速度快的计算节点都被强制在同步障碍处停止,等待其他处理能力较弱且迭代速度慢的计算节点完成其当前轮次迭代任务后,训练系统才会执行下一轮次迭代任务。同步算法的优势是保证各个计算节点模型参数的一致性,从而确保算法收敛性分析具有理论依据。同步算法缺点是系统性能受限于训练速度最慢的节点,即拖累者效应。
异步算法主要思想是当系统中的某个计算节点完成其当前轮次迭代后,它可以继续执行下一轮次迭代而无需等待其他计算节点。该算法的好处在于避免了同步算法的拖累者效应,充分发挥系统性能。然而,异步算法由于性能差异较大的不同计算节点产生新旧各异局部梯度信息被中心节点所利用,因而会导致梯度过时的问题。
综上所述,尽管目前存在针对深度学习模型训练通信问题的相关方法和算法,但是它们在以下存在不足:算法逻辑复杂且计算量大使得算法性能受限。深度学习问题的有效解决方案通常依赖于大数据集和大模型的支撑。然而,已有研究已经证明低效的通信方式训练神经网络模型至少花费数周的时间,因而对于时间敏感型任务场景难以适用。
发明内容
本申请实施例的目的在于提供一种信息融合方法、一种数据通信方法、一种信息融合装置、一种数据通信装置及一种电子设备和一种计算机非易失性可读存储介质,提升了模型的分布式训练速度,降低了中心节点与计算节点之间的通信开销。
为实现上述目的,本申请实施例提供了一种信息融合方法,应用于分布式训练系统中的中心节点,上述方法包括:
当满足通信触发条件时,获取上述分布式训练系统中各计算节点的局部参数;其中,上述通信触发条件为参与本轮训练的关键节点均执行完成本轮训练任务;
在各计算节点中选择N个参与下一轮训练的关键节点,对N个上述关键节点的局部参数进行融合得到全局参数;
将上述全局参数发送至各计算节点,并向上述关键节点发送训练命令,以便上述关键节 点基于上述全局参数执行下一轮训练任务。
其中,上述参与本轮训练的关键节点均执行完成本轮训练任务,包括:
参与本轮训练的关键节点均执行完成预设次数的迭代训练过程。
其中,上述在各计算节点中选择N个参与下一轮训练的关键节点,包括:
计算上述关键节点的局部参数的平均参数,确定各计算节点的局部参数与上述平均参数的偏差,选择偏差最小的N个计算节点作为参与下一轮训练的关键节点。
其中,上述确定各计算节点的局部参数与上述平均参数的偏差,包括:
将上述平均参数发送至各计算节点,以便各计算节点计算自身的局部参数与上述平均参数的偏差,并返回至上述中心节点。
其中,还包括:
将训练模型划分为多个训练子模型,并将上述训练子模型分配至各计算节点。
其中,上述将训练模型划分为多个训练子模型,包括:
在水平方向或垂直方向将训练模型划分为多个训练子模型。
其中,还包括:
将多个训练样本分配至各计算节点,以便各计算节点基于对应的训练样本执行迭代训练过程。
其中,上述将多个训练样本分配至各计算节点,包括:
基于采样方法将多个训练样本分配至各计算节点,或,将多个训练样本按照数据维度进行拆分并分配至各计算节点。
其中,上述基于采样方法将多个训练样本分配至各计算节点,包括:将上述多个训练样本通过有放回的随机采样和/或局部置乱采样的方式分配至各计算节点;或者将上述多个训练样本通过有放回的随机采样和/或全部置乱采样的方式分配至各计算节点。
其中,上述将多个训练样本按照数据维度进行拆分并分配至各计算节点,包括:在每个训练样本具有多维属性或特征的情况下,将上述多个训练样本按照不同的属性进行拆分,并把拆分后的样本子集分配至相应的计算节点。
其中,上述对N个上述关键节点的局部参数进行融合得到全局参数,包括:计算N个上述关键节点的局部参数的平均值,将上述平均值确定为上述全局参数。
为实现上述目的,本申请实施例提供了一种数据通信方法,应用于分布式训练系统中的计算节点,上述方法包括:
当满足通信触发条件时,基于预设压缩算法对自身的局部参数进行压缩操作,并将压缩 后的局部参数传输至上述中心节点;
获取上述中心节点发送的全局参数;其中,上述全局参数为上述中心节点对N个关键节点的局部参数进行融合得到的;
当接收到中心节点发送的训练命令时,基于上述全局参数执行对应的训练任务。
其中,上述预设压缩算法为:
Figure PCTCN2022133806-appb-000001
其中,x为上述局部参数,||x|| 2为x的L2范数,sign(x)为x的符号,d为上述局部参数的维度,
Figure PCTCN2022133806-appb-000002
x i为x的第i个维度,C[x]为压缩后的局部参数。
其中,还包括:
获取上述中心节点发送的平均参数,并计算自身的局部参数与上述平均参数的偏差返回至上述中心节点。
其中,上述当满足通信触发条件时,基于预设压缩算法对自身的局部参数进行压缩操作,并将压缩后的局部参数传输至中心节点,包括:在参与本轮训练的关键节点均执行完成本轮训练任务的情况下,基于预设压缩算法对自身的局部参数进行压缩操作,并将压缩后的局部参数传输至中心节点;上述获取上述中心节点发送的全局参数,包括:在上述计算节点被选为参与下一轮训练的关键节点的情况下,获取上述中心节点发送的全局参数;上述当接收到中心节点发送的训练命令时,基于上述全局参数执行对应的训练任务,包括:当接收到上述中心节点发送的训练命令时,基于上述全局参数执行下一轮训练任务。
其中,上述方法还包括:获取多个训练子模型中的部分训练子模型,其中,上述多个训练子模型是将与上述训练任务关联的训练模型进行划分所得到的子模型,上述多个训练子模型被分配至上述分布式训练系统中的各计算节点;和/或获取多个训练样本中的部分训练样本,其中,上述多个训练样本是与上述训练任务关联的训练样本,上述多个训练样本被分配至各计算节点,各计算节点用于基于对应的训练样本执行迭代训练过程。
为实现上述目的,本申请实施例提供了一种信息融合装置,应用于分布式训练系统中的中心节点,上述装置包括:
第一获取模块,被设置为当满足通信触发条件时,获取上述分布式训练系统中各计算节点的局部参数;其中,上述通信触发条件为参与本轮训练的关键节点均执行完成本轮训练任 务;
融合模块,被设置为在各计算节点中选择N个参与下一轮训练的关键节点,对N个上述关键节点的局部参数进行融合得到全局参数;
发送模块,被设置为将上述全局参数发送至各计算节点,并向上述关键节点发送训练命令,以便上述关键节点基于上述全局参数执行下一轮训练任务。
为实现上述目的,本申请实施例提供了一种数据通信装置,应用于分布式训练系统中的计算节点,上述装置包括:
压缩模块,被设置为当满足通信触发条件时,基于预设压缩算法对自身的局部参数进行压缩操作,并将压缩后的局部参数传输至上述中心节点;
第二获取模块,被设置为获取上述中心节点发送的全局参数;其中,上述全局参数为上述中心节点对N个关键节点的局部参数进行融合得到的;
执行模块,被设置为当接收到中心节点发送的训练命令时,基于上述全局参数执行对应的训练任务。
为实现上述目的,本申请实施例提供了一种电子设备,包括:
存储器,被设置为存储计算机程序;
处理器,被设置为执行上述计算机程序时实现如上述信息融合方法或数据通信方法的步骤。
为实现上述目的,本申请实施例提供了一种计算机非易失性可读存储介质,上述计算机非易失性可读存储介质上存储有计算机程序,上述计算机程序被处理器执行时实现如上述信息融合方法或数据通信方法的步骤。
本申请实施例提供的信息融合方法,中心节点仅选择N个关键节点进行信息融合,有效降低融合的计算节点的数目,在下一轮训练中仅选择N个关键节点执行训练任务,其他计算节点不执行训练任务,提升了模型的分布式训练速度。
本申请实施例提供的数据通信方法,计算节点在将自身的局部参数传输至中心节点之前,基于预设的压缩算法对其进行压缩,减少了中心节点与计算节点之间的通信量,降低了中心节点与计算节点之间的通信开销。
本申请实施例还公开了一种信息融合装置、一种数据通信装置及一种电子设备和一种计算机非易失性可读存储介质,同样能实现上述技术效果。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本申请实施例。
附图说明
为了更清楚地说明本申请实施例实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。附图是用来提供对本公开的理解,并且构成说明书的一部分,与下面的可选实施方式一起用于解释本公开,但并不构成对本公开的限制。在附图中:
图1为中心化架构与非中心化架构的示意图;
图2为根据一示例性实施例示出的一种面向参数服务器架构的分布式节点信息融合框架图;
图3为根据一示例性实施例示出的一种信息融合方法的流程图;
图4为根据一示例性实施例示出的一种数据通信方法的流程图;
图5为根据一示例性实施例示出的一种信息融合装置的结构图;
图6为根据一示例性实施例示出的一种数据通信装置的结构图;
图7为根据一示例性实施例示出的一种电子设备的结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请实施例保护的范围。另外,在本申请实施例中,“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
本申请实施例提供的一种面向参数服务器架构的分布式节点信息融合框架图如图2所示,包括数据/模型划分部件、参数服务器架构分布式训练系统部件、节点选择与数据压缩技术部件、以及训练结果输出部件。
数据/模型划分部件主要完成将待处理的数据集和模型的输入任务:其中,数据拆分模块主要负责数据集的拆分任务,并将拆分后的子数据集部署到相应计算节点之上;模型拆分模块主要负责将原始大模型拆分成若干个规模较小的子模型。
参数服务器架构分布式训练系统部件主要用于完成实际训练任务。
节点选择与数据压缩技术部件作为整个分布式训练系统框架的核心技术:其中,节点选择模块完成了关键计算节点的选择任务,规避计算所有计算节点的信息,从而有效缓解参数 服务器的“通信瓶颈”问题;数据压缩模块从数据的角度压缩通信量,进而实现模型训练速度的提升。例如,图2中原始的分布式训练系统包括计算节点1、计算节点2、计算节点3以及参数服务器节点,通过节点选择模块设计的选择方法,剔除不符合条件的计算节点2。因此,后续迭代过程中实际参与计算的只有计算节点1、计算节点3以及参数服务器节点。另外,分别针对计算节点1和计算节点3的通信信息(如梯度、模型参数等),采用数据压缩技术,减少通信量。在参数服务器架构中主要有worker和server两种角色。worker主要负责:一是基于其局部数据样本完成局部训练任务;二是通过客户端接口与server进行通信。server主要负责:一是对各个worker发送来的局部梯度进行聚合或融合;二是更新全局模型参数并返回至各个worker处。
训练结果输出部件负责输出训练任务的全局解,并以可视化的方式呈现,便于后续的改进与优化。
综上所述,各部件各司其职,协同完成各类复杂训练任务。
本申请实施例公开了一种信息融合方法,提升了模型的分布式训练速度。
参见图3,根据一示例性实施例示出的一种信息融合方法的流程图,如图3所示,包括:
S101:当满足通信触发条件时,获取上述分布式训练系统中各计算节点的局部参数;其中,上述通信触发条件为参与本轮训练的关键节点均执行完成本轮训练任务;
本实施例应用于分布式训练系统,该分布式训练系统包括多个计算节点(worker)和1个中心节点(server),每个worker与server双向连接,表示数据传递可以双向传递。然而,worker与worker之间并没有直接连接。每个worker均可使用server提供的全局参数独立地开展各自的训练任务。详细而言,每个worker通过以下两种操作与server通信:其一为PULL(拉取),即worker从server处获取全局参数;其二为PUSH(推送),即worker向server发送局部参数。本实施例中执行主体为中心节点。
在可选的实施例中,worker将其各自的局部参数
Figure PCTCN2022133806-appb-000003
发送至server,当参与本轮训练的关键节点均执行完成本轮训练任务时,server可以获取到各worker的局部参数。关键节点为执行本轮训练之前,中心节点在所有计算节点中选择的N个参与训练的节点,其中,N=1,或者N=2,或者N为大于或者等于3的正整数;作为一种可选的实施方式,上述N个参与训练的节点可以是所有计算节点中的部分节点。可选的,通信触发条件可以为参与本轮训练的关键节点均执行完成预设次数的迭代训练过程。本实施例不对预设次数进行限定,例如,预设次数为1,则在每次迭代完成时,worker与server进行局部参数同步,预设次数为10,则在各worker完成各自10次迭代后,再同server进行局部参数同步,预设次数为T(迭代总次 数),则在各worker完成所有迭代之后,最终同server进行局部参数同步。
S102:在各计算节点中选择N个参与下一轮训练的关键节点,对N个上述关键节点的局部参数进行融合得到全局参数;
在本步骤中,为下一轮训练重新选择N个关键节点,对N个关键节点的局部参数进行融合得到全局参数,也即计算N个关键节点的局部参数的平均值作为全局参数。
作为一种可选实施方式,上述在各计算节点中选择N个参与下一轮训练的关键节点,包括:计算上述关键节点的局部参数的平均参数,确定各计算节点的局部参数与上述平均参数的偏差,选择偏差最小的N个计算节点作为参与下一轮训练的关键节点;作为一种可行的实施方式,上述选择的N个计算节点可以是上述各计算节点中的部分计算节点。
在可选的实施中,server计算一个平均参数
Figure PCTCN2022133806-appb-000004
确定第r个worker的局部参数与平均参数之间的偏差:
Figure PCTCN2022133806-appb-000005
对v 1、v 2,....,v r,...,v n按照从大到小顺序依次排列,并从中选择较小的N个worker作为关键节点。
作为一种可行的实施方式,上述确定各计算节点的局部参数与上述平均参数的偏差,包括:将上述平均参数发送至各计算节点,以便各计算节点计算自身的局部参数与上述平均参数的偏差,并返回至上述中心节点。
在可选的实施例中,server将计算得到的平均参数返回至各worker,由各worker计算自身的局部参数与平均参数的偏差,并返回至server。
S103:将上述全局参数发送至各计算节点,并向上述关键节点发送训练命令,以便上述关键节点基于上述全局参数执行下一轮训练任务。
在本步骤中,中心节点将全局参数发送至各计算节点,计算节点将自身的局部参数更新为该全局参数。但是,只有关键节点执行下一轮训练任务,而其他未被选择为关键节点的计算节点不会执行下一轮训练任务,提升了模型的分布式训练速度。
面向参数服务器架构的分布式节点信息融合算法的输入为:迭代总次数T、学习率η、初始模型参数x 0、迭代通信触发条件σ、关键节点的数量N,输出为:全局收敛模型参数x T。执行过程为:
for迭代次数t=0,1,...,T do
每个worker并发执行计算节点训练函数Worker_Training(t);
if迭代次数t满足通信触发条件σ do
server执行服务器节点训练函数Server_Training(t);
end if
end for
Return全局收敛模型参数xT
其中,计算节点训练函数Worker_Training(t),可以定义为:
Function Worker_Training(t)
假定第r个worker执行一次随机采样并获取一个训练样本
Figure PCTCN2022133806-appb-000006
该worker从server处PULL最新全局参数x;
基于参数x与训练样本
Figure PCTCN2022133806-appb-000007
计算局部随机梯度
Figure PCTCN2022133806-appb-000008
worker更新其局部参数
Figure PCTCN2022133806-appb-000009
worker调用梯度压缩函数Compress_gradient()将
Figure PCTCN2022133806-appb-000010
worker将
Figure PCTCN2022133806-appb-000011
PUSH到server处;
end Function
参数服务器节点训练函数Server_Training(t)定义如下:
Function Server_Training(t)
调用计算节点选择函数Worker_Selction()选择N个关键节点进行全局参数信息融合与同步;
计算全局模型参数信息融合:
Figure PCTCN2022133806-appb-000012
server将上述全局参数发送至各worker处;
end Function
本申请实施例提供的信息融合方法,中心节点仅选择N个关键节点进行信息融合,有效降低融合的计算节点的数目,在下一轮训练中仅选择N个关键节点执行训练任务,其他计算节点不执行训练任务,提升了模型的分布式训练速度。
可以理解的是,深度学习模型训练任务的两个必要前提在于数据与模型。训练深度学习模型依赖于优质数据集。数据/模型划分部件负责将待处理的数据集和模型作为深度学习训 练任务的输入,并为用户访问数据或模型提供接口。
一般而言,由于输入的深度学习模型/数据集的规模巨大,导致其处理困难。因此,采用分而治之的思想,将原始的大规模数据集或模型进行分解,使得其处理过程变得相对容易。该组件包含数据拆分模块(也称为数据并行)和模型拆分模块(也称为模型并行)。
在上述实施例的基础上,作为一种可选的实施方式,还包括:将训练模型划分为多个训练子模型,并将上述训练子模型分配至各计算节点。
在可选的实施例中,如果训练任务模型过大且无法通过单机方式实现存储,则需要对模型进行有效拆分使得训练任务变得可行。模型并行将模型参数拆分成多个子模型,并且各个子模型分配至不同的计算节点。值得注意的是由于神经网络模型的特殊性,即神经网络模型的分层结构使得其在应用模型并行方面具有显著优势。神经网络模型按照不同的拆分方式可以分为水平拆分和垂直拆分,也即上述将训练模型划分为多个训练子模型,包括:在水平方向或垂直方向将训练模型划分为多个训练子模型。
在上述实施例的基础上,作为一种可选的实施方式,还包括:将多个训练样本分配至各计算节点,以便各计算节点基于对应的训练样本执行迭代训练过程。
数据并行依赖于在并行计算环境中多个处理器(计算节点)细分数据集实现分割计算。数据并行算法侧重于将数据分布在不同的并行计算节点上,并且各计算节点执行相同的计算模型。数据并行模式按照数据集不同的拆分策略分为基于样本的数据并行和基于样本维度的数据并行。也即,上述将多个训练样本分配至各计算节点,包括:基于采样方法将多个训练样本分配至各计算节点,或,将多个训练样本按照数据维度进行拆分并分配至各计算节点。基于样本的数据并行:假定分布式训练系统数据集包含多个数据样本和多个计算节点,将这个样本通过有放回的随机采样与局部(全局)置乱采样两种方式分配至多个计算节点。基于样本维度的数据并行:假定数据集包含多个样本且每个样本具有多维属性或特征,分布式训练系统包括多个计算节点,从样本属性维度出发,将多个样本按照不同的属性进行拆分,并把拆分后的样本子集分配至相应的计算节点。
此外,在某些场景下同时使用模型拆分模块与数据拆分模块,由此产生了数据与模型的混合拆分策略。数据与模型的混合拆分策略(混合并行)顾名思义就是同时将数据并行模式与模型并行模式结合起来,一方面将数据集拆分,另一方将模型也进行拆分,使其能够应用于更复杂的模型训练任务中。
本申请实施例公开了一种数据通信方法,降低了中心节点与计算节点之间的通信开销。
参见图4,根据一示例性实施例示出的一种数据通信方法的流程图,如图4所示,包括:
S201:当满足通信触发条件时,基于预设压缩算法对自身的局部参数进行压缩操作,并将压缩后的局部参数传输至上述中心节点;
本实施例的执行主体为分布式训练系统中的计算节点,在可选的实施例中,计算节点需要将自身的局部参数传输至中心节点。
在真实的深度神经网络模型训练场景,有研究表明梯度计算或通信占GPU训练总时长的94%以上,严重制约了训练效率。为了降低通信量,采用改进后的1-bit压缩技术。原始的1-bit压缩技术定义为:
令C[*]表示压缩操作运算,||·|| 1表示求向量的L1范数,x∈R d表示一个d维实数向量,sign(x)表示取向量x的符号,则对向量x取1-bit压缩操作:
Figure PCTCN2022133806-appb-000013
上述压缩过程中虽然能够减少通信量,但在某些情况下会产生误码。例如对于向量x=[1,-2,3]和向量y=[1,2,3]而言,
C[x]=(|1|+|-2|+|3|)/3*(+);
C[y]=(|1|+|2|+|3|)/3*(+);
上述两个向量压缩结果相同。换言之,不同的向量,采用原始的1-bit压缩后结果竟然相同,显然这种压缩会产生误码。相反地,压缩的目标应尽量做到差异化。为此,本实施例设计了一种改进的1-bit(比特)压缩技术规避上述问题。改进后的1-bit压缩技术(也即预设压缩算法)如下:
Figure PCTCN2022133806-appb-000014
其中,x为上述局部参数,||x|| 2为x的L2范数,sign(x)为x的符号,d为上述局部参数的维度,
Figure PCTCN2022133806-appb-000015
x i为x的第i个维度,C[x]为压缩后的局部参数。
改进方案与原始方案主要有两点不同:一是采用λ系数来规避误码问题;二是采用L2范数代替原始的L1范数,这是因为L2范数的数学性质更为良好。可见,通过上述预设压缩算法,可以将原始的训练数据的32bit或16bit数据压缩至1bit,进而降低了中心节点与计算节点之间的通信开销。
可选的,本实施例还包括:获取上述中心节点发送的平均参数,并计算自身的局部参数与上述平均参数的偏差返回至上述中心节点。
S202:获取上述中心节点发送的全局参数;其中,上述全局参数为上述中心节点对N个关键节点的局部参数进行融合得到的;
S203:当接收到中心节点发送的训练命令时,基于上述全局参数执行对应的训练任务。
本申请实施例提供的数据通信方法,计算节点在将自身的局部参数传输至中心节点之前,基于预设的压缩算法对其进行压缩,减少了中心节点与计算节点之间的通信量,降低了中心节点与计算节点之间的通信开销。
下面对本申请实施例提供的一种信息融合装置进行介绍,下文描述的一种信息融合装置与上文描述的一种信息融合方法可以相互参照。
参见图5,根据一示例性实施例示出的一种信息融合装置的结构图,如图5所示,包括:
第一获取模块501,被设置为当满足通信触发条件时,获取上述分布式训练系统中各计算节点的局部参数;其中,上述通信触发条件为参与本轮训练的关键节点均执行完成本轮训练任务;
融合模块502,被设置为在各计算节点中选择N个参与下一轮训练的关键节点,对N个上述关键节点的局部参数进行融合得到全局参数;
发送模块503,被设置为将上述全局参数发送至各计算节点,并向上述关键节点发送训练命令,以便上述关键节点基于上述全局参数执行下一轮训练任务。
本申请实施例提供的信息融合装置,中心节点仅选择N个关键节点进行信息融合,有效降低融合的计算节点的数目,在下一轮训练中仅选择N个关键节点执行训练任务,其他计算节点不执行训练任务,提升了模型的分布式训练速度。
在上述实施例的基础上,作为一种可选的实施方式,上述通信触发条件为参与本轮训练的关键节点均执行完成预设次数的迭代训练过程。
在上述实施例的基础上,作为一种可选的实施方式,上述融合模块502包括:
选择单元,被设置为计算上述关键节点的局部参数的平均参数,确定各计算节点的局部参数与上述平均参数的偏差,选择偏差最小的N个计算节点作为参与下一轮训练的关键节点;
融合单元,被设置为对N个上述关键节点的局部参数进行融合得到全局参数。
在上述实施例的基础上,作为一种可选的实施方式,上述选择单元被设置为:计算上述关键节点的局部参数的平均参数,将上述平均参数发送至各计算节点,以便各计算节点计算 自身的局部参数与上述平均参数的偏差,并返回至上述中心节点,选择偏差最小的N个计算节点作为参与下一轮训练的关键节点。
在上述实施例的基础上,作为一种可选的实施方式,还包括:
第一分配模块,被设置为将训练模型划分为多个训练子模型,并将上述训练子模型分配至的计算节点。
在上述实施例的基础上,作为一种可选的实施方式,上述第一分配模块被设置为:在水平方向或垂直方向将训练模型划分为多个训练子模型,并将上述训练子模型分配至各计算节点。
在上述实施例的基础上,作为一种可选的实施方式,还包括:
第二分配模块,被设置为将多个训练样本分配至各计算节点,以便各计算节点基于对应的训练样本执行迭代训练过程。
在上述实施例的基础上,作为一种可选的实施方式,上述第二分配模块被设置为:基于采样方法将多个训练样本分配至各计算节点,或,将多个训练样本按照数据维度进行拆分并分配至各计算节点。
下面对本申请实施例提供的一种数据通信装置进行介绍,下文描述的一种数据通信装置与上文描述的一种数据通信方法可以相互参照。
参见图6,根据一示例性实施例示出的一种数据通信装置的结构图,如图6所示,包括:
压缩模块601,被设置为当满足通信触发条件时,基于预设压缩算法对自身的局部参数进行压缩操作,并将压缩后的局部参数传输至上述中心节点;
第二获取模块602,被设置为获取上述中心节点发送的全局参数;其中,上述全局参数为上述中心节点对N个关键节点的局部参数进行融合得到的;
执行模块603,被设置为当接收到中心节点发送的训练命令时,基于上述全局参数执行对应的训练任务。
本申请实施例提供的数据通信装置,计算节点在将自身的局部参数传输至中心节点之前,基于预设的压缩算法对其进行压缩,减少了中心节点与计算节点之间的通信量,降低了中心节点与计算节点之间的通信开销。
在上述实施例的基础上,作为一种可选实施方式,上述预设压缩算法为:
Figure PCTCN2022133806-appb-000016
其中,x为上述局部参数,||x|| 2为x的L2范数,sign(x)为x的符号,d为上述局部参数的维度,
Figure PCTCN2022133806-appb-000017
x i为x的第i个维度,C[x]为压缩后的局部参数。
在上述实施例的基础上,作为一种可选实施方式,还包括:
计算模块,被设置为获取上述中心节点发送的平均参数,并计算自身的局部参数与上述平均参数的偏差返回至上述中心节点。
关于上述实施例中的装置,其中各个模块执行操作的可选的方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
基于上述程序模块的硬件实现,且为了实现本申请实施例的方法,本申请实施例还提供了一种电子设备,图7为根据一示例性实施例示出的一种电子设备的结构图,如图7所示,电子设备包括:
通信接口1,能够与其它设备比如网络设备等进行信息交互;
处理器2,与通信接口1连接,以实现与其它设备进行信息交互,被设置为运行计算机程序时,执行上述一个或多个技术方案提供的信息融合方法或数据通信方法。而上述计算机程序存储在存储器3上。
当然,实际应用时,电子设备中的各个组件通过总线系统4耦合在一起。可理解,总线系统4被设置为实现这些组件之间的连接通信。总线系统4除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图7中将各种总线都标为总线系统4。
本申请实施例中的存储器3被设置为存储各种类型的数据以支持电子设备的操作。这些数据的示例包括:被设置为在电子设备上操作的任何计算机程序。
可以理解,存储器3可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、电可擦除可编程只读存储器(EEPROM,Electrically Erasable Programmable Read-Only Memory)、磁性随机存取存储器(FRAM,ferromagnetic random access memory)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(CD-ROM,Compact Disc Read-Only Memory);磁表面存储器可以是磁盘存储器或磁带存储器。易失性存储器可以是随机存取存储器(RAM,Random Access Memory),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可 用,例如静态随机存取存储器(SRAM,Static Random Access Memory)、同步静态随机存取存储器(SSRAM,Synchronous Static Random Access Memory)、动态随机存取存储器(DRAM,Dynamic Random Access Memory)、同步动态随机存取存储器(SDRAM,Synchronous Dynamic Random Access Memory)、双倍数据速率同步动态随机存取存储器(DDRSDRAM,Double Data Rate Synchronous Dynamic Random Access Memory)、增强型同步动态随机存取存储器(ESDRAM,Enhanced Synchronous Dynamic Random Access Memory)、同步连接动态随机存取存储器(SLDRAM,SyncLink Dynamic Random Access Memory)、直接内存总线随机存取存储器(DRRAM,Direct Rambus Random Access Memory)。本申请实施例描述的存储器3旨在包括但不限于这些和任意其它适合类型的存储器。
上述本申请实施例揭示的方法可以应用于处理器2中,或者由处理器2实现。处理器2可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器2中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器2可以是通用处理器、DSP,或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。处理器2可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤,可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于存储介质中,该存储介质位于存储器3,处理器2读取存储器3中的程序,结合其硬件完成前述方法的步骤。
处理器2执行上述程序时实现本申请实施例的各个方法中的相应流程,为了简洁,在此不再赘述。
在示例性实施例中,本申请实施例还提供了一种存储介质,即计算机存储介质,可以为计算机非易失性可读存储介质,例如包括存储计算机程序的存储器3,上述计算机程序可由处理器2执行,以完成前述信息融合方法或数据通信方法上述步骤。计算机非易失性可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、Flash Memory、磁表面存储器、光盘、或CD-ROM等存储器。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
或者,本申请实施例上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台电子设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例上述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上上述,仅为本申请实施例的可选的实施方式,但本申请实施例的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请实施例揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请实施例的保护范围之内。因此,本申请实施例的保护范围应以上述权利要求的保护范围为准。

Claims (20)

  1. 一种信息融合方法,应用于分布式训练系统中的中心节点,所述方法包括:
    当满足通信触发条件时,获取所述分布式训练系统中各计算节点的局部参数;其中,所述通信触发条件为参与本轮训练的关键节点均执行完成本轮训练任务;
    在各所述计算节点中选择N个参与下一轮训练的关键节点,对N个所述关键节点的局部参数进行融合得到全局参数;
    将所述全局参数发送至各所述计算节点,并向所述关键节点发送训练命令,以便所述关键节点基于所述全局参数执行下一轮训练任务。
  2. 根据权利要求1所述信息融合方法,其中,所述参与本轮训练的关键节点均执行完成本轮训练任务,包括:
    参与本轮训练的关键节点均执行完成预设次数的迭代训练过程。
  3. 根据权利要求1所述信息融合方法,其中,所述在各所述计算节点中选择N个参与下一轮训练的关键节点,包括:
    计算所述关键节点的局部参数的平均参数,确定各所述计算节点的局部参数与所述平均参数的偏差,选择偏差最小的N个计算节点作为参与下一轮训练的关键节点。
  4. 根据权利要求3所述信息融合方法,其中,所述确定各所述计算节点的局部参数与所述平均参数的偏差,包括:
    将所述平均参数发送至各所述计算节点,以便各所述计算节点计算自身的局部参数与所述平均参数的偏差,并返回至所述中心节点。
  5. 根据权利要求1所述信息融合方法,其中,还包括:
    将训练模型划分为多个训练子模型,并将所述训练子模型分配至各所述计算节点。
  6. 根据权利要求5所述信息融合方法,其中,所述将训练模型划分为多个训练子模型,包括:
    在水平方向或垂直方向将训练模型划分为多个训练子模型。
  7. 根据权利要求1所述信息融合方法,其中,还包括:
    将多个训练样本分配至各所述计算节点,以便各计算节点基于对应的训练样本执行迭代训练过程。
  8. 根据权利要求7所述信息融合方法,其中,所述将多个训练样本分配至各所述计算节点,包括:
    基于采样方法将多个训练样本分配至各所述计算节点,或,将多个训练样本按照数据维度进行拆分并分配至各所述计算节点。
  9. 根据权利要求8所述信息融合方法,其中,所述基于采样方法将多个训练样本分配至各所述计算节点,包括:
    将所述多个训练样本通过有放回的随机采样和/或局部置乱采样的方式分配至各所述计算节点;或者
    将所述多个训练样本通过有放回的随机采样和/或全部置乱采样的方式分配至各所述计算节点。
  10. 根据权利要求8所述信息融合方法,其中,所述将多个训练样本按照数据维度进行拆分并分配至各所述计算节点,包括:
    在每个训练样本具有多维属性或特征的情况下,将所述多个训练样本按照不同的属性进行拆分,并把拆分后的样本子集分配至相应的计算节点。
  11. 根据权利要求1所述信息融合方法,其中,所述对N个所述关键节点的局部参数进行融合得到全局参数,包括:
    计算N个所述关键节点的局部参数的平均值,将所述平均值确定为所述全局参数。
  12. 一种数据通信方法,应用于分布式训练系统中的计算节点,所述方法包括:
    当满足通信触发条件时,基于预设压缩算法对自身的局部参数进行压缩操作,并将压缩后的局部参数传输至中心节点;
    获取所述中心节点发送的全局参数;其中,所述全局参数为所述中心节点对N个关键节点的局部参数进行融合得到的;
    当接收到所述中心节点发送的训练命令时,基于所述全局参数执行对应的训练任务。
  13. 根据权利要求12所述数据通信方法,其中,所述预设压缩算法为:
    Figure PCTCN2022133806-appb-100001
    其中,x为所述局部参数,||x|| 2为x的L2范数,sign(x)为x的符号,d为所述局部参数的维度,
    Figure PCTCN2022133806-appb-100002
    x i为x的第i个维度,C[x]为压缩后的局部参数。
  14. 根据权利要求12所述数据通信方法,其中,还包括:
    获取所述中心节点发送的平均参数,并计算自身的局部参数与所述平均参数的偏差 返回至所述中心节点。
  15. 根据权利要求12所述信息融合方法,其中,
    所述当满足通信触发条件时,基于预设压缩算法对自身的局部参数进行压缩操作,并将压缩后的局部参数传输至中心节点,包括:在参与本轮训练的关键节点均执行完成本轮训练任务的情况下,基于预设压缩算法对自身的局部参数进行压缩操作,并将压缩后的局部参数传输至中心节点;
    所述获取所述中心节点发送的全局参数,包括:在所述计算节点被选为参与下一轮训练的关键节点的情况下,获取所述中心节点发送的全局参数;
    所述当接收到中心节点发送的训练命令时,基于所述全局参数执行对应的训练任务,包括:当接收到所述中心节点发送的训练命令时,基于所述全局参数执行下一轮训练任务。
  16. 根据权利要求12所述数据通信方法,其中,所述方法还包括:
    获取多个训练子模型中的部分训练子模型,其中,所述多个训练子模型是将与所述训练任务关联的训练模型进行划分所得到的子模型,所述多个训练子模型被分配至所述分布式训练系统中的各计算节点;和/或
    获取多个训练样本中的部分训练样本,其中,所述多个训练样本是与所述训练任务关联的训练样本,所述多个训练样本被分配至各所述计算节点,各所述计算节点用于基于对应的训练样本执行迭代训练过程。
  17. 一种信息融合装置,应用于分布式训练系统中的中心节点,所述装置包括:
    第一获取模块,被设置为当满足通信触发条件时,获取所述分布式训练系统中各计算节点的局部参数;其中,所述通信触发条件为参与本轮训练的关键节点均执行完成本轮训练任务;
    融合模块,被设置为在各所述计算节点中选择N个参与下一轮训练的关键节点,对N个所述关键节点的局部参数进行融合得到全局参数;
    发送模块,被设置为将所述全局参数发送至各所述计算节点,并向所述关键节点发送训练命令,以便所述关键节点基于所述全局参数执行下一轮训练任务。
  18. 一种数据通信装置,应用于分布式训练系统中的计算节点,所述装置包括:
    压缩模块,被设置为当满足通信触发条件时,基于预设压缩算法对自身的局部参数进行压缩操作,并将压缩后的局部参数传输至中心节点;
    第二获取模块,被设置为获取所述中心节点发送的全局参数;其中,所述全局参数为所述中心节点对N个关键节点的局部参数进行融合得到的;
    执行模块,被设置为当接收到中心节点发送的训练命令时,基于所述全局参数执行对应的训练任务。
  19. 一种电子设备,包括:
    存储器,被设置为存储计算机程序;
    处理器,被设置为执行所述计算机程序时实现如权利要求1至11任一项所述信息融合方法或如权利要求12至16任一项所述数据通信方法的步骤。
  20. 一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至11任一项所述信息融合方法或如权利要求12至16任一项所述数据通信方法的步骤。
PCT/CN2022/133806 2022-07-18 2022-11-23 信息融合方法、数据通信方法、装置及电子设备和非易失性可读存储介质 WO2024016542A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210838709.3 2022-07-18
CN202210838709.3A CN114997337B (zh) 2022-07-18 2022-07-18 信息融合、数据通信方法、装置及电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2024016542A1 true WO2024016542A1 (zh) 2024-01-25

Family

ID=83022643

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/133806 WO2024016542A1 (zh) 2022-07-18 2022-11-23 信息融合方法、数据通信方法、装置及电子设备和非易失性可读存储介质

Country Status (2)

Country Link
CN (1) CN114997337B (zh)
WO (1) WO2024016542A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997337B (zh) * 2022-07-18 2023-01-13 浪潮电子信息产业股份有限公司 信息融合、数据通信方法、装置及电子设备和存储介质
CN118114780A (zh) * 2022-11-30 2024-05-31 华为技术有限公司 分布式机器学习方法、设备、存储介质及程序产品
CN115660078A (zh) * 2022-12-29 2023-01-31 浪潮电子信息产业股份有限公司 一种分布式计算方法、系统、存储介质和电子设备
CN115879543B (zh) * 2023-03-03 2023-05-05 浪潮电子信息产业股份有限公司 一种模型训练方法、装置、设备、介质及系统
CN116681127B (zh) * 2023-07-27 2023-11-07 山东海量信息技术研究院 一种神经网络模型训练方法、装置及电子设备和存储介质
CN116704296B (zh) * 2023-08-04 2023-11-03 浪潮电子信息产业股份有限公司 一种图像处理方法、装置、系统、设备及计算机存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516250A (zh) * 2021-07-13 2021-10-19 北京百度网讯科技有限公司 一种联邦学习方法、装置、设备以及存储介质
US20220101189A1 (en) * 2020-09-30 2022-03-31 Vmware, Inc. Federated inference
CN114756383A (zh) * 2022-06-15 2022-07-15 苏州浪潮智能科技有限公司 一种分布式计算方法、系统、设备及存储介质
CN114997337A (zh) * 2022-07-18 2022-09-02 浪潮电子信息产业股份有限公司 信息融合、数据通信方法、装置及电子设备和存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107944566B (zh) * 2017-11-28 2020-12-22 杭州云脑科技有限公司 一种机器学习方法、主节点、工作节点及系统
US11775863B2 (en) * 2019-05-22 2023-10-03 Oracle International Corporation Enforcing fairness on unlabeled data to improve modeling performance
CN110276455B (zh) * 2019-06-19 2022-08-30 南京邮电大学 基于全局率权重的分布式深度学习系统
CN111461343B (zh) * 2020-03-13 2023-08-04 北京百度网讯科技有限公司 模型参数更新方法及其相关设备
US20230422054A1 (en) * 2020-09-07 2023-12-28 Lg Electronics Inc. Method for performing federated learning in wireless communication system, and apparatus therefor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220101189A1 (en) * 2020-09-30 2022-03-31 Vmware, Inc. Federated inference
CN113516250A (zh) * 2021-07-13 2021-10-19 北京百度网讯科技有限公司 一种联邦学习方法、装置、设备以及存储介质
CN114756383A (zh) * 2022-06-15 2022-07-15 苏州浪潮智能科技有限公司 一种分布式计算方法、系统、设备及存储介质
CN114997337A (zh) * 2022-07-18 2022-09-02 浪潮电子信息产业股份有限公司 信息融合、数据通信方法、装置及电子设备和存储介质

Also Published As

Publication number Publication date
CN114997337B (zh) 2023-01-13
CN114997337A (zh) 2022-09-02

Similar Documents

Publication Publication Date Title
WO2024016542A1 (zh) 信息融合方法、数据通信方法、装置及电子设备和非易失性可读存储介质
WO2023240845A1 (zh) 一种分布式计算方法、系统、设备及存储介质
Liu et al. Adaptive asynchronous federated learning in resource-constrained edge computing
CN107330516B (zh) 模型参数训练方法、装置及系统
CN107480789B (zh) 一种深度学习模型的高效转换方法及装置
CN105446896B (zh) 映射化简应用的缓存管理方法和装置
KR20180131836A (ko) 파라미터 서버 및 그것에 의해 수행되는 분산 딥러닝 파라미터 공유 방법
CN108280522A (zh) 一种插件式分布式机器学习计算框架及其数据处理方法
KR20180125734A (ko) 파라미터 공유 장치 및 방법
Ahn et al. ShmCaffe: A distributed deep learning platform with shared memory buffer for HPC architecture
Wang et al. An efficient and non-intrusive GPU scheduling framework for deep learning training systems
CN110659278A (zh) 基于cpu-gpu异构架构的图数据分布式处理系统
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
Ahn et al. Soft memory box: A virtual shared memory framework for fast deep neural network training in distributed high performance computing
Bac et al. Serverless computing approach for deploying machine learning applications in edge layer
Li et al. Optimizing makespan and resource utilization for multi-DNN training in GPU cluster
Madsen et al. Enorm: Efficient window-based computation in large-scale distributed stream processing systems
CN115879543B (zh) 一种模型训练方法、装置、设备、介质及系统
CN110021339B (zh) 基于蛋白质折叠测算蛋白质结构的集群并行计算加速方法
Yang et al. Parameter communication consistency model for large-scale security monitoring based on mobile computing
Duan et al. Lightweight federated reinforcement learning for independent request scheduling in microgrids
Kim et al. Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud
Zhou et al. Scheduling-efficient framework for neural network on heterogeneous distributed systems and mobile edge computing systems
CN115292044A (zh) 数据处理方法、装置、电子设备及存储介质
Ma et al. Cloud-based multidimensional parallel dynamic programming algorithm for a cascade hydropower system