WO2018077293A1 - 数据传输方法和系统、电子设备 - Google Patents

数据传输方法和系统、电子设备 Download PDF

Info

Publication number
WO2018077293A1
WO2018077293A1 PCT/CN2017/108450 CN2017108450W WO2018077293A1 WO 2018077293 A1 WO2018077293 A1 WO 2018077293A1 CN 2017108450 W CN2017108450 W CN 2017108450W WO 2018077293 A1 WO2018077293 A1 WO 2018077293A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
node
matrix
sparse
deep learning
Prior art date
Application number
PCT/CN2017/108450
Other languages
English (en)
French (fr)
Inventor
朱元昊
颜深根
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Publication of WO2018077293A1 publication Critical patent/WO2018077293A1/zh
Priority to US16/382,058 priority Critical patent/US20190236453A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the present application relates to deep learning techniques, and more particularly to data transmission methods and systems, and electronic devices.
  • the deep learning training system is a computing system that acquires a deep learning model by training input data.
  • the deep learning training system needs to process a large amount of training data.
  • the ImageNet data set opened by the Stanford University Computer Vision Laboratory contains more than 14 million high-precision images.
  • single-node deep learning training systems often take weeks or even months to complete operations due to their computational power and memory limitations. In this case, the distributed deep learning training system has received extensive attention in industry and academia.
  • a typical distributed deep learning training system typically uses a distributed computing framework to run a gradient descent algorithm. .
  • the network traffic generated by gradient aggregation and parameter broadcasts is usually proportional to the size of the deep learning model.
  • the new deep learning model is growing in size.
  • the AlexNet model contains more than 60 million parameters, and the VGG-16 model has hundreds of millions of parameters. Therefore, a large amount of network traffic is generated during the deep learning training, which is subject to the network bandwidth and other conditions. Communication time becomes one of the performance bottlenecks of the distributed deep learning training system.
  • the embodiment of the present application provides a data transmission scheme.
  • an embodiment of the present application provides a data transmission method, including:
  • performing sparse processing on at least a portion of the first data includes: comparing at least a portion of the first data to a given filtering threshold, and filtering out less than the at least portion A portion of the filtering threshold, wherein the filtering threshold decreases as the number of training iterations of the deep learning model increases.
  • the method before performing the thinning process on the at least part of the first data, the method further includes: randomly determining a portion of the first data as the at least part; and thinning at least part of the determined first data deal with.
  • the sending, to the at least one other node, the at least part of the first data after the thinning process comprises: compressing the at least part of the first data after performing sparse processing; and sending the compression to the at least one other node After the first data.
  • the method according to the first aspect of the present invention further includes: acquiring, by the at least one other node, second data for parameter updating the deep learning model trained by the distributed system; The two data updates the parameters of the deep learning model.
  • acquiring, by the at least one other node, second data for performing parameter update on the deep learning model trained by the distributed system including: receiving and decompressing, sending, by the at least one other node, the compressed Second data for parameter updating of the deep learning model trained by the distributed system.
  • the first data includes: a gradient matrix obtained by any training process calculation during iterative training of the deep learning model; and/or any training during iterative training of the deep learning model An old parameter, a parameter difference between the new parameter obtained by performing the parameter update based on at least one parameter update of the deep learning model for the distributed system training sent by the at least one other node Value matrix.
  • performing sparse processing on at least part of the first data includes: selecting, from the gradient matrix, a first portion whose absolute values are respectively smaller than the filtering threshold a matrix element; randomly selecting a second partial matrix element from the gradient matrix; setting a value of a matrix element of the gradient matrix that belongs to the first partial matrix element and the second partial matrix element to 0, to obtain a sparse gradient matrix Transmitting, to the at least one other node, the first data that is at least partially subjected to the sparse processing, comprising: compressing the sparse gradient matrix into a character string; and transmitting the character string to the at least one other node through a network.
  • the first data includes the parameter difference matrix
  • performing sparse processing on at least part of the first data including: selecting absolute values from the parameter difference matrix to be smaller than the filtering respectively a third partial matrix element of the threshold; randomly selecting a fourth partial matrix element from the parameter difference matrix; and a matrix of the parameter difference matrix that belongs to the third partial matrix element and the fourth partial matrix element
  • the value of the element is set to 0, and a sparse parameter difference matrix is obtained
  • sending the at least one portion of the first data after the sparse processing to the at least one other node comprises: compressing the sparse parameter difference matrix into a string; Sending the string to the at least one other node.
  • a data transmission system including:
  • a data determining module configured to determine first data that is to be sent by any node in the distributed system to at least one other node for parameter updating the deep learning model trained by the distributed system
  • a sparse processing module configured to perform sparse processing on at least part of the first data
  • a data sending module configured to send, to the at least one other node, the first data that is at least partially subjected to the sparse processing.
  • the sparse processing module includes: a filtering submodule, configured to compare at least part of the first data with a given filtering threshold, and filter out the filtering threshold from the at least part And the filtering threshold decreases as the number of training iterations of the deep learning model increases.
  • the sparse processing module further includes: a random selection module, configured to randomly determine a portion of the first data as the at least part; a sparse module, configured to perform at least part of the determined first data Sparse processing.
  • the data sending module includes: a compression submodule, configured to compress the first data that is at least partially subjected to sparse processing; and a sending submodule, configured to send the compressed first to the at least one other node data.
  • the system further includes: a data acquiring module, configured to acquire second data sent by the at least one other node for parameter updating the deep learning model of the distributed system training; and an update module, Updating the parameters of the deep learning model based on at least the second data.
  • a data acquiring module configured to acquire second data sent by the at least one other node for parameter updating the deep learning model of the distributed system training
  • an update module Updating the parameters of the deep learning model based on at least the second data.
  • the data obtaining module includes: a receiving and decompressing submodule, configured to receive and decompress a second parameter that is sent by the at least one other node and configured to perform parameter update on the deep learning model trained by the distributed system data.
  • a receiving and decompressing submodule configured to receive and decompress a second parameter that is sent by the at least one other node and configured to perform parameter update on the deep learning model trained by the distributed system data.
  • the first data includes: a gradient matrix obtained by any training process calculation during iterative training of the deep learning model; and/or any training during iterative training of the deep learning model An old parameter, a parameter difference between the new parameter obtained by performing the parameter update based on at least one parameter update of the deep learning model for the distributed system training sent by the at least one other node Value matrix.
  • the filtering submodule is configured to select, from the gradient matrix, a first partial matrix element whose absolute values are respectively smaller than the filtering threshold; the random selection submodule And a method for randomly selecting a second partial matrix element from the gradient matrix; the sparse sub-module is configured to set a value of a matrix element of the gradient matrix that belongs to the first partial matrix element and the second partial matrix element simultaneously 0, a sparse gradient matrix is obtained; the compression sub-module is configured to compress the sparse gradient matrix into a character string; and the sending sub-module sends the character string to the at least one other node through a network.
  • the filtering submodule is configured to select, from the parameter difference matrix, a third partial matrix element whose absolute values are respectively smaller than the filtering threshold; Random selection submodule And randomly selecting a fourth partial matrix element from the parameter difference matrix; the sparse sub-module is configured to use a matrix of the parameter difference matrix that belongs to the third partial matrix element and the fourth partial matrix element simultaneously The value of the element is set to 0 to obtain a sparse parameter difference matrix; the compression submodule is configured to compress the sparse parameter difference matrix into a string; the sending submodule is configured to send the at least one other node through the network Send the string.
  • an electronic device including the data transmission system described in any of the embodiments of the present application.
  • an electronic device including:
  • the processor When the processor is running the data processing system, the units in the data transmission system of any of the embodiments of the present application are executed.
  • an electronic device includes: one or more processors, a memory, a communication component, and a communication bus through which the processor, the memory, and the communication component pass The bus completes communication with each other;
  • the memory is configured to store at least one executable instruction, the executable instruction causing the processor to perform an operation corresponding to the data transmission method provided by any embodiment of the present application.
  • a computer program comprising computer readable code, when a computer readable code is run on a device, a processor in the device performs the above-described An instruction of each step in the data transmission method described in an embodiment.
  • a computer readable storage medium for storing computer readable instructions, when the instructions are executed, implementing the data transmission described in any of the above embodiments of the present application. The operation of each step in the method.
  • the data transmission method and system, the electronic device, the program and the medium provided by the embodiment of the present application determine parameter updating of a deep learning model for training distributed system training to be sent by at least one other node by any node in the distributed system.
  • First data performing sparse processing on at least part of the first data, and transmitting at least one of the first data after the sparse processing to at least one other node.
  • Embodiments of the present application can eliminate at least partially unimportant data (such as gradients and/or parameters), reduce network traffic generated by each gradient accumulation and/or parameter broadcast, and shorten training time.
  • the application does not need to reduce the communication frequency, and can obtain the latest parameters in time, which can be used in the deep learning training system for communication in each iteration, and also in the system that needs to reduce the communication frequency.
  • FIG. 1 is a flow chart of an embodiment of a data transmission method in accordance with the present application.
  • FIG. 2 is an exemplary flow chart of gradient filtering in an embodiment of a data transmission method of the present application.
  • FIG. 3 is an exemplary flow chart of parameter filtering in an embodiment of a data transmission method of the present application.
  • FIG. 4 is a schematic structural diagram of an embodiment of a data transmission system according to the present application.
  • FIG. 5 is a schematic structural diagram of another embodiment of a data transmission system according to the present application.
  • FIG. 6 is a schematic structural diagram of an embodiment of a node device according to the present application.
  • FIG. 7 is a schematic structural diagram of an embodiment of an electronic device according to the present application.
  • Embodiments of the present application can be applied to electronic devices such as terminal devices, computer systems, servers, etc., which can operate with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known terminal devices, computing systems, environments, and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers, and the like include, but are not limited to: Personal computer system, server computer system, thin client, thick client, handheld or laptop device, microprocessor based system, set top box, programmable consumer electronics, network personal computer, small computer system, mainframe computer system and including A distributed cloud computing technology environment for any of the above systems, and the like.
  • Electronic devices such as terminal devices, computer systems, servers, etc., can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system.
  • program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types.
  • the computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network.
  • program modules may be located on a local or remote computing system storage medium including storage devices.
  • FIG. 1 is a flow chart of an embodiment of a data transmission method in accordance with the present application. As shown in FIG. 1, the data transmission method of this embodiment includes:
  • step S110 first data for parameter update of a deep learning model for distributed system training to be sent by at least one other node by a node in the distributed system is determined.
  • the distributed system therein may be, for example, a cluster composed of a plurality of computing nodes, or may be composed of a plurality of computing nodes and a parameter server.
  • the deep learning model therein may include, for example, but not limited to, a neural network (such as a convolutional neural network), and the parameters may be, for example, matrix variables for constructing a deep learning model, and the like.
  • step S110 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a data determination module executed by the processor.
  • step S120 at least part of the first data is subjected to thinning processing.
  • the sparse processing is to remove less important parts from the first data, thereby reducing the network traffic consumed by transmitting the first data and reducing the training time of the deep learning model.
  • step S120 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a sparse processing module executed by the processor.
  • step S130 the first data after at least part of the thinning process is sent to the at least one other node.
  • step S130 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a data transmitting module executed by the processor.
  • the data transmission method of the embodiment of the present application is configured to transmit data for parameter update of a deep learning model running on a computing node between any two computing nodes or a computing node and a parameter server in the distributed deep learning system, which may be ignored. Less important parts of the transmitted data, such as unimportant gradients and/or parameters, which help to reduce network traffic generated during aggregation and broadcast operations, thereby reducing network transmission for each iteration of the calculation. Time, advance And shorten the overall training time of deep learning.
  • performing sparse processing on at least a portion of the first data may include comparing at least a portion of the first data to a given filtering threshold and comparing the first data. The portion smaller than the filtering threshold is filtered out at least in part.
  • the filtering threshold may be decreased as the number of training iterations of the deep learning model increases, so that the minor parameters are less likely to be selected and eliminated in the later stage of training.
  • the method before performing sparse processing on at least part of the first data, may further include: randomly determining a portion of the first data as the at least part; performing sparse processing on at least part of the determined first data .
  • the partial data in the first data is sparsely processed, and the remaining data in the first data is not subjected to sparse processing.
  • Part of the data that has not been sparsely processed can be sent in a conventional manner.
  • the processor may be executed by a processor to call a corresponding instruction stored in the memory, or may be executed by a data acquisition module executed by the processor, for example, by a random selection in a data acquisition module operated by the processor. Modules and sparse submodules are executed.
  • the sending, by the at least one other node, the at least part of the first data after performing the thinning process may include: compressing the first data that is at least partially subjected to the sparse processing, and compressing may adopt a general compression algorithm.
  • a compression algorithm such as snappy, zlib
  • transmitting the compressed first data to the at least one other node may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a data transmitting module executed by the processor, such as a compression sub-module in a data transmission module that may be executed by the processor, respectively.
  • the sending submodule is executed.
  • the method may further include:
  • any one of the foregoing nodes obtains, by the at least one other node, second data for performing parameter update on the deep learning model of the distributed system training, for example, receiving and decompressing, the at least one other node is compressed and sent for The deep learning model of the distributed system training performs the second data of the parameter update.
  • the processor may be executed by a processor to call a corresponding instruction stored in the memory, or may be executed by a data acquisition module executed by the processor;
  • the parameters of the deep learning model are updated based on at least the second data.
  • the timing of the update may occur when any of the above nodes completes the training of the current round during the iterative training of the deep learning model. In an alternative example, this may be performed by the processor invoking the corresponding instruction stored in the memory or by the update module being executed by the processor.
  • the first data includes: a gradient matrix obtained by any one of the above-mentioned nodes during the iterative training of the deep learning model.
  • the distributed deep learning training system provides raw gradient values (including the gradient values produced by each compute node) as input, and the input gradient can be a single precision value
  • a matrix is a matrix variable used to update the parameters of a deep learning model.
  • the first data comprises: an old parameter of any one of the above-mentioned nodes training during the iterative training of the deep learning model, and a distribution for transmitting at least according to at least one other node
  • the deep learning model of the system training performs a parameter difference matrix between the new parameters obtained by parameter updating the second data obtained by updating the old parameters.
  • the distributed deep learning training system replaces the parameters of each compute node cache with newly updated parameters.
  • the parameters refer to the matrix variables that construct the deep learning model, which can be a matrix of single-precision values.
  • performing sparse processing on at least part of the first data may include: selecting, from the gradient matrix, the first portion that the absolute values are respectively smaller than the filtering threshold a matrix element; randomly selecting a second partial matrix element from the gradient matrix; and setting a value of a matrix element belonging to the first partial matrix element and the second partial matrix element in the gradient matrix to 0, to obtain a sparse gradient matrix.
  • sending the at least partially sparsely processed first data to the at least one other node may include: compressing the sparse gradient matrix into a character string; and transmitting the character string to the at least one other node through the network.
  • FIG. 2 is an exemplary flow chart of gradient filtering in an embodiment of a data transmission method of the present application. As shown in FIG. 2, this embodiment includes:
  • step S210 a number of gradients are selected from the original gradient matrix, for example using an absolute value strategy.
  • the absolute value strategy is to select a gradient whose absolute value is less than a given filtering threshold.
  • the filtering threshold therein can be exemplarily calculated by the following formula: Among them, ⁇ gsmp represents the initial filtering threshold, which can be preset before the deep learning training, and dgsmp is also a preset constant. In the deep learning training system, the number of iterations required can be specified in advance, and t represents the current number of iterations in the deep learning training. Dgsmp ⁇ log(t) can dynamically change the filtering threshold as the number of iterations increases. As the number of iterations increases, the filtering threshold becomes smaller and smaller, so that in the later stages of training, small gradients are less likely to be eliminated. In this embodiment, the value of ⁇ gsmp can be between 1x10 -4 and 1x10 -3 , and the value of dgsmp can be between 0.1 and 1. The specific value can be adjusted according to the specific application.
  • a number of gradients are selected from the input raw gradient matrix, for example using a stochastic strategy.
  • the stochastic strategy randomly selects a given ratio among all the gradient values input, for example, a gradient of 50%-90%, 60%-80%, and the like.
  • steps S210-220 may be performed by a processor calling a corresponding instruction stored in the memory, or may be performed by a sparse processing module executed by the processor or a randomly selected sub-module therein.
  • step S230 the gradient values selected by the absolute value strategy and the random strategy are not important to the calculation, and the influence is small, and they are set to 0, thereby converting the input gradient matrix into a sparse gradient matrix.
  • the sparse gradient matrix is processed using a compression strategy to reduce the volume.
  • the compression strategy uses a general compression algorithm such as snappy, zlib, etc. to compress the sparse gradient matrix into a string.
  • steps S230-240 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a sparse processing module executed by the processor or a sparse sub-module therein.
  • a gradient matrix is outputted with a string by the absolute value strategy and the random operation culling operation and the compression strategy compression operation, and the volume thereof is greatly reduced.
  • the calculation node transmits the generated character string through the network, and the network traffic generated by this process is correspondingly reduced, so that the communication time in the gradient accumulation process can be effectively reduced.
  • performing sparse processing on at least part of the first data may include: selecting absolute values from the parameter difference matrix to be smaller than respectively Filtering a third partial matrix element of the threshold; randomly selecting a fourth partial matrix element from the parameter difference matrix; setting a value of a matrix element belonging to the third partial matrix element and the fourth partial matrix element in the parameter difference matrix to 0 Sparse parameter difference matrix.
  • sending the at least one part of the first data after the sparse processing to the at least one other node may include: compressing the sparse parameter difference matrix into a character string; and sending the character string to the at least one other node through the network.
  • FIG. 3 is an exemplary flow chart of parameter filtering in an embodiment of a data transmission method of the present application.
  • the newly updated parameters in the deep learning model are represented by ⁇ new
  • the old parameters of the cache are represented by ⁇ old.
  • this embodiment includes:
  • step S310 a number of values are selected from the parameter difference matrix ⁇ diff, for example, using an absolute value strategy.
  • the absolute value strategy is to select a gradient whose absolute value is less than a given filtering threshold.
  • the filtering threshold therein can be exemplarily calculated by the following formula: Among them, ⁇ gsmp represents the initial filtering threshold, which can be preset before the deep learning training, and dgsmp is also a preset constant. In the deep learning training system, the number of iterations required can be specified in advance, and t represents the current number of iterations in the deep learning training. Dgsmp ⁇ log(t) can dynamically change the filtering threshold as the number of iterations increases. As the number of iterations increases, the filtering threshold becomes smaller and smaller, so that in the later stages of training, small gradients are less likely to be eliminated. In this embodiment, the value of ⁇ gsmp can be between 1x10 -4 and 1x10 -3 , and the value of dgsmp can be between 0.1 and 1. The specific value can be adjusted according to the specific application.
  • a number of values are selected from the ⁇ diff matrix, for example using a stochastic strategy.
  • the random strategy randomly selects a given ratio in all ⁇ diff matrices input, for example, a gradient of 50%-90%, 60%-80%, and the like.
  • steps S310-320 may be performed by the processor invoking corresponding instructions stored in the memory, or may be performed by a sparse processing module executed by the processor or a randomly selected sub-module therein.
  • step S330 the ⁇ diff value selected by both the absolute value strategy and the random strategy is set to 0, thereby converting the ⁇ diff matrix into a sparse matrix.
  • the sparse matrix is processed using a compression strategy to reduce the volume.
  • the compression strategy uses a common compression algorithm, such as snappy, zlib, etc., to compress the sparse matrix into a string.
  • the above steps S330-340 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a sparse processing module executed by the processor or a sparse sub-module therein.
  • the deep learning training system can greatly reduce the network traffic generated in the parameter broadcast operation by broadcasting the generated character string through the network. Therefore, the communication time can be effectively reduced, thereby reducing the overall deep learning training time.
  • the decompression operation is performed, and ⁇ diff is added to the cached ⁇ old to update the corresponding parameter.
  • the same node can apply the gradient filtering mode shown in FIG. 2 or the parameter filtering mode shown in FIG. 3, and the corresponding steps are not described herein.
  • any of the data transmission methods provided by the embodiments of the present application may be performed by any suitable device having data processing capabilities, including but not limited to: a terminal device, a server, and the like.
  • any data transmission method provided by the embodiment of the present application may be executed by a processor, such as the processor, by executing a corresponding instruction stored in the memory to perform any one of the data transmission methods mentioned in the embodiments of the present application. This will not be repeated below.
  • the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed.
  • the foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
  • FIG. 4 is a schematic structural diagram of an embodiment of a data transmission system according to the present application.
  • the data processing system of the embodiment of the present invention can be used to implement the foregoing various data processing method embodiments of the present application. As shown in FIG. 4, the system of this embodiment includes:
  • the data determining module 410 is configured to determine, by the node in the distributed system, the first data to be sent to the at least one other node for parameter updating the deep learning model of the distributed system training;
  • a sparse processing module 420 configured to perform sparse processing on at least part of the first data
  • the sparse processing module 420 may include: a filtering submodule 422, configured to compare at least part of the first data with a given filtering threshold, and Filtering a portion smaller than a filtering threshold in at least a portion of the comparison of the first data, wherein the filtering threshold is related to the depth learning model The number of training iterations decreases and decreases.
  • the data sending module 430 is configured to send, to the at least one other node, the first data that is at least partially subjected to the sparse processing.
  • the sparse processing module 420 may further include: a random selection sub-module, configured to randomly determine the sparse processing of at least part of the first data according to a predetermined policy. A portion of a data is at least partially; a sparse sub-module for performing sparse processing on at least a portion of the determined first data.
  • the data sending module 430 may include: a compression submodule 432, configured to compress the first data that is at least partially subjected to the sparse processing; and the sending submodule 434, Transmitting the compressed first data to at least one other node.
  • FIG. 5 is a schematic structural diagram of another embodiment of a data transmission system according to the present application. As shown in FIG. 5, compared with the embodiment shown in FIG. 4, the data transmission system of this embodiment further includes:
  • the data obtaining module 510 is configured to acquire second data that is sent by at least one other node for performing parameter update on the deep learning model of the distributed system training;
  • the updating module 520 is configured to update parameters of the deep learning model of any of the nodes according to the second data.
  • the data acquisition module 510 may include a receiving and decompressing sub-module 512, configured to receive and decompress at least one other node and send the compressed system for the distributed system.
  • the trained deep learning model performs the second data of the parameter update.
  • the first data includes: a gradient matrix obtained by any one of the above nodes during the iterative training of the deep learning model; and/or, any of the above nodes in the deep learning model An old parameter of any training during iterative training, and a parameter between the new parameter obtained by updating the old parameter based on at least one parameter updated by at least one other node for parameter updating of the deep learning model of the distributed system training Difference matrix.
  • the filtering sub-module 422 is configured to select, from the gradient matrix, a first partial matrix element whose absolute values are respectively smaller than a given filtering threshold; and a random selection sub-module for randomly selecting a second partial matrix element from the gradient matrix;
  • the sparse sub-module is used to set the value of the matrix element belonging to the first partial matrix element and the second partial matrix element in the gradient matrix to 0 to obtain a sparse gradient matrix;
  • the compression sub-module is used to compress the sparse gradient matrix into a string;
  • the submodule sends a string to the at least one other node through the network.
  • the filtering sub-module is configured to select, from the parameter difference matrix, a third partial matrix element whose absolute values are respectively smaller than a given filtering threshold; the random selection sub-module is used to randomly from the parameter difference matrix Select the fourth part of the matrix element; the sparse sub-module is used to belong to the third part of the matrix element and the fourth part of the parameter difference matrix
  • the value of the matrix element of the matrix element is set to 0 to obtain a sparse parameter difference matrix;
  • the compression sub-module is used to compress the sparse parameter difference matrix into a string;
  • the sending sub-module is configured to send characters to the at least one other node through the network. string.
  • the embodiment of the present application further provides an electronic device, including the data processing system of any of the foregoing embodiments of the present application.
  • the embodiment of the present application further provides another electronic device, including:
  • the embodiment of the present application further provides another electronic device, including: one or more processors, a memory, a plurality of cache components, a communication component, and a communication bus, the processor, the memory, the plurality of cache units, and the foregoing communication
  • the components complete communication with each other through the communication bus, the transmission rates and/or storage spaces of the plurality of cache components are different, and the plurality of cache components are preset with different lookup priorities according to the transmission rate and/or the storage space;
  • the memory is configured to store at least one executable instruction, and the executable instruction causes the processor to perform an operation corresponding to the data transmission method of any of the above embodiments of the present application.
  • FIG. 6 is a schematic structural diagram of an embodiment of a node device according to the present application. It includes a processor 602, a communication component 604, a memory 606, and a communication bus 608. Communication components can include, but are not limited to, I/O interfaces, network cards, and the like.
  • Processor 602, communication component 604, and memory 606 complete communication with one another via communication bus 608.
  • the communication component 604 is configured to communicate with network elements of other devices, such as a client or a data collection device.
  • the processor 602 is configured to execute the program 610. Specifically, the related steps in the foregoing method embodiments may be performed.
  • the program can include program code, the program code including computer operating instructions.
  • the processor 602 may be one or more, and the device may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more configured to implement the embodiments of the present application. Integrated circuits, etc.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • the memory 606 is configured to store the program 610.
  • Memory 606 may include high speed RAM memory and may also include non-volatile memory, such as at least one disk memory.
  • the program 610 includes at least one executable instruction, and may be specifically configured to cause the processor 602 to perform an operation of determining parameters of a deep learning model for distributed system training to be sent by any node in the distributed system to at least one other node. Updating the first data; performing sparse processing on at least part of the first data; and transmitting at least part of the first data after the sparse processing to the at least one other node.
  • FIG. 7 is a schematic structural diagram of an embodiment of an electronic device according to the present application.
  • the electronic device includes one or more processors, communication units, etc., one or more processors such as one or more central processing units (CPUs) 701, and/or one or more image processing
  • CPUs central processing units
  • the processor can perform various appropriate operations according to executable instructions stored in the read only memory (ROM) 702 or executable instructions loaded from the storage portion 708 into the random access memory (RAM) 703. Action and processing.
  • the communication portion 712 can include, but is not limited to, a network card, which can include, but is not limited to, an IB (Infiniband) network card, and the processor can communicate with the read only memory 702 and/or the random access memory 703 to execute executable instructions over the bus 704.
  • the unit 712 is connected to and communicates with other target devices via the communication unit 712, thereby performing operations corresponding to any data processing method provided by the embodiment of the present application, for example, determining that any node in the distributed system is to be sent to at least one other node.
  • first data for parameter updating the deep learning model trained by the distributed system; performing sparse processing on at least part of the first data; and transmitting at least part of the sparse processing to the at least one other node The first data.
  • RAM 703 various programs and data required for the operation of the device can be stored.
  • the CPU 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704.
  • ROM 702 is an optional module.
  • the RAM 703 stores executable instructions or writes executable instructions to the ROM 702 at runtime, the executable instructions causing the processor 701 to perform operations corresponding to the data processing methods described above.
  • An input/output (I/O) interface 705 is also coupled to bus 704.
  • the communication portion 712 may be integrated or may be provided with a plurality of sub-modules (eg, a plurality of IB network cards) and on the bus link.
  • the following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, etc.; an output portion 707 including a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a speaker; a storage portion 708 including a hard disk or the like And a communication portion 709 including a network interface card such as a LAN card, a modem, or the like.
  • the communication section 709 performs communication processing via a network such as the Internet.
  • Driver 710 is also connected to I/O interface 705 as needed.
  • a removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like, is mounted on the drive 710 as needed so that a computer program read therefrom is installed into the storage portion 708 as needed.
  • FIG. 7 is only an optional implementation manner.
  • the number and type of components in FIG. 7 may be selected, deleted, added, or replaced according to actual needs;
  • Different function components can also be implemented in separate settings or integrated settings, such as GPU and CPU detachable settings or GPU can be integrated on the CPU, the communication part can be separated, or integrated on the CPU or GPU. and many more.
  • an embodiment of the present disclosure includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising Executing instructions corresponding to the method steps provided in the embodiments of the present application, for example, determining, by using any node in the distributed system, to update parameters of the deep learning model trained by the distributed system to at least one other node An instruction of data; an instruction to perform sparse processing on at least a portion of the first data; and an instruction to at least partially perform the first data after the sparse processing to the at least one other node.
  • the embodiment of the present application further provides a computer program, including computer readable code, when the computer readable code is run on a device, the processor in the device executes to implement any of the embodiments of the present application. Instructions for each step in the data transfer method.
  • the embodiment of the present application further provides a computer readable storage medium for storing a computer readable instruction, when the instruction is executed, implementing the operations of each step in the data transmission method of any embodiment of the present application.
  • the above method according to an embodiment of the present application may be implemented in hardware, firmware, or implemented as a recordable medium.
  • Software or computer code in quality (such as CD ROM, RAM, floppy disk, hard disk or magneto-optical disk), or implemented as being downloaded over a network, originally stored in a remote recording medium or a non-transitory machine readable medium and stored in The computer code in the local recording medium, such that the methods described herein can be stored in such software processing on a recording medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware such as an ASIC or an FPGA.
  • a computer, processor, microprocessor controller or programmable hardware includes storage components (eg, RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is The processing methods described herein are implemented when the processor or hardware is accessed and executed. Moreover, when a general purpose computer accesses code for implementing the processing shown herein, the execution of the code converts the general purpose computer into a special purpose computer for performing the processing shown herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Complex Calculations (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本申请实施例公开了数据传输方法和系统、电子设备,其中所述方法包括:确定分布式系统中一节点向至少一其他节点待发送的、用于对所述分布式系统训练的深度学习模型进行参数更新的第一数据;对所述第一数据中的至少部分进行稀疏处理;向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据。本申请的实施有助于在不降低通信频率的情形下减少网络通信流量,缩短深度学习训练的时间。

Description

数据传输方法和系统、电子设备
本申请要求在2016年10月28日提交中国专利局、申请号为CN 201610972729.4、发明名称为“数据传输方法和系统、电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及深度学习技术,尤其涉及数据传输方法和系统、电子设备。
背景技术
随着大数据时代的到来,深度学习得到了广泛的应用,包括图像识别、推荐系统以及自然语言处理等。深度学习训练系统是一种通过训练输入数据获取深度学习模型的计算系统。在工业环境中,为了能够提供高质量的深度学习模型,深度学习训练系统需要处理大量训练数据,如:斯坦福大学计算机视觉实验室开放的ImageNet数据集包含了1400多万张高精度的图片。然而,单节点的深度学习训练系统由于其计算能力和内存限制,往往耗时数周甚至数月才能完成运算。在这种情况下,分布式深度学习训练系统在工业界和学术界得到了广泛的关注。
典型的分布式深度学习训练系统通常利用分布式计算框架运行梯度下降算法。。在每次迭代计算过程中,梯度聚集和参数广播等产生的网络流量通常正比于深度学习模型的大小。而新型的深度学习模型大小日益增长,例如,AlexNet模型包含了六千多万参数,VGG-16模型更是具有上亿参数。因此,在深度学习训练过程中会产生大量网络流量,受制于网络带宽等条件,通信时间成为分布式深度学习训练系统的性能瓶颈之一。
发明内容
本申请实施例提供一种数据传输方案。
根据本申请实施例的个方面,本申请实施例提供一种数据传输方法,包括:
确定分布式系统中一节点向至少一其他节点待发送的、用于对所述分布式系统训练的深度学习模型进行参数更新的第一数据;
对所述第一数据中的至少部分进行稀疏处理;
向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据。
可选地,对所述第一数据中的至少部分进行稀疏处理,包括:将所述第一数据中的至少部分分别与给定过滤阈值进行比较,并从所述至少部分中滤除小于所述过滤阈值的部分,其中,所述过滤阈值随所述深度学习模型的训练迭代次数的增加而减小。
可选地,对所述第一数据中的至少部分进行稀疏处理之前,还包括:随机确定所述第一数据的部分作为所述至少部分;对确定的所述第一数据的至少部分进行稀疏处理。
可选地,所述向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据,包括:压缩所述至少部分进行稀疏处理后的第一数据;向所述至少一其他节点发送压缩后的第一数据。
可选地,根据本发明第一方面的方法还包括:获取所述至少一其他节点发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据;至少根据所述第二数据对所述深度学习模型的参数进行更新。
可选地,获取所述至少一其他节点发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据,包括:接收并解压缩所述至少一其他节点压缩后发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据。
可选地,所述第一数据包括:在所述深度学习模型的迭代训练期间任一次训练过程计算所得到的梯度矩阵;和/或,在所述深度学习模型的迭代训练期间任一次训练的旧参数、与至少根据所述至少一其他节点发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据进行所述旧参数更新所得到的新参数之间的参数差值矩阵。
可选地,在所述第一数据包括所述梯度矩阵时,对所述第一数据中的至少部分进行稀疏处理,包括:从所述梯度矩阵选取绝对值分别小于所述过滤阈值的第一部分矩阵元素;从所述梯度矩阵随机选取第二部分矩阵元素;将所述梯度矩阵中同时属于所述第一部分矩阵元素和所述第二部分矩阵元素的矩阵元素的数值置0,得到稀疏梯度矩阵;向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据,包括:将所述稀疏梯度矩阵压缩为一个字符串;通过网络向所述至少一其他节点发送所述字符串。
可选地,在所述第一数据包括所述参数差值矩阵时,对所述第一数据中的至少部分进行稀疏处理,包括:从所述参数差值矩阵选取绝对值分别小于所述过滤阈值的第三部分矩阵元素;从所述参数差值矩阵随机选取第四部分矩阵元素;将所述参数差值矩阵中同时属于所述第三部分矩阵元素和所述第四部分矩阵元素的矩阵元素的数值置0,得到稀疏参数差值矩阵;向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据,包括:将所述稀疏参数差值矩阵压缩为一个字符串;通过网络向所述至少一其他节点发送所述字符串。
根据本申请实施例的另一个方面,提供一种数据传输系统,包括:
数据确定模块,用于确定分布式系统中任一节点向至少一其他节点待发送的、用于对所述分布式系统训练的深度学习模型进行参数更新的第一数据;
稀疏处理模块,用于对所述第一数据中的至少部分进行稀疏处理;
数据发送模块,用于向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据。
可选地,所述稀疏处理模块包括:过滤子模块,用于将所述第一数据中的至少部分分别与给定过滤阈值进行比较,并从所述至少部分中滤除小于所述过滤阈值的部分,其中,所述过滤阈值随所述深度学习模型的训练迭代次数的增加而减小。
可选地,所述稀疏处理模块还包括:随机选取模块,用于随机确定所述第一数据的部分作为所述至少部分;稀疏模块,用于对确定的所述第一数据的至少部分进行稀疏处理。
可选地,所述数据发送模块包括:压缩子模块,用于压缩所述至少部分进行稀疏处理后的第一数据;发送子模块,用于向所述至少一其他节点发送压缩后的第一数据。
可选地,所述系统还包括:数据获取模块,用于获取所述至少一其他节点发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据;更新模块,用于至少根据所述第二数据对所述深度学习模型的参数进行更新。
可选地,数据获取模块包括:接收和解压缩子模块,用于接收并解压缩所述至少一其他节点压缩后发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据。
可选地,所述第一数据包括:在所述深度学习模型的迭代训练期间任一次训练过程计算所得到的梯度矩阵;和/或,在所述深度学习模型的迭代训练期间任一次训练的旧参数、与至少根据所述至少一其他节点发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据进行所述旧参数更新所得到的新参数之间的参数差值矩阵。
可选地,在所述第一数据包括所述梯度矩阵时,所述过滤子模块用于从所述梯度矩阵选取绝对值分别小于所述过滤阈值的第一部分矩阵元素;所述随机选取子模块用于从所述梯度矩阵随机选取第二部分矩阵元素;所述稀疏子模块用于将所述梯度矩阵中同时属于所述第一部分矩阵元素和所述第二部分矩阵元素的矩阵元素的数值置0,得到稀疏梯度矩阵;所述压缩子模块用于将所述稀疏梯度矩阵压缩为一个字符串;所述发送子模块通过网络向所述至少一其他节点发送所述字符串。
可选地,在所述第一数据包括所述参数差值矩阵时,所述过滤子模块用于从所述参数差值矩阵选取绝对值分别小于所述过滤阈值的第三部分矩阵元素;所述随机选取子模块用 于从所述参数差值矩阵随机选取第四部分矩阵元素;所述稀疏子模块用于将所述参数差值矩阵中同时属于所述第三部分矩阵元素和所述第四部分矩阵元素的矩阵元素的数值置0,得到稀疏参数差值矩阵;所述压缩子模块用于将所述稀疏参数差值矩阵压缩为一个字符串;所述发送子模块用于通过网络向所述至少一其他节点发送所述字符串。
根据本申请实施例的又一个方面,提供一种电子设备,包括本申请任一实施例所述的数据传输系统。
根据本申请实施例的再一个方面,提供一种电子设备,包括:
处理器和本申请任一实施例所述的数据传输系统;
在处理器运行所述数据处理系统时,本申请任一实施例所述的数据传输系统中的单元被运行。
根据本申请实施例的再一个方面,提供一种电子设备,包括:一个或多个处理器、存储器、通信部件和通信总线,所述处理器、所述存储器和所述通信部件通过所述通信总线完成相互间的通信;
所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行本申请任一实施例提供的种数据传输方法对应的操作。
根据本申请实施例的再一方面,提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现本申请上述任一实施例所述的数据传输方法中各步骤的指令。
根据本申请实施例的再一方面,还提供了一种计算机可读存储介质,用于存储计算机可读取的指令,所述指令被执行时实现本申请上述任一实施例所述的数据传输方法中各步骤的操作。
本申请实施例提供的数据传输方法和系统、电子设备、程序和介质,确定分布式系统中任一节点向至少一其他节点待发送的、用于对分布式系统训练的深度学习模型进行参数更新的第一数据;对第一数据中的至少部分进行稀疏处理,并向至少一其他节点发送至少部分进行稀疏处理后的第一数据。本申请实施例可以剔除至少部分不重要的数据(例如梯度和/或参数),减少每次梯度累加和/或参数广播产生的网络流量,缩短训练时间。本申请不用降低通信频率,可及时获取最新参数,既可用于每次迭代都进行通信的深度学习训练系统中,也可以用于需要降低通信频率的系统中。
下面通过附图和实施例,对本申请的技术方案做进一步的详细描述。
附图说明
构成说明书的一部分的附图描述了本申请的实施例,并且连同描述一起用于解释本申请的原理。
本申请将在下面参考附图并结合可选实施例进行说明。其中:
图1为根据本申请数据传输方法一实施例的流程图。
图2为根据本申请数据传输方法实施例中梯度过滤的一个示例性流程图。
图3为根据本申请数据传输方法实施例中参数过滤的一个示例性流程图。
图4为根据本申请数据传输系统一实施例的结构示意图。
图5为根据本申请数据传输系统另一实施例的结构示意图。
图6为本申请节点设备一实施例的结构示意图。
图7为本申请电子设备一个实施例的结构示意图。
为清晰起见,这些附图均为示意性及简化的图,它们只给出了对于理解本申请所必要的细节,而省略其他细节。
具体实施方式
现在将参照附图来详细描述本申请的各种示例性实施例。应当理解,在详细描述和具体例子表明本申请可选实施例的同时,它们仅为说明目的给出。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本申请的范围。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本申请及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
本申请实施例可以应用于终端设备、计算机系统、服务器等电子设备,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于: 个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统﹑大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。
终端设备、计算机系统、服务器等电子设备可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。
图1为根据本申请数据传输方法一实施例的流程图。如图1所示,该实施例的数据传输方法包括:
在步骤S110中,确定分布式系统中一节点向至少一其他节点待发送的、用于对分布式系统训练的深度学习模型进行参数更新的第一数据。
其中的分布式系统例如可以是多个计算节点构成的集群,或者可以由多个计算节点和一参数服务器组成。其中的深度学习模型例如可包括但不限于神经网络(如卷积神经网络),其中的参数例如可以为构建深度学习模型的矩阵变量等。
在一个可选示例中,步骤S110可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的数据确定模块执行。
在步骤S120中,对上述第一数据中的至少部分进行稀疏处理。
本申请各实施例中,稀疏处理是为了从第一数据中剔除不太重要的部分,从而使传输第一数据耗用的网络流量变小,降低深度学习模型的训练时间。
在一个可选示例中,步骤S120可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的稀疏处理模块执行。
在步骤S130中,向上述至少一其他节点发送至少部分进行稀疏处理后的第一数据。
在一个可选示例中,步骤S130可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的数据发送模块执行。
本申请实施例的数据传输方法,用于在分布式深度学习系统中任意两个计算节点或计算节点和参数服务器之间,传输对计算节点运行的深度学习模型进行参数更新的数据,其可以忽略所传输数据中不太重要的部分,例如不重要的梯度和/或参数,从而有助于减小在聚集和广播操作中产生的网络流量,从而降低在每次迭代计算中用于网络传输的时间,进 而缩短深度学习总体训练时间。
在其中一个可选实施例中,对第一数据中的至少部分进行稀疏处理,可以包括:将第一数据中的至少部分分别与给定的过滤阈值进行比较,并从第一数据进行比较的至少部分中滤除小于过滤阈值的部分。
其中,过滤阈值可以随深度学习模型的训练迭代次数的增加而减小,以在训练后期使得微小参数更不容易被选择剔除。
在其中一个可选实施例中,对第一数据中的至少部分进行稀疏处理之前,还可以包括:随机确定第一数据的部分作为上述至少部分;对确定的第一数据的至少部分进行稀疏处理。换言之,在此对第一数据中的部分数据进行稀疏处理,第一数据中的其余部分数据不进行稀疏处理。未进行稀疏处理的部分数据可以按照传统方式发送。在一个可选示例中,该可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的数据获取模块执行,例如可以分别由被处理器运行的数据获取模块中的随机选取子模块和稀疏子模块执行。
在其中一个可选实施例中,向至少一其他节点发送至少部分进行稀疏处理后的第一数据,可以包括:压缩上述至少部分进行稀疏处理后的第一数据,压缩可采用通用的压缩算法,例如snappy、zlib等压缩算法;然后向上述至少一其他节点发送压缩后的第一数据。在一个可选示例中,该可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的数据发送模块执行,例如可以分别由被处理器运行的数据发送模块中的压缩子模块和发送子模块执行。
在本申请数据传输方法的另一实施中,还可包括:
上述任一节点获取至少一其他节点发送的、用于对分布式系统训练的深度学习模型进行参数更新的第二数据,例如,接收并解压缩上述至少一其他节点压缩后发送的、用于对分布式系统训练的深度学习模型进行参数更新的第二数据。在一个可选示例中,该可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的数据获取模块执行;
至少根据该第二数据对上述深度学习模型的参数进行更新。其中,更新的时机可以发生在上述任一节点在深度学习模型迭代训练期间当前轮训练完成的时候进行。在一个可选示例中,该可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的更新模块执行。
在其中一个可选实施例中,第一数据包括:上述任一节点在深度学习模型的迭代训练期间任一次训练过程计算所得到的梯度矩阵。分布式深度学习训练系统提供原始梯度值(包括每一计算节点产生的梯度值)作为输入,输入梯度可以是一个由单精度数值组成的 矩阵,是用于更新深度学习模型参数的矩阵变量。和/或,在另一个可选实施例中,第一数据包括:上述任一节点在深度学习模型的迭代训练期间任一次训练的旧参数、与至少根据至少一其他节点发送的用于对分布式系统训练的深度学习模型进行参数更新的第二数据进行旧参数更新所得到的新参数之间的参数差值矩阵。在每次参数广播操作中,分布式深度学习训练系统会用新更新的参数替换每个计算节点缓存的参数。其中的参数指构建深度学习模型的矩阵变量,可以是由单精度数值组成的矩阵。
在本申请各实施例的一个可选示例中,在第一数据包括梯度矩阵时,对第一数据中的至少部分进行稀疏处理,可以包括:从梯度矩阵选取绝对值分别小于过滤阈值的第一部分矩阵元素;从梯度矩阵随机选取第二部分矩阵元素;将梯度矩阵中同时属于第一部分矩阵元素和第二部分矩阵元素的矩阵元素的数值置0,得到稀疏梯度矩阵。相应地,该示例中,向至少一其他节点发送至少部分进行稀疏处理后的第一数据,可以包括:将稀疏梯度矩阵压缩为一个字符串;通过网络向至少一其他节点发送字符串。
图2为根据本申请数据传输方法实施例中梯度过滤的一个示例性流程图。如图2所示,该实施例包括:
在步骤S210,例如采用绝对值策略,从原始梯度矩阵中选定若干梯度。
其中,绝对值策略为选取绝对值小于给定过滤阈值的梯度。其中的过滤阈值可以示例性地由以下公式计算:
Figure PCTCN2017108450-appb-000001
其中,φgsmp表示初始过滤阈值,可以在深度学习训练前预先设定,dgsmp也是一个预设设定的常量。在深度学习训练系统中,需要的迭代次数是可以预先指定的,t表示深度学习训练中当前的迭代次数。dgsmp×log(t)可以随着迭代次数的增加而动态改变过滤阈值。随着迭代次数的增加,过滤阈值越来越小,这样,在训练后期,微小梯度更不容易被选择剔除。在本实施例中,φgsmp的取值可以1x10-4到1x10-3之间,dgsmp的取值可以在0.1到1之间,具体的取值可根据具体应用调整。
在步骤S220,例如采用随机策略,从输入的原始梯度矩阵选定若干梯度。
其中,随机策略在输入的所有梯度值中,随机选择给定比例,例如50%-90%、60%-80%等梯度。
在一个可选示例中,上述步骤S210~220可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的稀疏处理模块或其中的随机选取子模块执行。
在步骤S230,同时被绝对值策略和随机策略选定的梯度数值对计算不重要、影响小,将它们设置为0,从而将输入的梯度矩阵转换为一个稀疏梯度矩阵。
在步骤S240,采用压缩策略处理稀疏梯度矩阵,以减少体积。
其中的压缩策略例如采用通用的压缩算法,例如snappy、zlib等压缩算法,将稀疏梯度矩阵压缩为一个字符串。
在一个可选示例中,上述步骤S230~240可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的稀疏处理模块或其中的稀疏子模块执行。
通过图2所示实施例,将一个梯度矩阵通过绝对值策略和随机策略的剔除操作和压缩策略的压缩操作,输出一个字符串,其体积会大幅度减小。在梯度累加操作中,计算节点通过网络传输所产生的字符串,此过程产生的网络流量会相应减少,因此,可以有效减小梯度累加过程中的通信时间。
在本申请各实施例的另一个可选示例中,在第一数据包括参数差值矩阵时,对第一数据中的至少部分进行稀疏处理,可以包括:从参数差值矩阵选取绝对值分别小于过滤阈值的第三部分矩阵元素;从参数差值矩阵随机选取第四部分矩阵元素;将参数差值矩阵中同时属于第三部分矩阵元素和第四部分矩阵元素的矩阵元素的数值置0,得到稀疏参数差值矩阵。相应的,该示例中,向至少一其他节点发送至少部分进行稀疏处理后的第一数据,可以包括:将稀疏参数差值矩阵压缩为一个字符串;通过网络向至少一其他节点发送字符串。
图3为根据本申请数据传输方法实施例中参数过滤的一个示例性流程图。在本实施例中,深度学习模型中新更新的参数由θnew表示,缓存的旧参数由θold表示。参数差值矩阵表示为:θdiff=θnew-θold,是一个与新参数和旧参数同样规模的矩阵。如图3所示,该实施例包括:
在步骤S310,例如采用绝对值策略,从参数差值矩阵θdiff中选定若干数值。
其中,绝对值策略为选取绝对值小于给定过滤阈值的梯度。其中的过滤阈值可以示例性地由以下公式计算:
Figure PCTCN2017108450-appb-000002
其中,φgsmp表示初始过滤阈值,可以在深度学习训练前预先设定,dgsmp也是一个预设设定的常量。在深度学习训练系统中,需要的迭代次数是可以预先指定的,t表示深度学习训练中当前的迭代次数。dgsmp×log(t)可以随着迭代次数的增加而动态改变过滤阈值。随着迭代次数的增加,过滤阈值越来越小,这样,在训练后期,微小梯度更不容易被选择剔除。在本实施例中,φgsmp的取值可以1x10-4到1x10-3之间,dgsmp的取值可以可以在0.1到1之间,具体的取值可根据具体应用调整。
在步骤S320,例如采用随机策略,从θdiff矩阵选定若干数值。
其中,随机策略在输入的所有θdiff矩阵中,随机选择给定比例,例如50%-90%、60%-80%等梯度。
在一个可选示例中,上述步骤S310~320可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的稀疏处理模块或其中的随机选取子模块执行。
在步骤S330,将同时被绝对值策略和随机策略选定的θdiff数值设置为0,从而将θdiff矩阵转换为一个稀疏矩阵。
在步骤S340,采用压缩策略处理稀疏矩阵,以减少体积。
压缩策略采用通用的压缩算法,例如snappy、zlib等压缩算法,将稀疏矩阵压缩为一个字符串。
在一个可选示例中,上述步骤S330~340可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的稀疏处理模块或其中的稀疏子模块执行。
深度学习训练系统通过网络广播生成的字符串,可大幅度减小参数广播操作中产生的网络流量,因此,可以有效减小通信时间,进而降低总体深度学习训练时间。计算节点获取前述字符串后,进行解压缩操作,将θdiff与缓存的θold相加更新相应的参数。
在可选实施例中,同一节点既可应用图2所示的梯度过滤方式,也可应用图3所示的参数过滤方式,在此不再赘述相应的步骤。
本申请实施例提供的任一种数据传输方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端设备和服务器等。或者,本申请实施例提供的任一种数据传输方法可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本申请实施例提及的任一种数据传输方法。下文不再赘述。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
图4为根据本申请数据传输系统一实施例的结构示意图。本发明实施例的数据处理系统可用于实现本申请上述各数据处理方法实施例。如图4所示,该实施例的系统包括:
数据确定模块410,用于确定分布式系统中任一节点向至少一其他节点待发送的、用于对分布式系统训练的深度学习模型进行参数更新的第一数据;
稀疏处理模块420,用于对第一数据中的至少部分进行稀疏处理;
在本申请各数据传输系统实施例的一个可选实施方式中,稀疏处理模块420可以包括:过滤子模块422,用于将第一数据中的至少部分分别与给定过滤阈值进行比较,并从第一数据进行比较的至少部分中滤除小于过滤阈值的部分,其中,过滤阈值随深度学习模型的 训练迭代次数的增加而减小。
数据发送模块430,用于向至少一其他节点发送至少部分进行稀疏处理后的第一数据。
在本申请各数据传输系统的又一实施例中,稀疏处理模块420还可以包括::随机选取子模块,用于在根据预定策略对第一数据中的至少部分进行稀疏处理之前,随机确定第一数据的部分作为至少部分;稀疏子模块,用于对确定的第一数据的至少部分进行稀疏处理。
在本申请各数据传输系统实施例的一个可选实施方式中,数据发送模块430可以包括:压缩子模块432,用于压缩上述至少部分进行稀疏处理后的第一数据;发送子模块434,用于向至少一其他节点发送压缩后的第一数据。图5为根据本申请数据传输系统另一实施例的结构示意图。如图5所示,与图4所示实施例相比,该实施例的数据传输系统还包括:
数据获取模块510,用于获取至少一其他节点发送的用于对分布式系统训练的深度学习模型进行参数更新的第二数据;
更新模块520,用于至少根据第二数据对上述任一节点的深度学习模型的参数进行更新。
在本申请各数据传输系统实施例的一个可选实施方式中,数据获取模块510可以包括接收和解压缩子模块512,用于接收并解压缩至少一其他节点压缩后发送的用于对分布式系统训练的深度学习模型进行参数更新的第二数据。
在其中一个可选实施方式中,第一数据包括:上述任一节点在深度学习模型的迭代训练期间任一次训练过程计算所得到的梯度矩阵;和/或,上述任一节点在深度学习模型的迭代训练期间任一次训练的旧参数、与至少根据至少一其他节点发送的用于对分布式系统训练的深度学习模型进行参数更新的第二数据进行旧参数更新所得到的新参数之间的参数差值矩阵。
在第一数据包括梯度矩阵时,过滤子模块422用于从梯度矩阵选取绝对值分别小于给定过滤阈值的第一部分矩阵元素;随机选取子模块用于从梯度矩阵随机选取第二部分矩阵元素;稀疏子模块用于将梯度矩阵中同时属于第一部分矩阵元素和第二部分矩阵元素的矩阵元素的数值置0,得到稀疏梯度矩阵;压缩子模块用于将稀疏梯度矩阵压缩为一个字符串;发送子模块通过网络向上述至少一其他节点发送字符串。
在第一数据包括参数差值矩阵时,过滤子模块用于从参数差值矩阵选取其绝对值分别小于给定过滤阈值的第三部分矩阵元素;随机选取子模块用于从参数差值矩阵随机选取第四部分矩阵元素;稀疏子模块用于将参数差值矩阵中同时属于第三部分矩阵元素和第四部 分矩阵元素的矩阵元素的数值置0,得到稀疏参数差值矩阵;压缩子模块用于将稀疏参数差值矩阵压缩为一个字符串;发送子模块用于通过网络向上述至少一其他节点发送字符串。
本申请实施例还提供了一种电子设备,包括本申请上述任一实施例的数据处理系统。
本申请实施例还提供了另一种电子设备,包括:
处理器和本申请上述任一实施例的数据传输系统;
在处理器运行上述数据传输系统时,本申请上述任一实施例的数据传输系统中的单元被运行。
本申请实施例还提供了又一种电子设备,包括:一个或多个处理器、存储器、多种缓存元件、通信部件和通信总线,上述处理器、上述存储器、上述多种缓存单元和上述通信部件通过上述通信总线完成相互间的通信,上述多种缓存元件的传输速率和/或存储空间不同、且上述多种缓存元件根据传输速率和/或存储空间被预先设置有不同的查找优先级;
上述存储器用于存放至少一可执行指令,上述可执行指令使上述处理器执行如本申请上述任一实施例的数据传输方法对应的操作。
图6为本申请节点设备一实施例的结构示意图。其包括:处理器602、通信部件604、存储器606、以及通信总线608。通信部件可包括但不限于I/O接口、网卡等。
处理器602、通信部件604、以及存储器606通过通信总线608完成相互间的通信。
通信部件604,用于与其它设备比如客户端或数据采集设备等的网元通信。
处理器602,用于执行程序610,具体可以执行上述方法实施例中的相关步骤。
具体地,程序可以包括程序代码,该程序代码包括计算机操作指令。
上述处理器602可以一个或多个,处理器的设备形态可以是中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路等。
存储器606,用于存放程序610。存储器606可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
程序610包括至少一条可执行指令,具体可以用于使得处理器602执行以下操作:确定分布式系统中任一节点向至少一其他节点待发送的用于对分布式系统训练的深度学习模型进行参数更新的第一数据;对第一数据中的至少部分进行稀疏处理;向至少一其他节点发送至少部分进行稀疏处理后的第一数据。
程序610中各步骤的具体实现可以参见上述实施例中的相应步骤和单元中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上面描述的 设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。
图7为本申请电子设备一个实施例的结构示意图。下面参考图7,其示出了适于用来实现本申请实施例的终端设备或服务器的电子设备的结构示意图。如图7所示,该电子设备包括一个或多个处理器、通信部等,一个或多个处理器例如:一个或多个中央处理单元(CPU)701,和/或一个或多个图像处理器(GPU)713等,处理器可以根据存储在只读存储器(ROM)702中的可执行指令或者从存储部分708加载到随机访问存储器(RAM)703中的可执行指令而执行各种适当的动作和处理。通信部712可包括但不限于网卡,网卡可包括但不限于IB(Infiniband)网卡,处理器可与只读存储器702和/或随机访问存储器703中通信以执行可执行指令,通过总线704与通信部712相连、并经通信部712与其他目标设备通信,从而完成本申请实施例提供的任一数据处理方法对应的操作,例如,确定分布式系统中任一节点向至少一其他节点待发送的、用于对所述分布式系统训练的深度学习模型进行参数更新的第一数据;对所述第一数据中的至少部分进行稀疏处理;向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据。
此外,在RAM 703中,还可存储有装置操作所需的各种程序和数据。CPU701、ROM702以及RAM703通过总线704彼此相连。在有RAM703的情况下,ROM702为可选模块。RAM703存储可执行指令,或在运行时向ROM702中写入可执行指令,可执行指令使处理器701执行上述数据处理方法对应的操作。输入/输出(I/O)接口705也连接至总线704。通信部712可以集成设置,也可以设置为具有多个子模块(例如多个IB网卡),并在总线链接上。
以下部件连接至I/O接口705:包括键盘、鼠标等的输入部分706;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分707;包括硬盘等的存储部分708;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分709。通信部分709经由诸如因特网的网络执行通信处理。驱动器710也根据需要连接至I/O接口705。可拆卸介质711,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器710上,以便于从其上读出的计算机程序根据需要被安装入存储部分708。
需要说明的,如图7所示的架构仅为一种可选实现方式,在具体实践过程中,可根据实际需要对上述图7的部件数量和类型进行选择、删减、增加或替换;在不同功能部件设置上,也可采用分离设置或集成设置等实现方式,例如GPU和CPU可分离设置或者可将GPU集成在CPU上,通信部可分离设置,也可集成设置在CPU或GPU上,等等。这些可替换的实施方式均落入本申请公开的保护范围。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行流程图所示的方法的程序代码,程序代码可包括对应执行本申请实施例提供的方法步骤对应的指令,例如,确定分布式系统中任一节点向至少一其他节点待发送的、用于对所述分布式系统训练的深度学习模型进行参数更新的第一数据的指令;对所述第一数据中的至少部分进行稀疏处理的指令;向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据的指令。
另外,本申请实施例还提供了一种计算机程序,包括计算机可读代码,当该计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现本申请任一实施例的数据传输方法中各步骤的指令。
另外,本申请实施例还提供了一种计算机可读存储介质,用于存储计算机可读取的指令,该指令被执行时实现本申请任一实施例的数据传输方法中各步骤的操作。
本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。除非明确指出,在此所用的单数形式“一”、“该”均包括复数含义(即具有“至少一”的意思)。应当进一步理解,说明书中使用的术语“具有”、“包括”和/或“包含”表明存在的特征、步骤、操作、元件和/或部件,但不排除存在或增加一个或多个其他特征、步骤、操作、元件、部件和/或其组合。如在此所用的术语“和/或”包括一个或多个列举的相关项目的任何及所有组合。除非明确指出,在此公开的任何方法的步骤不必精确按照所公开的顺序执行。
一些可选实施例已经在前面进行了说明,但是应当强调的是,本申请不局限于这些实施例,而是可以本申请主题范围内的其它方式实现。
需要指出,根据实施的需要,可将本申请实施例中描述的各个部件/步骤拆分为更多部件/步骤,也可将两个或多个部件/步骤或者部件/步骤的部分操作组合成新的部件/步骤,以实现本申请实施例的目的。
上述根据本申请实施例的方法可在硬件、固件中实现,或者被实现为可存储在记录介 质(诸如CD ROM、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码,或者被实现为通过网络下载的、原始存储在远程记录介质或非暂时机器可读介质中并将被存储在本地记录介质中的计算机代码,从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处理。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如,RAM、ROM、闪存等),当所述软件或计算机代码被计算机、处理器或硬件访问且执行时,实现在此描述的处理方法。此外,当通用计算机访问用于实现在此示出的处理的代码时,代码的执行将通用计算机转换为用于执行在此示出的处理的专用计算机。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及方法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的可选应用和设计约束条件。专业技术人员可以对每个可选的应用来使用不同方法实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
以上实施方式仅用于说明本申请实施例,而并非对本申请实施例的限制,有关技术领域的普通技术人员,在不脱离本申请实施例的精神和范围的情况下,还可以做出各种变化和变型,因此所有等同的技术方案也属于本申请实施例的范畴,本申请实施例的专利保护范围应由权利要求限定。

Claims (23)

  1. 一种数据传输方法,其特征在于,包括:
    确定分布式系统中一节点向至少一其他节点待发送的、用于对所述分布式系统训练的深度学习模型进行参数更新的第一数据;
    对所述第一数据中的至少部分进行稀疏处理;
    向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据。
  2. 根据权利要求1所述的方法,其特征在于,对所述第一数据中的至少部分进行稀疏处理,包括:
    将所述第一数据中的至少部分分别与给定过滤阈值进行比较,并从所述至少部分中滤除小于所述过滤阈值的部分,其中,所述过滤阈值随所述深度学习模型的训练迭代次数的增加而减小。
  3. 根据权利要求1或2所述的方法,其特征在于,对所述第一数据中的至少部分进行稀疏处理之前,还包括:
    随机确定所述第一数据的部分作为所述至少部分;
    对确定的所述第一数据的至少部分进行稀疏处理。
  4. 根据权利要求1-3任一所述的方法,其特征在于,所述向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据,包括:
    压缩所述至少部分进行稀疏处理后的第一数据;
    向所述至少一其他节点发送压缩后的第一数据。
  5. 根据权利要求1-4任一所述的方法,其特征在于,还包括:
    获取所述至少一其他节点发送的、用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据;
    至少根据所述第二数据对所述深度学习模型的参数进行更新。
  6. 根据权利要求5所述的方法,其特征在于,获取所述至少一其他节点发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据,包括:
    接收并解压缩所述至少一其他节点压缩后发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据。
  7. 根据权利要求1-6任一所述的方法,其特征在于,所述第一数据包括:
    在所述深度学习模型的迭代训练期间任一次训练过程计算所得到的梯度矩阵;和/或,
    在所述深度学习模型的迭代训练期间任一次训练的旧参数、与至少根据所述至少一其 他节点发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据进行所述旧参数更新所得到的新参数之间的参数差值矩阵。
  8. 根据权利要求7所述的方法,其特征在于,在所述第一数据包括所述梯度矩阵时,对所述第一数据中的至少部分进行稀疏处理,包括:
    从所述梯度矩阵选取绝对值分别小于所述过滤阈值的第一部分矩阵元素;
    从所述梯度矩阵随机选取第二部分矩阵元素;
    将所述梯度矩阵中同时属于所述第一部分矩阵元素和所述第二部分矩阵元素的矩阵元素的数值置0,得到稀疏梯度矩阵;
    向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据,包括:
    将所述稀疏梯度矩阵压缩为一个字符串;
    通过网络向所述至少一其他节点发送所述字符串。
  9. 根据权利要求7或8所述的方法,其特征在于,在所述第一数据包括所述参数差值矩阵时,对所述第一数据中的至少部分进行稀疏处理,包括:
    从所述参数差值矩阵选取绝对值分别小于所述过滤阈值的第三部分矩阵元素;
    从所述参数差值矩阵随机选取第四部分矩阵元素;
    将所述参数差值矩阵中同时属于所述第三部分矩阵元素和所述第四部分矩阵元素的矩阵元素的数值置0,得到稀疏参数差值矩阵;
    向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据,包括:
    将所述稀疏参数差值矩阵压缩为一个字符串;
    通过网络向所述至少一其他节点发送所述字符串。
  10. 一种数据传输系统,其特征在于,包括:
    数据确定模块,用于确定分布式系统中一节点向至少一其他节点待发送的、用于对所述分布式系统训练的深度学习模型进行参数更新的第一数据;
    稀疏处理模块,用于对所述第一数据中的至少部分进行稀疏处理;
    数据发送模块,用于向所述至少一其他节点发送至少部分进行稀疏处理后的第一数据。
  11. 根据权利要求10所述的系统,其特征在于,所述稀疏处理模块包括:
    过滤子模块,用于将所述第一数据中的至少部分分别与给定过滤阈值进行比较,并从所述至少部分中滤除小于所述过滤阈值的部分,其中,所述过滤阈值随所述深度学习模型的训练迭代次数的增加而减小。
  12. 根据权利要求10或11所述的系统,其特征在于,所述稀疏处理模块还包括:
    随机选取子模块,用于随机确定所述第一数据的部分作为所述至少部分;
    稀疏子模块,用于对确定的所述第一数据的至少部分进行稀疏处理。
  13. 根据权利要求10-12任一所述的系统,其特征在于,所述数据发送模块包括:
    压缩子模块,用于压缩所述至少部分进行稀疏处理后的第一数据;
    发送子模块,用于向所述至少一其他节点发送压缩后的第一数据。
  14. 根据权利要求10-13任一所述的系统,其特征在于,还包括:
    数据获取模块,用于获取所述至少一其他节点发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据;
    更新模块,用于至少根据所述第二数据对所述深度学习模型的参数进行更新。
  15. 根据权利要求14所述的系统,其特征在于,所述数据获取模块包括:
    接收和解压缩子模块,用于接收并解压缩所述至少一其他节点压缩后发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据。
  16. 根据权利要求10-15任一所述的系统,其特征在于,所述第一数据包括:
    在所述深度学习模型的迭代训练期间任一次训练过程计算所得到的梯度矩阵;和/或,
    在所述深度学习模型的迭代训练期间任一次训练的旧参数、与至少根据所述至少一其他节点发送的用于对所述分布式系统训练的深度学习模型进行参数更新的第二数据进行所述旧参数更新所得到的新参数之间的参数差值矩阵。
  17. 根据权利要求16所述的系统,其特征在于,在所述第一数据包括所述梯度矩阵时,所述过滤子模块用于从所述梯度矩阵选取绝对值分别小于所述过滤阈值的第一部分矩阵元素;
    所述随机选取子模块用于从所述梯度矩阵随机选取第二部分矩阵元素;
    所述稀疏子模块用于将所述梯度矩阵中同时属于所述第一部分矩阵元素和所述第二部分矩阵元素的矩阵元素的数值置0,得到稀疏梯度矩阵;
    所述压缩子模块用于将所述稀疏梯度矩阵压缩为一个字符串;
    所述发送子模块通过网络向所述至少一其他节点发送所述字符串。
  18. 根据权利要求16或17所述的系统,其特征在于,在所述第一数据包括所述参数差值矩阵时,所述过滤子模块用于从所述参数差值矩阵选取绝对值分别小于所述过滤阈值的第三部分矩阵元素;
    所述随机选取子模块用于从所述参数差值矩阵随机选取第四部分矩阵元素;
    所述稀疏子模块用于将所述参数差值矩阵中同时属于所述第三部分矩阵元素和所述 第四部分矩阵元素的矩阵元素的数值置0,得到稀疏参数差值矩阵;
    所述压缩子模块用于将所述稀疏参数差值矩阵压缩为一个字符串;
    所述发送子模块用于通过网络向所述至少一其他节点发送所述字符串。
  19. 一种电子设备,其特征在于,包括权利要求10-18任一所述的数据传输系统。
  20. 一种电子设备,其特征在于,包括:
    处理器和权利要求10-18任一所述的数据传输系统;
    在处理器运行所述数据传输系统时,权利要求10-18任一所述的数据传输系统中的单元被运行。
  21. 一种电子设备,其特征在于,包括:一个或多个处理器、存储器、通信部件和通信总线,所述处理器、所述存储器和所述通信部件通过所述通信总线完成相互间的通信;
    所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求1-9任一所述的数据传输方法对应的操作。
  22. 一种计算机程序,包括计算机可读代码,其特征在于,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现权利要求1-9任一所述的数据传输方法中各步骤的指令。
  23. 一种计算机可读存储介质,用于存储计算机可读取的指令,其特征在于,所述指令被执行时实现权利要求1-9任一所述的数据传输方法中各步骤的操作。
PCT/CN2017/108450 2016-10-28 2017-10-30 数据传输方法和系统、电子设备 WO2018077293A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/382,058 US20190236453A1 (en) 2016-10-28 2019-04-11 Method and system for data transmission, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610972729.4A CN108021982B (zh) 2016-10-28 2016-10-28 数据传输方法和系统、电子设备
CN201610972729.4 2016-10-28

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/382,058 Continuation US20190236453A1 (en) 2016-10-28 2019-04-11 Method and system for data transmission, and electronic device

Publications (1)

Publication Number Publication Date
WO2018077293A1 true WO2018077293A1 (zh) 2018-05-03

Family

ID=62023122

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/108450 WO2018077293A1 (zh) 2016-10-28 2017-10-30 数据传输方法和系统、电子设备

Country Status (3)

Country Link
US (1) US20190236453A1 (zh)
CN (1) CN108021982B (zh)
WO (1) WO2018077293A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740755A (zh) * 2019-01-08 2019-05-10 深圳市网心科技有限公司 一种基于梯度下降法的数据处理方法及相关装置
CN112364897A (zh) * 2020-10-27 2021-02-12 曙光信息产业(北京)有限公司 分布式训练方法及装置、存储介质及电子设备

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214512B (zh) * 2018-08-01 2021-01-22 中兴飞流信息科技有限公司 一种深度学习的参数交换方法、装置、服务器及存储介质
CN109871942B (zh) * 2019-02-19 2021-06-11 上海商汤智能科技有限公司 神经网络的训练方法和装置、系统、存储介质
CN110245743A (zh) * 2019-05-23 2019-09-17 中山大学 一种异步分布式深度学习训练方法、装置及系统
US11451480B2 (en) * 2020-03-31 2022-09-20 Micron Technology, Inc. Lightweight artificial intelligence layer to control the transfer of big data
CN111625603A (zh) * 2020-05-28 2020-09-04 浪潮电子信息产业股份有限公司 一种分布式深度学习的梯度信息更新方法及相关装置
CN111857949B (zh) * 2020-06-30 2023-01-10 苏州浪潮智能科技有限公司 模型发布方法、装置、设备及存储介质
CN112235384B (zh) * 2020-10-09 2023-10-31 腾讯科技(深圳)有限公司 分布式系统中的数据传输方法、装置、设备及存储介质
CN113242258B (zh) * 2021-05-27 2023-11-14 安天科技集团股份有限公司 一种主机集群的威胁检测方法和装置
CN113610210B (zh) * 2021-06-28 2024-03-29 深圳大学 基于智能网卡的深度学习训练网络迭代更新方法
CN116980420B (zh) * 2023-09-22 2023-12-15 新华三技术有限公司 一种集群通信方法、系统、装置、设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102405495A (zh) * 2009-03-11 2012-04-04 谷歌公司 使用稀疏特征对信息检索进行音频分类
CN105574506A (zh) * 2015-12-16 2016-05-11 深圳市商汤科技有限公司 基于深度学习和大规模集群的智能人脸追逃系统及方法
CN105791189A (zh) * 2016-02-23 2016-07-20 重庆大学 一种提高重构精度的稀疏系数分解方法
WO2016154440A1 (en) * 2015-03-24 2016-09-29 Hrl Laboratories, Llc Sparse inference modules for deep learning

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6970939B2 (en) * 2000-10-26 2005-11-29 Intel Corporation Method and apparatus for large payload distribution in a network
US7843855B2 (en) * 2001-09-13 2010-11-30 Network Foundation Technologies, Llc System and method for broadcasting content to nodes on computer networks
GB2493956A (en) * 2011-08-24 2013-02-27 Inview Technology Ltd Recommending audio-visual content based on user's personal preerences and the profiles of others
CN105989368A (zh) * 2015-02-13 2016-10-05 展讯通信(天津)有限公司 一种目标检测方法及装置以及移动终端
CN104714852B (zh) * 2015-03-17 2018-05-22 华中科技大学 一种适用于分布式机器学习的参数同步优化方法及其系统
CN105005911B (zh) * 2015-06-26 2017-09-19 深圳市腾讯计算机系统有限公司 深度神经网络的运算系统及运算方法
CN104966104B (zh) * 2015-06-30 2018-05-11 山东管理学院 一种基于三维卷积神经网络的视频分类方法
CN105786757A (zh) * 2016-02-26 2016-07-20 涂旭平 一种板上集成分布式高性能运算系统装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102405495A (zh) * 2009-03-11 2012-04-04 谷歌公司 使用稀疏特征对信息检索进行音频分类
WO2016154440A1 (en) * 2015-03-24 2016-09-29 Hrl Laboratories, Llc Sparse inference modules for deep learning
CN105574506A (zh) * 2015-12-16 2016-05-11 深圳市商汤科技有限公司 基于深度学习和大规模集群的智能人脸追逃系统及方法
CN105791189A (zh) * 2016-02-23 2016-07-20 重庆大学 一种提高重构精度的稀疏系数分解方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740755A (zh) * 2019-01-08 2019-05-10 深圳市网心科技有限公司 一种基于梯度下降法的数据处理方法及相关装置
CN109740755B (zh) * 2019-01-08 2023-07-18 深圳市网心科技有限公司 一种基于梯度下降法的数据处理方法及相关装置
CN112364897A (zh) * 2020-10-27 2021-02-12 曙光信息产业(北京)有限公司 分布式训练方法及装置、存储介质及电子设备
CN112364897B (zh) * 2020-10-27 2024-05-28 曙光信息产业(北京)有限公司 分布式训练方法及装置、存储介质及电子设备

Also Published As

Publication number Publication date
US20190236453A1 (en) 2019-08-01
CN108021982A (zh) 2018-05-11
CN108021982B (zh) 2021-12-28

Similar Documents

Publication Publication Date Title
WO2018077293A1 (zh) 数据传输方法和系统、电子设备
CN113808231B (zh) 信息处理方法及装置、图像渲染方法及装置、电子设备
WO2017143747A1 (zh) 一种移动终端网络请求方法及系统
WO2023051035A1 (zh) 机器人的数据传输方法及装置、电子设备、存储介质
CN113157480A (zh) 错误信息处理方法、装置、存储介质以及终端
CN112671892A (zh) 数据传输方法、装置、电子设备、介质和计算机程序产品
US20160127745A1 (en) Efficient screen image transfer
CN115186738B (zh) 模型训练方法、装置和存储介质
CN115904240A (zh) 数据处理方法、装置、电子设备和存储介质
CN114386577A (zh) 用于执行深度学习模型的方法、设备和存储介质
CN113344213A (zh) 知识蒸馏方法、装置、电子设备及计算机可读存储介质
CN116611495B (zh) 深度学习模型的压缩方法、训练方法、处理方法及装置
CN116341689B (zh) 机器学习模型的训练方法、装置、电子设备及存储介质
CN115294396B (zh) 骨干网络的训练方法以及图像分类方法
CN117075920A (zh) 应用程序安装包的优化方法、装置
CN115049051A (zh) 一种模型权重的压缩方法、装置、电子设备及存储介质
CN117112601A (zh) 一种数据库数据压缩方法、装置、设备及存储介质
CN117556075A (zh) 应用于pacs系统的数据处理方法、装置、设备及介质
CN115906982A (zh) 分布式训练方法、梯度通信方法、装置及电子设备
CN113781494A (zh) 图像分割方法、装置、电子设备和计算机可读介质
CN112990422A (zh) 参数服务器、客户机、权值参数的处理方法及系统
CN117351299A (zh) 图像生成及模型训练方法、装置、设备和存储介质
CN112988366A (zh) 参数服务器、主从客户机、权值参数的处理方法及系统
CN116310518A (zh) 图像分类模型的训练方法、分类方法、设备以及相关装置
CN116016484A (zh) 数据传输方法、装置、设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17865594

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17865594

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 14.08.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 17865594

Country of ref document: EP

Kind code of ref document: A1