CN114328360A - Data transmission method, device, electronic equipment and medium - Google Patents

Data transmission method, device, electronic equipment and medium Download PDF

Info

Publication number
CN114328360A
CN114328360A CN202111447979.3A CN202111447979A CN114328360A CN 114328360 A CN114328360 A CN 114328360A CN 202111447979 A CN202111447979 A CN 202111447979A CN 114328360 A CN114328360 A CN 114328360A
Authority
CN
China
Prior art keywords
gradient
data set
data
mask
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111447979.3A
Other languages
Chinese (zh)
Inventor
赵谦谦
阚宏伟
王彦伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN202111447979.3A priority Critical patent/CN114328360A/en
Publication of CN114328360A publication Critical patent/CN114328360A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the application discloses a data transmission method, a data transmission device, electronic equipment and a computer readable storage medium, wherein a gradient matrix of data to be processed is divided into a plurality of data sets with fixed lengths; and screening a preset number of gradient values from each data set according to the set screening granularity. Combining the screened gradient numerical values into a target data set; and determining a gradient mask corresponding to each target data set based on the position of the gradient value in the gradient matrix. And the target data set and the corresponding gradient mask are transmitted to opposite-end equipment, and the opposite-end equipment can restore the target data set to a corresponding position according to the gradient mask, so that a gradient matrix can be reconstructed. The compression of the gradient matrix is realized by setting the screening granularity, and the reduction of the target data set is realized based on the gradient mask, so that the sparsity of the compressed gradient matrix is reduced, and the compressed gradient matrix can be ensured to be an effective approximation of the original gradient matrix.

Description

Data transmission method, device, electronic equipment and medium
Technical Field
The present application relates to the field of device communication technologies, and in particular, to a data transmission method, an apparatus, an electronic device, and a computer-readable storage medium.
Background
Model training of distributed deep learning generally relies on cluster acceleration of a Graphics Processing Unit (GPU), and specific parallel algorithms can be divided into data parallel and model parallel, wherein the most common is data parallel. In data parallel, each GPU is provided with a complete model, training data are distributed to different GPUs, each GPU independently executes a forward processing calculation loss function, then executes a back propagation calculation gradient matrix, finally, a plurality of GPUs execute an AllReduce set communication synchronization gradient, a weight matrix is updated by using an average gradient, and the process is repeated until model training is completed.
The most common implementation for multiple GPUs to perform AllReduce collective communication is the Ring-AllReduce algorithm. The algorithm adopts a ring topology to connect a plurality of GPUs, divides data into small blocks and circularly transmits the small blocks in a ring. Each GPU receives a block of data from the last GPU in the ring topology and simultaneously sends a block of data of the same size to the next GPU in the ring topology, which may balance the transmission and reception bandwidth of each link. Assuming that the number of GPUs is N, the data size is K, and the link bandwidth is B, it can be found that the overall communication time is 2 (N-1) K/(N B). When the number of GPUs N is sufficiently large, the overall communication time depends only on the data size K and the link bandwidth B. The link bandwidth B depends on the existing external bus and network interconnection technology, cannot be rapidly increased in a short period, and the data size K can be effectively reduced through algorithm optimization.
The gradient matrix of the deep learning model is very large, the absolute values of a plurality of gradients are small, and the effect on updating the weight matrix is not large. Therefore, researchers have proposed methods to compress the gradient matrix so that these insignificant gradient values can be discarded when communicating. In the aspect of gradient matrix compression, two commonly used technical schemes include fine-grained sparse communication and coarse-grained sparse communication. The fine-grained sparse communication selects a threshold value at first, then traverses a gradient matrix, only retains gradients with gradient numerical values exceeding the threshold value, stores the gradients as key value pair forms (numerical values and indexes), and only transmits a set of the key value pairs during communication. Coarse-grained sparse communication first divides the gradient matrix into a number of fixed-size blocks, typically divides the entire gradient matrix into hundreds or thousands of blocks, then calculates the L1 norm of each block, and selects a proportion of blocks with larger L1 norms for communication.
But sparse matrix computation of fine-grained sparse communication is not suitable for a GPU architecture, resulting in very high computation overhead associated with compression. And the number and the position of the gradient selected by each GPU are different, so that load imbalance is caused, and the network bandwidth utilization rate is low. Although coarse-grained sparse communication can solve the problem that fine-grained sparse communication is not suitable for a GPU (graphics processing unit) architecture, the granularity of gradient matrix blocking is large, the blocking mode is not fine enough, the approximation degree of a compressed gradient matrix to an original gradient matrix is insufficient, and the precision is easily lost in the communication process.
It can be seen that how to reduce the sparsity of the compressed gradient matrix and ensure that the compressed gradient matrix is an effective approximation of the original gradient matrix is a problem to be solved by those skilled in the art.
Disclosure of Invention
An object of the embodiments of the present application is to provide a data transmission method, an apparatus, an electronic device, and a computer-readable storage medium, which can reduce sparsity of a compressed gradient matrix and ensure that the compressed gradient matrix is an effective approximation of an original gradient matrix.
In order to solve the foregoing technical problem, an embodiment of the present application provides a data transmission method, including:
dividing a gradient matrix of data to be processed into a plurality of data sets with fixed lengths;
screening a preset number of gradient values from each data set according to a set screening granularity; combining the screened gradient numerical values into a target data set;
determining a gradient mask corresponding to each target data set based on the position of the gradient numerical value in the gradient matrix;
and transmitting the target data set and the corresponding gradient mask to opposite-end equipment.
Optionally, the screening, according to the set screening granularity, a preset number of gradient values from each data set includes:
determining source thread numbers of residual data in the data set by each thread of the data set according to a set mask calculation mode;
acquiring a corresponding gradient numerical value based on the source thread number;
and selecting a preset number of gradient values with the maximum value according to all the gradient values contained in each data set.
Optionally, the set screening granularity comprises screening two gradient values from four adjacent gradient values;
correspondingly, the selecting the gradient values with the maximum value according to all the gradient values included in each data set includes:
and taking every adjacent four gradient values as a data set, and screening out two gradient values with the maximum value from the four gradient values.
Optionally, the determining a gradient mask corresponding to each target data set based on the position of the gradient value in the gradient matrix includes:
and according to a binary form, setting the position of the screened gradient numerical value in the gradient matrix to be one, and setting the rest position in the gradient matrix to be zero.
Optionally, the transmitting the target data set and the gradient mask corresponding to the target data set to the peer device includes:
combining a plurality of gradient masks into a gradient mask group according to a set data length;
and writing the gradient mask group into a global memory by a calling thread so as to be convenient for the opposite-end device to read the gradient mask from the global memory.
Optionally, the method further comprises:
under the condition that the data set to be restored and the corresponding target gradient mask are obtained, filling target gradient values contained in the data set to be restored to corresponding positions according to the target gradient mask, and setting the positions without corresponding data to zero to obtain a restored gradient matrix.
Optionally, the method further comprises:
and dynamically adjusting the set screening granularity according to different stages of model training.
The embodiment of the application also provides a data transmission device, which comprises a dividing unit, a screening unit, a combining unit, a determining unit and a transmission unit;
the dividing unit is used for dividing the gradient matrix of the data to be processed into a plurality of data sets with fixed lengths;
the screening unit is used for screening a preset number of gradient numerical values from each data set according to a set screening granularity;
the combination unit is used for combining the screened gradient numerical values into a target data set;
the determining unit is configured to determine a gradient mask corresponding to each target data set based on a position of the gradient value in the gradient matrix;
and the transmission unit is used for transmitting the target data set and the corresponding gradient mask to opposite-end equipment.
Optionally, the screening unit includes a determining subunit, an obtaining subunit, and a selecting subunit;
the determining subunit is used for determining source thread numbers of the residual data in the data set of each thread of the data set according to a set mask calculation mode;
the obtaining subunit is configured to obtain a corresponding gradient numerical value based on the source thread number;
and the selecting subunit is configured to select a preset number of gradient values with a maximum value according to all the gradient values included in each data set.
Optionally, the set screening granularity comprises screening two gradient values from four adjacent gradient values;
correspondingly, the selecting subunit is configured to select two gradient values with the largest value from the four gradient values by using each adjacent four gradient values as a data set.
Optionally, the determining unit is configured to set the selected gradient value to one in the position of the gradient matrix and set the remaining position of the gradient matrix to zero in a binary form.
Optionally, the transmission unit includes a merging subunit and a writing subunit;
the merging subunit is configured to merge the multiple gradient masks into one gradient mask group according to a set data length;
the write-in subunit is configured to invoke a thread to write the gradient mask group into a global memory, so that the peer device reads the gradient mask from the global memory.
Optionally, a reduction unit is further included;
and the restoring unit is used for filling target gradient values contained in the data set to be restored to corresponding positions according to the target gradient mask under the condition that the data set to be restored and the target gradient mask corresponding to the data set to be restored are obtained, and setting the positions without corresponding data to zero so as to obtain a restored gradient matrix.
Optionally, an adjusting unit is further included;
and the adjusting unit is used for dynamically adjusting the set screening granularity according to different stages of model training.
An embodiment of the present application further provides a data transmission device, including:
a memory for storing a computer program;
a processor for executing said computer program to implement the steps of the data transmission method as described above.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the data transmission method as described above.
According to the technical scheme, the gradient matrix of the data to be processed is divided into a plurality of data sets with fixed lengths; and screening a preset number of gradient values from each data set according to the set screening granularity. By setting the screening granularity, the data volume of the gradient matrix can be effectively reduced, and the purpose of compressing the gradient matrix is achieved. Combining the screened gradient numerical values into a target data set; determining a gradient mask corresponding to each target data set based on the position of the gradient numerical value in the gradient matrix; the gradient mask characterizes the position of each gradient value in the target dataset in the gradient matrix. The target data set and the corresponding gradient mask are transmitted to opposite-end equipment, the opposite-end equipment can restore the target data set to a corresponding position according to the gradient mask, and therefore a gradient matrix can be reconstructed, and the reconstructed gradient matrix and an original gradient matrix have high approximation degree. In the technical scheme, the compression of the gradient matrix is realized by setting the screening granularity, and the reduction of the target data set is realized based on the gradient mask, so that the sparsity of the compressed gradient matrix is reduced, and the compressed gradient matrix is ensured to be an effective approximation of the original gradient matrix.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
Fig. 1 is a flowchart of a data transmission method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a 2:4 fine-grained structured sparse provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of a warp shuffle-based gradient selection provided in an embodiment of the present application;
fig. 4 is a schematic diagram illustrating calculation of a gradient mask according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a data transmission device according to an embodiment of the present application;
fig. 6 is a structural diagram of a data transmission device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.
The terms "including" and "having," and any variations thereof, in the description and claims of this application and the drawings described above, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may include other steps or elements not expressly listed.
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings.
Next, a data transmission method provided in an embodiment of the present application is described in detail. Fig. 1 is a flowchart of a data transmission method provided in an embodiment of the present application, where the method includes:
s101: a gradient matrix of data to be processed is divided into a plurality of fixed length data sets.
The gradient matrix contains the gradient values. The gradient values are used to represent the weights of the deep learning model.
In the embodiment of the present application, in order to reduce the data amount of the gradient matrix, the gradient matrix may be subjected to a compression process. Meanwhile, in order to approximate the compressed gradient matrix to the original gradient matrix, the gradient matrices can be grouped to obtain data sets, and because the gradient values contained in each data set belong to relatively similar elements, a part of gradient values can be screened out from each data set, so that the purpose of compressing the gradient matrix is achieved.
In a particular implementation, a fixed length may be set for the partitioning of the data set. The gradient matrix is partitioned by a fixed length, resulting in a plurality of data sets. The fixed length can be set according to actual requirements, for example, 4 gradient values can be set as one data set.
S102: and screening a preset number of gradient values from each data set according to the set screening granularity, and combining the screened gradient values into a target data set.
The screening granularity refers to how many gradient values are screened from each dataset. The screening granularity can be a specific number or a proportional relation.
Taking every 4 gradient values as a data set as an example, the screening granularity may be 2, that is, 2 gradient values are screened from every 4 gradient values as a target data set. The screening granularity may also be 1/2, i.e. 4 x 1/2-2 gradient values are screened out from every 4 gradient values as a target data set.
For the sake of convenience of distinction, in the present embodiment, the dataset of the screened combination of gradient values is referred to as a target dataset.
When the screening of the gradient numerical values is performed, a preset number of gradient numerical values may be screened out according to the magnitude of each gradient numerical value included in each data set. In a specific implementation, each thread of the data set may determine, according to a set mask calculation manner, a source thread number of the remaining data in the data set where the thread is located. The source thread number is used for indicating a storage position of the data, and the corresponding gradient numerical value can be acquired based on the source thread number. And selecting the gradient values with the maximum value of the preset number according to all the gradient values contained in each data set.
Taking the example that the set screening granularity includes screening two gradient values from four adjacent gradient values, each adjacent four gradient values may be used as a data set, and the two gradient values with the largest value are screened from the four gradient values.
As shown in fig. 2, which is a schematic diagram of a 2:4 fine-grained structured sparse provided by an embodiment of the present application, a white box in fig. 2 may be used to represent an unselected gradient value in a gradient matrix, and a black box is used to represent a selected gradient value in the gradient matrix. The selected gradient values constitute the form of the target data set as indicated to the right of the arrow in fig. 2.
S103: and determining a gradient mask corresponding to each target data set based on the position of the gradient value in the gradient matrix.
The gradient mask characterizes the position of each gradient value in the target dataset in the gradient matrix. In a specific implementation, the position of the screened gradient value in the gradient matrix may be set to be one, and the remaining position in the gradient matrix may be set to be zero according to a binary form.
Referring to fig. 2, which is a schematic diagram of 2:4 fine-grained structured sparseness, the right-most binary values in fig. 2 represent the gradient masks corresponding to the respective target data sets. Taking the first row of data as an example, "01101001" indicates that the combination of the gradient values of the 2 nd, 3 rd, 5 th and 8 th bits is selected from the gradient matrix as the target data set.
S104: and transmitting the target data set and the corresponding gradient mask to opposite-end equipment.
The target data set and the corresponding gradient mask are transmitted to opposite-end equipment, the opposite-end equipment can restore the target data set to the corresponding position according to the gradient mask, so that a gradient matrix can be reconstructed, and the reconstructed gradient matrix and the original gradient matrix have higher approximation degree
Taking the efficient implementation of fine-grained structured sparse communication on a GPU as an example, in a thread model of the GPU, Grid is composed of a plurality of blocks, which are composed of a plurality of threads, and each 32 adjacent threads constitute a warp. Threads within a warp execute the same instruction at the same time, operating on different data, i.e. single instruction multi-threading (SIMT). In the storage hierarchy of the GPU, the off-chip global memory speed is lower than the on-chip shared memory, and the on-chip shared memory speed is lower than the register. when the global memory accessed by the threads in the warp falls in the 128-byte aligned continuous region, the threads are merged into a memory transaction.
In 2:4 fine-grained structured sparse communication, 32 adjacent data in an original gradient matrix are loaded from a global memory by a thread in warp, and each thread takes one data. Each thread then calculates whether the data it owns is two of larger absolute values in the bundle in which it resides (every 4 adjacent threads make up a bundle).
Each thread has access to all data in the thread bundle. Such computation can typically be implemented by warp accessing global memory again to load adjacent data, or warp first loads data into shared memory to optimize subsequent accesses.
Fig. 3 is a schematic diagram of gradient selection based on warp shuffle according to an embodiment of the present application, and fig. 3 is a schematic diagram of a method for implementing data exchange by a thread in warp without sharing a memory, that is, a warp shuffle primitive. The method allows a thread to directly access the registers of adjacent threads, thus reducing latency and saving memory resources.
In specific implementation, the current thread number and the laneMask can be subjected to bitwise exclusive or by calling the shfl _ xor _ sync primitive, the source thread number is calculated, and then the value of the corresponding variable on the source thread, namely the gradient value, is directly obtained. Aiming at the 2:4 gradient selection problem, the laneMask is equal to 1, 2 and 3, and the data complete exchange in the thread bundle can be completed by calling the shfl _ xor _ sync primitive three times. And finally, writing the gradient values into the global memory by the threads corresponding to the reserved gradient values, wherein each writing thread can determine the writing position through the thread number and the counted gradient pressure mask. Or the reserved data can be written into the shared memory first, and then the first half of threads in the Block write the data into the global memory from the shared memory, so that half of global memory transactions are saved.
The fine-grained structured sparse communication needs to calculate a gradient mask and transmit the gradient mask at the same time during communication, so that a receiver, namely opposite-end equipment, can recover an approximate gradient matrix. The process of computing the gradient mask may also be efficiently implemented via the warp shuffle primitive. \ u
Fig. 4 is a schematic diagram illustrating calculation of a gradient mask according to an embodiment of the present application, where a shfl _ down _ sync primitive adds delta to a current thread number to calculate a source thread number, and then directly obtains a gradient value of a corresponding variable on the source thread. When the source thread number exceeds the warp boundary, the variable on the current thread remains unchanged. As can be seen from FIG. 2, one warp corresponds to a 32-bit, i.e., 4-byte, gradient mask, so in the embodiment of the present application, each thread in warp is assigned a 4-byte variable and initialized to 0. After the gradient selection is completed, each thread to which the data is retained sets the corresponding position in this variable. And finally, calling five shfl _ down _ sync primitives by enabling the delta to be equal to 16, 8, 4, 2 and 1 to execute parallel reduction, setting the reduction operation to be bitwise or, and calculating a complete gradient mask by the thread with the last number of 0 to write the complete gradient mask into the shared memory. And writing the gradient mask from the shared memory into the global memory by a part of threads in the Block.
Taking 2:4 fine-grained structured sparse processing as an example, 2 gradient values are screened out from every 4 gradient values to serve as a target data set, and when the target data set and a corresponding gradient mask are transmitted to opposite-end equipment, a plurality of gradient masks can be combined into one gradient mask group according to a set data length; and the calling thread writes the gradient mask group into the global memory so as to facilitate the opposite-end device to read the gradient mask from the global memory.
One thread can process 32-bit data, and the set data length can be 32 bits. When a target data set contains 2-bit data, 16 target data sets can be combined into one gradient mask set.
When the opposite-end equipment restores the approximate gradient matrix, a part of threads in the Block load the gradient mask corresponding to the Block to the shared memory. Then, the thread in warp fetches the corresponding gradient mask from the shared memory. All threads in warp access the same 4-byte gradient mask, so that the threads can be merged into a shared memory transaction in a broadcast mode. And finally, filling the target gradient numerical values contained in the data set to be restored to corresponding positions by warp according to the gradient mask, and setting the positions without the corresponding data to zero to obtain a restored gradient matrix. The restored gradient matrix may be written to global memory. The process of fetching the corresponding data may also reduce global memory transactions by sharing memory.
According to the technical scheme, the gradient matrix of the data to be processed is divided into a plurality of data sets with fixed lengths; and screening a preset number of gradient values from each data set according to the set screening granularity. By setting the screening granularity, the data volume of the gradient matrix can be effectively reduced, and the purpose of compressing the gradient matrix is achieved. Combining the screened gradient numerical values into a target data set; determining a gradient mask corresponding to each target data set based on the position of the gradient numerical value in the gradient matrix; the gradient mask characterizes the position of each gradient value in the target dataset in the gradient matrix. The target data set and the corresponding gradient mask are transmitted to opposite-end equipment, the opposite-end equipment can restore the target data set to a corresponding position according to the gradient mask, and therefore a gradient matrix can be reconstructed, and the reconstructed gradient matrix and an original gradient matrix have high approximation degree. In the technical scheme, the compression of the gradient matrix is realized by setting the screening granularity, and the reduction of the target data set is realized based on the gradient mask, so that the sparsity of the compressed gradient matrix is reduced, and the compressed gradient matrix is ensured to be an effective approximation of the original gradient matrix.
For some deep learning models, using 50% sparsity still retains many insignificant gradients. At this time, smaller sparsity should be used according to the characteristics of the deep learning model. Different sparsity may also be used at different stages of model training. Namely, the set screening granularity is dynamically adjusted according to different stages of model training.
For example, in the early stage of model training, the degree of approximation can be improved by using a larger sparsity, so that the loss function is rapidly reduced. And then, the communication traffic is reduced by using smaller sparsity, and the model iteration speed is increased. But the above implementations all have to maintain the feature of structured sparsity for efficient implementation on the GPU.
Fig. 5 is a schematic structural diagram of a data transmission apparatus according to an embodiment of the present application, including a dividing unit 51, a screening unit 52, a combining unit 53, a determining unit 54, and a transmitting unit 55;
a dividing unit 51, configured to divide a gradient matrix of data to be processed into a plurality of data sets of fixed length;
a screening unit 52, configured to screen a preset number of gradient values from each data set according to a set screening granularity;
a combining unit 53 for combining the screened gradient values into a target data set;
a determining unit 54, configured to determine a gradient mask corresponding to each target data set based on a position of the gradient value in the gradient matrix;
and a transmission unit 55, configured to transmit the target data set and the gradient mask corresponding to the target data set to an opposite device.
Optionally, the screening unit includes a determining subunit, an obtaining subunit and a selecting subunit;
the determining subunit is used for determining source thread numbers of the residual data in the data set of each thread of the data set according to a set mask calculation mode;
the acquiring subunit is used for acquiring a corresponding gradient numerical value based on the source thread number;
and the selecting subunit is used for selecting the gradient values with the maximum value of the preset number according to all the gradient values contained in each data set.
Optionally, the set screening granularity comprises screening two gradient values from four adjacent gradient values;
correspondingly, the selecting subunit is configured to select two gradient values with the largest value from the four gradient values by using each adjacent four gradient values as a data set.
Optionally, the determining unit is configured to set the selected gradient value to one in the position of the gradient matrix and set the remaining position of the gradient matrix to zero in the binary form.
Optionally, the transmission unit includes a merging subunit and a writing subunit;
a merging subunit, configured to merge the multiple gradient masks into one gradient mask group according to a set data length;
and the writing subunit is used for calling the thread to write the gradient mask group into the global memory so as to facilitate the opposite-end device to read the gradient mask from the global memory.
Optionally, a reduction unit is further included;
and the restoring unit is used for filling the target gradient values contained in the data set to be restored to corresponding positions according to the target gradient mask under the condition that the data set to be restored and the corresponding target gradient mask are obtained, and setting the positions without corresponding data to zero so as to obtain a restored gradient matrix.
Optionally, an adjusting unit is further included;
and the adjusting unit is used for dynamically adjusting the set screening granularity according to different stages of model training.
The description of the features in the embodiment corresponding to fig. 5 may refer to the related description of the embodiment corresponding to fig. 1, and is not repeated here.
According to the technical scheme, the gradient matrix of the data to be processed is divided into a plurality of data sets with fixed lengths; and screening a preset number of gradient values from each data set according to the set screening granularity. By setting the screening granularity, the data volume of the gradient matrix can be effectively reduced, and the purpose of compressing the gradient matrix is achieved. Combining the screened gradient numerical values into a target data set; determining a gradient mask corresponding to each target data set based on the position of the gradient numerical value in the gradient matrix; the gradient mask characterizes the position of each gradient value in the target dataset in the gradient matrix. The target data set and the corresponding gradient mask are transmitted to opposite-end equipment, the opposite-end equipment can restore the target data set to a corresponding position according to the gradient mask, and therefore a gradient matrix can be reconstructed, and the reconstructed gradient matrix and an original gradient matrix have high approximation degree. In the technical scheme, the compression of the gradient matrix is realized by setting the screening granularity, and the reduction of the target data set is realized based on the gradient mask, so that the sparsity of the compressed gradient matrix is reduced, and the compressed gradient matrix is ensured to be an effective approximation of the original gradient matrix.
Fig. 6 is a structural diagram of a data transmission device according to an embodiment of the present application, and as shown in fig. 6, the data transmission device includes: a memory 20 for storing a computer program;
the processor 21 is configured to implement the steps of the data transmission method according to the above-mentioned embodiment when executing the computer program.
The data transmission device provided by the embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.
The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 21 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.
The memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing the following computer program 201, wherein after being loaded and executed by the processor 21, the computer program can implement the relevant steps of the data transmission method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may also include an operating system 202, data 203, and the like, and the storage manner may be a transient storage manner or a permanent storage manner. Operating system 202 may include, among others, Windows, Unix, Linux, and the like. Data 203 may include, but is not limited to, gradient matrices, screening granularities, target data sets, gradient masks, and the like.
In some embodiments, the data transfer device may further include a display 22, an input/output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.
Those skilled in the art will appreciate that the configuration shown in fig. 6 does not constitute a limitation of the data transmission apparatus and may include more or fewer components than those shown.
It is to be understood that, if the data transmission method in the above embodiments is implemented in the form of software functional units and sold or used as a stand-alone product, it may be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the present application may be substantially or partially implemented in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods of the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrically erasable programmable ROM, a register, a hard disk, a removable magnetic disk, a CD-ROM, a magnetic or optical disk, and other various media capable of storing program codes.
Based on this, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the data transmission method as described above.
The functions of the functional modules of the computer-readable storage medium according to the embodiment of the present invention may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the related description of the foregoing method embodiment, which is not described herein again.
A data transmission method, an apparatus, an electronic device, and a computer-readable storage medium provided in the embodiments of the present application are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
A data transmission method, an apparatus, an electronic device, and a computer-readable storage medium provided by the present application are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present application.

Claims (10)

1. A method of data transmission, comprising:
dividing a gradient matrix of data to be processed into a plurality of data sets with fixed lengths;
screening a preset number of gradient values from each data set according to a set screening granularity; combining the screened gradient numerical values into a target data set;
determining a gradient mask corresponding to each target data set based on the position of the gradient numerical value in the gradient matrix;
and transmitting the target data set and the corresponding gradient mask to opposite-end equipment.
2. The data transmission method according to claim 1, wherein the screening a preset number of gradient values from each of the data sets according to a set screening granularity comprises:
determining source thread numbers of residual data in the data set by each thread of the data set according to a set mask calculation mode;
acquiring a corresponding gradient numerical value based on the source thread number;
and selecting a preset number of gradient values with the maximum value according to all the gradient values contained in each data set.
3. The data transmission method according to claim 2, wherein the set screening granularity includes screening two gradient values from four adjacent gradient values;
correspondingly, the selecting the gradient values with the maximum value according to all the gradient values included in each data set includes:
and taking every adjacent four gradient values as a data set, and screening out two gradient values with the maximum value from the four gradient values.
4. The data transmission method according to claim 1, wherein the determining a gradient mask corresponding to each target data set based on the position of the gradient value in the gradient matrix comprises:
and according to a binary form, setting the position of the screened gradient numerical value in the gradient matrix to be one, and setting the rest position in the gradient matrix to be zero.
5. The data transmission method according to claim 1, wherein the transmitting the target data set and its corresponding gradient mask to a peer device comprises:
combining a plurality of gradient masks into a gradient mask group according to a set data length;
and writing the gradient mask group into a global memory by a calling thread so as to be convenient for the opposite-end device to read the gradient mask from the global memory.
6. The data transmission method according to claim 1, further comprising:
under the condition that the data set to be restored and the corresponding target gradient mask are obtained, filling target gradient values contained in the data set to be restored to corresponding positions according to the target gradient mask, and setting the positions without corresponding data to zero to obtain a restored gradient matrix.
7. The data transmission method according to any one of claims 1 to 6, further comprising:
and dynamically adjusting the set screening granularity according to different stages of model training.
8. A data transmission device is characterized by comprising a dividing unit, a screening unit, a combining unit, a determining unit and a transmission unit;
the dividing unit is used for dividing the gradient matrix of the data to be processed into a plurality of data sets with fixed lengths;
the screening unit is used for screening a preset number of gradient numerical values from each data set according to a set screening granularity;
the combination unit is used for combining the screened gradient numerical values into a target data set;
the determining unit is configured to determine a gradient mask corresponding to each target data set based on a position of the gradient value in the gradient matrix;
and the transmission unit is used for transmitting the target data set and the corresponding gradient mask to opposite-end equipment.
9. A data transmission device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to carry out the steps of the data transmission method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the data transmission method according to one of claims 1 to 7.
CN202111447979.3A 2021-11-30 2021-11-30 Data transmission method, device, electronic equipment and medium Pending CN114328360A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111447979.3A CN114328360A (en) 2021-11-30 2021-11-30 Data transmission method, device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111447979.3A CN114328360A (en) 2021-11-30 2021-11-30 Data transmission method, device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN114328360A true CN114328360A (en) 2022-04-12

Family

ID=81049150

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111447979.3A Pending CN114328360A (en) 2021-11-30 2021-11-30 Data transmission method, device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN114328360A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341628A (en) * 2023-02-24 2023-06-27 北京大学长沙计算与数字经济研究院 Gradient sparsification method, system, equipment and storage medium for distributed training

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341628A (en) * 2023-02-24 2023-06-27 北京大学长沙计算与数字经济研究院 Gradient sparsification method, system, equipment and storage medium for distributed training
CN116341628B (en) * 2023-02-24 2024-02-13 北京大学长沙计算与数字经济研究院 Gradient sparsification method, system, equipment and storage medium for distributed training

Similar Documents

Publication Publication Date Title
US8120607B1 (en) Boundary transition region stitching for tessellation
CN111160545A (en) Artificial neural network processing system and data processing method thereof
CN106779057B (en) Method and device for calculating binary neural network convolution based on GPU
US20110243469A1 (en) Selecting and representing multiple compression methods
CN110262907A (en) System and method for unified Application Programming Interface and model
EP3172659B1 (en) Parallelization of scalar operations by vector processors using data-indexed accumulators in vector register files, and related circuits, methods, and computer-readable media
US9378533B2 (en) Central processing unit, GPU simulation method thereof, and computing system including the same
CN111708511A (en) Data compression for neural networks
CN111279384B (en) Compression and decompression of indices in a graphics pipeline
CN113361695B (en) Convolutional neural network accelerator
CN114723033B (en) Data processing method, data processing device, AI chip, electronic device and storage medium
CN112950656A (en) Block convolution method for pre-reading data according to channel based on FPGA platform
CN113469350A (en) Deep convolutional neural network acceleration method and system suitable for NPU
JP7408723B2 (en) Neural network processing unit, neural network processing method and device
CN114328360A (en) Data transmission method, device, electronic equipment and medium
CN110377874B (en) Convolution operation method and system
CN117435855B (en) Method for performing convolution operation, electronic device, and storage medium
CN113485750B (en) Data processing method and data processing device
CN114626516A (en) Neural network acceleration system based on floating point quantization of logarithmic block
CN110490308B (en) Design method of acceleration library, terminal equipment and storage medium
CN114119927A (en) Model processing method and device for oblique photography model optimization
CN111722923A (en) Heterogeneous resource calling method and device and computer readable storage medium
KR20230141408A (en) Animation rendering methods, installations, electronic devices, and storage media
CN112882835B (en) Machine node parallel processing method and device, computer equipment and storage medium
CN114781618A (en) Neural network quantization processing method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination