CN112261023A - Data transmission method and device of convolutional neural network - Google Patents

Data transmission method and device of convolutional neural network Download PDF

Info

Publication number
CN112261023A
CN112261023A CN202011104673.3A CN202011104673A CN112261023A CN 112261023 A CN112261023 A CN 112261023A CN 202011104673 A CN202011104673 A CN 202011104673A CN 112261023 A CN112261023 A CN 112261023A
Authority
CN
China
Prior art keywords
array
transmission
processing unit
data
compressed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011104673.3A
Other languages
Chinese (zh)
Inventor
罗建刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202011104673.3A priority Critical patent/CN112261023A/en
Publication of CN112261023A publication Critical patent/CN112261023A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/04Protocols for data compression, e.g. ROHC
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Security & Cryptography (AREA)
  • Neurology (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a data transmission method and a data transmission device of a convolutional neural network, wherein the method comprises the steps of dividing data to be transmitted into a plurality of arrays based on a data division mode, and sequentially executing the following steps for each array in response to the start of aggregation of the array above the array: calling computing resources to perform sparse compression on the array at the source processing unit to generate a compressed array; calling communication resources to execute a transmission mode-based protocol on the compressed array; calling communication resources to perform transmission mode-based aggregation on the compressed array; the computing resource is invoked to perform decompression on the compressed array at the target processing unit to extract the array. The invention can reduce the communication data volume under the condition of ensuring the convergence precision so as to improve the transmission efficiency, reduce the waiting time and improve the overall speed.

Description

Data transmission method and device of convolutional neural network
Technical Field
The present invention relates to the field of neural networks, and more particularly, to a data transmission method and apparatus for a convolutional neural network.
Background
Increasingly sophisticated machine learning algorithms, such as Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), etc., can achieve unprecedented performance in many practical applications and solve many areas of difficulty, such as speech recognition, text processing, and image recognition. However, a long time is often required for training on a single Graphics Processing Unit (GPU), and the application is limited to a certain extent due to low efficiency. The most widely used method to reduce training time is to perform data parallel training. In data parallel training, each GPU has a complete copy of the model parameters, and the GPU often exchanges parameters with other GPUs participating in the training, which results in significant communication costs and becomes a system bottleneck when communication is slow.
In order to solve the communication bottleneck in training, the communication bottleneck can be solved from two aspects of hardware and software. More advanced GPU interconnection technology is adopted in the aspect of hardware; advanced modern communication libraries are employed in software. The ring communication method is applied more in the existing communication method, and the Pipeline technology can be effectively adopted, so that the method has good expansibility and is applied more in large data volume transmission. However, under the limitation of a low-speed network, for example, under a partial PCIE connection, the transmission speed is only about 7.5GB/s, which has gradually become a bottleneck for GPU calculation. In the case of multi-node transmission, the transmission is often performed through a network, which imposes a more serious restriction on GPU interactive computation.
Aiming at the problems of large communication data volume, long time consumption and slow overall task processing progress of the convolutional neural network in the prior art, no effective solution is available at present.
Disclosure of Invention
In view of the above, an object of the embodiments of the present invention is to provide a data transmission method and apparatus for a convolutional neural network, which can reduce the amount of communication data to improve the transmission efficiency, reduce the latency, and improve the overall speed while ensuring the convergence accuracy.
In view of the above object, a first aspect of the embodiments of the present invention provides a data transmission method for a convolutional neural network, including dividing data to be transmitted into a plurality of arrays based on a data division manner, and sequentially performing the following steps for each array in response to a previous array starting aggregation:
calling computing resources to perform sparse compression on the array at the source processing unit to generate a compressed array;
calling communication resources to execute a transmission mode-based protocol on the compressed array;
calling communication resources to perform transmission mode-based aggregation on the compressed array;
the computing resource is invoked to perform decompression on the compressed array at the target processing unit to extract the array.
In some embodiments, performing the sparse compression on the array to generate the compressed array comprises:
extracting the value and position of each element from the array to form a pair of numbers;
deleting the element number pairs with the value of zero;
the remaining pairs of element numbers are combined to form a compressed array.
In some embodiments, further comprising: after deleting the element number pairs having a value of zero, additionally deleting element number pairs having a value less than the filtering threshold based on a predetermined filtering threshold.
In some embodiments, the data partitioning and transmission modes are determined based on the processing unit topology.
In some embodiments, the processing unit topology is determined based on the number and architecture of processing units used by the convolutional neural network.
In some embodiments, the data partitioning is an average allocation based on the number of processing units; the transmission mode is annular transmission or annular full-protocol transmission; the processing unit topology is a ring topology.
In some embodiments, further comprising: while the transport-based aggregation is being performed, the compute resources are also initially invoked to perform sparse compression on its next array.
In some embodiments, further comprising: and a transmission interface is pre-established for the convolutional neural network, and transmission mode-based reduction and aggregation are performed on the compressed array based on the transmission interface.
A second aspect of an embodiment of the present invention provides a data transmission apparatus for a convolutional neural network, including:
a processor; and
a memory storing program code executable by the processor, the program code when executed partitioning data to be transmitted into a plurality of arrays based on a data partitioning manner, and for each array performing the following steps in sequence in response to a previous array thereon starting aggregation:
calling computing resources to perform sparse compression on the array at the source processing unit to generate a compressed array;
calling communication resources to execute a transmission mode-based protocol on the compressed array;
calling communication resources to perform transmission mode-based aggregation on the compressed array;
the computing resource is invoked to perform decompression on the compressed array at the target processing unit to extract the array.
In some embodiments, the data partitioning manner and the transmission manner are both determined based on the topology of the processing unit; the processing unit topology is determined based on the number and architecture of processing units used by the convolutional neural network.
The invention has the following beneficial technical effects: according to the data transmission method and device of the convolutional neural network, provided by the embodiment of the invention, the array is subjected to sparse compression by calling computing resources in a source processing unit to generate a compressed array; calling communication resources to execute a transmission mode-based protocol on the compressed array; calling communication resources to perform transmission mode-based aggregation on the compressed array; the technical scheme that the computing resources are called to decompress the compressed array in the target processing unit so as to extract the array can reduce the communication data volume under the condition of ensuring the convergence precision so as to improve the transmission efficiency, reduce the waiting time and improve the overall speed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a data transmission method of a convolutional neural network provided in the present invention;
FIG. 2 is a block diagram of a data transmission method of a convolutional neural network according to the present invention;
fig. 3 is a schematic pipeline diagram of a data transmission method of a convolutional neural network provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
In view of the above object, a first aspect of embodiments of the present invention proposes an embodiment of a data transmission method of a convolutional neural network that reduces the amount of communication data while ensuring convergence accuracy. Fig. 1 is a schematic flow chart of a data transmission method of a convolutional neural network provided by the present invention.
As shown in fig. 1, the data transmission method of the convolutional neural network includes dividing data to be transmitted into a plurality of arrays based on a data division manner, and sequentially performing the following steps for each array in response to the start of aggregation of the previous array:
step S101, invoking computing resources to execute sparse compression on an array in a source processing unit to generate a compressed array;
step S103, calling communication resources to execute a transmission mode-based protocol on the compressed array;
step S105, calling communication resources to perform transmission mode-based aggregation on the compressed array;
step S107, the computing resource is called to decompress the compressed array in the target processing unit so as to extract the array.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium of the computer may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.
In some embodiments, performing the sparse compression on the array to generate the compressed array comprises:
extracting the value and position of each element from the array to form a pair of numbers;
deleting the element number pairs with the value of zero;
the remaining pairs of element numbers are combined to form a compressed array.
In some embodiments, further comprising: after deleting the element number pairs having a value of zero, additionally deleting element number pairs having a value less than the filtering threshold based on a predetermined filtering threshold.
In some embodiments, the data partitioning and transmission modes are determined based on the processing unit topology.
In some embodiments, the processing unit topology is determined based on the number and architecture of processing units used by the convolutional neural network.
In some embodiments, the data partitioning is an average allocation based on the number of processing units; the transmission mode is annular transmission or annular full-protocol transmission; the processing unit topology is a ring topology.
In some embodiments, further comprising: while the transport-based aggregation is being performed, the compute resources are also initially invoked to perform sparse compression on its next array.
In some embodiments, further comprising: and a transmission interface is pre-established for the convolutional neural network, and transmission mode-based reduction and aggregation are performed on the compressed array based on the transmission interface.
The following further illustrates embodiments of the invention in accordance with the embodiments shown in fig. 2 and 3.
Referring to fig. 2, the frame is mainly divided into three major parts: the method includes the steps that firstly, a deep learning framework data transmission interface is established, wherein the deep learning framework data transmission interface comprises a pyrrch, a TF, an mxnet and the like, the data transmission interface is consistent with nccl, and the universality of a program is guaranteed. Secondly, topology establishment and selection, namely, topology with lower delay is selected according to the gpu architecture establishment and by combining factors such as the size of data volume and the like. According to different topologies, transmission modes are different, and data division modes are also different, for example, in ring communication, each GPU can take Size/N data each time (Size is the total Size of data to be transmitted, and N is the number of GPUs). And thirdly, a sparse compression communication part, wherein the sparse storage mode adopts a row compression mode, and the transmission homogenization is a one-dimensional array form, so that the expression can be realized only by element values and column marks. For example, the transmission array is:
(0,6,0,0,7,0,0,0,0,0,0,0,2,0,0,1)
can be expressed as:
(1,4,12,15)(6,7,2,1)
it can be seen that with a sparseness of 25%, the amount of transmission is only 50% of the original amount of data. And the matrix after sparse compression can be subjected to reduction operation (summation, maximum value taking and the like) under the compression condition, so that the method has higher acceleration effect compared with the traditional compression method.
But sparse compression and decompression can take up computing time and affect the program efficiency. When the compression and decompression time is optimized and reduced, the same strategy is adopted as that adopted in the traditional compression, and pipeline is adopted to hide the sparse compression time, so that the program efficiency is improved. The compression of the second process is synchronously started in the ring aggregation in a manner as shown in fig. 3, and the communication bandwidth mainly occupied by the ring aggregation and the ring protocol is not large for the occupation of computing resources, so that the next transmission data can be subjected to sparse compression processing in the transmission process by using a pipeline, the compression time is hidden, and the program efficiency is improved.
The embodiment of the invention is based on the annular and tree communication, adopts a sparse compression method, reduces the data volume during transmission and improves the transmission bandwidth. In the case where the degree of thinning is 1/n of the source data, an acceleration ratio of n/2 times can be obtained at the highest. Tests prove that the deep learning framework cannot be negatively converged when a proper threshold value is obtained. Therefore, the communication bandwidth of the GPU is effectively improved through data sparsification, and the convergence of a deep learning model is guaranteed. The problems of low-speed network and low communication efficiency of the GPU are solved to a certain extent.
It can be seen from the foregoing embodiments that, in the data transmission method of a convolutional neural network provided in the embodiments of the present invention, a calculation resource is called to perform sparse compression on an array in a source processing unit to generate a compressed array; calling communication resources to execute a transmission mode-based protocol on the compressed array; calling communication resources to perform transmission mode-based aggregation on the compressed array; the technical scheme that the computing resources are called to decompress the compressed array in the target processing unit so as to extract the array can reduce the communication data volume under the condition of ensuring the convergence precision so as to improve the transmission efficiency, reduce the waiting time and improve the overall speed.
It should be particularly noted that, the steps in the embodiments of the data transmission method of the convolutional neural network described above can be mutually intersected, replaced, added, and deleted, so that the data transmission method of the convolutional neural network that is transformed by these reasonable permutations and combinations shall also belong to the scope of the present invention, and shall not limit the scope of the present invention to the described embodiments.
In view of the above object, a second aspect of the embodiments of the present invention proposes an embodiment of a data transmission apparatus of a convolutional neural network that reduces the amount of communication data while ensuring convergence accuracy. The data transmission device of the convolutional neural network comprises:
a processor; and
a memory storing program code executable by the processor, the program code when executed partitioning data to be transmitted into a plurality of arrays based on a data partitioning manner, and for each array performing the following steps in sequence in response to a previous array thereon starting aggregation:
calling computing resources to perform sparse compression on the array at the source processing unit to generate a compressed array;
calling communication resources to execute a transmission mode-based protocol on the compressed array;
calling communication resources to perform transmission mode-based aggregation on the compressed array;
the computing resource is invoked to perform decompression on the compressed array at the target processing unit to extract the array.
In some embodiments, the data partitioning manner and the transmission manner are both determined based on the topology of the processing unit; the processing unit topology is determined based on the number and architecture of processing units used by the convolutional neural network.
As can be seen from the foregoing embodiments, the data transmission apparatus of the convolutional neural network according to the embodiments of the present invention generates a compressed array by invoking computing resources to perform sparse compression on an array in a source processing unit; calling communication resources to execute a transmission mode-based protocol on the compressed array; calling communication resources to perform transmission mode-based aggregation on the compressed array; the technical scheme that the computing resources are called to decompress the compressed array in the target processing unit so as to extract the array can reduce the communication data volume under the condition of ensuring the convergence precision so as to improve the transmission efficiency, reduce the waiting time and improve the overall speed.
It should be particularly noted that, the above-mentioned embodiment of the data transmission apparatus of the convolutional neural network adopts the embodiment of the data transmission method of the convolutional neural network to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the data transmission method of the convolutional neural network. Of course, since the steps in the data transmission method embodiment of the convolutional neural network can be mutually intersected, replaced, added, and deleted, the data transmission apparatus of the convolutional neural network that is transformed by these reasonable permutations and combinations shall also belong to the scope of the present invention, and shall not limit the scope of the present invention to the above embodiment.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A data transmission method of a convolutional neural network is characterized by comprising the following steps of dividing data to be transmitted into a plurality of arrays based on a data division mode, and sequentially executing the following steps for each array in response to the beginning of aggregation of the previous array:
invoking computing resources to perform sparse compression on the array at a source processing unit to generate a compressed array;
calling communication resources to execute a transmission mode-based protocol on the compressed array;
calling communication resources to perform transmission mode-based aggregation on the compressed array;
invoking computing resources to perform decompression on the compressed array at the target processing unit to extract the array.
2. The method of claim 1, wherein performing sparse compression on the array to generate a compressed array comprises:
extracting the value and position of each element from the array to form a pair of numbers;
deleting the element number pairs with the value of zero;
and combining the remaining pairs of element numbers to form the compressed array.
3. The method of claim 2, further comprising: after deleting the element number pairs having a value of zero, additionally deleting element number pairs having a value less than the filtering threshold based on a predetermined filtering threshold.
4. The method of claim 1, wherein the data partitioning scheme and the transmission scheme are both determined based on processing unit topology.
5. The method of claim 4, wherein the processing unit topology is determined based on the number and architecture of processing units used by the convolutional neural network.
6. The method of claim 5, wherein the data partitioning is an average distribution based on a number of processing units; the transmission mode is ring transmission or ring full-protocol transmission; the processing unit topology is a ring topology.
7. The method of claim 1, further comprising: while the transport-based aggregation is being performed, the compute resources are also initially invoked to perform sparse compression on its next array.
8. The method of claim 1, further comprising: and a transmission interface is pre-established for the convolutional neural network, and transmission mode-based specification and aggregation are performed on the compressed array based on the transmission interface.
9. A data transmission apparatus for a convolutional neural network, comprising:
a processor; and
a memory storing program code executable by the processor, the program code when executed partitioning data to be transmitted into a plurality of arrays based on a data partitioning manner, and for each array performing the following steps in sequence in response to a previous array thereon starting aggregation:
invoking computing resources to perform sparse compression on the array at a source processing unit to generate a compressed array;
calling communication resources to execute a transmission mode-based protocol on the compressed array;
calling communication resources to perform transmission mode-based aggregation on the compressed array;
invoking computing resources to perform decompression on the compressed array at the target processing unit to extract the array.
10. The apparatus of claim 9, wherein the data partitioning scheme and the transmission scheme are both determined based on a processing unit topology; the processing unit topology is determined based on the number and architecture of processing units used by the convolutional neural network.
CN202011104673.3A 2020-10-15 2020-10-15 Data transmission method and device of convolutional neural network Pending CN112261023A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011104673.3A CN112261023A (en) 2020-10-15 2020-10-15 Data transmission method and device of convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011104673.3A CN112261023A (en) 2020-10-15 2020-10-15 Data transmission method and device of convolutional neural network

Publications (1)

Publication Number Publication Date
CN112261023A true CN112261023A (en) 2021-01-22

Family

ID=74243614

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011104673.3A Pending CN112261023A (en) 2020-10-15 2020-10-15 Data transmission method and device of convolutional neural network

Country Status (1)

Country Link
CN (1) CN112261023A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022222578A1 (en) * 2021-04-21 2022-10-27 华为技术有限公司 Aggregation communication method and system, and computer device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101621514A (en) * 2009-07-24 2010-01-06 北京航空航天大学 Network data compressing method, network system and synthesis center equipment
US20150067009A1 (en) * 2013-08-30 2015-03-05 Microsoft Corporation Sparse matrix data structure
CN106775598A (en) * 2016-12-12 2017-05-31 温州大学 A kind of Symmetric Matrices method of the compression sparse matrix based on GPU
CN108229644A (en) * 2016-12-15 2018-06-29 上海寒武纪信息科技有限公司 The device of compression/de-compression neural network model, device and method
US20190190538A1 (en) * 2017-12-18 2019-06-20 Facebook, Inc. Accelerator hardware for compression and decompression
CN110134636A (en) * 2018-02-09 2019-08-16 中兴通讯股份有限公司 Model training method, server and computer readable storage medium
CN110377288A (en) * 2018-04-13 2019-10-25 赛灵思公司 Neural network compresses compiler and its compiling compression method
CN110909870A (en) * 2018-09-14 2020-03-24 中科寒武纪科技股份有限公司 Training device and method
CN111324630A (en) * 2020-03-04 2020-06-23 中科弘云科技(北京)有限公司 MPI-based neural network architecture search parallelization method and equipment
CN111699695A (en) * 2017-12-06 2020-09-22 V-诺瓦国际有限公司 Method and apparatus for decoding a received encoded data set
CN111737540A (en) * 2020-05-27 2020-10-02 中国科学院计算技术研究所 Graph data processing method and medium applied to distributed computing node cluster

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101621514A (en) * 2009-07-24 2010-01-06 北京航空航天大学 Network data compressing method, network system and synthesis center equipment
US20150067009A1 (en) * 2013-08-30 2015-03-05 Microsoft Corporation Sparse matrix data structure
CN106775598A (en) * 2016-12-12 2017-05-31 温州大学 A kind of Symmetric Matrices method of the compression sparse matrix based on GPU
CN108229644A (en) * 2016-12-15 2018-06-29 上海寒武纪信息科技有限公司 The device of compression/de-compression neural network model, device and method
CN111699695A (en) * 2017-12-06 2020-09-22 V-诺瓦国际有限公司 Method and apparatus for decoding a received encoded data set
US20190190538A1 (en) * 2017-12-18 2019-06-20 Facebook, Inc. Accelerator hardware for compression and decompression
CN110134636A (en) * 2018-02-09 2019-08-16 中兴通讯股份有限公司 Model training method, server and computer readable storage medium
CN110377288A (en) * 2018-04-13 2019-10-25 赛灵思公司 Neural network compresses compiler and its compiling compression method
CN110909870A (en) * 2018-09-14 2020-03-24 中科寒武纪科技股份有限公司 Training device and method
CN111324630A (en) * 2020-03-04 2020-06-23 中科弘云科技(北京)有限公司 MPI-based neural network architecture search parallelization method and equipment
CN111737540A (en) * 2020-05-27 2020-10-02 中国科学院计算技术研究所 Graph data processing method and medium applied to distributed computing node cluster

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022222578A1 (en) * 2021-04-21 2022-10-27 华为技术有限公司 Aggregation communication method and system, and computer device

Similar Documents

Publication Publication Date Title
CN110390385B (en) BNRP-based configurable parallel general convolutional neural network accelerator
CN108108809B (en) Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof
WO2021109699A1 (en) Artificial intelligence accelerator, device, chip and data processing method
WO2022105805A1 (en) Data processing method and in-memory computing chip
CN107066239A (en) A kind of hardware configuration for realizing convolutional neural networks forward calculation
WO2022001141A1 (en) Gpu communication method and device, and medium
CN112333234B (en) Distributed machine learning training method and device, electronic equipment and storage medium
US11948352B2 (en) Speculative training using partial gradients update
US20230244537A1 (en) Efficient gpu resource allocation optimization method and system
CN112905530B (en) On-chip architecture, pooled computing accelerator array, unit and control method
CN111079923A (en) Spark convolution neural network system suitable for edge computing platform and circuit thereof
CN114356578B (en) Parallel computing method, device, equipment and medium for natural language processing model
CN112817730A (en) Deep neural network service batch processing scheduling method and system and GPU
WO2022110860A1 (en) Hardware environment-based data operation method, apparatus and device, and storage medium
CN110600020B (en) Gradient transmission method and device
CN112261023A (en) Data transmission method and device of convolutional neural network
WO2020103883A1 (en) Method for executing matrix multiplication, circuit and soc
CN109740619B (en) Neural network terminal operation method and device for target recognition
CN117273084A (en) Calculation method and device of neural network model, electronic equipment and storage medium
US20230083565A1 (en) Image data processing method and apparatus, storage medium, and electronic device
CN116431562B (en) Multi-head attention mechanism fusion calculation distribution method based on acceleration processor
WO2020238106A1 (en) Data processing method, electronic apparatus, and computer-readable storage medium
US20230306236A1 (en) Device and method for executing lstm neural network operation
US20230128421A1 (en) Neural network accelerator
CN107894957B (en) Convolutional neural network-oriented memory data access and zero insertion method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210122

RJ01 Rejection of invention patent application after publication