CN113655986A - FFT convolution algorithm parallel implementation method and system based on NUMA affinity - Google Patents

FFT convolution algorithm parallel implementation method and system based on NUMA affinity Download PDF

Info

Publication number
CN113655986A
CN113655986A CN202111000202.2A CN202111000202A CN113655986A CN 113655986 A CN113655986 A CN 113655986A CN 202111000202 A CN202111000202 A CN 202111000202A CN 113655986 A CN113655986 A CN 113655986A
Authority
CN
China
Prior art keywords
fast fourier
memory access
fourier transform
result
uniform memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111000202.2A
Other languages
Chinese (zh)
Other versions
CN113655986B (en
CN113655986B9 (en
Inventor
王庆林
梅松竹
郝若晨
李东升
姜晶菲
赖志权
黄显栋
刘杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202111000202.2A priority Critical patent/CN113655986B9/en
Publication of CN113655986A publication Critical patent/CN113655986A/en
Publication of CN113655986B publication Critical patent/CN113655986B/en
Application granted granted Critical
Publication of CN113655986B9 publication Critical patent/CN113655986B9/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/4806Computations with complex numbers
    • G06F7/4812Complex multiplication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Discrete Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a parallel implementation method and a system of FFT convolution algorithm based on NUMA affinity, wherein the method comprises the following steps: performing fast Fourier transform on input data, and storing a first fast Fourier transform result to a specified non-uniform memory access node; performing fast Fourier transform on the weight, and storing a second fast Fourier transform result to a specified non-uniform memory access node; based on the first fast Fourier transform result and the second fast Fourier transform result, the parallel complex matrix multiplication of the non-uniform memory access level and the multi-core level is realized, and the result of the complex matrix multiplication is evenly distributed to all the non-uniform memory access nodes; and performing fast Fourier inverse transformation based on the result of the complex matrix multiplication to obtain the output of the fast Fourier convolution algorithm. The method can obviously reduce the remote memory access overhead in the FFT convolution calculation process on the NUMA architecture and improve the FFT convolution performance on the NUMA architecture.

Description

FFT convolution algorithm parallel implementation method and system based on NUMA affinity
Technical Field
The invention relates to the technical field of FFT (fast Fourier transform) convolution algorithms, in particular to a parallel FFT convolution algorithm implementation method and system based on NUMA (Non Uniform Memory Access) affinity.
Background
The convolutional neural network is one of the most representative algorithms for deep learning, and is widely applied to various artificial intelligence scenes. The convolution operation typically occupies a large portion of the computational overhead of the convolutional neural network. The convolution algorithm based on FFT can effectively reduce the complexity of convolution calculation, thereby effectively reducing the calculation expense of convolution. How to realize high-performance FFT convolution algorithm on multi-core and many-core processors is a hot point of academic research. At present, the existing work is oriented to a multi-core/many-core processor based on a Uniform Memory Access (UMA) architecture, and the optimization is not performed on the NUMA architecture. On a many-core processor of a NUMA architecture, a core can directly access a local memory of a NUMA node to which the core belongs and access a remote memory attached to other NUMA nodes through a network on chip. Thus, memory access latency can increase significantly when cores and memory are located on different NUMA nodes.
Therefore, how to effectively reduce the remote memory access overhead in the FFT convolution calculation process on the NUMA architecture and improve the performance of the FFT convolution on the NUMA architecture is an urgent problem to be solved.
Disclosure of Invention
In view of this, the invention provides a parallel implementation method for an FFT convolution algorithm based on NUMA affinity, which can significantly reduce the remote memory access overhead in the FFT convolution calculation process on the NUMA architecture and improve the performance of the FFT convolution on the NUMA architecture.
The invention provides a parallel implementation method of an FFT convolution algorithm based on NUMA affinity, which comprises the following steps:
performing fast Fourier transform on input data, and storing a first fast Fourier transform result to a specified non-uniform memory access node;
performing fast Fourier transform on the weight, and storing a second fast Fourier transform result to a specified non-uniform memory access node;
based on the first fast Fourier transform result and the second fast Fourier transform result, realizing the parallel complex matrix multiplication of the non-uniform memory access level and the multi-core level, and evenly distributing the result of the complex matrix multiplication to all the non-uniform memory access nodes;
and performing fast Fourier inverse transformation based on the result of the complex matrix multiplication to obtain the output of a fast Fourier convolution algorithm.
Preferably, the performing fast fourier transform on the input data and storing the first fast fourier transform result on the designated non-uniform memory access node includes:
convolution Input B][C][H][W]Divided into BxCxXDeltaBxC
Figure BDA0003233272060000021
The size is divided into blocks, wherein B represents the size of mini-batch in convolution calculation, C represents the number of input channels, H and W represent the height and width of the feature map of convolution input and output respectively,
Figure BDA0003233272060000022
is the size of the partitioned block, wherein,
Figure BDA0003233272060000023
Hfand WfWhich represents the size of the convolution kernel and,
Figure BDA0003233272060000024
represents rounding up;
processing each by each processor core independently
Figure BDA0003233272060000025
A fast Fourier transform of the size of the block, the result of the fast Fourier transform
Figure BDA0003233272060000026
Dividing the memory into 2 xL tuples, and evenly distributing and storing all the tuples to a specified non-uniform memory access node, wherein L represents the width of a vector register of a processor;
after all the processor cores finish the fast Fourier transformation of all the BxCxXDelta blocks in parallel, a first fast Fourier transformation result D [ N ] is obtained][Pb][Cbl1][Bbr][γ][Cl1][Br][2×L]Wherein N represents the number of nodes of the non-uniform memory access,
Figure BDA0003233272060000027
represents the total number of tuples divided up,
Figure BDA0003233272060000028
Figure BDA0003233272060000031
γ ═ X × Δ denotes the number of blocks into which each feature map is divided, Cl1And BrIs the block size in the complex matrix multiplication.
Preferably, the performing fast fourier transform on the weights and storing the second fast fourier transform result on the designated non-uniform memory access node includes:
inputting the convolution into Filter K][C][Hf][Wf]Padding into KC blocks with the size of delta multiplied by delta, wherein K represents the number of output channels;
solving each by each processor core alone
Figure BDA0003233272060000032
Fast speed of blockingFourier transform, fast Fourier transform
Figure BDA0003233272060000033
Dividing the memory into 2 xL tuples, and evenly distributing and storing all the tuples to a specified non-uniform memory access node, wherein L represents the width of a vector register of a processor;
after all processor cores finish the fast Fourier transform of all KC blocks in parallel, a second fast Fourier transform result G [ N ] is obtained][Pb][Cbl1][Kbr][Cl1][Kr][2×L]Wherein, in the step (A),
Figure BDA0003233272060000034
Kris the block size in the parallel complex matrix multiplication.
Preferably, the performing the parallel complex matrix multiplication at the non-uniform memory access level and at the multi-core level based on the first fast fourier transform result and the second fast fourier transform result, and evenly distributing the result of the complex matrix multiplication to all the non-uniform memory access nodes includes the following steps:
step 1, inputting D [ N ]][Pb][Cbl1][Bbr][Υ][Cl1][Br][2×L]And G [ N ]][Pb][Cbl1][Kbr][Cl1][Kr][2×L]Obtaining the number n of the non-uniform memory access node where the current processor core is located, wherein n is more than or equal to 0<N, the total number Cores of Cores in the current non-uniform memory access node, and the number cid of the processor core in the current non-uniform memory access node, wherein 0 is less than or equal to cid<Cores, wherein Dn[Pb][Cbl1][Bbr][Υ][Cl1][Br][2×L]And Gn[Pb][Cbl1][Kbr][Cl1][Kr][2×L]Representing D, G the portion stored on the nth non-coherent memory access node;
step 2, making delta equal to 0;
step 3, setting cs to be 0;
step 4, letting cb mu be cid;
step 5, according toFormula (II)
Figure BDA0003233272060000041
krs=kss×Kl2
Figure BDA0003233272060000042
μ=cbμ-kss×BbrX y-bss x y solved kss, bss and μ, respectively, wherein,
Figure BDA0003233272060000043
represents rounding down;
step 6, making kk equal to 0;
step 7, according to the values of delta, cs, kk + krs, cb mu, bss and mu, g 'is obtained from the current non-uniform memory access node'n=Gn,δ,cs,kk+krs,d′n=Dn,δ,cs,bss,μFrom global, get Z' ═ Zkk+krs,bss,μ,δWherein, g'nA sub-tensor representing the magnitude of the G tensor stored at the nth non-uniform memory access nodel1×Kr×(2×L),d′nA sub-tensor representing the storage of the D tensor at the nth non-uniform memory access node, whose size is Cl1×BrX (2 × L), Z' represents the sub-tensors of the Z tensor uniformly distributed over all non-uniform memory access introductions, with size Br×Kr×(2×L);
Step 8, calculating
Figure BDA0003233272060000044
Step 9, mixing z' [ B ]r][Kr][2×L]Value of (2) is stored back in Zkk+krs,bss,μ,δPerforming the following steps;
step 10, calculating kk-kk + 1;
step 11, if kk<min(Kl2,Kbr-kss×Kl2) If yes, skipping to step 7 to continue processing; if kk<min(Kl2,Kbr-kss×Kl2) If not, go to step 12, where Kl2For the partition size in complex matrix multiplication, min represents the maximum of the twoA small value;
step 12, calculating cb mu ═ cb mu + Cores;
step 13, if cb mu<Bbr×Kbl2If x y is true, jumping to step 5 to continue processing, if cb mu<Bbr×Kbl2If the x gamma is not satisfied, executing step 14; wherein
Figure BDA0003233272060000045
Step 14, calculating cs as cs + 1;
step 15, if cs<Cbl1If yes, skipping to the step 4 to continue processing; if cs is<Cbl1If not, executing step 16;
step 16, calculating δ ═ δ + 1;
step 17, if δ < Pb is true, skipping to step 3 to continue processing, and if δ < Pb is false, executing step 18;
step 18, completing the calculation and outputting the result Z [ Kb ]r][Bbr][γ][P][Br][Kr][2×L]。
Preferably, the performing inverse fast fourier transform based on the result of the complex matrix multiplication to obtain an output of a fast fourier convolution algorithm includes:
from Z [ Kb ] by each processor corer][Bbr][γ][P][Br][Kr][2×L]To extract one from
Figure BDA0003233272060000051
Size division and finish one
Figure BDA0003233272060000052
The inverse fast Fourier transform of the size of a block is obtained
Figure BDA0003233272060000053
Outputting a result of the convolution of the size;
after all the processor cores finish the fast Fourier inverse conversion of all the blocks together, the results are spliced into output [ B ] [ K ] [ H ] [ W ], and the output of the fast Fourier convolution algorithm is obtained.
A parallel implementation system for FFT convolution algorithm based on NUMA affinity comprises:
the first fast Fourier transform module is used for carrying out fast Fourier transform on input data and storing a first fast Fourier transform result to a specified non-uniform memory access node;
the second fast Fourier transform module is used for carrying out fast Fourier transform on the weight and storing a second fast Fourier transform result to a specified non-consistent memory access node;
the parallel complex matrix multiplication module is used for realizing the parallel complex matrix multiplication of the non-uniform memory access level and the multi-core level based on the first fast Fourier transform result and the second fast Fourier transform result and evenly distributing the result of the complex matrix multiplication to all the non-uniform memory access nodes;
and the fast Fourier inverse conversion module is used for carrying out fast Fourier inverse conversion based on the result of the complex matrix multiplication to obtain the output of the fast Fourier convolution algorithm.
Preferably, when the first fft module performs fast fourier transform on the input data and stores the first fft result in the designated non-uniform memory access node, the first fft module is specifically configured to:
convolution Input B][C][H][W]Divided into BxCxXDeltaBxC
Figure BDA0003233272060000054
The size is divided into blocks, wherein B represents the size of mini-batch in convolution calculation, C represents the number of input channels, H and W represent the height and width of the feature map of convolution input and output respectively,
Figure BDA0003233272060000061
is the size of the partitioned block, wherein,
Figure BDA0003233272060000062
Hfand WfWhich represents the size of the convolution kernel and,
Figure BDA0003233272060000063
represents rounding up;
processing each by each processor core independently
Figure BDA0003233272060000064
A fast Fourier transform of the size of the block, the result of the fast Fourier transform
Figure BDA0003233272060000065
Dividing the memory into 2 xL tuples, and evenly distributing and storing all the tuples to a specified non-uniform memory access node, wherein L represents the width of a vector register of a processor;
after all the processor cores finish the fast Fourier transformation of all the BxCxXDelta blocks in parallel, a first fast Fourier transformation result D [ N ] is obtained][Pb][Cbl1][Bbr][Υ][Cl1][Br][2×L]Wherein N represents the number of nodes of the non-uniform memory access,
Figure BDA0003233272060000066
represents the total number of tuples divided up,
Figure BDA0003233272060000067
Figure BDA0003233272060000068
γ ═ X × Δ denotes the number of blocks into which each feature map is divided, Cl1And BrIs the block size in the complex matrix multiplication.
Preferably, when the second fft module performs fft on the weights and stores the second fft result in the designated non-uniform memory access node, the second fft module is specifically configured to:
inputting the convolution into Filter K][C][Hf][Wf]Filled into KC blocks of size delta x delta, where KRepresenting the number of output channels;
solving each by each processor core alone
Figure BDA0003233272060000069
Block-wise fast Fourier transform, fast Fourier transformed
Figure BDA00032332720600000610
Dividing the memory into 2 xL tuples, and evenly distributing and storing all the tuples to a specified non-uniform memory access node, wherein L represents the width of a vector register of a processor;
after all processor cores finish the fast Fourier transform of all KC blocks in parallel, a second fast Fourier transform result G [ N ] is obtained][Pb][Cbl1][Kbr][Cl1][Kr][2×L]Wherein, in the step (A),
Figure BDA00032332720600000611
Kris the block size in the parallel complex matrix multiplication.
Preferably, when the parallel complex matrix multiplication module performs the non-uniform memory access level and multi-core level parallel complex matrix multiplication based on the first fast fourier transform result and the second fast fourier transform result, and evenly distributes the complex matrix multiplication result to all non-uniform memory access nodes, the parallel complex matrix multiplication module is specifically configured to perform the following steps:
step 1, inputting D [ N ]][Pb][Cbl1][Bbr][γ][Cl1][Br][2×L]And G [ N ]][Pb][Cbl1][Kbr][Cl1][Kr][2×L]Obtaining the number n of the non-uniform memory access node where the current processor core is located, wherein n is more than or equal to 0<N, the total number Cores of Cores in the current non-uniform memory access node, and the number cid of the processor core in the current non-uniform memory access node, wherein 0 is less than or equal to cid<Cores, wherein Dn[Pb][Cbl1][Bbr][γ][Cl1][Br][2×L]And Gn[Pb][Cbl1][Kbr][Cl1][Kr][2×L]Representing D, G the portion stored on the nth non-coherent memory access node;
step 2, making delta equal to 0;
step 3, setting cs to be 0;
step 4, letting cb mu be cid;
step 5, according to the formula
Figure BDA0003233272060000071
krs=kss×Kl2
Figure BDA0003233272060000072
μ=cbμ-kss×BbrSolving the solutions of kss, bss and mu respectively by the multiplied by gamma-bss multiplied by gamma, wherein,
Figure BDA0003233272060000073
represents rounding down;
step 6, making kk equal to 0;
step 7, according to the values of delta, cs, kk + krs, cb mu, bss and mu, g is obtained from the current non-uniform memory access noden′=Gn,δ,cs,kk+krs,dn′=Dn,δ,cs,bss,μFrom global, get Z' ═ Zkk+krs,bss,μ,δWherein g isn' A sub-tensor representing the storage of the G tensor at the nth non-uniform memory access node, whose size is Cl1×Kr×(2×L),dn' A sub-tensor representing the storage of the D tensor at the nth non-uniform memory access node, whose size is Cl1×BrX (2 × L), Z' represents the sub-tensors of the Z tensor uniformly distributed over all non-uniform memory access introductions, with size Br×Kr×(2×L);
Step 8, calculating
Figure BDA0003233272060000074
Step 9, mixing z' [ B ]r][Kr][2×L]Value of (2) is stored back in Zkk+krs,bss,μ,δPerforming the following steps;
step 10, calculating kk-kk + 1;
step 11. If kk<min(Kl2,Kbr-kss×Kl2) If yes, skipping to step 7 to continue processing; if kk<min(Kl2,Kbr-kss×Kl2) If not, go to step 12, where Kl2The size of the block in the multiplication of the complex matrix is min represents the minimum value of the two;
step 12, calculating cb mu ═ cb mu + Cores;
step 13, if cb mu<Bbr×Kbl2If the x gamma is established, skipping to step 5 to continue processing, if cb mu<Bbr×Kbl2If the x gamma is not satisfied, executing step 14; wherein
Figure BDA0003233272060000081
Step 14, calculating cs as cs + 1;
step 15, if cs<Cbl1If yes, skipping to the step 4 to continue processing; if cs is<Cbl1If not, executing step 16;
step 16, calculating δ ═ δ + 1;
step 17, if δ < Pb is true, skipping to step 3 to continue processing, and if δ < Pb is false, executing step 18;
step 18, completing the calculation and outputting the result Z [ Kb ]r][Bbr][γ][P][Br][Kr][2×L]。
Preferably, when the inverse fast fourier transform module performs inverse fast fourier transform based on the result of the complex matrix multiplication to obtain an output of a fast fourier convolution algorithm, the inverse fast fourier transform module is specifically configured to:
from Z [ Kb ] by each processor corer][Bbr][γ][P][Br][Kr][2×L]To extract one from
Figure BDA0003233272060000082
Size division and finish one
Figure BDA0003233272060000083
Block size blockFast Fourier inverse transformation, fast Fourier inverse transformation of a block is obtained
Figure BDA0003233272060000084
Outputting a result of the convolution of the size;
after all the processor cores finish the fast Fourier inverse conversion of all the blocks together, the results are spliced into output [ B ] [ K ] [ H ] [ W ], and the output of the fast Fourier convolution algorithm is obtained.
In summary, the present invention discloses a parallel implementation method for FFT convolution algorithm based on NUMA affinity, which includes performing fast fourier transform on input data, and storing a first fast fourier transform result to a designated non-uniform memory access node; performing fast Fourier transform on the weight, and storing a second fast Fourier transform result to a specified non-uniform memory access node; secondly, based on the first fast Fourier transform result and the second fast Fourier transform result, the parallel complex matrix multiplication of the non-uniform memory access level and the multi-core level is realized, and the result of the complex matrix multiplication is evenly distributed to all the non-uniform memory access nodes; and performing fast Fourier inverse transformation based on the result of the complex matrix multiplication to obtain the output of the fast Fourier convolution algorithm. The method can obviously reduce the remote memory access overhead in the FFT convolution calculation process on the NUMA architecture and improve the FFT convolution performance on the NUMA architecture.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of an embodiment of a parallel implementation method of an FFT convolution algorithm based on NUMA affinity disclosed in the present invention;
FIG. 2 is a schematic illustration of NUMA affinity based translation result partitioning as disclosed herein;
FIG. 3 is a schematic diagram of communications between NUMA nodes before optimization;
FIG. 4 is a schematic diagram of communications between optimized NUMA nodes;
FIG. 5 is a schematic structural diagram of an embodiment of a NUMA affinity-based FFT convolution algorithm parallel implementation system disclosed in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, which is a flowchart of an embodiment of a parallel implementation method of an FFT convolution algorithm based on NUMA affinity disclosed in the present invention, the method may include the following steps:
s101, performing fast Fourier transform on input data, and storing a first fast Fourier transform result to a designated non-uniform memory access node;
when the FFT convolution algorithm needs to be realized in parallel, firstly, Input FFT conversion is carried out based on NUMA affinity, namely, firstly, FFT conversion is carried out on Input data, and a conversion result is stored on a specified node.
Specifically, the convolution is Input into Input [ B ]][C][H][W]Divided into BxCxXDeltaBxC
Figure BDA0003233272060000101
The size is divided into blocks, wherein B represents the size of mini-batch in convolution calculation, C represents the number of input channels, H and W represent the height and width of the feature map of convolution input and output respectively,
Figure BDA0003233272060000102
is the size of the partitioned block, wherein,
Figure BDA0003233272060000103
Hfand WfWhich represents the size of the convolution kernel and,
Figure BDA0003233272060000104
represents rounding up;
processing each by each processor core independently
Figure BDA0003233272060000105
FFT conversion of the size blocks, and the result of the FFT conversion
Figure BDA0003233272060000106
Dividing the vector register into 2 × L tuples, and storing all the tuples to the specified NUMA nodes in an evenly distributed manner, as shown in fig. 2, where L represents the vector register width of the processor;
after all the processor cores finish FFT conversion of all the BxCxXDeltablocks in parallel, a first FFT conversion result D [ N ] is obtained][Pb][Cbl1][Bbr][γ][Cl1][Br][2×L]Wherein N represents the number of nodes of NUMA,
Figure BDA0003233272060000107
represents the total number of tuples divided up,
Figure BDA0003233272060000108
γ ═ X × Δ denotes the number of blocks into which each feature map is divided, Cl1And BrIs the block size in the complex matrix multiplication.
S102, performing fast Fourier transform on the weight, and storing a second fast Fourier transform result to a designated non-uniform memory access node;
and performing Kernel FFT conversion based on NUMA (non uniform memory access) affinity, namely performing FFT conversion on the weights, and storing the conversion result to a specified node.
Specifically, the convolution is input to Filter K][C][Hf][Wf]Padding into KC blocks with the size of delta multiplied by delta, wherein K represents the number of output channels;
solving each by each processor core alone
Figure BDA0003233272060000111
Block-wise FFT conversion, FFT-converted
Figure BDA0003233272060000112
Dividing the vector register into 2 × L tuples, and storing all the tuples to the specified NUMA nodes in an evenly distributed manner, as shown in fig. 2, where L represents the vector register width of the processor;
after all processor cores finish FFT conversion of all KC blocks in parallel, a second FFT conversion result G [ N ] is obtained][Pb][Cbl1][Kbr][Cl1][Kr][2×L]Wherein, in the step (A),
Figure BDA0003233272060000113
Kris the block size in the parallel complex matrix multiplication.
S103, based on the first fast Fourier transform result and the second fast Fourier transform result, the parallel complex matrix multiplication of the non-uniform memory access level and the multi-core level is realized, and the result of the complex matrix multiplication is evenly distributed to all the non-uniform memory access nodes;
then, NUMA-level and multi-core-level parallel complex matrix multiplication is realized based on the converted result, and the result of the complex matrix multiplication is evenly distributed to all NUMA nodes.
Specifically, the method can comprise the following steps:
step 1, inputting D [ N ]][Pb][Cbl1][Bbr][γ][Cl1][Br][2×L]And G [ N ]][Pb][Cbl1][Kbr][Cl1][Kr][2×L]Obtaining the NUMA node number n of the current processor core, wherein n is more than or equal to 0<N, the total number of Cores in the current NUMA node and the number cid of the processor core in the current NUMA node, wherein 0 is less than or equal to cid<Cores, wherein Dn[Pb][Cbl1][Bbr][γ][Cl1][Br][2×L]And Gn[Pb][Cbl1][Kbr][Cl1][Kr][2×L]To representD. A portion of G stored on the nth NUMA node;
step 2, making delta equal to 0;
step 3, setting cs to be 0;
step 4, letting cb mu be cid;
step 5, according to the formula
Figure BDA0003233272060000121
krs=kss×Kl2
Figure BDA0003233272060000122
μ=cbμ-kss×BbrSolving the solutions of kss, bss and mu respectively by the multiplied by gamma-bss multiplied by gamma, wherein,
Figure BDA0003233272060000123
represents rounding down;
step 6, making kk equal to 0;
step 7, according to the values of delta, cs, kk + krs, cb mu, bss and mu, g 'is obtained from the current NUMA node'n=Gn,δ,cs,kk+krs,d′n=Dn,δ,cs,bss,μFrom global, get Z' ═ Zkk+krs,bss,μ,δWherein, g'nA sub-tensor representing the storage of the G tensor at the nth NUMA node, of size Cl1×Kr×(2×L),d′nA sub-tensor representing the storage of the D tensor at the nth NUMA node, of size Cl1×BrX (2 × L), Z' denotes a sub-tensor of the Z tensor which is evenly distributed over all NUMA introductions, with the size Br×Kr×(2×L);
Step 8, calculating
Figure BDA0003233272060000124
Step 9, mixing z' [ B ]r][Kr][2×L]Value of (2) is stored back in Zkk+krs,bss,μ,δPerforming the following steps;
step 10, calculating kk-kk + 1;
step 11, if kk<min(Kl2,Kbr-kss×Kl2) If yes, go to step 7 and continueC, processing; if kk<min(Kl2,Kbr-kss×Kl2) If not, go to step 12, where Kl2The size of the block in the multiplication of the complex matrix is min represents the minimum value of the two;
step 12, calculating cb mu ═ cb mu + Cores;
step 13, if cb mu<Bbr×Kbl2If the x gamma is established, skipping to step 5 to continue processing, if cb mu<Bbr×Kbl2If the x gamma is not satisfied, executing step 14; wherein
Figure BDA0003233272060000125
Step 14, calculating cs as cs + 1;
step 15, if cs<Cbl1If yes, skipping to the step 4 to continue processing; if cs is<Cbl1If not, executing step 16;
step 16, calculating δ ═ δ + 1;
step 17, if δ < Pb is true, skipping to step 3 to continue processing, and if δ < Pb is false, executing step 18;
step 18, completing the calculation and outputting the result Z [ Kb ]r][Bbr][γ][P][Br][Kr][2×L]。
And S104, performing fast Fourier inverse transformation based on the result of the complex matrix multiplication to obtain the output of the fast Fourier convolution algorithm.
And finally, performing output IFFT conversion based on the result of the complex matrix multiplication to obtain the output of the FFT convolution algorithm.
In particular, from Z [ Kb ] by each processor corer][Bbr][γ][P][Br][Kr][2×L]To extract one from
Figure BDA0003233272060000131
Size division and finish one
Figure BDA0003233272060000132
Size-division IFFT conversion, one-division IFFT conversionTo obtain
Figure BDA0003233272060000133
Outputting a result of the convolution of the size;
after all the processor cores finish IFFT conversion of all the blocks together, the results are spliced into output [ B ] [ K ] [ H ] [ W ], and output of the FFT convolution algorithm is obtained.
As shown in fig. 3, which is a schematic diagram of communication between NUMA nodes before optimization, and as shown in fig. 4, which is communication between NUMA nodes after optimization, it can be seen that, by using the present invention, overhead of remote memory access on a NUMA architecture can be significantly reduced, and by using NUMA-level parallelism among a plurality of NUMA nodes and kernel-level parallelism among a plurality of processor cores in a single NUMA node, computational performance of an FFT convolution algorithm on a NUMA architecture can be significantly improved.
As shown in fig. 5, which is a schematic structural diagram of an embodiment of a parallel implementation system for an FFT convolution algorithm based on NUMA affinity disclosed in the present invention, the system may include:
a first fast fourier transform module 501, configured to perform fast fourier transform on input data, and store a first fast fourier transform result to a designated non-uniform memory access node;
a second fast fourier transform module 502, configured to perform fast fourier transform on the weight, and store a second fast fourier transform result to a designated non-uniform memory access node;
a parallel complex matrix multiplication module 503, configured to implement parallel complex matrix multiplication of the non-uniform memory access level and the multi-core level based on the first fast fourier transform result and the second fast fourier transform result, and evenly distribute the result of the complex matrix multiplication to all non-uniform memory access nodes;
and the inverse fast fourier transform module 504 is configured to perform inverse fast fourier transform based on a result of the complex matrix multiplication to obtain an output of a fast fourier convolution algorithm.
The working principle of the parallel implementation system of the FFT convolution algorithm based on NUMA affinity disclosed in this embodiment is the same as that of the parallel implementation method of the FFT convolution algorithm based on NUMA affinity, and is not described herein again.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A parallel implementation method for FFT convolution algorithm based on NUMA affinity is characterized by comprising the following steps:
performing fast Fourier transform on input data, and storing a first fast Fourier transform result to a specified non-uniform memory access node;
performing fast Fourier transform on the weight, and storing a second fast Fourier transform result to a specified non-uniform memory access node;
based on the first fast Fourier transform result and the second fast Fourier transform result, realizing the parallel complex matrix multiplication of the non-uniform memory access level and the multi-core level, and evenly distributing the result of the complex matrix multiplication to all the non-uniform memory access nodes;
and performing fast Fourier inverse transformation based on the result of the complex matrix multiplication to obtain the output of a fast Fourier convolution algorithm.
2. The method of claim 1, wherein performing a fast fourier transform on the input data and storing a first fast fourier transform result on a designated non-coherent memory access node comprises:
convolution Input B][C][H][W]Divided into BxCxXDeltaBxC
Figure FDA0003233272050000011
The size is divided into blocks, wherein B represents the size of mini-batch in convolution calculation, C represents the number of input channels, H and W represent the height and width of the feature map of convolution input and output respectively,
Figure FDA0003233272050000012
is the size of the partitioned block, wherein,
Figure FDA0003233272050000013
Hfand WfWhich represents the size of the convolution kernel and,
Figure FDA0003233272050000014
represents rounding up;
processing each by each processor core independently
Figure FDA0003233272050000015
A fast Fourier transform of the size of the block, the result of the fast Fourier transform
Figure FDA0003233272050000016
Dividing the memory into 2 xL tuples, and evenly distributing and storing all the tuples to a specified non-uniform memory access node, wherein L represents the width of a vector register of a processor;
after all the processor cores finish the fast Fourier transformation of all the BxCxXDelta blocks in parallel, a first fast Fourier transformation result D [ N ] is obtained][Pb][Cbl1][Bbr][γ][Cl1][Br][2×L]Wherein N represents the number of nodes of the non-uniform memory access,
Figure FDA0003233272050000021
represents the total number of tuples divided up,
Figure FDA0003233272050000022
Figure FDA0003233272050000023
γ ═ X × Δ denotes the number of blocks into which each feature map is divided, Cl1And BrIs the block size in the complex matrix multiplication.
3. The method of claim 2, wherein performing a fast fourier transform on the weights and storing a second result of the fast fourier transform on a designated non-coherent memory access node comprises:
inputting the convolution into Filter K][C][Hf][Wf]Is filled intoKC δ × δ -sized blocks, where K represents the number of output channels;
solving each by each processor core alone
Figure FDA0003233272050000024
Block-wise fast Fourier transform, fast Fourier transformed
Figure FDA0003233272050000025
Dividing the memory into 2 xL tuples, and evenly distributing and storing all the tuples to a specified non-uniform memory access node, wherein L represents the width of a vector register of a processor;
after all processor cores finish the fast Fourier transform of all KC blocks in parallel, a second fast Fourier transform result G [ N ] is obtained][Pb][Cbl1][Kbr][Cl1][Kr][2×L]Wherein, in the step (A),
Figure FDA0003233272050000026
Kris the block size in the parallel complex matrix multiplication.
4. The method of claim 3, wherein the performing the non-uniform memory access level and multi-core level parallel complex matrix multiplication based on the first fast Fourier transform result and the second fast Fourier transform result, and evenly distributing the complex matrix multiplication result to all non-uniform memory access nodes comprises:
step 1, inputting D [ N ]][Pb][Cbl1][Bbr][γ][Cl1][Br][2×L]And G [ N ]][Pb][Cbl1][Kbr][Cl1][Kr][2×L]Obtaining the number n of the non-uniform memory access node where the current processor core is located, wherein n is more than or equal to 0<N, the total number Cores of Cores in the current non-uniform memory access node, and the number cid of the processor core in the current non-uniform memory access node, wherein 0 is less than or equal to cid<Cores, wherein Dn[Pb][Cbl1][Bbr][γ][Cl1][Br][2×L]And Gn[Pb][Cbl1][Kbr][Cl1][Kr][2×L]Representing D, G the portion stored on the nth non-coherent memory access node;
step 2, making delta equal to 0;
step 3, setting cs to be 0;
step 4, letting cb mu be cid;
step 5, according to the formula
Figure FDA0003233272050000031
krs=kss×Kl2
Figure FDA0003233272050000032
μ=cbμ-kss×BbrSolving the solutions of kss, bss and mu respectively by the multiplied by gamma-bss multiplied by gamma, wherein,
Figure FDA0003233272050000033
represents rounding down;
step 6, making kk equal to 0;
step 7, according to the values of delta, cs, kk + krs, cb mu, bss and mu, g 'is obtained from the current non-uniform memory access node'n=Gn,δ,cs,kk+krs,d′n=Dn,δ,cs,bss,μFrom global, get Z' ═ Zkk+krs,bss,μ,δWherein, g'nA sub-tensor representing the magnitude of the G tensor stored at the nth non-uniform memory access nodel1×Kr×(2×L),d′nA sub-tensor representing the storage of the D tensor at the nth non-uniform memory access node, whose size is Cl1×BrX (2 × L), Z' represents the sub-tensors of the Z tensor uniformly distributed over all non-uniform memory access introductions, with size Br×Kr×(2×L);
Step 8, calculating
Figure FDA0003233272050000034
Step 9, mixing z' [ B ]r][Kr][2×L]Value of (2) is stored back in Zkk+krs,bss,μ,δPerforming the following steps;
step 10, calculating kk-kk + 1;
step 11, if kk<min(Kl2,Kbr-kss×Kl2) If yes, skipping to step 7 to continue processing; if kk<min(Kl2,Kbr-kss×Kl2) If not, go to step 12, where Kl2The size of the block in the multiplication of the complex matrix is min represents the minimum value of the two;
step 12, calculating cb mu ═ cb mu + Cores;
step 13, if cb mu<Bbr×Kbl2If the x gamma is established, skipping to step 5 to continue processing, if cb mu<Bbr×Kbl2If the x gamma is not satisfied, executing step 14; wherein
Figure FDA0003233272050000035
Step 14, calculating cs as cs + 1;
step 15, if cs<Cbl1If yes, skipping to the step 4 to continue processing; if cs is<Cbl1If not, executing step 16;
step 16, calculating δ ═ δ + 1;
step 17, if δ < Pb is true, skipping to step 3 to continue processing, and if δ < Pb is false, executing step 18;
step 18, completing the calculation and outputting the result Z [ Kb ]r][Bbr][γ][P][Br][Kr][2×L]。
5. The method of claim 4, wherein performing an inverse fast Fourier transform based on the result of the complex matrix multiplication to obtain an output of a fast Fourier convolution algorithm comprises:
from Z [ Kb ] by each processor corer][Bbr][Υ][P][Br][Kr][2×L]To extract one from
Figure FDA0003233272050000041
Size division and finish one
Figure FDA0003233272050000042
The inverse fast Fourier transform of the size of a block is obtained
Figure FDA0003233272050000043
Outputting a result of the convolution of the size;
after all the processor cores finish the fast Fourier inverse conversion of all the blocks together, the results are spliced into output [ B ] [ K ] [ H ] [ W ], and the output of the fast Fourier convolution algorithm is obtained.
6. An FFT convolution algorithm parallel implementation system based on NUMA affinity is characterized by comprising the following components:
the first fast Fourier transform module is used for carrying out fast Fourier transform on input data and storing a first fast Fourier transform result to a specified non-uniform memory access node;
the second fast Fourier transform module is used for carrying out fast Fourier transform on the weight and storing a second fast Fourier transform result to a specified non-consistent memory access node;
the parallel complex matrix multiplication module is used for realizing the parallel complex matrix multiplication of the non-uniform memory access level and the multi-core level based on the first fast Fourier transform result and the second fast Fourier transform result and evenly distributing the result of the complex matrix multiplication to all the non-uniform memory access nodes;
and the fast Fourier inverse conversion module is used for carrying out fast Fourier inverse conversion based on the result of the complex matrix multiplication to obtain the output of the fast Fourier convolution algorithm.
7. The system of claim 6, wherein the first fast fourier transform module, when performing fast fourier transform on the input data and storing the first fast fourier transform result on the designated non-coherent memory access node, is specifically configured to:
convolution Input B][C][H][W]Divided into BxCxXDeltaBxC
Figure FDA0003233272050000051
The size is divided into blocks, wherein B represents the size of mini-batch in convolution calculation, C represents the number of input channels, H and W represent the height and width of the feature map of convolution input and output respectively,
Figure FDA0003233272050000052
is the size of the partitioned block, wherein,
Figure FDA0003233272050000053
Hfand WfWhich represents the size of the convolution kernel and,
Figure FDA0003233272050000054
represents rounding up;
processing each by each processor core independently
Figure FDA0003233272050000055
A fast Fourier transform of the size of the block, the result of the fast Fourier transform
Figure FDA0003233272050000056
Dividing the memory into 2 xL tuples, and evenly distributing and storing all the tuples to a specified non-uniform memory access node, wherein L represents the width of a vector register of a processor;
after all the processor cores finish the fast Fourier transformation of all the BxCxXDelta blocks in parallel, a first fast Fourier transformation result D [ N ] is obtained][Pb][Cbl1][Bbr][Υ][Cl1][Br][2×L]Wherein N represents the number of nodes of the non-uniform memory access,
Figure FDA0003233272050000057
represents the total number of tuples divided up,
Figure FDA00032332720500000511
Figure FDA0003233272050000058
y X × Δ represents the number of segments into which each feature map is divided, Cl1And BrIs the block size in the complex matrix multiplication.
8. The system of claim 7, wherein the second fft module, when performing fft on the weights and storing the second fft results on the designated non-coherent memory access nodes, is specifically configured to:
inputting the convolution into Filter K][C][Hf][Wf]Padding into KC blocks with the size of delta multiplied by delta, wherein K represents the number of output channels;
solving each by each processor core alone
Figure FDA0003233272050000059
Block-wise fast Fourier transform, fast Fourier transformed
Figure FDA00032332720500000510
Dividing the memory into 2 xL tuples, and evenly distributing and storing all the tuples to a specified non-uniform memory access node, wherein L represents the width of a vector register of a processor;
after all processor cores finish the fast Fourier transform of all KC blocks in parallel, a second fast Fourier transform result G [ N ] is obtained][Pb][Cbl1][Kbr][Cl1][Kr][2×L]Wherein, in the step (A),
Figure FDA0003233272050000061
Kris the block size in the parallel complex matrix multiplication.
9. The system according to claim 8, wherein the parallel complex matrix multiplication module, when performing the non-uniform memory access level and multi-core level parallel complex matrix multiplication based on the first fft result and the second fft result, and evenly distributing the complex matrix multiplication result to all non-uniform memory access nodes, is specifically configured to perform the following steps:
step 1, inputting D [ N ]][Pb][Cbl1][Bbr][Υ][Cl1][Br][2×L]And G [ N ]][Pb][Cbl1][Kbr][Cl1][Kr][2×L]Obtaining the number n of the non-uniform memory access node where the current processor core is located, wherein n is more than or equal to 0<N, the total number Cores of Cores in the current non-uniform memory access node, and the number cid of the processor core in the current non-uniform memory access node, wherein 0 is less than or equal to cid<Cores, wherein Dn[Pb][Cbl1][Bbr][Υ][Cl1][Br][2×L]And Gn[Pb][Cbl1][Kbr][Cl1][Kr][2×L]Representing D, G the portion stored on the nth non-coherent memory access node;
step 2, making delta equal to 0;
step 3, setting cs to be 0;
step 4, letting cb mu be cid;
step 5, according to the formula
Figure FDA0003233272050000062
krs=kss×Kl2
Figure FDA0003233272050000063
μ=cbμ-kss×BbrSolving the solutions of kss, bss and mu respectively by the multiplied by gamma-bss multiplied by gamma, wherein,
Figure FDA0003233272050000064
represents rounding down;
step 6, making kk equal to 0;
step 7, according to delta, cs,The values of kk + krs, cb mu, bss and mu acquire g 'from the current non-uniform memory access node'n=Gn,δ,cs,kk+krs,d′n=Dn,δ,cs,bss,μFrom global, get Z' ═ Zkk+krs,bss,μ,δWherein, g'nA sub-tensor representing the magnitude of the G tensor stored at the nth non-uniform memory access nodel1×Kr×(2×L),d′nA sub-tensor representing the storage of the D tensor at the nth non-uniform memory access node, whose size is Cl1×BrX (2 × L), Z' represents the sub-tensors of the Z tensor uniformly distributed over all non-uniform memory access introductions, with size Br×Kr×(2×L);
Step 8, calculating
Figure FDA0003233272050000071
Step 9, mixing z' [ B ]r][Kr][2×L]Value of (2) is stored back in Zkk+krs,bss,μ,δPerforming the following steps;
step 10, calculating kk-kk + 1;
step 11, if kk<min(Kl2,Kbr-kss×Kl2) If yes, skipping to step 7 to continue processing; if kk<min(Kl2,Kbr-kss×Kl2) If not, go to step 12, where Kl2The size of the block in the multiplication of the complex matrix is min represents the minimum value of the two;
step 12, calculating cb mu ═ cb mu + Cores;
step 13, if cb mu<Bbr×Kbl2If the x gamma is established, skipping to step 5 to continue processing, if cb mu<Bbr×Kbl2If the x gamma is not satisfied, executing step 14; wherein
Figure FDA0003233272050000072
Step 14, calculating cs as cs + 1;
step 15, if cs<Cbl1If yes, skipping to the step 4 to continue processing; if cs is<Cbl1If not, executing step 16;
step 16, calculating δ ═ δ + 1;
step 17, if δ < Pb is true, skipping to step 3 to continue processing, and if δ < Pb is false, executing step 18;
step 18, completing the calculation and outputting the result Z [ Kb ]r][Bbr][γ][P][Br][Kr][2×L]。
10. The system according to claim 9, wherein the inverse fast fourier transform module, when performing inverse fast fourier transform based on the result of the complex matrix multiplication, is specifically configured to:
from Z [ Kb ] by each processor corer][Bbr][γ][P][Br][Kr][2×L]To extract one from
Figure FDA0003233272050000073
Size division and finish one
Figure FDA0003233272050000074
The inverse fast Fourier transform of the size of a block is obtained
Figure FDA0003233272050000075
Outputting a result of the convolution of the size;
after all the processor cores finish the fast Fourier inverse conversion of all the blocks together, the results are spliced into output [ B ] [ K ] [ H ] [ W ], and the output of the fast Fourier convolution algorithm is obtained.
CN202111000202.2A 2021-08-27 2021-08-27 FFT convolution algorithm parallel implementation method and system based on NUMA affinity Active CN113655986B9 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111000202.2A CN113655986B9 (en) 2021-08-27 2021-08-27 FFT convolution algorithm parallel implementation method and system based on NUMA affinity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111000202.2A CN113655986B9 (en) 2021-08-27 2021-08-27 FFT convolution algorithm parallel implementation method and system based on NUMA affinity

Publications (3)

Publication Number Publication Date
CN113655986A true CN113655986A (en) 2021-11-16
CN113655986B CN113655986B (en) 2023-06-30
CN113655986B9 CN113655986B9 (en) 2023-10-10

Family

ID=78482338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111000202.2A Active CN113655986B9 (en) 2021-08-27 2021-08-27 FFT convolution algorithm parallel implementation method and system based on NUMA affinity

Country Status (1)

Country Link
CN (1) CN113655986B9 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116401502A (en) * 2023-06-09 2023-07-07 之江实验室 Method and device for optimizing Winograd convolution based on NUMA system characteristics
WO2023182765A1 (en) * 2022-03-21 2023-09-28 Samsung Electronics Co., Ltd. Speech enhancement method and device using fast fourier convolution

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103558592A (en) * 2013-10-08 2014-02-05 北京航空航天大学 Satellite-borne SAR echo data simulation method based on MPI parallel computing
CN111143766A (en) * 2019-12-24 2020-05-12 上海寒武纪信息科技有限公司 Method and apparatus for processing two-dimensional complex matrix by artificial intelligence processor
CN112559952A (en) * 2019-09-26 2021-03-26 无锡江南计算技术研究所 Heterogeneous many-core fast Fourier transform method based on sequence layering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103558592A (en) * 2013-10-08 2014-02-05 北京航空航天大学 Satellite-borne SAR echo data simulation method based on MPI parallel computing
CN112559952A (en) * 2019-09-26 2021-03-26 无锡江南计算技术研究所 Heterogeneous many-core fast Fourier transform method based on sequence layering
CN111143766A (en) * 2019-12-24 2020-05-12 上海寒武纪信息科技有限公司 Method and apparatus for processing two-dimensional complex matrix by artificial intelligence processor

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023182765A1 (en) * 2022-03-21 2023-09-28 Samsung Electronics Co., Ltd. Speech enhancement method and device using fast fourier convolution
CN116401502A (en) * 2023-06-09 2023-07-07 之江实验室 Method and device for optimizing Winograd convolution based on NUMA system characteristics
CN116401502B (en) * 2023-06-09 2023-11-03 之江实验室 Method and device for optimizing Winograd convolution based on NUMA system characteristics

Also Published As

Publication number Publication date
CN113655986B (en) 2023-06-30
CN113655986B9 (en) 2023-10-10

Similar Documents

Publication Publication Date Title
CN106228238B (en) Accelerate the method and system of deep learning algorithm on field programmable gate array platform
TWI639119B (en) Adaptive execution engine for convolution computing systems cross-reference to related applications
CN109993299B (en) Data training method and device, storage medium and electronic device
CN113655986B9 (en) FFT convolution algorithm parallel implementation method and system based on NUMA affinity
CN112084038B (en) Memory allocation method and device of neural network
US10755169B2 (en) Hybrid non-uniform convolution transform engine for deep learning applications
CN112199636B (en) Fast convolution method and device suitable for microprocessor
Bottleson et al. clcaffe: Opencl accelerated caffe for convolutional neural networks
KR20200100190A (en) Image Transformation for Machine Learning
CN113986816A (en) Reconfigurable computing chip
Alexandru et al. Efficient implementation of the overlap operator on multi-GPUs
CN113885941A (en) Singular value decomposition operation implementation method, device and related equipment
CN110490308B (en) Design method of acceleration library, terminal equipment and storage medium
CN114461978B (en) Data processing method and device, electronic equipment and readable storage medium
CN110414672B (en) Convolution operation method, device and system
CN115238879A (en) Architecture search method of deep neural network and hardware accelerator
Wu et al. Skeletongcn: a simple yet effective accelerator for gcn training
CN116881618B (en) General matrix multiplication calculation optimization method, device and processor
CN115328440A (en) General sparse matrix multiplication implementation method and device based on 2D systolic array
TWI798591B (en) Convolutional neural network operation method and device
CN110930290B (en) Data processing method and device
Ho et al. Towards FPGA-assisted spark: An SVM training acceleration case study
CN113888390A (en) Feature map processing method and device, electronic equipment and computer readable medium
CN118503205B (en) Method and apparatus for processing tensor data
CN111931919B (en) Sparse neural network computing method and device based on systolic array

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CI03 Correction of invention patent

Correction item: Description

Correct: Paragraphs 0001-0129 of the instruction manual submitted on May 23, 2023

False: Paragraphs 0001-0129 of the instruction manual submitted on the application date

Number: 26-02

Page: ??

Volume: 39

CI03 Correction of invention patent