CN113655986B - FFT convolution algorithm parallel implementation method and system based on NUMA affinity - Google Patents

FFT convolution algorithm parallel implementation method and system based on NUMA affinity Download PDF

Info

Publication number
CN113655986B
CN113655986B CN202111000202.2A CN202111000202A CN113655986B CN 113655986 B CN113655986 B CN 113655986B CN 202111000202 A CN202111000202 A CN 202111000202A CN 113655986 B CN113655986 B CN 113655986B
Authority
CN
China
Prior art keywords
fast fourier
fourier transform
memory access
result
uniform memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111000202.2A
Other languages
Chinese (zh)
Other versions
CN113655986A (en
CN113655986B9 (en
Inventor
王庆林
梅松竹
郝若晨
李东升
姜晶菲
赖志权
黄显栋
刘杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202111000202.2A priority Critical patent/CN113655986B9/en
Publication of CN113655986A publication Critical patent/CN113655986A/en
Application granted granted Critical
Publication of CN113655986B publication Critical patent/CN113655986B/en
Publication of CN113655986B9 publication Critical patent/CN113655986B9/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/4806Computations with complex numbers
    • G06F7/4812Complex multiplication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Discrete Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a parallel implementation method and a system of FFT convolution algorithm based on NUMA affinity, wherein the method comprises the following steps: performing fast Fourier transform on input data, and storing a first fast Fourier transform result to a designated non-uniform memory access node; performing fast Fourier transform on the weights, and storing a second fast Fourier transform result to a designated non-uniform memory access node; realizing non-uniform memory access level and multi-core level parallel complex matrix multiplication based on a first fast Fourier transform result and a second fast Fourier transform result, and uniformly distributing the complex matrix multiplication result to all non-uniform memory access nodes; and performing fast Fourier inverse transformation based on the complex matrix multiplication result to obtain the output of a fast Fourier convolution algorithm. The invention can obviously reduce the remote memory access overhead in the FFT convolution calculation process on the NUMA architecture and improve the FFT convolution performance on the NUMA architecture.

Description

FFT convolution algorithm parallel implementation method and system based on NUMA affinity
Technical Field
The invention relates to the technical field of FFT (fast Fourier transform ) convolution algorithms, in particular to a parallel implementation method and a parallel implementation system of an FFT convolution algorithm based on NUMA (Non Uniform Memory Access, non-uniform memory access) affinity.
Background
Convolutional neural networks are one of the most representative algorithms for deep learning, and are widely applied to various artificial intelligence scenes. Convolution operations typically account for a significant portion of the computational overhead of convolutional neural networks. The convolution algorithm based on FFT can effectively reduce the complexity of convolution calculation, thereby effectively reducing the calculation overhead of convolution. How to implement high-performance FFT convolution algorithms on multi-core and many-core processors has been a hotspot in academic research. The existing work is directed to multi-core/many-core processors based on UMA (Uniform Memory Access, consistent memory access) architecture, and is not optimized for NUMA architecture. On a many-core processor of a NUMA architecture, a core can directly access the local memory of the NUMA node to which the core belongs and access the remote memory attached to other NUMA nodes through a network on chip. Thus, memory access latency increases significantly when the core and memory are located at different NUMA nodes.
Therefore, how to effectively reduce the remote memory access overhead in the FFT convolution calculation process on the NUMA architecture and improve the performance of the FFT convolution on the NUMA architecture is a problem to be solved.
Disclosure of Invention
In view of this, the invention provides a parallel implementation method of FFT convolution algorithm based on NUMA affinity, which can remarkably reduce remote memory access overhead in FFT convolution calculation process on NUMA architecture and improve performance of FFT convolution on NUMA architecture.
The invention provides a parallel implementation method of FFT convolution algorithm based on NUMA affinity, comprising the following steps:
performing fast Fourier transform on input data, and storing a first fast Fourier transform result to a designated non-uniform memory access node;
performing fast Fourier transform on the weights, and storing a second fast Fourier transform result to a designated non-uniform memory access node;
realizing non-uniform memory access level and multi-core level parallel complex matrix multiplication based on the first fast Fourier transform result and the second fast Fourier transform result, and uniformly distributing the complex matrix multiplication result to all non-uniform memory access nodes;
and performing fast Fourier inverse transformation based on the complex matrix multiplication result to obtain the output of a fast Fourier convolution algorithm.
Preferably, the performing fast fourier transform on the input data and storing the first fast fourier transform result on the designated non-uniform memory access node includes:
input [ B ] is convolved][C][H][W]Divided into B×C×X×Δ pieces
Figure BDA0003233272060000021
A block of size, wherein B represents the size of mini-batch in the convolution calculation, C represents the number of input channels, H and W represent the height and width of the feature map of the convolution input and output, respectively,/->
Figure BDA0003233272060000022
Is the partitioned block size, wherein +.>
Figure BDA0003233272060000023
H f And W is f Representing the size of the convolution kernel +.>
Figure BDA0003233272060000024
Representing an upward rounding;
processing each by each processor core individually
Figure BDA0003233272060000025
Fast fourier transform of size block, result of fast fourier transform +.>
Figure BDA0003233272060000026
Dividing into 2 xL tuples, and storing all the tuples into designated non-uniform memory access nodes in an average distribution manner, wherein L represents the vector register width of a processor;
after all processor cores complete the fast Fourier transform of all BXCXXXDeltablocks in parallel, a first fast Fourier transform result DN is obtained][Pb][Cb l1 ][Bb r ][γ][C l1 ][B r ][2×L]Where N represents the number of nodes of the non-coherent memory access,
Figure BDA0003233272060000027
representing the total number of tuples divided, +.>
Figure BDA0003233272060000028
Figure BDA0003233272060000031
Gamma=x×Δ denotes the number of divided blocks in each feature map, C l1 And B r Is the block size in complex matrix multiplication.
Preferably, the performing fast fourier transform on the weights and storing the second fast fourier transform result on the designated non-uniform memory access node includes:
the convolution is input into Filter [ K ]][C][H f ][W f ]A block filled with KC delta x delta size, where K represents the number of output channels;
solving each by each processor core individually
Figure BDA0003233272060000032
Blocked fast fourier transform, fast fourier transform
Figure BDA0003233272060000033
Dividing into 2 xL tuples, and storing all the tuples into designated non-uniform memory access nodes in an average distribution manner, wherein L represents the vector register width of a processor;
after all the processor cores complete the fast Fourier transform of all KC blocks in parallel, a second fast Fourier transform result G [ N ] is obtained][Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]Wherein, the method comprises the steps of, wherein,
Figure BDA0003233272060000034
K r is the block size in parallel complex matrix multiplication.
Preferably, the implementation of non-uniform memory access level and multi-core level parallel complex matrix multiplication based on the first and second fast fourier transform results, and the average distribution of the complex matrix multiplication results to all non-uniform memory access nodes, includes the following steps:
step 1, input D [ N ]][Pb][Cb l1 ][Bb r ][Υ][C l1 ][B r ][2×L]And G [ N ]][Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]Obtaining the number n of the non-uniform memory access node where the current processor core is located, wherein n is not less than 0<N, total number of Cores in current non-uniform memory access node Cores, and number of processor Cores in current non-uniform memory access node cid, wherein 0 is equal to or less than cid<Cores, where D n [Pb][Cb l1 ][Bb r ][Υ][C l1 ][B r ][2×L]And G n [Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]The portion of the representation D, G stored on the nth non-coherent memory access node;
step 2, let δ=0;
step 3, letting cs=0;
step 4, letting cbμ=cid;
step 5, according to the formula
Figure BDA0003233272060000041
krs=kss×K l2 ,/>
Figure BDA0003233272060000042
μ=cbμ-kss×Bb r Xy-bss Xy solving kss, bss and μ, respectively, wherein ++>
Figure BDA0003233272060000043
Representing a downward rounding;
step 6, let kk=0;
step 7, obtaining g 'from the current non-uniform memory access node according to the values of delta, cs, kk+krs, cb mu, bss and mu' n =G n,δ,cs,kk+krs ,d′ n =D n,δ,cs,bss,μ Obtaining Z' =z from global kk+krs,bss,μ,δ Wherein g' n Representing the G tensor stored in the nth non-nodeSub-tensors on coherent memory access nodes of size C l1 ×K r ×(2×L),d′ n Representing the sub-tensor of the D tensor stored on the nth non-uniform memory access node, having a size of C l1 ×B r X (2 XL), Z' represents the sub-tensors of the Z tensor evenly distributed over all non-coherent memory access introductions, of size B r ×K r ×(2×L);
Step 8, calculating
Figure BDA0003233272060000044
Step 9, Z' [ B ] r ][K r ][2×L]Value of (2) is stored back to Z kk+krs,bss,μ,δ In (a) and (b);
step 10, calculating kk=kk+1;
step 11, if kk<min(K l2 ,Kb r -kss×K l2 ) If yes, jumping to the step 7 to continue processing; if kk is<min(K l2 ,Kb r -kss×K l2 ) If not, go to step 12, wherein K l2 For the block size in the complex matrix multiplication, min represents the minimum value of the two;
step 12, calculating cbμ=cbμ+cores;
step 13, if cb mu<Bb r ×Kb l2 If x gamma is true, jump to step 5 to continue processing, if cb mu<Bb r ×Kb l2 If x gamma is not satisfied, executing step 14; wherein the method comprises the steps of
Figure BDA0003233272060000045
Step 14, calculating cs=cs+1;
step 15, if cs<Cb l1 If yes, jumping to the step 4 to continue processing; if cs<Cb l1 If not, executing step 16;
step 16, calculating delta=delta+1;
step 17, if delta < Pb is true, jumping to step 3 to continue processing, and if delta < Pb is not true, executing step 18;
step 18, completing calculation and outputting result Z [ Kb ] r ][Bb r ][γ][P][B r ][K r ][2×L]。
Preferably, the performing inverse fast fourier transform based on the complex matrix multiplication result to obtain an output of a fast fourier convolution algorithm includes:
z [ Kb ] by each processor core r ][Bb r ][γ][P][B r ][K r ][2×L]Extracting one from
Figure BDA0003233272060000051
Size blocking and completing one +.>
Figure BDA0003233272060000052
Inverse fast fourier transform of blocks of size, inverse fast fourier transform of one block resulting in +.>
Figure BDA0003233272060000053
Convolution output result of the size;
after all processor cores finish the fast Fourier inverse transformation of all blocks together, the results are spliced into output [ B ] [ K ] [ H ] [ W ], and the output of the fast Fourier convolution algorithm is obtained.
An FFT convolution algorithm parallel implementation system based on NUMA affinity, comprising:
the first fast Fourier transform module is used for performing fast Fourier transform on input data and storing a first fast Fourier transform result to a designated non-uniform memory access node;
the second fast Fourier transform module is used for performing fast Fourier transform on the weights and storing second fast Fourier transform results to the appointed non-uniform memory access nodes;
the parallel complex matrix multiplication module is used for realizing non-uniform memory access level and multi-core level parallel complex matrix multiplication based on the first fast Fourier transform result and the second fast Fourier transform result, and uniformly distributing the complex matrix multiplication result to all non-uniform memory access nodes;
and the fast Fourier inverse conversion module is used for carrying out fast Fourier inverse conversion based on the complex matrix multiplication result to obtain the output of a fast Fourier convolution algorithm.
Preferably, the first fft module is configured to, when performing fft on the input data and storing a result of the fft on a designated non-uniform memory access node:
input [ B ] is convolved][C][H][W]Divided into B×C×X×Δ pieces
Figure BDA0003233272060000054
A block of size, wherein B represents the size of mini-batch in the convolution calculation, C represents the number of input channels, H and W represent the height and width of the feature map of the convolution input and output, respectively,/->
Figure BDA0003233272060000061
Is the partitioned block size, wherein +.>
Figure BDA0003233272060000062
H f And W is f Representing the size of the convolution kernel +.>
Figure BDA0003233272060000063
Representing an upward rounding;
processing each by each processor core individually
Figure BDA0003233272060000064
Fast fourier transform of size block, result of fast fourier transform +.>
Figure BDA0003233272060000065
Dividing into 2 xL tuples, and storing all the tuples into designated non-uniform memory access nodes in an average distribution manner, wherein L represents the vector register width of a processor;
fast fourier transform of all b×c×x×Δ partitions done in parallel at all processor coresAfter the conversion, a first fast Fourier transform result D [ N ] is obtained][Pb][Cb l1 ][Bb r ][Υ][C l1 ][B r ][2×L]Where N represents the number of nodes of the non-coherent memory access,
Figure BDA0003233272060000066
representing the total number of tuples divided, +.>
Figure BDA0003233272060000067
Figure BDA0003233272060000068
Gamma=x×Δ denotes the number of divided blocks in each feature map, C l1 And B r Is the block size in complex matrix multiplication.
Preferably, the second fft module is configured to, when performing fft on the weights and storing the second fft result on the designated non-uniform memory access node:
the convolution is input into Filter [ K ]][C][H f ][W f ]A block filled with KC delta x delta size, where K represents the number of output channels;
solving each by each processor core individually
Figure BDA0003233272060000069
Blocked fast fourier transform, fast fourier transform
Figure BDA00032332720600000610
Dividing into 2 xL tuples, and storing all the tuples into designated non-uniform memory access nodes in an average distribution manner, wherein L represents the vector register width of a processor;
after all the processor cores complete the fast Fourier transform of all KC blocks in parallel, a second fast Fourier transform result G [ N ] is obtained][Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]Wherein, the method comprises the steps of, wherein,
Figure BDA00032332720600000611
K r is the block size in parallel complex matrix multiplication.
Preferably, the parallel complex matrix multiplication module is specifically configured to, when implementing non-uniform memory access level and multi-core level parallel complex matrix multiplication based on the first fft result and the second fft result, and evenly distribute the complex matrix multiplication result to all non-uniform memory access nodes, perform the following steps:
step 1, input D [ N ]][Pb][Cb l1 ][Bb r ][γ][C l1 ][B r ][2×L]And G [ N ]][Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]Obtaining the number n of the non-uniform memory access node where the current processor core is located, wherein n is not less than 0<N, total number of Cores in current non-uniform memory access node Cores, and number of processor Cores in current non-uniform memory access node cid, wherein 0 is equal to or less than cid<Cores, where D n [Pb][Cb l1 ][Bb r ][γ][C l1 ][B r ][2×L]And G n [Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]The portion of the representation D, G stored on the nth non-coherent memory access node;
step 2, let δ=0;
step 3, letting cs=0;
step 4, letting cbμ=cid;
step 5, according to the formula
Figure BDA0003233272060000071
krs=kss×K l2 ,/>
Figure BDA0003233272060000072
μ=cbμ-kss×Bb r Xγ -bss. Gamma. Solving kss, bss and μ, respectively, wherein +.>
Figure BDA0003233272060000073
Representing a downward rounding;
step 6, let kk=0;
step 7, according to the values of delta, cs, kk+krs, cb mu, bss and mu, obtaining g from the current non-uniform memory access node n ′=G n,δ,cs,kk+krs ,d n ′=D n,δ,cs,bss,μ Obtaining Z' =z from global kk+krs,bss,μ,δ Wherein g n ' represents the sub-tensor of the G tensor stored on the nth non-uniform memory access node, and has a size of C l1 ×K r ×(2×L),d n ' represents the sub-tensor of the D-tensor stored on the nth non-uniform memory access node, of size C l1 ×B r X (2 XL), Z' represents the sub-tensors of the Z tensor evenly distributed over all non-coherent memory access introductions, of size B r ×K r ×(2×L);
Step 8, calculating
Figure BDA0003233272060000074
Step 9, Z' [ B ] r ][K r ][2×L]Value of (2) is stored back to Z kk+krs,bss,μ,δ In (a) and (b);
step 10, calculating kk=kk+1;
step 11, if kk<min(K l2 ,Kb r -kss×K l2 ) If yes, jumping to the step 7 to continue processing; if kk is<min(K l2 ,Kb r -kss×K l2 ) If not, go to step 12, wherein K l2 For the block size in the complex matrix multiplication, min represents the minimum value of the two;
step 12, calculating cbμ=cbμ+cores;
step 13, if cb mu<Bb r ×Kb l2 If xγ is true, the process proceeds to step 5, and if cb μ is true<Bb r ×Kb l2 If x gamma is not satisfied, executing step 14; wherein the method comprises the steps of
Figure BDA0003233272060000081
Step 14, calculating cs=cs+1;
step 15, if cs<Cb l1 If true, thenJumping to the step 4 to continue processing; if cs<Cb l1 If not, executing step 16;
step 16, calculating delta=delta+1;
step 17, if delta < Pb is true, jumping to step 3 to continue processing, and if delta < Pb is not true, executing step 18;
step 18, completing calculation and outputting result Z [ Kb ] r ][Bb r ][γ][P][B r ][K r ][2×L]。
Preferably, the inverse fast fourier transform module is specifically configured to, when performing inverse fast fourier transform based on the complex matrix multiplication result to obtain an output of a fast fourier convolution algorithm:
z [ Kb ] by each processor core r ][Bb r ][γ][P][B r ][K r ][2×L]Extracting one from
Figure BDA0003233272060000082
Size blocking and completing one +.>
Figure BDA0003233272060000083
Inverse fast fourier transform of blocks of size, inverse fast fourier transform of one block resulting in +.>
Figure BDA0003233272060000084
Convolution output result of the size;
after all processor cores finish the fast Fourier inverse transformation of all blocks together, the results are spliced into output [ B ] [ K ] [ H ] [ W ], and the output of the fast Fourier convolution algorithm is obtained.
In summary, the invention discloses a parallel implementation method of an FFT convolution algorithm based on NUMA affinity, which comprises the steps of performing fast fourier transform on input data, and storing a first fast fourier transform result to a designated non-uniform memory access node; performing fast Fourier transform on the weights, and storing a second fast Fourier transform result to a designated non-uniform memory access node; then realizing non-uniform memory access level and multi-core level parallel complex matrix multiplication based on the first fast Fourier transform result and the second fast Fourier transform result, and uniformly distributing the complex matrix multiplication result to all non-uniform memory access nodes; and performing fast Fourier inverse transformation based on the complex matrix multiplication result to obtain the output of a fast Fourier convolution algorithm. The invention can obviously reduce the remote memory access overhead in the FFT convolution calculation process on the NUMA architecture and improve the FFT convolution performance on the NUMA architecture.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an embodiment of a parallel implementation method of FFT convolution algorithm based on NUMA affinity disclosed by the invention;
FIG. 2 is a diagram illustrating the partitioning of translation results based on NUMA affinity according to the present disclosure;
FIG. 3 is a schematic diagram of communication between NUMA nodes prior to optimization;
FIG. 4 is a schematic diagram of communication between optimized NUMA nodes;
fig. 5 is a schematic structural diagram of an embodiment of a parallel implementation system of FFT convolution algorithm based on NUMA affinity.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, a flowchart of an embodiment of a method for implementing parallel FFT convolution algorithm based on NUMA affinity disclosed in the present invention, the method may include the following steps:
s101, performing fast Fourier transform on input data, and storing a first fast Fourier transform result to a designated non-uniform memory access node;
when the FFT convolution algorithm is needed to be implemented in parallel, input FFT conversion is firstly performed based on NUMA affinity, namely, input data is firstly subjected to FFT conversion, and a conversion result is stored on a designated node.
Specifically, the convolution is Input to Input [ B ]][C][H][W]Divided into B×C×X×Δ pieces
Figure BDA0003233272060000101
A block of size, wherein B represents the size of mini-batch in the convolution calculation, C represents the number of input channels, H and W represent the height and width of the feature map of the convolution input and output, respectively,/->
Figure BDA0003233272060000102
Is the partitioned block size, wherein +.>
Figure BDA0003233272060000103
H f And W is f Representing the size of the convolution kernel +.>
Figure BDA0003233272060000104
Representing an upward rounding;
processing each by each processor core individually
Figure BDA0003233272060000105
Block-sized FFT conversion, the result of the FFT conversion
Figure BDA0003233272060000106
Dividing into 2×L tuples, and storing all tuple average allocation to a designated NUMA node, as shown in FIG. 2, where L represents the vector register width of the processor;
at all processor cores andafter the line completes the FFT conversion of all BXCXXDeltablocks, a first FFT conversion result D [ N ] is obtained][Pb][Cb l1 ][Bb r ][γ][C l1 ][B r ][2×L]Wherein N represents the number of nodes of NUMA,
Figure BDA0003233272060000107
representing the total number of tuples divided, +.>
Figure BDA0003233272060000108
Gamma=x×Δ denotes the number of divided blocks in each feature map, C l1 And B r Is the block size in complex matrix multiplication.
S102, performing fast Fourier transform on the weights, and storing a second fast Fourier transform result to a designated non-uniform memory access node;
kernel FFT conversion is performed based on NUMA affinity, namely, weights are subjected to FFT conversion, and the converted results are stored on designated nodes.
Specifically, the convolution is input into Filter [ K ]][C][H f ][W f ]A block filled with KC delta x delta size, where K represents the number of output channels;
solving each by each processor core individually
Figure BDA0003233272060000111
Blocked FFT conversion, ++FFT conversion>
Figure BDA0003233272060000112
Dividing into 2×L tuples, and storing all tuple average allocation to a designated NUMA node, as shown in FIG. 2, where L represents the vector register width of the processor;
after all the processor cores complete FFT conversion of all KC blocks in parallel, a second FFT conversion result G [ N ] is obtained][Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]Wherein, the method comprises the steps of, wherein,
Figure BDA0003233272060000113
K r is the block size in parallel complex matrix multiplication.
S103, realizing non-uniform memory access level and multi-core level parallel complex matrix multiplication based on a first fast Fourier transform result and a second fast Fourier transform result, and uniformly distributing the complex matrix multiplication result to all non-uniform memory access nodes;
and then, realizing NUMA level and multi-core level parallel complex matrix multiplication based on the converted result, and uniformly distributing the complex matrix multiplication result to all NUMA nodes.
Specifically, the method comprises the following steps:
step 1, input D [ N ]][Pb][Cb l1 ][Bb r ][γ][C l1 ][B r ][2×L]And G [ N ]][Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]Obtaining the number n of NUMA nodes where the current processor core is located, wherein n is more than or equal to 0<N, total number of Cores Cores in current NUMA node, and number cid of processor core in current NUMA node, wherein 0 is equal to or less than cid<Cores, where D n [Pb][Cb l1 ][Bb r ][γ][C l1 ][B r ][2×L]And G n [Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]The portion of representation D, G stored on the nth NUMA node;
step 2, let δ=0;
step 3, letting cs=0;
step 4, letting cbμ=cid;
step 5, according to the formula
Figure BDA0003233272060000121
krs=kss×K l2 ,/>
Figure BDA0003233272060000122
μ=cbμ-kss×Bb r Xγ -bss. Gamma. Solving kss, bss and μ, respectively, wherein +.>
Figure BDA0003233272060000123
Representing a downward rounding;
step 6, let kk=0;
step 7, obtaining g 'from the current NUMA node according to the values of delta, cs, kk+krs, cb mu, bss and mu' n =G n,δ,cs,kk+krs ,d′ n =D n,δ,cs,bss,μ Obtaining Z' =z from global kk+krs,bss,μ,δ Wherein g' n Representing the sub-tensor of the G tensor stored on the nth NUMA node with the size of C l1 ×K r ×(2×L),d′ n Representing the D tensor stored on the nth NUMA node with a size of C l1 ×B r X (2 XL), Z' represents the sub-tensor of the Z tensor evenly distributed over all NUMA introductions, of size B r ×K r ×(2×L);
Step 8, calculating
Figure BDA0003233272060000124
Step 9, Z' [ B ] r ][K r ][2×L]Value of (2) is stored back to Z kk+krs,bss,μ,δ In (a) and (b);
step 10, calculating kk=kk+1;
step 11, if kk<min(K l2 ,Kb r -kss×K l2 ) If yes, jumping to the step 7 to continue processing; if kk is<min(K l2 ,Kb r -kss×K l2 ) If not, go to step 12, wherein K l2 For the block size in the complex matrix multiplication, min represents the minimum value of the two;
step 12, calculating cbμ=cbμ+cores;
step 13, if cb mu<Bb r ×Kb l2 If xγ is true, the process proceeds to step 5, and if cb μ is true<Bb r ×Kb l2 If x gamma is not satisfied, executing step 14; wherein the method comprises the steps of
Figure BDA0003233272060000125
Step 14, calculating cs=cs+1;
step 15, if cs<Cb l1 If yes, jumping to the step 4 to continue processing; if cs<Cb l1 Is not formed intoStanding, executing step 16;
step 16, calculating delta=delta+1;
step 17, if delta < Pb is true, jumping to step 3 to continue processing, and if delta < Pb is not true, executing step 18;
step 18, completing calculation and outputting result Z [ Kb ] r ][Bb r ][γ][P][B r ][K r ][2×L]。
S104, performing fast Fourier inverse transformation based on the complex matrix multiplication result to obtain the output of a fast Fourier convolution algorithm.
And finally, performing output IFFT conversion based on the complex matrix multiplication result to obtain the output of the FFT convolution algorithm.
Specifically, the slave Z [ Kb ] is calculated by each processor core r ][Bb r ][γ][P][B r ][K r ][2×L]Extracting one from
Figure BDA0003233272060000131
Size blocking and completing one +.>
Figure BDA0003233272060000132
Block-sized IFFT conversion, one block-sized IFFT conversion resulting in
Figure BDA0003233272060000133
Convolution output result of the size;
after all processor cores finish IFFT conversion of all blocks together, the results are spliced into output [ B ] [ K ] [ H ] [ W ], and the output of the FFT convolution algorithm is obtained.
As shown in fig. 3, which is a communication schematic diagram between NUMA nodes before optimization, and as shown in fig. 4, which is a communication between NUMA nodes after optimization, it can be seen that the cost of remote memory access on a NUMA architecture can be significantly reduced by the present invention, and the computing performance of an FFT convolution algorithm on the NUMA architecture can be significantly improved by NUMA level parallelism between a plurality of NUMA nodes and core level parallelism between a plurality of processor cores in a single NUMA node.
Referring to fig. 5, a schematic structural diagram of an embodiment of a parallel implementation system of FFT convolution algorithm based on NUMA affinity according to the present disclosure may include:
the first fast fourier transform module 501 is configured to perform fast fourier transform on input data, and store a result of the first fast fourier transform to a specified non-uniform memory access node;
the second fft module 502 is configured to perform fft on the weights, and store the second fft result to a designated non-uniform memory access node;
a parallel complex matrix multiplication module 503, configured to implement non-uniform memory access level and multi-core parallel complex matrix multiplication based on the first fft result and the second fft result, and evenly distribute the complex matrix multiplication result to all non-uniform memory access nodes;
the inverse fast fourier transform module 504 is configured to perform inverse fast fourier transform based on the complex matrix multiplication result, and obtain an output of the fast fourier convolution algorithm.
The working principle of the NUMA affinity-based FFT convolution algorithm parallel implementation system disclosed in this embodiment is the same as that of the NUMA affinity-based FFT convolution algorithm parallel implementation method, and is not described here again.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (4)

1. A parallel implementation method of FFT convolution algorithm based on NUMA affinity is characterized by comprising the following steps:
performing fast Fourier transform on input data, and storing a first fast Fourier transform result to a designated non-uniform memory access node;
performing fast Fourier transform on the weights, and storing a second fast Fourier transform result to a designated non-uniform memory access node;
realizing non-uniform memory access level and multi-core level parallel complex matrix multiplication based on the first fast Fourier transform result and the second fast Fourier transform result, and uniformly distributing the complex matrix multiplication result to all non-uniform memory access nodes;
performing fast Fourier inverse transformation based on the complex matrix multiplication result to obtain the output of a fast Fourier convolution algorithm; wherein:
the fast fourier transforming the input data and storing the first fast fourier transforming result to the designated non-uniform memory access node includes:
input [ B ] is convolved][C][H][W]Divided into B×C×X×Δ pieces
Figure FDA0004241960060000011
A block of size, wherein B represents the size of mini-batch in the convolution calculation, C represents the number of input channels, H and W represent the height and width of the feature map of the convolution input and output, respectively,/->
Figure FDA0004241960060000012
Is the partitioned block size, wherein +.>
Figure FDA0004241960060000013
H f And W is f Representing the size of the convolution kernel +.>
Figure FDA0004241960060000014
Representing an upward rounding;
processing each by each processor core individually
Figure FDA0004241960060000015
Fast fourier transform of size block, result of fast fourier transform +.>
Figure FDA0004241960060000016
Dividing into 2 xL tuples, and storing all the tuples into designated non-uniform memory access nodes in an average distribution manner, wherein L represents the vector register width of a processor;
after all processor cores complete the fast Fourier transform of all BXCXXXDeltablocks in parallel, a first fast Fourier transform result DN is obtained][Pb][Cb l1 ][Bb r ][Υ][C l1 ][B r ][2×L]]Where N represents the number of nodes of the non-coherent memory access,
Figure FDA0004241960060000017
representing the total number of tuples divided, +.>
Figure FDA0004241960060000018
Figure FDA0004241960060000021
Y=x×Δ represents the number of divided blocks in each feature map, C l1 And B r Is the block size in complex matrix multiplication;
the performing fast fourier transform on the weight, and storing the second fast fourier transform result to a designated non-uniform memory access node, includes:
the convolution is input into Filter [ K ]][C][H f ][W f ]A block filled with k×c blocks of δ×δ size, where K represents the number of output channels;
solving each by each processor core individually
Figure FDA0004241960060000022
Blocked fast Fourier transform +.>
Figure FDA0004241960060000023
Dividing into 2 xL tuples, and storing all the tuples into designated non-uniform memory access nodes in an average distribution manner, wherein L represents the vector register width of a processor;
after all processor cores complete the fast Fourier transform of all KXC blocks in parallel, a second fast Fourier transform result G [ N ] is obtained][Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]Wherein, the method comprises the steps of, wherein,
Figure FDA0004241960060000024
K r the block size in parallel complex matrix multiplication;
the method for realizing non-uniform memory access level and multi-core level parallel complex matrix multiplication based on the first fast Fourier transform result and the second fast Fourier transform result and evenly distributing the complex matrix multiplication result to all non-uniform memory access nodes comprises the following steps:
step 1, input D [ N ]][Pb][Cb l1 ][Bb r ][Υ][[C l1 ][B r ][2×L]And G [ N ]][Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]Obtaining the number n of the non-uniform memory access node where the current processor core is located, wherein n is not less than 0<N, total number of Cores in current non-uniform memory access node Cores, and number of processor Cores in current non-uniform memory access node cid, wherein 0 is equal to or less than cid<Cores, where D n [Pb][Cb l1 ][Bb r ][Υ][C l1 ][B r ][2×L]And G n [Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]The portion of the representation D, G stored on the nth non-coherent memory access node;
step 2, let δ=0;
step 3, letting cs=0;
step 4, letting cbμ=cid;
step 5, according to the formula
Figure FDA0004241960060000031
krs=kss×K l2 ,/>
Figure FDA0004241960060000032
μ=cbμ-kss×Bb r Xy-bss Xy solving kss, bss and μ, respectively, wherein ++>
Figure FDA0004241960060000033
Representing a downward rounding;
step 6, let kk=0;
step 7, according to delta, cs, kk+krs, cb. Mu.bsValues of s, μ, obtain g 'from the current non-coherent memory access node' n =G n,δ,cs,kk+krs ,d′ n =D n,δ,cs,bss,μ Obtaining Z' =z from global kk+krs,bss,μ,δ Wherein g' n Representing a sub-tensor of the G tensor stored on the nth non-uniform memory access node, the sub-tensor having a size of C l1 ×K r ×(2×L),d′ n Representing the sub-tensor of the D tensor stored on the nth non-uniform memory access node, having a size of C l1 ×B r X (2 XL), Z' represents the sub-tensors of the Z tensor evenly distributed over all non-coherent memory access introductions, of size B r ×K r ×(2×L);
Step 8, calculating
Figure FDA0004241960060000034
Step 9, Z' [ B ] r ][K r ][2×L]Value of (2) is stored back to Z kk+krs,bss,μ,δ In (a) and (b);
step 10, calculating kk=kk+1;
step 11, if kk<min(K l2 ,Kb r -kss×K l2 ) If yes, jumping to the step 7 to continue processing; if kk is<min(K l2 ,Kb r -kss×K l2 ) If not, go to step 12, wherein K l2 For the block size in the complex matrix multiplication, min represents the minimum value of the two;
step 12, calculating cbμ=cbμ+cores;
step 13, if cb mu<Bb r ×Kb l2 If x gamma is true, jump to step 5 to continue processing, if cb mu<Bb r ×Kb l2 If x y is not satisfied, executing step 14; wherein the method comprises the steps of
Figure FDA0004241960060000035
Step 14, calculating cs=cs+1;
step 15, if cs<Cb l1 If yes, jumping to the step 4 to continue processing; if cs<Cb l1 Is not trueStep 16 is performed;
step 16, calculating delta=delta+1;
step 17, if delta < Pb is true, jumping to step 3 to continue processing, and if delta < Pb is not true, executing step 18;
step 18, completing calculation and outputting result Z [ Kb ] r ][Bb r ][Υ][P][B r ][K r ][2×L]。
2. The method of claim 1, wherein performing an inverse fast fourier transform based on the complex matrix multiplication results to obtain an output of a fast fourier convolution algorithm comprises:
z [ Kb ] by each processor core r ][Bb r ][Υ][P][B r ][K r ][2×L]Extracting one from
Figure FDA0004241960060000041
Size blocking and completing one +.>
Figure FDA0004241960060000042
Inverse fast fourier transform of blocks of size, one block resulting from inverse fast fourier transform
Figure FDA0004241960060000043
Convolution output result of the size;
after all processor cores finish the fast Fourier inverse transformation of all blocks together, the results are spliced into output [ B ] [ K ] [ H ] [ W ], and the output of the fast Fourier convolution algorithm is obtained.
3. An FFT convolution algorithm parallel implementation system based on NUMA affinity, characterized by comprising:
the first fast Fourier transform module is used for performing fast Fourier transform on input data and storing a first fast Fourier transform result to a designated non-uniform memory access node;
the second fast Fourier transform module is used for performing fast Fourier transform on the weights and storing second fast Fourier transform results to the appointed non-uniform memory access nodes;
the parallel complex matrix multiplication module is used for realizing non-uniform memory access level and multi-core level parallel complex matrix multiplication based on the first fast Fourier transform result and the second fast Fourier transform result, and uniformly distributing the complex matrix multiplication result to all non-uniform memory access nodes;
the fast Fourier inverse transformation module is used for performing fast Fourier inverse transformation based on the complex matrix multiplication result to obtain the output of a fast Fourier convolution algorithm; wherein:
the first fast fourier transform module is specifically configured to, when performing fast fourier transform on input data and storing a result of the first fast fourier transform on a specified non-uniform memory access node:
input [ B ] is convolved][C][H][W]Divided into B×C×X×Δ pieces
Figure FDA0004241960060000051
A block of size, wherein B represents the size of mini-batch in the convolution calculation, C represents the number of input channels, H and W represent the height and width of the feature map of the convolution input and output, respectively,/->
Figure FDA0004241960060000052
Is the partitioned block size, wherein +.>
Figure FDA0004241960060000053
H f And W is f Representing the size of the convolution kernel +.>
Figure FDA0004241960060000054
Representing an upward rounding;
processing each by each processor core individually
Figure FDA0004241960060000055
Fast fourier transform of size block, result of fast fourier transform +.>
Figure FDA0004241960060000056
Dividing into 2 xL tuples, and storing all the tuples into designated non-uniform memory access nodes in an average distribution manner, wherein L represents the vector register width of a processor;
after all processor cores complete the fast Fourier transform of all BXCXXXDeltablocks in parallel, a first fast Fourier transform result DN is obtained][Pb][Cb l1 ][Bb r ][Υ][C l1 ][B r ][2×L]Where N represents the number of nodes of the non-coherent memory access,
Figure FDA0004241960060000057
representing the total number of tuples divided, +.>
Figure FDA0004241960060000058
Figure FDA0004241960060000059
Y=x×Δ represents the number of divided blocks in each feature map, C l1 And B r Is the block size in complex matrix multiplication;
the second fast fourier transform module is specifically configured to, when performing fast fourier transform on the weights and storing a second fast fourier transform result on the specified non-uniform memory access node:
the convolution is input into Filter [ K ]][C][H f ][W f ]A block filled with k×c blocks of δ×δ size, where K represents the number of output channels;
solving each by each processor core individually
Figure FDA00042419600600000510
Blocked fast Fourier transform +.>
Figure FDA00042419600600000511
Dividing into 2 xL tuples, and storing all the tuples into designated non-uniform memory access nodes in an average distribution manner, wherein L represents the vector register width of a processor;
after all processor cores complete the fast Fourier transform of all KXC blocks in parallel, a second fast Fourier transform result G [ N ] is obtained][Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]Wherein, the method comprises the steps of, wherein,
Figure FDA0004241960060000061
K r the block size in parallel complex matrix multiplication;
the parallel complex matrix multiplication module is specifically configured to perform the following steps when performing non-uniform memory access level and multi-core parallel complex matrix multiplication based on the first fft result and the second fft result, and evenly distributing the complex matrix multiplication result to all non-uniform memory access nodes:
step 1, input D [ N ]][Pb][Cb l1 ][Bb r ][Υ][[C l1 ][B r ][2×L]And G [ N ]][Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]Obtaining the number n of the non-uniform memory access node where the current processor core is located, wherein n is not less than 0<N, total number of Cores in current non-uniform memory access node Cores, and number of processor Cores in current non-uniform memory access node cid, wherein 0 is equal to or less than cid<Cores, where D n [Pb][Cb l1 ][Bb r ][Υ][C l1 ][B r ][2×L]And G n [Pb][Cb l1 ][Kb r ][C l1 ][K r ][2×L]The portion of the representation D, G stored on the nth non-coherent memory access node;
step 2, let δ=0;
step 3, letting cs=0;
step 4, letting cbμ=cid;
step 5, according to the formula
Figure FDA0004241960060000062
krs=kss×K l2 ,/>
Figure FDA0004241960060000063
μ=cbμ-kss×Bb r Xy-bss Xy solving kss, bss and μ, respectively, wherein ++>
Figure FDA0004241960060000064
Representing a downward rounding;
step 6, let kk=0;
step 7, obtaining g 'from the current non-uniform memory access node according to the values of delta, cs, kk+krs, cb mu, bss and mu' n =G n,δ,cs,kk+krs ,d′ n =D n,δ,cs,bss,μ Obtaining Z' =z from global kk+krs,bss,μ,δ Wherein g' n Representing a sub-tensor of the G tensor stored on the nth non-uniform memory access node, the sub-tensor having a size of C l1 ×K r ×(2×L),d′ n Representing the sub-tensor of the D tensor stored on the nth non-uniform memory access node, having a size of C l1 ×B r X (2 XL), Z' represents the sub-tensors of the Z tensor evenly distributed over all non-coherent memory access introductions, of size B r ×K r ×(2×L);
Step 8, calculating
Figure FDA0004241960060000071
Step 9, Z' [ B ] r ][K r ][2×L]Value of (2) is stored back to Z kk+krs,bss,μ,δ In (a) and (b);
step 10, calculating kk=kk+1;
step 11, if kk<min(K l2 ,Kb r -kss×K l2 ) If yes, jumping to the step 7 to continue processing; if kk is<min(K l2 ,Kb r -kss×K l2 ) If not, go to step 12, wherein K l2 For the block size in complex matrix multiplication, min tableThe minimum value of the two is shown;
step 12, calculating cbμ=cbμ+cores;
step 13, if cb mu<Bb r ×Kb l2 If x gamma is true, jump to step 5 to continue processing, if cb mu<Bb r ×Kb l2 If x y is not satisfied, executing step 14; wherein the method comprises the steps of
Figure FDA0004241960060000072
Step 14, calculating cs=cs+1;
step 15, if cs<Cb l1 If yes, jumping to the step 4 to continue processing; if cs<Cb l1 If not, executing step 16;
step 16, calculating delta=delta+1;
step 17, if delta < Pb is true, jumping to step 3 to continue processing, and if delta < Pb is not true, executing step 18;
step 18, completing calculation and outputting result Z [ Kb ] r ][Bb r ][Υ][P][B r ][K r ][2×L]。
4. The system according to claim 3, wherein the inverse fast fourier transform module is configured to, when performing inverse fast fourier transform based on the complex matrix multiplication result, obtain an output of a fast fourier convolution algorithm:
z [ Kb ] by each processor core r ][Bb r ][Υ][P][B r ][K r ][2×L]Extracting one from
Figure FDA0004241960060000073
Size blocking and completing one +.>
Figure FDA0004241960060000074
Inverse fast fourier transform of blocks of size, one block resulting from inverse fast fourier transform
Figure FDA0004241960060000075
Convolution output result of the size;
after all processor cores finish the fast Fourier inverse transformation of all blocks together, the results are spliced into output [ B ] [ K ] [ H ] [ W ], and the output of the fast Fourier convolution algorithm is obtained.
CN202111000202.2A 2021-08-27 2021-08-27 FFT convolution algorithm parallel implementation method and system based on NUMA affinity Active CN113655986B9 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111000202.2A CN113655986B9 (en) 2021-08-27 2021-08-27 FFT convolution algorithm parallel implementation method and system based on NUMA affinity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111000202.2A CN113655986B9 (en) 2021-08-27 2021-08-27 FFT convolution algorithm parallel implementation method and system based on NUMA affinity

Publications (3)

Publication Number Publication Date
CN113655986A CN113655986A (en) 2021-11-16
CN113655986B true CN113655986B (en) 2023-06-30
CN113655986B9 CN113655986B9 (en) 2023-10-10

Family

ID=78482338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111000202.2A Active CN113655986B9 (en) 2021-08-27 2021-08-27 FFT convolution algorithm parallel implementation method and system based on NUMA affinity

Country Status (1)

Country Link
CN (1) CN113655986B9 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023182765A1 (en) * 2022-03-21 2023-09-28 Samsung Electronics Co., Ltd. Speech enhancement method and device using fast fourier convolution
CN116401502B (en) * 2023-06-09 2023-11-03 之江实验室 Method and device for optimizing Winograd convolution based on NUMA system characteristics

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103558592A (en) * 2013-10-08 2014-02-05 北京航空航天大学 Satellite-borne SAR echo data simulation method based on MPI parallel computing
CN111143766A (en) * 2019-12-24 2020-05-12 上海寒武纪信息科技有限公司 Method and apparatus for processing two-dimensional complex matrix by artificial intelligence processor
CN112559952A (en) * 2019-09-26 2021-03-26 无锡江南计算技术研究所 Heterogeneous many-core fast Fourier transform method based on sequence layering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103558592A (en) * 2013-10-08 2014-02-05 北京航空航天大学 Satellite-borne SAR echo data simulation method based on MPI parallel computing
CN112559952A (en) * 2019-09-26 2021-03-26 无锡江南计算技术研究所 Heterogeneous many-core fast Fourier transform method based on sequence layering
CN111143766A (en) * 2019-12-24 2020-05-12 上海寒武纪信息科技有限公司 Method and apparatus for processing two-dimensional complex matrix by artificial intelligence processor

Also Published As

Publication number Publication date
CN113655986A (en) 2021-11-16
CN113655986B9 (en) 2023-10-10

Similar Documents

Publication Publication Date Title
CN106228238B (en) Accelerate the method and system of deep learning algorithm on field programmable gate array platform
TWI748151B (en) Accelerator for neural network computing and execution method thereof
JP6857286B2 (en) Improved performance of neural network arrays
CN111667051B (en) Neural network accelerator applicable to edge equipment and neural network acceleration calculation method
JP7025441B2 (en) Scheduling of neural network processing
CN113655986B (en) FFT convolution algorithm parallel implementation method and system based on NUMA affinity
CN112084038B (en) Memory allocation method and device of neural network
JP7430744B2 (en) Improving machine learning models to improve locality
CN112199636B (en) Fast convolution method and device suitable for microprocessor
CN108304926B (en) Pooling computing device and method suitable for neural network
CN108304925B (en) Pooling computing device and method
US10755169B2 (en) Hybrid non-uniform convolution transform engine for deep learning applications
JP2023109847A (en) Image transformation for machine learning
US11842220B2 (en) Parallelization method and apparatus with processing of neural network model for manycore system
CN113569511A (en) Quantum circuit simulation method and device
CN113037800A (en) Job scheduling method and job scheduling device
CN111523642A (en) Data reuse method, operation method and device and chip for convolution operation
CN116762080A (en) Neural network generation device, neural network operation device, edge device, neural network control method, and software generation program
CN112149047A (en) Data processing method and device, storage medium and electronic device
CN116644804B (en) Distributed training system, neural network model training method, device and medium
Wu et al. Skeletongcn: a simple yet effective accelerator for gcn training
CN115328440A (en) General sparse matrix multiplication implementation method and device based on 2D systolic array
CN116090518A (en) Feature map processing method and device based on systolic operation array and storage medium
CN113986816A (en) Reconfigurable computing chip
KR20220114228A (en) Processor, method for operating the same, and electronic device including the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CI03 Correction of invention patent

Correction item: Description

Correct: Paragraphs 0001-0129 of the instruction manual submitted on May 23, 2023

False: Paragraphs 0001-0129 of the instruction manual submitted on the application date

Number: 26-02

Page: ??

Volume: 39

CI03 Correction of invention patent