CN117931430A

CN117931430A - Method for realizing DFT performance optimization by processing device and data processing system

Info

Publication number: CN117931430A
Application number: CN202311844671.1A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-04-26

Abstract

The disclosure discloses a method and a data processing system for realizing DFT performance optimization by using a processing device, wherein the data processing system is used for executing data processing; the data processing system includes: a computing device, a storage device; wherein the storage device is configured to store the program instructions; the computing device configured to load and execute the program instructions such that the computing device performs steps comprising: the computing device splits the input sequence and the W matrix according to a splitting strategy to obtain an input subsequence and a corresponding W sub-matrix; wherein the W matrix is a conjugate symmetric matrix; the computing device performs reshape operation processing on the input subsequence; the computing device multiplies reshape the left of the W sub-matrix by the input sub-sequence after operation processing to obtain a first sub-sequence; the computing means determines the DFT result from all the first sub-sequences.

Description

Method for realizing DFT performance optimization by processing device and data processing system

Technical Field

The present disclosure relates generally to the field of intelligent computing, and more particularly to the field of neural networks. More particularly, the present disclosure relates to a method and data processing system for implementing DFT performance optimization with a processing device.

Background

In intelligent computing systems, common operations such as in neural network model algorithms are packaged into operators through a programming framework for direct invocation by programmers, such as convolution, pooling, and the like. TensorFlow, pyTorch, et al, are currently popular deep learning frameworks. In these programming frameworks, computational graphs are typically used to describe the computation of machine learning algorithms, with tensors representing all data in the computational graph and operators representing various operations.

As shown in fig. 5a and 5b, a schematic diagram of DFT implementation flow based on processing device development is shown. Taking x with input batch as B as an example, C represents real part and imaginary part sparsity, and is 1 or 2. The flow steps comprise: in step1, a W matrix is generated. Wherein the W matrix generated in step1 is a single kernel written, and is implemented on-chip using sin/cos and vaa instructions. Transpost is implemented for call 2 cnnlTranspose times. In step2 matmul is implemented for call cnnlMatmul. In step3, the sub-sequence obtained in step2 is subjected to permute, and in step4, the real part and the imaginary part are integrated. This step is performed with step5 at the time of the specific event, processing one sub-picture size at a time, the sub-picture possibly containing 2L, 4L, etc. at a time, depending on the on-picture space size. In step5, the results of step4 are combined, where the task size of each process corresponds to the size that each processing core in the computing device is capable of processing. In step6, a post-processing method such as a transposition operation is performed on the result of step 5. In the prior art, the four steps of step3, step4, step5 and step6 require the development of a single kernel implementation. In this scheme, the data is first put into the required layout format using the non-optimized and cooley tukey algorithm, and the temporary result needs to be saved in workspace. In step2, DFT is performed on the split sub-sequence, and the temporary result is saved in workspace, which shows that in the process of implementing DFT by the computing device, the IO redundancy is large, and the performance is greatly reduced.

Disclosure of Invention

To at least partially solve one or more of the technical problems mentioned in the background, the present disclosure provides solutions from a number of aspects.

In a first aspect, the present disclosure discloses a data processing system for performing data processing; the data processing system includes: a computing device, a storage device; wherein,

The storage device is configured to store the program instructions;

the computing device configured to load and execute the program instructions such that the computing device performs steps comprising:

The computing device splits the input sequence and the W matrix according to a splitting strategy to obtain an input subsequence and a corresponding W sub-matrix; wherein the W matrix is a conjugate symmetric matrix;

the computing device performs reshape operation processing on the input subsequence;

the computing device multiplies reshape the left of the W sub-matrix by the input sub-sequence after operation processing to obtain a first sub-sequence;

The computing means determines the DFT result from all the first sub-sequences.

Preferably, in this embodiment, the computing device is further configured to determine DFT results according to all the first sub-sequences:

according to the size of the storage space of the neuron storage unit of the processor core in the computing device, integrating the real part and the imaginary part of the first subsequence to obtain an integration result;

performing butterfly processing based on stockham modes by using the integration result to obtain a corresponding second subsequence;

and determining DFT results according to all the second subsequences.

performing a wobble number processing on all the first subsequences;

After integrating the real part and the imaginary part of the first subsequence subjected to the pendulum number processing, merging the integration result according to the task type and the number of the processor cores;

Performing butterfly processing based on cooley tukey modes according to the result of the merging operation to obtain a corresponding second subsequence;

and determining DFT results according to all the second subsequences.

Preferably, in this embodiment, the computing device is further configured to determine DFT results according to all the second sub-sequences:

And performing transposition operation on the second subsequence to obtain a DFT result.

Preferably, in this embodiment, the computing device splits the input sequence using a radix-2 decomposition, and ensuring that L is the scale that the processor core can handle the largest.

In a second aspect, the present disclosure provides a method of implementing DFT performance optimization with a processing device, the method comprising:

The processing device splits the input sequence and the W matrix according to a splitting strategy to obtain an input subsequence and a corresponding W sub-matrix; wherein the W matrix is a conjugate symmetric matrix;

the processing device carries out reshape operation processing on the input subsequence;

the processing device multiplies reshape the input subsequence after operation processing by the left side of the W sub-matrix to obtain a first subsequence;

the processing device determines DFT results according to all the first subsequences;

The processing device compiles DFT performance optimization to obtain corresponding binary instruction sequences to be distributed to the computing device for executing corresponding tasks.

Preferably, in this embodiment, the processing device is further configured to determine DFT results according to all the first subsequences:

and determining DFT results according to all the second subsequences.

performing a wobble number processing on all the first subsequences;

and determining DFT results according to all the second subsequences.

Preferably, in this embodiment, the processing means splits the input sequence using base 2 decomposition and ensuring that L is the scale that the processor core can handle the largest.

Preferably, in the present embodiment, the processing means generates a half-scale W matrix; wherein the half-scale W matrix includes a main diagonal and data on one side of the main diagonal.

Through the scheme provided by the scheme, redundant IO optimization in the DFT process is realized in the existing scheme based on the hardware resource advantage of the multi-core computing device, and the performance is greatly improved.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 illustrates a block diagram of a board of an embodiment of the present disclosure;

FIG. 2 illustrates a block diagram of a combination processing device according to an embodiment of the present disclosure;

FIG. 3a illustrates a schematic internal architecture of a single processor core of a single core computing device of an embodiment of the present disclosure;

FIG. 3b shows a simplified schematic diagram of the internal structure of a multi-core computing device of an embodiment of the present disclosure;

FIG. 4 shows a simplified schematic of the internal architecture of a computing device when it is multi-core;

FIG. 5a discloses one of the schematic diagrams of the prior art DFT implementation process;

FIG. 5b discloses a second schematic diagram of a prior art DFT implementation;

FIG. 6 shows a flow chart of a method for implementing DFT performance optimization with a processing device;

FIG. 7 shows a schematic diagram of an operation process after modification of the prior art scheme;

FIG. 8 shows a schematic diagram of real and imaginary part determination in a DFT performance optimization scheme based on stockham algorithm;

FIG. 9 shows a schematic diagram of the operations involving the W matrix in a cooley tukey algorithm-based DFT performance optimization scheme;

FIG. 10 shows a schematic diagram of operations involving the W matrix in a stockham algorithm-based DFT performance optimization scheme;

FIG. 11 shows a schematic diagram of a performance optimization scheme involving sram in a DFT performance optimization scheme based on cooley tukey algorithm;

FIG. 12a shows one of the schematic diagrams of DFT performance optimization schemes based on stockham algorithm;

FIG. 12b shows a second schematic diagram of a DFT performance optimization scheme based on stockham algorithm;

FIG. 13 illustrates a block diagram of a hardware configuration of a data processing system in which various aspects of embodiments of the present disclosure may be implemented.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," and the like, as may appear in the claims, specification and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context.

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Exemplary hardware architecture

Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may include a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combined processing means 20 comprises computing means 201, interface means 202, processing means 203 and storage means 204.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (DIGITAL SIGNAL processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processors, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

The storage 204 is used to store data to be processed, which may be DRAM, is DDR memory, and is typically 16G or larger in size, for storing data for the computing device 201 and/or the processing device 203.

When the computing device 201 runs the neural network, the processing device 203 is generally required to compile the neural network to obtain an executable file, where the executable file includes device information, that is, which device in the heterogeneous computer system the executable file needs to execute. The executable files are assembled and linked to obtain an executable program of the neural network, and the executable program is stored in the storage device 204.

The processing device 203 may read an executable program from a storage location of the executable program and obtain a plurality of tasks of the program according to the executable program. These tasks are distributed via the interface means 202 to the computing means 201 for execution, ultimately obtaining the result of the operation.

Fig. 3a shows a schematic diagram of the internal structure of a processing core when the computing device 201 in fig. 2 is a single-core device. The computing device 301 is configured to process input data such as computer vision, voice, natural language, data mining, etc., and the computing device 301 includes three modules: a control module 31 (also referred to as a controller), an arithmetic module 32 (also referred to as an operator), and a storage module 33 (also referred to as a memory).

The control module 31 is used for coordinating and controlling the operation of the operation module 32 and the storage module 33 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 311 and an instruction decode unit (instruction decode unit, IDU) 312. The instruction fetching unit 311 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 312 decodes the fetched instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 322 is responsible for the core computation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used for storing or carrying related data, including a neuron storage unit (NRAM) 331, a weight storage unit (WEIGHT RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.NRAM 331 is to store input neurons, output neurons, and calculated intermediate results; WRAM 332 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; DMA 333 is coupled to DRAM 204 via bus 34 and is responsible for data handling between computing device 301 and DRAM 204. It should be noted that the NRAM and WRAM herein may be two memory areas formed by dividing the same memory in a logic memory space, or may be two independent memories, which are not limited herein specifically.

Fig. 3b shows a simplified schematic diagram of the internal architecture of computing device 201 as a multi-core. The multi-core computing device may be abstracted using a hierarchical hardware model. As shown, the multi-Core computing device may be abstracted into three levels, namely a Chip level (Chip) 360, a processor Cluster level (Cluster) 370, and a processor Core level (Core) 380. The embodiments of the present disclosure mainly relate to data transmission and calculation unit portions of a storage unit, and thus the drawings and description briefly show and introduce related calculation structures, and other portions are omitted.

At the chip level, local DDR memory is included on each chip, each processor chip acts as a compute and control unit, and each processor chip includes multiple processors as compute units.

At the processor cluster level, each multiprocessor includes a plurality of accelerator cores as control and computation units, and further has a shared memory SRAM as a memory unit.

At the processor core level, each accelerator core includes an array of local memory and local processing units. NFU refers to a neural arithmetic unit (Neuron Function Unit) for performing convolution calculations. The structure of a single processor core may be similar to the block diagram of a single core computing device shown in FIG. 3a and will not be described in detail herein.

In the multi-Core computing device, the storage model includes a board global memory, SRAM (shared memory) on a Cluster, NRAM on Core, WRAM, registers, and the like. For better performance, the data movement and memory/computation balancing between memory levels below the Card may be explicitly controlled. The SRAM is included in a memory processing unit MPU (Memory Process Unit Core, abbreviated as MPU, or Mem Core). Core refers to an intelligent processing Core (INTELLIGENT PROCESS UNIT CORE, IPU Core or Core for short) in a multi-Core computing device. 1 IPU Core contains NRAM, WRAM, NFU, etc. Cluster refers to a Cluster of processors or computing clusters, typically a multi-Core computing device comprising a number of Cluster, one Cluster comprising 1 Mem core+N IPU cores.

Fig. 4 shows a simplified schematic diagram of the internal architecture of the computing device 201 of fig. 2 when it is multi-core. The multi-core computing device may be abstracted using a hierarchical hardware model. As shown, the multi-core computing device 400 is a system-on-chip that includes at least one compute cluster (cluster), each of which in turn includes a plurality of processor cores, in other words, the multi-core computing device 400 is formed in a system-on-chip-compute cluster-processor core hierarchy.

At the system-on-chip level, as shown, the multi-core computing device 400 includes an external memory controller 41, a peripheral communication module 42, an on-chip interconnect module 43, a global synchronization module 44, and a plurality of computing clusters 45.

There may be a plurality of external memory controllers 41, 2 being shown by way of example, for accessing external memory devices (e.g., DRAM 204 in FIG. 2) to read data from or write data to off-chip in response to access requests issued by the processor cores. The peripheral communication module 42 is configured to receive a control signal from the processing device (203 of fig. 2) via the interface device (202 of fig. 2) and to initiate the computing device (201 of fig. 2) to perform a task. The on-chip interconnect module 43 connects the external memory controller 41, the peripheral communication module 42, and the plurality of computing clusters 45 for transmitting data and control signals between the respective modules. The global synchronization module 44 is, for example, a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each computing cluster to ensure synchronization of information. The plurality of computing clusters 45 are the computing cores of the multi-core computing device 400, 4 on each die being illustratively shown, the multi-core computing device 400 of the present disclosure may also include 8, 16, 64, or even more computing clusters 45 as hardware evolves. The computing clusters 45 are used to efficiently execute the deep learning algorithm.

At the level of the compute clusters, as shown in fig. 4, each compute cluster 45 includes a plurality of processor cores 406 as control and compute units, and a shared memory core 407 as a memory unit. Further, each computing cluster may further include a local synchronization module 412, configured to coordinate the working progress of each processor core in the computing cluster, so as to ensure synchronization of information. The processor cores 406 are illustratively shown as 4, and the present disclosure does not limit the number of processor cores 406.

The storage cores 407 are mainly used for storing and communicating, i.e., storing shared data or intermediate results between the processor cores 406, and executing communication between the compute clusters 45 and the DRAM 204, communication between the compute clusters 45, communication between the processor cores 406, and the like. In other embodiments, the memory core 407 has scalar operation capabilities to perform scalar operations.

The memory core 407 includes a shared memory unit (SRAM) 408, a broadcast bus 409, a compute cluster direct memory access module (cluster direct memory access, CDMA) 410, and a global direct memory access module (global direct memory access, GDMA) 411. The SRAM 408 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 406 in the same computing cluster 45 is not required to be obtained from the processor cores 406 to the DRAM 204 respectively, but is transferred between the processor cores 406 through the SRAM 408, and the memory core 407 only needs to rapidly distribute the multiplexed data from the SMEM 408 to a plurality of processor cores 406, so as to improve the inter-core communication efficiency and greatly reduce on-chip off-chip input/output access. Broadcast buses 409, CDMA 410, and GDMA are used to perform communication between processor cores 406, communication between compute clusters 45, and data transfer between compute clusters 45 and DRAM 204, respectively.

At the level of the processor cores, the structure of a single processor core may be similar to the block diagram of a single core computing device shown in FIG. 3a, and will not be described in detail herein.

Data processing scheme

A method for implementing DFT performance optimization with a processing device according to an embodiment of the present disclosure is described below, as shown in fig. 6. The method comprises the following steps:

Step 601), the processing device splits the input sequence and the W matrix according to a splitting strategy to obtain an input subsequence and a corresponding W sub-matrix; wherein the W matrix is a conjugate symmetric matrix.

In the technical scheme, in order to optimize redundant IO (input/output) quantity in the operation process, the processing device adopts base 2 decomposition and ensures that L is the scale of the maximum processing of the processor core to split the input sequence.

As shown in fig. 9, in the conventional DFT implementation, each column of data red frame (rr) and red frame (ri) output by the result obtained by calculation is conjugate symmetric, and the calculation result of the second half can be derived from the calculation result of the first half, that is, the matrix multiplication with the lower half of w_re & & _, is redundant, and since permute data blocks are discontinuous, each number in the W matrix is fished one by one, so that the efficiency is low. By utilizing the conjugate symmetry of the W matrix, the conjugate symmetry part data is loaded, so that the efficiency can be further improved.

When the existing DFT implementation is improved, cooley tukey algorithm is adjusted to stockham algorithm as shown in FIG. 10. Each column of data red frames (rr) and red frames (ri) output by the calculated result are conjugate symmetrical, and the calculated result of the second half part can be deduced through the calculated result of the first half part, namely, the matrix multiplication operation of the second half part of the W_re &W_im is redundant. When loading conjugate symmetrical part data, the efficiency is relatively high because the conjugate symmetrical part data needs to be fished one by one, and the IO quantity can be expected to be saved to be L, namely, the IO quantity is related to the splitting result: l×2ζ. Thus, the DFT implementation, whether based on cooley tukey algorithm or stockham algorithm, the processing means generates a half-scale W matrix; wherein the half-scale W matrix includes a main diagonal and data on one side of the main diagonal.

Step 602) the processing means performs reshape operation processing on the input sub-sequence.

Fig. 7 is a schematic diagram of an operation process after the improvement of the conventional scheme. As can be seen from FIG. 7, the input subsequence is [ a, b, c, d, e, f, g, h, I, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x ] which is processed by reshape to become [ a, b, c, d, e, f, g, h ] of 3*8; i, j, k, l, m, n, o, p; q, r, s, t, u, v, w, x ].

Step 603), the processing device multiplies reshape the input subsequence after operation processing by the left side of the W sub-matrix to obtain a first subsequence.

In this embodiment, to eliminate the transfer overhead in the existing scheme, the matmul mode needs to be modified.

In existing DFT implementations, the reason for considering the introduction of the Transpost is: the original practice is that in [ batch, L, 2m, c ] is transposed into in [ batch, c, 2m, L ], then W [ L, L ] matrix is made to be batch_ matmul to in [ batch, 2m, L ] W [ L, L ], and the real part and the imaginary part of W are made once respectively;

The manner in which batch matmul can be optimized is as follows: w_trans [ c, L, L ] in [ batch, L, 2m, c ], thus eliminating transpose. Looking at the W matrix generation mechanism, we can get w_trans= W, where trans here mainly refers to column-row interchange and does not involve complex symbol conversion. So W_trans [ c, L, L ] in [ batch, L,2 μm, c ] W [ c, L, L ] in [ batch, L,2 μm, c ]. Therefore, in the original existing DFT implementation, the cooley tukey algorithm is adopted, and the real part and the imaginary part are respectively called once matmul, taking RFFT as an example, and load twice input is needed: in [ batch, 2m, L ] W_real [ L, L ] & in [ batch, 2m, L ] W_img [ L, L ], if the mode of generating the W matrix is changed, the W matrix is changed into in [ batch, 2m, L ] W_img [ L, L,2], so that load one input is only needed, repeated load input is avoided, and n is expected to be reduced. Wherein matmul is from 4n→3n. However, the generation mode of the W is correspondingly modified, and a specific and efficient generation mode needs to be further discussed; however, since W is generated only once at the time of network call, it should have little effect.

If the stockham algorithm is adopted in the DFT implementation process, the real part and the imaginary part are generated at one time by the same reason, W [2, L ] in [ batch, L, 2-m, c ], so that repeated load input data is avoided, n (matmul is from 4n to 3 n) can be reduced, and matmul is supposed to be divided by real part and imaginary part before the input data arrangement is changed: w real x in real, W real x in virtual, W virtual x in real, W virtual x in virtual; in this matmul approach, the real and imaginary parts are mixed and aligned, i.e., the ordering becomes [ c_w, L,2 μm, c_in ], where c_w and c_in are both 2. As shown in fig. 8.

Step 604) the processing means determines the DFT result from all the first sub-sequences.

As shown in fig. 7, in the existing DFT implementation, cooley tukey algorithm is used, if the original data wants to get the expected layout data, the reason for passing reshape →transfer→ permute, permute is that: the order of L is expected to be 0,4,2,6,1,5,3,7, but is actually 0,2,3,4,5,6,7, and L corresponding to a calculated sub-graph needs to be formulated to be L on a load basis, with memcpy being the lowest dimension data_size. After the transfer is removed, the wobble pattern is [ L,2≡m ], and if a load of L is desired, the memcpy lowest dimension data_size is 1.

Based on cooley tukey algorithm, in the modified scheme, as can be seen from fig. 7, the processing device is further configured to determine DFT results according to all the first subsequences:

performing a wobble number processing on all the first subsequences;

and determining DFT results according to all the second subsequences.

In a cooley tukey algorithm-based DFT implementation, inter-layer IO caches in cooleyTurkey interact between SRAM408 and GDMA; if tasks are reasonably allocated and the SRAM408 is used, one layer can be fused more, 4n is saved, but the SRAM408 also brings mv time overhead and increases code complexity, as shown in FIG. 11. The sub-graph put down on-chip is increased by finer FINDLIMIT strategies + spatial multiplexing. At present, the size of the subgraph s is improved from 4 to 5, and the performance is improved by about 10%; because of the IO bottleneck, it may even be considered to add a portion of redundant computation in exchange for less space usage to reduce IO.

As shown in fig. 7, the DFT implementation is improved, the cooley tukey algorithm is adjusted to stockham algorithm, the stockham algorithm changes the combination sequence of each L, the combination sequence of the intermediate butterfly transforms is different, the final result is consistent with cooley tukey, and the process of permute is hidden in the butterfly transforms. Through analysis, the performance problem brought by permute can be solved by adopting stockham algorithm. Thus, after the transfer is eliminated, when the load is multiple L, the lowest dimension can be continuous to a certain extent, and as shown in fig. 7, the target data block is continuous by adopting stockham algorithm, so that the target data block can be continuously carried. Specifically, before the DFT implementation is optimized, the size of the sub-graph can be put down is calculated with L as a basic block, as shown by the red box in fig. 7. One core, one load at a time [ l_sub,2≡m_sub ], then transpose to [2≡m_sub, l_sub ], repeat the original calculation step, it is expected that 2n can be reduced.

Based on stockham algorithm, as can be seen from fig. 7, the processing means is further configured to determine DFT results according to all the first sub-sequences:

and determining DFT results according to all the second subsequences.

In the optimized stockham algorithm, first, for the network delivery scale [16,48000 ]: now, in the case where 2m and L dimensions are partitioned simultaneously, it is desirable to put as many 2m as possible: 2 x 5x 375/2^7 = 93.75, i.e. l_sub = 93.75 → is aligned down to 64 (which may be a problem with low computational efficiency), i.e. the L direction is split into 4 parts, each ipu core in one cluster computes a part, from the first layer to the last layer, different clusters take different batches, and no SRAM is needed at this time, and it is expected that 8n can be reduced, as shown in fig. 12 a.

It can also be generalized to: the L direction is free from data dependence, when the 2-m direction is not required to be split, the sram is not required to be used (in practice, a similar method can be adopted by a Cooley Tukey algorithm before optimization, so that the L is ensured to be split without interaction with the Dram, and the once calculation is completed;

For the general scale: when 2 μm is as large as 4 parts or more, the use of sram can increase the placement of a layer of subgraphs, reducing one IO interaction with dram, as shown in FIG. 12 b.

Step 605) the processing device compiles the DFT performance optimization to obtain a corresponding binary instruction sequence for distribution to the computing device for execution of the corresponding task.

The description of the embodiment shows that, based on the hardware resource advantage of the multi-core computing device, redundant IO optimization in the DFT process is realized in the existing scheme, so that the performance is greatly improved.

FIG. 13 illustrates a block diagram of a data processing system in which various aspects of embodiments of the present disclosure may be implemented. As shown, the data processing system 1300 includes: computing device 1310, storage device 1320. Wherein a storage 1320 is configured to store the program instructions; a computing device 1310 configured to load and execute the program instructions such that the computing device 1310 performs steps comprising: the computing device 1310 splits the input sequence and the W matrix according to a splitting strategy to obtain an input subsequence and a corresponding W submatrix; wherein the W matrix is a conjugate symmetric matrix. The computing device 1310 performs reshape operation processes on the input subsequences. Computing device 1310 multiplies reshape the input subsequence after operation processing by the left of the W sub-matrix to obtain a first subsequence;

the computing device 1310 determines DFT results from all the first sub-sequences.

In the data processing system 1300 of fig. 13, only constituent elements related to the present embodiment are shown. Thus, it will be apparent to those of ordinary skill in the art that: data processing system 1300 may also include common constituent elements different from those shown in FIG. 13, such as: a display.

Data processing system 1300 may correspond to a computing device having various processing functions, such as functions for programming, compiling source code. For example, data processing system 1300 may be implemented as various types of devices, such as a Personal Computer (PC), a server device, a mobile device, and so forth.

In addition, the processing device is configured to execute program instructions to control all functions of computing device 1310. For example, the processing device controls all functions of the computing device 1310 by executing programs stored in the storage device 1320 on the computing device 1310. The processing device may be implemented by a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an Application Processor (AP), an artificial intelligence processor chip (IPU), etc., provided in computing device 1310. However, the present disclosure is not limited thereto.

The storage 1320 is hardware for storing various data processed in the computing device 1310. For example, the storage 1320 may store processed data and data to be processed in the computing device 1310. The storage 1320 may store data that has been processed or is to be processed by the processing device, such as source code before compilation, assembly instructions after compilation, and the like. Further, the storage 1320 may store program instructions of applications, drivers, etc. to be driven by the computing device 1310. For example: the storage 1320 may store various programs related to a data processing method or the like to be executed by the processing device. The memory device 1320 may be a DRAM, but the present disclosure is not limited thereto. The storage 1320 may include at least one of volatile memory or nonvolatile memory. The nonvolatile memory may include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, phase change RAM (PRAM), magnetic RAM (MRAM), resistive RAM (RRAM), ferroelectric RAM (FRAM), and the like. Volatile memory can include Dynamic RAM (DRAM), static RAM (SRAM), synchronous DRAM (SDRAM), PRAM, MRAM, RRAM, ferroelectric RAM (FeRAM), and the like. In an embodiment, the memory 920 may include at least one of a Hard Disk Drive (HDD), a Solid State Drive (SSD), a high density flash memory (CF), a Secure Digital (SD) card, a Micro-secure digital (Micro-SD) card, a Mini-secure digital (Mini-SD) card, an extreme digital (xD) card, a cache (caches), or a memory stick.

In summary, specific functions implemented by the storage device 1320 and the processing device of the computing device 1310 provided in the embodiments of the present disclosure may be explained in comparison with the foregoing embodiments in the present disclosure, and may achieve the technical effects of the foregoing embodiments, which will not be repeated herein.

In this embodiment, preferably, the computing device 1310 is further configured to determine DFT results according to all the first sub-sequences:

and determining DFT results according to all the second subsequences.

In this embodiment, preferably, the computing device is further configured to determine DFT results according to all the first sub-sequences:

performing a wobble number processing on all the first subsequences;

and determining DFT results according to all the second subsequences.

In this embodiment, preferably, the computing device is further configured to determine DFT results according to all the second sub-sequences:

In this embodiment, the computing device 1310 preferably splits the input sequence using a radix-2 decomposition and ensures that L is the scale that the processor core can handle the largest.

In an embodiment of the present disclosure, there is also provided a computer-readable storage medium in which program instructions are stored, which when loaded and executed by a processor, cause the processor to perform the method of processing data in a computational graph described in the embodiments of the present disclosure. In an embodiment of the present disclosure, there is also provided a computer program product comprising a computer program or instructions which, when executed by a processor, implement a method for processing data in a computational graph according to the embodiments described in the present disclosure.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are divided herein by taking into account the logic function, and there may be other manners of dividing the units when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPU, GPU, FPGA, DSP and ASICs, etc. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (RESISTIVE RANDOM ACCESS MEMORY, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (ENHANCED DYNAMIC Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. The appended claims are intended to define the scope of the disclosure and are therefore to cover all equivalents or alternatives falling within the scope of these claims.

Claims

1. A data processing system for performing data processing; the data processing system includes: a computing device, a storage device; wherein,

The storage device is configured to store the program instructions;

The computing means determines the DFT result from all the first sub-sequences.

2. The system of claim 1, wherein the computing means is further configured to determine DFT results based on all of the first subsequences:

and determining DFT results according to all the second subsequences.

3. The system of claim 1, wherein the computing means is further configured to determine DFT results based on all of the first subsequences:

performing a wobble number processing on all the first subsequences;

and determining DFT results according to all the second subsequences.

4. The system of claim 3, wherein the computing means is further configured to determine DFT results based on all of the second subsequences:

5. The system of claim 1, wherein the computing device splits the input sequence using a base 2 decomposition and ensuring that L is the scale that the processor core can handle the largest.

6. A method for implementing DFT performance optimization with a processing device, the method comprising:

7. The method of claim 6, wherein the DFT result is determined from all the first subsequences, the processing means being further configured to:

and determining DFT results according to all the second subsequences.

8. The method of claim 6, wherein the DFT result is determined from all the first subsequences, the processing means being further configured to:

performing a wobble number processing on all the first subsequences;

and determining DFT results according to all the second subsequences.

9. The method of claim 1, wherein the processing means splits the input sequence using a base 2 decomposition and ensuring that L is the scale that the processor core can handle the largest.

10. The method of claim 1, wherein the processing means generates a half-scale W matrix; wherein the half-scale W matrix includes a main diagonal and data on one side of the main diagonal.