WO2022022362A1 - Data processing method and device, and storage medium - Google Patents

Data processing method and device, and storage medium Download PDF

Info

Publication number
WO2022022362A1
WO2022022362A1 PCT/CN2021/107658 CN2021107658W WO2022022362A1 WO 2022022362 A1 WO2022022362 A1 WO 2022022362A1 CN 2021107658 W CN2021107658 W CN 2021107658W WO 2022022362 A1 WO2022022362 A1 WO 2022022362A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
matrix
register file
processed
address
Prior art date
Application number
PCT/CN2021/107658
Other languages
French (fr)
Chinese (zh)
Inventor
王华勇
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2022022362A1 publication Critical patent/WO2022022362A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • the embodiments of the present application relate to the field of communication technologies, and in particular, to a data processing method, device, and storage medium.
  • MIMO Multiple-Input Multiple-Output
  • Wi-Fi wireless Internet access
  • matrix inversion is mainly used for application specific integrated circuit (ASIC), which is dedicated to processing one or more types of matrix inversion algorithms through hard-wired methods, which can achieve the highest efficiency and power consumption. excellent.
  • ASIC application specific integrated circuit
  • An embodiment of the present application provides a data processing method, the method includes the following steps: storing data in a matrix to be processed included in a received task request in an input register file; wherein the task request includes: data, the dimension of the matrix to be processed, and the number of blocks of the matrix to be processed; according to the dimension of the matrix to be processed and the preset mapping information, determine the target processing instruction storage address corresponding to the matrix to be processed; wherein, the mapping information is used to indicate the matrix
  • the mapping relationship between the dimension and the storage address of the processing instruction, the target processing instruction stored in the storage address of the target processing instruction is used to determine the inverse matrix of the matrix to be processed; according to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool, determine Inter-block parallel computing unit information; sequentially read target processing instructions from the target processing instruction storage address, and process the data stored in the input register file in parallel according to the target processing instruction and the computing unit corresponding to the inter-block parallel computing unit information.
  • the embodiment of the present application also proposes a data processing device, the device includes a memory, a processor, a program stored in the memory and running on the processor, and a data bus for realizing connection and communication between the processor and the memory,
  • the program implements the steps of the aforementioned method when executed by the processor.
  • An embodiment of the present application provides a storage medium for computer-readable storage, where the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the steps of the foregoing method.
  • FIG. 1 is a schematic flowchart of a data processing method provided by an embodiment
  • FIG. 2 is a schematic flowchart of an embodiment of a data processing method provided by another embodiment
  • FIG. 3 is a schematic structural diagram of a data processing apparatus provided by an embodiment
  • FIG. 4 is a schematic diagram of a lower triangular matrix storage format provided by an embodiment
  • Fig. 5 is the schematic diagram of processing instruction format
  • FIG. 6 is a schematic diagram of the access type of the data included in the processing instruction
  • Fig. 7 is the schematic diagram of access channel merging
  • FIG. 8 is a schematic structural diagram of a data processing apparatus provided by another embodiment
  • FIG. 9 is a schematic structural diagram of a data processing device provided by an embodiment.
  • the main purpose of the embodiments of the present application is to provide a data processing method, device, and storage medium, aiming at realizing the function of efficiently determining inverse matrices of various dimensional matrices.
  • FIG. 1 is a schematic flowchart of a data processing method provided by an embodiment. This embodiment is applicable to the scenario of determining the inverse matrix of the received matrix. This embodiment may be executed by a data processing apparatus, which may be implemented in software and/or hardware, and may be integrated into a communication device such as a multi-mode base station or a multi-mode terminal. As shown in FIG. 1, the data processing method provided by this embodiment includes the following steps:
  • Step 101 Store the data in the matrix to be processed included in the received task request in the input register file.
  • the task request includes: the data in the matrix to be processed, the dimension of the matrix to be processed, and the number of blocks of the matrix to be processed.
  • the dimension of the matrix to be processed in this embodiment may be any dimension. For example, 1, 2, 4, 8, 16, 24, and 32, etc.
  • the task request in this embodiment may be sent by other devices to the data processing apparatus, or may be generated by other modules of the data processing apparatus.
  • the number of blocks of the matrix to be processed refers to the number of the matrix to be processed.
  • the matrix to be processed in this embodiment may be a lower triangular matrix.
  • the matrix inversion operation in massive MIMO plays an important role in the channel detection algorithm and precoding algorithm.
  • Step 102 Determine the target processing instruction storage address corresponding to the matrix to be processed according to the dimension of the matrix to be processed and preset mapping information.
  • the mapping information is used to indicate the mapping relationship between the dimension of the matrix and the storage address of the processing instruction.
  • the target processing instruction stored in the target processing instruction storage address is used to determine the inverse matrix of the matrix to be processed.
  • the application can be analyzed in advance according to the application requirements, and all required matrix dimensions in the application can be extracted from it, and the inversion algorithms and processes of various dimensions can be determined according to the algorithm performance simulation.
  • the required matrix inversion ranges from 1x1 to 64x64, of which 2x2 uses the direct inversion algorithm, 4x4 uses the block inversion algorithm, and the others use the Cholesky decomposition algorithm.
  • 64x64 can be decomposed into 4x16x16, 8x8x8, 64x1x1, and needs to support 8x2x2, 8x4x4 and so on.
  • these specifications are converted into pseudo-codes by means of scripts, mapped to various hardware resources in the form of processing instructions, and then the address ranges of various specification codes are counted and made into the form of configuration files, waiting for dynamic or static downloaded to the hardware instruction memory in the data processing device, for example, downloaded to a random access memory (Random Access Memory, RAM).
  • RAM Random Access Memory
  • the data processing apparatus in this embodiment can load the above configuration file and download processing instructions and mapping information after power-on reset.
  • the processing instruction represents the calculation and control flow, and the mapping information stores the dimension of the matrix, the start address of the processing instruction, and the end address of the processing instruction (ie, the storage address of the processing instruction).
  • the target processing instruction storage address corresponding to the matrix to be processed may be determined according to the dimension of the matrix to be processed and a preset mapping table, that is, the correspondence between the matrix dimension and the storage address of the processing instruction.
  • the target processing instruction stored in the target processing instruction storage address in this embodiment is used to determine the inverse matrix of the matrix to be processed.
  • Step 103 According to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool, and determine the parallel computing unit information between the blocks.
  • the computing resources in the resource pool in this embodiment may include computing resources such as multipliers.
  • the throughput of matrix inversion is mainly determined by the multiplier. According to the number of matrix blocks and dimensions extracted above, the maximum number of multiplication units required to calculate one element is determined, for example, it can be 64 multiplication units. Meanwhile, to support inter-block parallelism, these multiplication units can be grouped. Through the dynamic configuration of various networks, these resources can be organized into a unified large resource pool or into multiple small resource pools.
  • step 103 according to the number of blocks of the matrix to be processed, the computing resources in the resource pool are configured, and the information of the computing units paralleled between the blocks is determined. For example, assuming that the number of blocks of the matrix to be processed is 4, the computing resources in the resource pool can be divided into 4 independent networks, and the 4 independent networks process the data in the 4 matrices to be processed in parallel.
  • the calculation unit information here is used to indicate the parallel calculation units between blocks, and the corresponding relationship between the calculation units and the matrix to be processed.
  • the inter-block parallel computing unit in this embodiment refers to a computing unit that processes data in different matrices to be processed in parallel.
  • step 102 and step 103 have no time sequence relationship. Both can be executed simultaneously and in any order.
  • Step 104 Read the target processing instructions sequentially from the target processing instruction storage addresses, and process the data stored in the input register file in parallel according to the target processing instructions and the computing units corresponding to the computing unit information paralleled between blocks.
  • the number of target processing instructions stored in the target processing instruction storage address in this embodiment may be multiple.
  • the target processing instructions are sequentially read. After each target processing instruction is read, the data stored in the input register file is processed in parallel according to the target processing instruction and the computing unit corresponding to the parallel computing unit information between the blocks.
  • the target processing instruction includes the access address and processing method. After reading the data from the input register file according to the access address in the target processing instruction, the read data can be processed according to the processing method included in the target processing instruction. to be processed.
  • the inverse matrix of the matrix to be processed can be obtained.
  • step 104 may be performed through the following process:
  • Step 1041 Set the current running address as the i-th address in the storage address of the target processing instruction.
  • Step 1042 Process the data stored in the input register file in parallel according to the target processing instruction corresponding to the ith address and the computing unit corresponding to the computing unit information in parallel between the blocks.
  • the current operation address is the start address in the storage address of the target processing instruction. i is an integer greater than or equal to 0.
  • target processing instructions also include delay cycles.
  • the data processing method provided in this embodiment draws on the idea of Single Instruction Multiple Data (SIMD) and ASIC, and can implement a general configurable matrix inversion method and device, so that the programmable SIMD can be obtained.
  • SIMD Single Instruction Multiple Data
  • the advantages of scalability and scalability can also be obtained from the advantages of low latency, high efficiency and low power consumption of ASIC.
  • the data processing method provided in this embodiment can be applied to MIMO technical fields such as 5G wireless mobile communication, deep space communication, optical fiber communication, satellite digital video and audio broadcasting.
  • This embodiment provides a data processing method, including: storing data in a matrix to be processed included in a received task request in an input register file, wherein the task request includes: data in the matrix to be processed, a matrix to be processed The dimension of the matrix to be processed and the number of blocks of the matrix to be processed; according to the dimension of the matrix to be processed and the preset mapping information, determine the target processing instruction storage address corresponding to the matrix to be processed, wherein the mapping information is used to indicate the dimension of the matrix and the storage address of the processing instruction The mapping relationship between addresses, the target processing executes the target processing instructions stored in the storage address to determine the inverse matrix of the matrix to be processed; according to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool to determine the parallel computing between blocks Unit information; sequentially read target processing instructions from the target processing instruction storage address, and process the data stored in the input register file in parallel according to the target processing instructions and the computing unit corresponding to the computing unit information paralleled between blocks.
  • the storage address of the target processing instruction corresponding to the matrix to be processed can be obtained according to the dimension of the matrix to be processed, and at the same time, the computing resources in the resource pool can be configured according to the number of blocks of the matrix to be processed, and the parallelism between blocks can be determined.
  • the data stored in the input register file is processed in parallel according to the target processing instructions sequentially read in the target processing instruction storage address and the computing units corresponding to the parallel computing unit information between blocks. It has strong programmability and scalability to process matrices of various dimensions. On the other hand, it can process data in parallel, with low latency and high efficiency.
  • FIG. 2 is a schematic flowchart of an embodiment of a data processing method provided by another embodiment.
  • the data processing method provided by this embodiment includes the following steps:
  • Step 201 When it is determined that there is a task request in the task interface, determine whether the input register file is free.
  • Step 202 When it is determined that the input register file is idle, a task request is received, and the state of the input register file is set to a busy state.
  • step 201 when it is judged that the input register file is idle, a task request is received, which can avoid a write operation error.
  • setting the state of the input register file to the busy state can prevent the input register file from being occupied by other tasks and cause errors in the data processing process.
  • the state of the input register file can be set to the busy state while receiving the task request.
  • the state of the input register file can also be set to the busy state before receiving the task request.
  • Step 203 Store the data in the matrix to be processed included in the received task request in the input register file.
  • the task request includes: the data in the matrix to be processed, the dimension of the matrix to be processed, and the number of blocks of the matrix to be processed.
  • the matrix to be processed in this embodiment may be a lower triangular matrix.
  • Step 203 may specifically be: according to the preset storage format of the lower triangular matrix, store the data in the matrix to be processed included in the received task request in the input register file; copy the data to be processed included in the received task request The data in the matrix, and the copied data is stored according to the storage format of the upper triangular matrix.
  • preprocessing operations such as data alignment may also be performed.
  • FIG. 4 is a schematic diagram of a lower triangular matrix storage format provided by an embodiment. As shown in FIG. 4 , it is assumed that the minimum storage unit of the input register file can store an 8*8 matrix, and the input register file has 4 rows and 4 columns of the minimum storage unit.
  • the input register file can store 1 to 64 1x1 matrices; as shown in (2) in Figure 4, the input register file can store 1 to 8 8x8 matrices, The storage location is shown in the gray area in (2); as shown in (3) in Figure 4, the input register file can store 1 to 4 16x16 matrices, and the storage location is shown in gray in (3) As shown in (4) in Figure 4, the input register file can store a 32x32 matrix, and the storage location is shown in the gray area in (4). Further, the gray area above the dotted line in (4) can store a 24x24 matrix.
  • the access to data in the algorithm mainly includes scalar access, row vector access and column vector access.
  • the upper triangle can be stored at the same time, thereby converting column vector access into row vector access and reducing the complexity of data access.
  • the data processing apparatus in this embodiment may further include an output register file.
  • FIG. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment. As shown in FIG. 3 , the data processing device mainly includes five modules including instruction control, task input, computing resource pool, register file and task output.
  • the register file includes: input register file B (regFileB) and output register file C (regFileC).
  • Input register file B and output register file C may be vector register files.
  • vector register files In order to reduce the data access delay and store the intermediate calculation process, two vector register files (ie, vector register files) are used, which can realize the pipeline operation of data input/output, thereby reducing the overall processing delay.
  • These high-dimensional vector register files can store multiple low-dimensional matrices in parallel, so as to complete the parallel inversion of multiple low-dimensional matrices, thereby improving the utilization of computing units, improving the throughput of matrix inversion and reducing control overhead, reducing overall power consumption.
  • the pipeline operation of data input/output means: in the data input stage, when the input register file is in an idle state, regardless of whether the output register file is idle, the task request can be received; in the data output stage, regardless of whether the data is output or not. When done, the state of the input register file can be set to the idle state.
  • the pipeline operation of data input/output can reduce the overall processing delay of data processing.
  • Step 204 Determine whether the output register file is free.
  • step 203 may be performed after step 204 .
  • the data After the data is stored, it can be determined whether the output register file is free before starting to process the data.
  • Step 205 When it is determined that the output register file is idle, the state of the output register file is set to a busy state.
  • Step 206 After setting the state of the output register file to the busy state, determine the execution according to the dimension of the matrix to be processed, and the correspondence between the preset matrix dimension and the storage address of the processing instruction, and determine the target processing instruction storage corresponding to the matrix to be processed. address steps.
  • step 207 Before proceeding to step 207, in order to avoid writing errors, it is necessary to judge the state of the output register file. When it is determined that the output register file is free, step 207 is determined to be executed. Setting the state of the output register file to the busy state can prevent the output register file from being occupied by other tasks and causing errors in the data processing process.
  • Step 207 Determine the target processing instruction storage address corresponding to the matrix to be processed according to the dimension of the matrix to be processed and the preset mapping information.
  • the mapping information is used to indicate the mapping relationship between the dimension of the matrix and the storage address of the processing instruction.
  • the target process executes the target process instruction stored in the memory address for determining the inverse of the matrix to be processed.
  • B represents the data in the input register file
  • C represents the data in the output register file
  • D represents the execution result of the last target processing instruction
  • A represents the processing results of B and C.
  • FIG. 5 is a schematic diagram of a processing instruction format. As shown in Figure 5, the meanings of the fields in the processing instruction are as follows:
  • Instructions including the above basic fine-grained operators, which are mainly used to control the execution of computing units.
  • Halt cycle Due to the strong dependency between some data in the matrix inversion algorithm, in order to avoid read and write conflicts, waiting cycles need to be inserted between some calculation processes. This field is used to suspend the pipeline for n cycles after executing this instruction.
  • This field is used for fine-grained intra-block parallelism, indicating the number of elements processed in parallel within a block. Combined with the coarse-grained inter-block parallelism in the task parameters (ie, the number of blocks of the matrix to be processed), the calculation can be calculated. Resources and various networks are organized in various forms. For example, if the number of blocks is 2 and the degree of parallelism is 2, the accumulation/scaling/broadcasting network will organize the resource pool into four independent networks. Spanning two disjoint independent networks, this one can handle the inversion of 2 matrices simultaneously, and each matrix can compute two elements simultaneously.
  • Calibration control This field is mainly used for the calculation and propagation of various calibration values in the algorithm process, as well as the selection of calibration values required for fixed point.
  • Source/Destination A This field is mainly used to control the behavior of data A. Through different controls, the operands can be obtained from the input register file B or the output register file C; the result can also be written back to the input register file B or the output register file C; at the same time, the constant value or the fetched data can also be obtained. the conjugate value of .
  • Source B and Source C These two fields are used to control the behavior of accessing register files B and C, respectively. With different controls, both row and column access are possible.
  • (From0, To0) represents the location of data accessed by different resource blocks; (From1, To1) represents the sequence of accessed data within the same resource block. Wherein, different resource blocks represent different rows or columns, and the same resource blocks represent a row or the same column.
  • FIG. 6 is a schematic diagram of access types of data included in processing instructions. As shown in FIG. 6, when the type in the processing instruction is 1, column access is indicated. From0 and To0 represent accessing different columns, and From1 and To1 represent which rows in the columns corresponding to From0 and To0 are accessed. For example, From0 and To0 can be From0 and To2, which means accessing columns 0 to 2; From1 and To1 can be From2, To4, which means accessing rows 2 to 2 of each of columns 0 to 2 4 lines.
  • From0 and To0 represent accessing different rows
  • From1 and To1 represent which columns in the row corresponding to From0 and To0 are accessed.
  • the entire processing pipeline of matrix inversion is controlled, so as to get rid of the limitation of hard-wired solidification, increase the flexibility of control, simplify the control logic, and reduce the complexity of hardware implementation.
  • Offline instruction programming can encode the matrix inversion process of various algorithms and dimension combinations according to the application requirements, and download it to the instruction RAM statically and dynamically, and configure the mapping relationship into the mapping information at the same time. dimension, find the corresponding program from the mapping information and execute it. Therefore, offline instruction programming not only enhances flexibility and reduces control complexity, but also benefits the reduction of dynamic power consumption.
  • Step 208 According to the number of blocks of the matrix to be processed, configure computing resources in the resource pool, and determine information of computing units paralleled between blocks.
  • step 207 There is no timing relationship between step 207 and step 208 .
  • Step 209 After sequentially reading the target processing instructions from the target processing instruction storage addresses, determine the parallel computing unit information in the block according to the degree of parallelism in the block.
  • step 209 The process of sequentially reading the target processing instructions in step 209 is similar to the specific implementation process of step 104, and will not be repeated here.
  • the degree of parallelism included in the target processing instruction in this embodiment refers to the degree of parallelism within a block.
  • the parallel computing unit information in the block can be determined according to the degree of parallelism in the block.
  • the task request includes the number of blocks of the matrix to be processed, and multiple small matrices are spliced into a large matrix and processed in parallel, so as to make full use of its resources, improve the utilization rate of resources, and reduce the problems caused by serial processing. delay overhead.
  • the matrix inversion is generally calculated for each element point. If only one element point is processed at a time, although it is simple, it does not have sufficient resources, and the delay is relatively large.
  • resources can be organized according to requirements, and multiple element points of the same matrix can be processed in parallel to achieve intra-block parallelism.
  • the intra-block parallelism refers to the number of elements in the same matrix to be processed that are processed in parallel.
  • This embodiment adopts the idea of computing resource pools, and combines computing resources into pools with multiple independent computing resources according to a certain granularity.
  • the computing resource pools can be combined into one or more sets of computing units according to application requirements to process a large matrix or parallel computing. Process multiple small matrices, and process multiple elements of the same matrix in parallel.
  • These resource pools can be combined into multiple parallel processing computing units as required to process multiple matrices or elements in parallel. In this way, since multiple parallel computing units share the same program and control logic, power consumption overhead is reduced, and when processing small matrices, throughput is improved and time delay is reduced.
  • these multiplication units are divided into 8 groups, with 8 in each group.
  • these resources can be organized into a unified large resource pool, or can be organized into a maximum of 8 small resource pools.
  • step 208 according to the number of blocks of the matrix to be processed in the task request, various networks are statically configured at a coarse-grained level, and the computing resource pool is organized into computing units for parallel processing between blocks.
  • step 209 according to the degree of parallelism in the target processing instruction read from the target processing instruction storage address, various networks and computing units are dynamically adjusted from a fine-grained level, and multiple elements in the block are processed in parallel, thereby realizing Control various units to run dynamically as required.
  • Step 210 Process the data stored in the input register file in parallel according to the target processing instruction, the computing unit corresponding to the information of the parallel computing units between the blocks, and the computing unit corresponding to the information of the parallel computing units within the block.
  • the data stored in the input register file may be processed in parallel based on the target processing instruction through the computing unit corresponding to the inter-block parallel computing unit information and the computing unit corresponding to the intra-block parallel computing unit information.
  • the parallel processing process can realize parallel processing among multiple matrix blocks to be processed, and can also realize parallel processing among multiple elements in the same matrix to be processed.
  • the intermediate data acquired during the current processing of the data stored in the input register file can be stored in the output register file or the input register file, and store the acquired result data in the output register file.
  • the acquired intermediate data may also be stored in a cache unit of the computing unit. That is, the two register files in this embodiment can be used for input storage, output storage or intermediate temporary storage, respectively. At the same time, according to the number of blocks processed in parallel, multiple small matrices can be stored separately.
  • the target instruction also includes the access address of the target data (ie, the source B and source C fields in FIG. 5 ) and the processing mode of the target data.
  • the access channels of the multiple target data are combined into one access channel; according to the combined access channel, read data from the input register file and/or output register file; from the read data, obtain the target data according to the access address of the target data; according to the processing method included in the target processing instruction, the parallel calculation between blocks
  • the computing unit corresponding to the unit information and the computing unit corresponding to the parallel computing unit information in the block process the target data in parallel.
  • the computing unit here can be a parallel computing unit between blocks or a parallel computing unit within a block. This embodiment is not limited to this.
  • read control information can be generated.
  • the read control information read all the data in the combined access channel from the input register file and/or the output register file, and then select the required target data from it.
  • carry out relevant preprocessing such as fixed-pointing or conjugation processing, etc.; according to the fixed-pointing data collected before, broadcast or assign to each fixed-pointing unit according to the target processing instruction requirements; according to the target
  • the instruction in the instruction starts the calculation unit corresponding to the related inter-block parallel calculation unit information and the calculation unit corresponding to the intra-block parallel calculation unit information, dynamically processes the fixed point and collects the fixed point related data according to the requirements in the instruction, And write the result back to the register file and/or the master unit.
  • the register file stores the data of the lower triangle to the upper triangle at the same time, which is equivalent to a copy, which can convert column access to row access, and because of simultaneous access
  • the data is only local. After the channel is merged, the behavior of reading data can be controlled within a certain number of rows, reducing the number of first-level selectors; after reading row data, column selection is performed separately. Through this technology, the number of row accesses is greatly reduced, thereby reducing a large amount of data selector logic, which is beneficial to the realization of the physical backend.
  • FIG. 7 is a schematic diagram of access channel merging. Complete 8 blocks of 8x8 matrices at the same time. When executing the processing instructions shown in the figure, if the channels are independent, 8 channels are required to read 8 rows of data at the same time; but after the channels are merged, only 4 channels are required to read 4 rows at the same time. data. In this way, the number of read selectors can be greatly reduced, which is beneficial to back-end wiring.
  • step 209 if it is determined that the current running address is equal to the instruction end address in the target processing instruction storage address, it indicates that the execution of the inversion instruction of the matrix to be processed is completed, and the input register file is released. After that, according to the task parameters and the fixed-pointing result of the main control, the output data is post-processed, and the relevant result parameters and data are output, and then the output register file is set to an idle state to complete the lower triangular inversion process of the entire matrix.
  • the input-calculation-output three-stage pipeline is loosely coupled, and each pipeline is weakly correlated, and the execution of the lower-level pipeline is only triggered by a trigger signal.
  • the task input module receives the task request
  • the data in the matrix to be processed included in the received task request is stored in the input register file through the data write interface;
  • the computing unit in the resource pool is configured It is the parallel computing unit information between blocks; according to the task request and mapping information, such as the mapping table, the target processing instruction storage address is read, and then, the target processing instruction is obtained: instruction 0 to instruction N: through instruction fetching, decoding and control, execute each target processing instruction in sequence; according to the target processing instruction, fine-grained configuration of resources in the resource pool is performed to determine parallel computing unit information in the block; data is processed in parallel according to the configured computing unit.
  • the resource pool in FIG. 3 includes a Multiply Accumulate (MAC) unit.
  • MAC Multiply Accumulate
  • the data processing method provided by this embodiment overcomes the problem that the traditional matrix inversion method cannot take into account various indicators such as versatility, throughput, complexity, and low delay, and draws on the idea of SIMD+ASIC, and proposes a highly efficient method.
  • the configurable and universal matrix inversion implementation method can adapt to different protocols and the continuous evolution of the protocol; at the same time, it combines various technologies such as channel merging, dual vector register files, inter-block and intra-block parallelism, and various broadcast networks.
  • the idea of resource pool is adopted to reduce the implementation cost, development and back-end risks, and meet the needs of various algorithms, various dimensions and various application scenarios.
  • FIG. 8 is a schematic structural diagram of a data processing apparatus provided by another embodiment.
  • the data processing apparatus provided in this embodiment includes the following modules: a storage module 81 , a first determination module 82 , a second determination module 83 , and a processing module 84 .
  • the storage module 81 is configured to store the data in the matrix to be processed included in the received task request in the input register file.
  • the task request includes: the data in the matrix to be processed, the dimension of the matrix to be processed, and the number of blocks of the matrix to be processed.
  • the first determining module 82 is configured to determine the target processing instruction storage address corresponding to the matrix to be processed according to the dimension of the matrix to be processed and preset mapping information.
  • the mapping information is used to indicate the mapping relationship between the dimension of the matrix and the storage address of the processing instruction.
  • the target processing instruction stored in the target processing instruction storage address is used to determine the inverse matrix of the matrix to be processed.
  • the second determination module 83 is configured to configure the computing resources in the resource pool according to the number of blocks of the matrix to be processed, and to determine the parallel computing unit information between the blocks.
  • the processing module 84 is configured to sequentially read the target processing instructions from the target processing instruction storage addresses, and process the data stored in the input register file in parallel according to the target processing instructions and the computing units corresponding to the parallel computing unit information between blocks.
  • the target processing instructions include intra-block parallelism.
  • the apparatus further includes: a third determination module configured to determine the parallel computing unit information within the block according to the degree of parallelism within the block.
  • the processing module 84 is specifically configured to process the data stored in the input register file in parallel according to the target processing instruction, the computing units corresponding to the information of the parallel computing units between the blocks, and the computing units corresponding to the information of the parallel computing units within the blocks.
  • the target processing instruction also includes a delay period.
  • the matrix to be processed is a lower triangular matrix.
  • the storage module 81 is specifically used to: store the data in the matrix to be processed included in the received task request in the input register file according to the preset storage format of the lower triangular matrix; copy the pending data included in the received task request. Process the data in the matrix, and store the copied data according to the storage format of the upper triangular matrix.
  • the storage module 81 is further configured to: store the intermediate data obtained in the process of processing the data stored in the input register file this time in the output register file or the input register file, and store the obtained result data in the output register file or the input register file. in the output register file.
  • the target instruction further includes an access address of the target data and a processing method of the target data.
  • the processing module 84 is specifically used for: if according to the access address of the target data, it is determined that the plurality of target data that the computing unit needs to access has the same row or column, then the access channels of the plurality of target data are combined into one access channel; read data from the input register file and/or output register file; from the read data, obtain the target data according to the access address of the target data; according to the processing method included in the target processing instruction, parallel between blocks
  • the calculation unit corresponding to the calculation unit information of the block and the calculation unit corresponding to the parallel calculation unit information in the block process the target data in parallel.
  • the apparatus further includes: a judgment module, configured to judge whether the input register file is idle when it is determined that there is a task request in the task interface; a receiving setting module, configured to receive the task request when it is determined that the input register file is idle , and sets the state of the input register file to busy.
  • the judgment module is further configured to judge whether the output register file is free.
  • the apparatus further includes: a setting module and a fourth determination module.
  • the setting module is configured to set the state of the output register file to a busy state when the output register file is determined to be idle.
  • the fourth determination module is configured to, after setting the state of the output register file to the busy state, determine to execute the step of determining the storage address of the target processing instruction corresponding to the matrix to be processed according to the dimension of the matrix to be processed and the preset mapping information .
  • the data processing apparatus provided in this embodiment is used to execute the data processing method in any of the foregoing embodiments.
  • the implementation principle and technical effect of the data processing apparatus provided in this embodiment are similar, and details are not repeated here.
  • FIG. 9 is a schematic structural diagram of a data processing device provided by an embodiment.
  • the data processing device includes a processor 91 and a memory 92; the number of processors 91 in the data processing device may be one or more, and one processor 91 is taken as an example in FIG. 9;
  • the processor 91 and the memory 92 can be connected by a bus or other means, and the connection by a bus is taken as an example in FIG. 9 .
  • the memory 92 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the data processing method in the embodiments of the present application (for example, a storage module in a data processing apparatus). 81, a first determination module 82, a second determination module 83, and a processing module 84).
  • the processor 91 executes the software programs, instructions, and modules stored in the memory 92 to perform various functional applications and data processing of the data processing device, that is, to implement the above-mentioned data processing method.
  • the memory 92 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the data processing apparatus, and the like. Additionally, memory 92 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
  • Embodiments of the present application also provide a storage medium containing computer-executable instructions, and the computer-executable instructions are used to execute a data processing method when executed by a computer processor, and the method includes:
  • the data in the matrix to be processed included in the received task request is stored in the input register file; wherein, the task request includes: the data in the matrix to be processed, the dimension of the matrix to be processed and the number of blocks of the matrix to be processed;
  • mapping information is used to indicate the mapping relationship between the dimension of the matrix and the storage address of the processing instruction, and the target processing instruction
  • the target processing instruction stored in the storage address is used to determine the inverse matrix of the matrix to be processed
  • the target processing instructions are sequentially read from the target processing instruction storage addresses, and the data stored in the input register file is processed in parallel according to the target processing instructions and the computing units corresponding to the computing unit information paralleled between blocks.
  • a storage medium containing computer-executable instructions provided by the present application the computer-executable instructions of which are not limited to the above method operations, and can also perform related operations in the data processing methods provided by any embodiment of the present application.
  • the various embodiments of the present application may be implemented in hardware or special purpose circuits, software, logic, or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device, although the application is not limited thereto.
  • the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components Components execute cooperatively.
  • Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit .
  • Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).
  • Computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media.
  • Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Advance Control (AREA)

Abstract

A data processing method and device, and a storage medium, which relate to the technical field of communications. The method comprises: storing data in a matrix to be processed comprised in a received task request into an input register file (101); according to the dimension of the matrix and preset mapping information, determining a target processing instruction storage address corresponding to the matrix (102); according to the number of blocks of the matrix, configuring a computational resource in a resource pool, and determining inter-block parallel computing unit information (103); and sequentially reading a target processing instruction from the target processing instruction storage address, and, according to the target processing instruction and a computing unit corresponding to the inter-block parallel computing unit information, processing in parallel the data stored in the input register file (104).

Description

数据处理方法、设备和存储介质Data processing method, device and storage medium
交叉引用cross reference
本申请基于申请号为“202010761369.X”、申请日为2020年07月31日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。This application is based on the Chinese patent application with the application number "202010761369.X" and the application date is July 31, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference. into this application.
技术领域technical field
本申请实施例涉及通信技术领域,尤其涉及一种数据处理方法、设备和存储介质。The embodiments of the present application relate to the field of communication technologies, and in particular, to a data processing method, device, and storage medium.
背景技术Background technique
多输入多输出(Multiple-Input Multiple-Output,MIMO)技术,即在发射端和接收端均采用多天线(或阵列天线)和多通道同时收发信号。通过多组天线的收发,可以大幅提升数据传输效率和信号稳定性,从而使得网络质量、网络速度以及用户容纳量得到质的提升。因此,MIMO技术是第五代移动通信技术网络和无线上网(Wi-Fi)6的核心技术,也是未来无线网络发展的核心所在。Multiple-Input Multiple-Output (MIMO) technology, that is, using multiple antennas (or array antennas) and multiple channels to send and receive signals at the same time at both the transmitter and receiver. Through the transmission and reception of multiple sets of antennas, data transmission efficiency and signal stability can be greatly improved, thereby improving network quality, network speed, and user capacity. Therefore, MIMO technology is the core technology of the fifth-generation mobile communication technology network and wireless Internet access (Wi-Fi) 6, and also the core of future wireless network development.
MIMO中的信号处理需要进行大量的矩阵运算。其中,矩阵求逆是最复杂的矩阵运算。而针对MIMO中矩阵的共轭对称性,下三角矩阵的求逆在其中尤为重要。下三角矩阵求逆运算的计算速度直接影响到MIMO系统的实时性,从而影响到整个MIMO系统的性能。Signal processing in MIMO requires extensive matrix operations. Among them, matrix inversion is the most complex matrix operation. For the conjugate symmetry of the matrix in MIMO, the inversion of the lower triangular matrix is particularly important. The calculation speed of the lower triangular matrix inversion operation directly affects the real-time performance of the MIMO system, thereby affecting the performance of the entire MIMO system.
矩阵求逆的方法有很多,针对不同的维度有不同的方法,从而获得不同的性能。比如,对于2*2矩阵,可以采用直接对求逆定理进行展开;对于4*4矩阵,既可以按照定义算伴随矩阵,也可以采用矩阵论中的分块求逆分解算法;对于6*6(包含)以上维度求逆实现方法,可以采用乔里斯基(Cholesky)简化分解算法。目前已进入大规模MIMO时代,需要对各种维度的矩阵,甚至高维矩阵进行求逆计算。矩阵求逆的计算复杂度跟维度的幂次方成正比。因此, 当矩阵维度较大时,求逆计算复杂度和计算量非常之高。总而言之,矩阵求逆在MIMO中需要支持的算法多、维度大、计算复杂度高以及要求的处理时间短。There are many methods for matrix inversion, and there are different methods for different dimensions, so as to obtain different performance. For example, for a 2*2 matrix, you can directly expand the inversion theorem; for a 4*4 matrix, you can either calculate the adjoint matrix according to the definition, or you can use the block inversion decomposition algorithm in matrix theory; for 6*6 (Inclusive) The above-mentioned dimension inversion implementation method can be simplified by the Cholesky decomposition algorithm. Now that we have entered the era of massive MIMO, it is necessary to perform inverse calculations on matrices of various dimensions, even high-dimensional matrices. The computational complexity of matrix inversion is proportional to the power of the dimension. Therefore, when the dimension of the matrix is large, the computational complexity and computation amount of inversion are very high. All in all, matrix inversion in MIMO needs to support many algorithms, large dimensions, high computational complexity and short processing time.
目前,矩阵求逆主要通过专用集成电路(Application Specific Integrated Circuit,ASIC),通过硬连线的方式,专用于处理某种或多种类型的矩阵求逆算法,可以在效率和功耗上达到最优。At present, matrix inversion is mainly used for application specific integrated circuit (ASIC), which is dedicated to processing one or more types of matrix inversion algorithms through hard-wired methods, which can achieve the highest efficiency and power consumption. excellent.
但是,上述矩阵求逆方式中,ASIC支持处理的矩阵类型有限,可扩展性和可裁剪性较差。However, in the above matrix inversion methods, the types of matrices that the ASIC supports and processes are limited, and the scalability and tailorability are poor.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种数据处理方法,方法包括以下步骤:将接收到的任务请求中包括的待处理矩阵中的数据存储在输入寄存器文件中;其中,任务请求包括:待处理矩阵中的数据、待处理矩阵的维度以及待处理矩阵的块数;根据待处理矩阵的维度,以及预置的映射信息,确定待处理矩阵对应的目标处理指令存储地址;其中,映射信息用于指示矩阵的维度与处理指令存储地址之间的映射关系,目标处理指令存储地址中存储的目标处理指令用于确定待处理矩阵的逆矩阵;根据待处理矩阵的块数,配置资源池中的计算资源,确定块间并行的计算单元信息;从目标处理指令存储地址中依次读取目标处理指令,并根据目标处理指令以及块间并行的计算单元信息对应的计算单元,并行处理输入寄存器文件中存储的数据。An embodiment of the present application provides a data processing method, the method includes the following steps: storing data in a matrix to be processed included in a received task request in an input register file; wherein the task request includes: data, the dimension of the matrix to be processed, and the number of blocks of the matrix to be processed; according to the dimension of the matrix to be processed and the preset mapping information, determine the target processing instruction storage address corresponding to the matrix to be processed; wherein, the mapping information is used to indicate the matrix The mapping relationship between the dimension and the storage address of the processing instruction, the target processing instruction stored in the storage address of the target processing instruction is used to determine the inverse matrix of the matrix to be processed; according to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool, determine Inter-block parallel computing unit information; sequentially read target processing instructions from the target processing instruction storage address, and process the data stored in the input register file in parallel according to the target processing instruction and the computing unit corresponding to the inter-block parallel computing unit information.
本申请实施例还提出了一种数据处理设备,设备包括存储器、处理器、存储在存储器上并可在处理器上运行的程序以及用于实现处理器和存储器之间的连接通信的数据总线,程序被处理器执行时实现前述方法的步骤。The embodiment of the present application also proposes a data processing device, the device includes a memory, a processor, a program stored in the memory and running on the processor, and a data bus for realizing connection and communication between the processor and the memory, The program implements the steps of the aforementioned method when executed by the processor.
本申请实施例提供了一种存储介质,用于计算机可读存储,存储介质存储有一个或者多个程序,一个或者多个程序可被一个或者多个处理器执行,以实现前述方法的步骤。An embodiment of the present application provides a storage medium for computer-readable storage, where the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the steps of the foregoing method.
附图说明Description of drawings
图1为一实施例提供的数据处理方法的流程示意图;1 is a schematic flowchart of a data processing method provided by an embodiment;
图2为另一实施例提供的数据处理方法实施例的流程示意图;2 is a schematic flowchart of an embodiment of a data processing method provided by another embodiment;
图3为一实施例提供的数据处理装置的结构示意图;3 is a schematic structural diagram of a data processing apparatus provided by an embodiment;
图4为一实施例提供的下三角矩阵存储格式的示意图;4 is a schematic diagram of a lower triangular matrix storage format provided by an embodiment;
图5为处理指令格式的示意图;Fig. 5 is the schematic diagram of processing instruction format;
图6为处理指令中包括的数据的访问类型的示意图;6 is a schematic diagram of the access type of the data included in the processing instruction;
图7为访问通道合并的示意图;Fig. 7 is the schematic diagram of access channel merging;
图8为另一实施例提供的数据处理装置的结构示意图;8 is a schematic structural diagram of a data processing apparatus provided by another embodiment;
图9为一实施例提供的数据处理设备的结构示意图。FIG. 9 is a schematic structural diagram of a data processing device provided by an embodiment.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本申请实施例,并不用于限定本申请实施例。It should be understood that the specific embodiments described herein are only used to explain the embodiments of the present application, and are not used to limit the embodiments of the present application.
在后续的描述中,使用用于表示元件的诸如“模块”、“部件”或“单元”的后缀仅为了有利于本申请实施例的说明,其本身没有特有的意义。因此,“模块”、“部件”或“单元”可以混合地使用。In the subsequent description, suffixes such as "module", "component" or "unit" used to represent elements are only used to facilitate the description of the embodiments of the present application, and have no specific meaning per se. Thus, "module", "component" or "unit" may be used interchangeably.
本申请实施例的主要目的在于提出一种数据处理方法、设备和存储介质,旨在实现高效确定多种维度矩阵的逆矩阵的功能。The main purpose of the embodiments of the present application is to provide a data processing method, device, and storage medium, aiming at realizing the function of efficiently determining inverse matrices of various dimensional matrices.
图1为一实施例提供的数据处理方法的流程示意图。本实施例适用于确定接收到的矩阵的逆矩阵的场景中。本实施例可以由数据处理装置执行,该数据处理装置可以由软件和/或硬件的方式实现,该数据处理装置可以集成于诸如多模基站或者多模终端的通信设备中。如图1所示,本实施例提供的数据处理方法包括如下步骤:FIG. 1 is a schematic flowchart of a data processing method provided by an embodiment. This embodiment is applicable to the scenario of determining the inverse matrix of the received matrix. This embodiment may be executed by a data processing apparatus, which may be implemented in software and/or hardware, and may be integrated into a communication device such as a multi-mode base station or a multi-mode terminal. As shown in FIG. 1, the data processing method provided by this embodiment includes the following steps:
步骤101:将接收到的任务请求中包括的待处理矩阵中的数据存储在输入寄存器文件中。Step 101: Store the data in the matrix to be processed included in the received task request in the input register file.
其中,任务请求包括:待处理矩阵中的数据、待处理矩阵的维度以及待处理矩阵的块数。The task request includes: the data in the matrix to be processed, the dimension of the matrix to be processed, and the number of blocks of the matrix to be processed.
本实施例中的待处理矩阵的维度可以为任意维度。例如,1、2、4、8、16、24及32等。本实施例中的任务请求可以是其他设备发送给该数据处理装置的,也可以是该数据处理装置的其他模块生成的。待处理矩阵的块数指的是待处理矩阵的数量。The dimension of the matrix to be processed in this embodiment may be any dimension. For example, 1, 2, 4, 8, 16, 24, and 32, etc. The task request in this embodiment may be sent by other devices to the data processing apparatus, or may be generated by other modules of the data processing apparatus. The number of blocks of the matrix to be processed refers to the number of the matrix to be processed.
可选地,本实施例中的待处理矩阵可以为下三角矩阵。大规模MIMO中矩阵求逆运算在信道检测算法与预编码算法中占据重要地位。Optionally, the matrix to be processed in this embodiment may be a lower triangular matrix. The matrix inversion operation in massive MIMO plays an important role in the channel detection algorithm and precoding algorithm.
步骤102:根据待处理矩阵的维度,以及预置的映射信息,确定待处理矩阵对应的目标处理指令存储地址。Step 102: Determine the target processing instruction storage address corresponding to the matrix to be processed according to the dimension of the matrix to be processed and preset mapping information.
其中,映射信息用于指示矩阵的维度与处理指令存储地址之间的映射关系。 目标处理指令存储地址中存储的目标处理指令用于确定待处理矩阵的逆矩阵。The mapping information is used to indicate the mapping relationship between the dimension of the matrix and the storage address of the processing instruction. The target processing instruction stored in the target processing instruction storage address is used to determine the inverse matrix of the matrix to be processed.
本实施例中,可以预先根据应用需求,分析应用,从中提取出应用中所有需要的矩阵维度,并根据算法性能仿真,确定各种维度的求逆算法和流程。比如,在5G MIMO处理过程中,需要的矩阵求逆从1x1到64x64,其中2x2采用直接求逆算法以及4x4采用分块求逆算法,其他采用Cholesky分解算法。同时64x64的可以分解成4x16x16,8x8x8,64x1x1,以及需要支持8x2x2,8x4x4等等。In this embodiment, the application can be analyzed in advance according to the application requirements, and all required matrix dimensions in the application can be extracted from it, and the inversion algorithms and processes of various dimensions can be determined according to the algorithm performance simulation. For example, in the process of 5G MIMO processing, the required matrix inversion ranges from 1x1 to 64x64, of which 2x2 uses the direct inversion algorithm, 4x4 uses the block inversion algorithm, and the others use the Cholesky decomposition algorithm. At the same time, 64x64 can be decomposed into 4x16x16, 8x8x8, 64x1x1, and needs to support 8x2x2, 8x4x4 and so on.
根据算法,通过脚本的方式,把这些规格转化为伪代码,通过处理指令的形式映射到各种硬件资源,然后统计各种规格代码的地址区间,做成配置文件的形式,等待通过动态或静态的方式下载到数据处理装置中的硬件指令存储器中,例如,下载到随机存取存储器(Random Access Memory,RAM)。According to the algorithm, these specifications are converted into pseudo-codes by means of scripts, mapped to various hardware resources in the form of processing instructions, and then the address ranges of various specification codes are counted and made into the form of configuration files, waiting for dynamic or static downloaded to the hardware instruction memory in the data processing device, for example, downloaded to a random access memory (Random Access Memory, RAM).
本实施例中的数据处理装置可以在上电复位后,装载上述配置文件,下载处理指令和映射信息。处理指令代表计算和控制流程,映射信息中存储矩阵的维度、处理指令开始地址以及处理指令结束地址(即,处理指令存储地址)的映射关系。The data processing apparatus in this embodiment can load the above configuration file and download processing instructions and mapping information after power-on reset. The processing instruction represents the calculation and control flow, and the mapping information stores the dimension of the matrix, the start address of the processing instruction, and the end address of the processing instruction (ie, the storage address of the processing instruction).
在步骤102中,可以根据待处理矩阵的维度,以及预置的映射表,即,矩阵维度与处理指令存储地址的对应关系,确定待处理矩阵对应的目标处理指令存储地址。本实施例中的目标处理指令存储地址中存储的目标处理指令用于确定待处理矩阵的逆矩阵。In step 102, the target processing instruction storage address corresponding to the matrix to be processed may be determined according to the dimension of the matrix to be processed and a preset mapping table, that is, the correspondence between the matrix dimension and the storage address of the processing instruction. The target processing instruction stored in the target processing instruction storage address in this embodiment is used to determine the inverse matrix of the matrix to be processed.
步骤103:根据待处理矩阵的块数,配置资源池中的计算资源,确定块间并行的计算单元信息。Step 103: According to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool, and determine the parallel computing unit information between the blocks.
本实施例中的资源池中的计算资源可以包括乘法器等计算资源。矩阵求逆的吞吐量主要由乘法器决定,根据如上提取的矩阵块数和维度,确定计算一个元素最大需要的乘法单元的数量,例如,可以为64个乘法单元。同时,为了支持块间并行,可以将这些乘法单元分组。通过各种网络的动态配置,这些资源,即可以组织成一个统一的1个大资源池,也可以组织成多个小资源池。The computing resources in the resource pool in this embodiment may include computing resources such as multipliers. The throughput of matrix inversion is mainly determined by the multiplier. According to the number of matrix blocks and dimensions extracted above, the maximum number of multiplication units required to calculate one element is determined, for example, it can be 64 multiplication units. Meanwhile, to support inter-block parallelism, these multiplication units can be grouped. Through the dynamic configuration of various networks, these resources can be organized into a unified large resource pool or into multiple small resource pools.
在步骤103中,根据待处理矩阵的块数,配置资源池中的计算资源,确定块间并行的计算单元信息。举例来说,假设待处理矩阵的块数为4个,则可以将资源池中的计算资源分为4个独立的网络,这4个独立的网络并行处理4个待处理矩阵中的数据。In step 103, according to the number of blocks of the matrix to be processed, the computing resources in the resource pool are configured, and the information of the computing units paralleled between the blocks is determined. For example, assuming that the number of blocks of the matrix to be processed is 4, the computing resources in the resource pool can be divided into 4 independent networks, and the 4 independent networks process the data in the 4 matrices to be processed in parallel.
这里的计算单元信息用于指示块间并行的计算单元,以及,计算单元与待处理矩阵的对应关系。本实施例中的块间并行的计算单元指的是并行处理不同的待处理矩阵中的数据的计算单元。The calculation unit information here is used to indicate the parallel calculation units between blocks, and the corresponding relationship between the calculation units and the matrix to be processed. The inter-block parallel computing unit in this embodiment refers to a computing unit that processes data in different matrices to be processed in parallel.
需要说明的是,步骤102与步骤103的执行过程没有时序关系。两者可以同时执行,可以以任意的顺序执行。It should be noted that the execution process of step 102 and step 103 have no time sequence relationship. Both can be executed simultaneously and in any order.
步骤104:从目标处理指令存储地址中依次读取目标处理指令,并根据目标处理指令以及块间并行的计算单元信息对应的计算单元,并行处理输入寄存器文件中存储的数据。Step 104: Read the target processing instructions sequentially from the target processing instruction storage addresses, and process the data stored in the input register file in parallel according to the target processing instructions and the computing units corresponding to the computing unit information paralleled between blocks.
本实施例中的目标处理指令存储地址中存储的目标处理指令的个数可以为 多个。在步骤104中,依次读取目标处理指令。在每读取到一个目标处理指令之后,根据目标处理指令以及块间并行的计算单元信息对应的计算单元,并行处理输入寄存器文件中存储的数据。The number of target processing instructions stored in the target processing instruction storage address in this embodiment may be multiple. In step 104, the target processing instructions are sequentially read. After each target processing instruction is read, the data stored in the input register file is processed in parallel according to the target processing instruction and the computing unit corresponding to the parallel computing unit information between the blocks.
目标处理指令中包括访问地址及处理方式等,可以根据目标处理指令中的访问地址,从输入寄存器文件中读取到数据后,根据目标处理指令中包括的处理方式,对该读取到的数据进行处理。The target processing instruction includes the access address and processing method. After reading the data from the input register file according to the access address in the target processing instruction, the read data can be processed according to the processing method included in the target processing instruction. to be processed.
本实施例中,在执行完目标处理指令存储地址中的所有目标处理指令之后,即可以获取到待处理矩阵的逆矩阵。In this embodiment, after all the target processing instructions in the storage address of the target processing instruction are executed, the inverse matrix of the matrix to be processed can be obtained.
一实施例中,可以通过以下过程执行步骤104:In one embodiment, step 104 may be performed through the following process:
步骤1041:将当前运行地址设置为目标处理指令存储地址中的第i个地址。Step 1041: Set the current running address as the i-th address in the storage address of the target processing instruction.
步骤1042:根据第i个地址对应的目标处理指令以及块间并行的计算单元信息对应的计算单元,并行处理输入寄存器文件中存储的数据。Step 1042 : Process the data stored in the input register file in parallel according to the target processing instruction corresponding to the ith address and the computing unit corresponding to the computing unit information in parallel between the blocks.
步骤1043:在处理完第i个地址对应的目标处理指令之后,在确定当前运行地址不等于目标处理指令存储地址中的指令结束地址时,确定i=i+1,返回执行将当前运行地址设置为目标处理指令存储地址中的第i个地址的步骤。Step 1043: After processing the target processing instruction corresponding to the i-th address, when it is determined that the current operating address is not equal to the instruction end address in the target processing instruction storage address, determine i=i+1, and return to execute to set the current operating address. The step of storing the ith address of the addresses for the target processing instruction.
需要说明的是,在初始运行时,当前运行地址为目标处理指令存储地址中的开始地址。i为大于或者等于0的整数。It should be noted that, during initial operation, the current operation address is the start address in the storage address of the target processing instruction. i is an integer greater than or equal to 0.
更进一步地,由于矩阵求逆算法中某些数据之间有较强的依赖关系,为了避免读写冲突,某些计算流程之间需要插入等待周期。基于该需求,目标处理指令还包括延迟周期。上述步骤1043的实现过程具体为:在确定当前运行地址不等于目标处理指令存储地址中的指令结束地址之后,在延迟第i个地址对应的目标处理指令中包括的延迟周期之后,确定i=i+1,返回执行将当前运行地址设置为目标处理指令存储地址中的第i个地址的步骤。举例来说,假设i=3,目标处理指令存储地址中的指令结束地址为第11个地址,在处理完第3个地址对应的目标处理指令之后,在确定当前运行地址不等于目标处理指令存储地址中的指令结束地址时,在延迟第3个地址对应的目标处理指令中包括的延迟周期之后,确定i=4,返回执行将当前运行地址设置为目标处理指令存储地址中的第4个地址的步骤。Furthermore, since some data in the matrix inversion algorithm have strong dependencies, in order to avoid read-write conflicts, waiting periods need to be inserted between some calculation processes. Based on this requirement, target processing instructions also include delay cycles. The implementation process of the above-mentioned step 1043 is specifically: after determining that the current operating address is not equal to the instruction end address in the target processing instruction storage address, after delaying the delay period included in the target processing instruction corresponding to the ith address, determining i=i +1, return to execute the step of setting the current operating address as the i-th address in the target processing instruction storage address. For example, assuming i=3, the instruction end address in the target processing instruction storage address is the 11th address. After processing the target processing instruction corresponding to the third address, it is determined that the current running address is not equal to the target processing instruction storage address. When the instruction ends in the address, after delaying the delay period included in the target processing instruction corresponding to the third address, determine i=4, and return to execute setting the current running address to the fourth address in the storage address of the target processing instruction A step of.
本实施例提供的数据处理方法,借鉴了单指令多数据流(Single Instruction Multiple Data,SIMD)和ASIC的思想,可以实现通用可配置矩阵求逆方法和装置,从而,既可获得SIMD的可编程性和可扩展性的优点,也可获得ASIC低时延、高效率和低功耗的优点。The data processing method provided in this embodiment draws on the idea of Single Instruction Multiple Data (SIMD) and ASIC, and can implement a general configurable matrix inversion method and device, so that the programmable SIMD can be obtained. The advantages of scalability and scalability can also be obtained from the advantages of low latency, high efficiency and low power consumption of ASIC.
本实施例提供的数据处理方法可以应用于5G无线移动通信、深空通信、光纤通信、卫星数字视频和音频广播等MIMO技术领域。The data processing method provided in this embodiment can be applied to MIMO technical fields such as 5G wireless mobile communication, deep space communication, optical fiber communication, satellite digital video and audio broadcasting.
本实施例提供一种数据处理方法,包括:将接收到的任务请求中包括的待处理矩阵中的数据存储在输入寄存器文件中,其中,任务请求包括:待处理矩阵中的数据、待处理矩阵的维度以及待处理矩阵的块数;根据待处理矩阵的维度,以及预置的映射信息,确定待处理矩阵对应的目标处理指令存储地址,其中,映射信息用于指示矩阵的维度与处理指令存储地址之间的映射关系,目标 处理执行存储地址中存储的目标处理指令用于确定待处理矩阵的逆矩阵;根据待处理矩阵的块数,配置资源池中的计算资源,确定块间并行的计算单元信息;从目标处理指令存储地址中依次读取目标处理指令,并根据目标处理指令以及块间并行的计算单元信息对应的计算单元,并行处理输入寄存器文件中存储的数据。该数据处理方法,可以根据待处理矩阵的维度,获取到待处理矩阵对应的目标处理指令存储地址,同时,还可以根据待处理矩阵的块数,配置资源池中的计算资源,确定块间并行的计算单元信息,之后,根据目标处理指令存储地址中依次读取到的目标处理指令以及块间并行的计算单元信息对应的计算单元,并行处理输入寄存器文件中存储的数据,一方面,可以实现处理多种维度的矩阵,可编程性和可扩展性强,另一方面,可以并行处理数据,时延较低、效率较高。This embodiment provides a data processing method, including: storing data in a matrix to be processed included in a received task request in an input register file, wherein the task request includes: data in the matrix to be processed, a matrix to be processed The dimension of the matrix to be processed and the number of blocks of the matrix to be processed; according to the dimension of the matrix to be processed and the preset mapping information, determine the target processing instruction storage address corresponding to the matrix to be processed, wherein the mapping information is used to indicate the dimension of the matrix and the storage address of the processing instruction The mapping relationship between addresses, the target processing executes the target processing instructions stored in the storage address to determine the inverse matrix of the matrix to be processed; according to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool to determine the parallel computing between blocks Unit information; sequentially read target processing instructions from the target processing instruction storage address, and process the data stored in the input register file in parallel according to the target processing instructions and the computing unit corresponding to the computing unit information paralleled between blocks. In the data processing method, the storage address of the target processing instruction corresponding to the matrix to be processed can be obtained according to the dimension of the matrix to be processed, and at the same time, the computing resources in the resource pool can be configured according to the number of blocks of the matrix to be processed, and the parallelism between blocks can be determined. After that, the data stored in the input register file is processed in parallel according to the target processing instructions sequentially read in the target processing instruction storage address and the computing units corresponding to the parallel computing unit information between blocks. It has strong programmability and scalability to process matrices of various dimensions. On the other hand, it can process data in parallel, with low latency and high efficiency.
图2为另一实施例提供的数据处理方法实施例的流程示意图。本实施例在图1所示实施例及各种可选方案的基础上,对数据处理方法包括的其他步骤作一详细说明。如图2所示,本实施例提供的数据处理方法包括如下步骤:FIG. 2 is a schematic flowchart of an embodiment of a data processing method provided by another embodiment. In this embodiment, on the basis of the embodiment shown in FIG. 1 and various optional solutions, other steps included in the data processing method are described in detail. As shown in Figure 2, the data processing method provided by this embodiment includes the following steps:
步骤201:在确定任务接口存在任务请求时,判断输入寄存器文件是否为空闲。Step 201: When it is determined that there is a task request in the task interface, determine whether the input register file is free.
步骤202:在确定输入寄存器文件空闲时,接收任务请求,并将输入寄存器文件的状态设置为忙状态。Step 202: When it is determined that the input register file is idle, a task request is received, and the state of the input register file is set to a busy state.
在步骤201中,在判断输入寄存器文件为空闲时,接收任务请求,可以避免写操作错误。同时,将输入寄存器文件的状态设置为忙状态,可以避免输入寄存器文件被其他任务占用而导致数据处理过程出错。为了提高效率,可以在接收任务请求的同时,将输入寄存器文件的状态设置为忙状态。当然,也可以在接收任务请求之前,将输入寄存器文件的状态设置为忙状态。In step 201, when it is judged that the input register file is idle, a task request is received, which can avoid a write operation error. At the same time, setting the state of the input register file to the busy state can prevent the input register file from being occupied by other tasks and cause errors in the data processing process. To improve efficiency, the state of the input register file can be set to the busy state while receiving the task request. Of course, the state of the input register file can also be set to the busy state before receiving the task request.
步骤203:将接收到的任务请求中包括的待处理矩阵中的数据存储在输入寄存器文件中。Step 203: Store the data in the matrix to be processed included in the received task request in the input register file.
其中,任务请求包括:待处理矩阵中的数据、待处理矩阵的维度以及待处理矩阵的块数。The task request includes: the data in the matrix to be processed, the dimension of the matrix to be processed, and the number of blocks of the matrix to be processed.
可选地,本实施例中的待处理矩阵可以为下三角矩阵。步骤203具体可以为:按照预设的下三角矩阵的存储格式,将接收到的任务请求中包括的待处理矩阵中的数据存储在输入寄存器文件中;拷贝接收到的任务请求中包括的待处理矩阵中的数据,并按照上三角矩阵的存储格式存储拷贝的数据。Optionally, the matrix to be processed in this embodiment may be a lower triangular matrix. Step 203 may specifically be: according to the preset storage format of the lower triangular matrix, store the data in the matrix to be processed included in the received task request in the input register file; copy the data to be processed included in the received task request The data in the matrix, and the copied data is stored according to the storage format of the upper triangular matrix.
可选地,在将数据存储之后,还可以进行数据拉齐等预处理操作。Optionally, after the data is stored, preprocessing operations such as data alignment may also be performed.
图4为一实施例提供的下三角矩阵存储格式的示意图。如图4所示,假设输入寄存器文件最小存储单元可以存储8*8的矩阵,并且,该输入寄存器文件有4行、4列该最小存储单元。那么,如图4中(1)图所示,该输入寄存器文件可以存储1~64个1x1矩阵;如图4中(2)图所示,该输入寄存器文件可以存储1~8个8x8矩阵,存储的位置如(2)图中的灰色区域所示;如图4中(3)图所示,该输入寄存器文件可以存储1~4个16x16矩阵,存储的位置如(3)图中的灰色区域所示;如图4中(4)图所示,该输入寄存器文件可以存储1个 32x32矩阵,存储的位置如(4)图中的灰色区域所示。进一步地,(4)图中的虚线之上的灰色区域可以存储1个24x24矩阵。FIG. 4 is a schematic diagram of a lower triangular matrix storage format provided by an embodiment. As shown in FIG. 4 , it is assumed that the minimum storage unit of the input register file can store an 8*8 matrix, and the input register file has 4 rows and 4 columns of the minimum storage unit. Then, as shown in (1) in Figure 4, the input register file can store 1 to 64 1x1 matrices; as shown in (2) in Figure 4, the input register file can store 1 to 8 8x8 matrices, The storage location is shown in the gray area in (2); as shown in (3) in Figure 4, the input register file can store 1 to 4 16x16 matrices, and the storage location is shown in gray in (3) As shown in (4) in Figure 4, the input register file can store a 32x32 matrix, and the storage location is shown in the gray area in (4). Further, the gray area above the dotted line in (4) can store a 24x24 matrix.
通过分析矩阵求逆的各种算法,算法中对数据的访问,主要有标量访问、行矢量访问以及列矢量访问。对于下三角矩阵求逆,可以同时存储上三角,从而把列矢量访问转化为行矢量访问,降低数据访问的复杂度。By analyzing various algorithms of matrix inversion, the access to data in the algorithm mainly includes scalar access, row vector access and column vector access. For the inversion of the lower triangular matrix, the upper triangle can be stored at the same time, thereby converting column vector access into row vector access and reducing the complexity of data access.
可选地,本实施例中的数据处理装置中还可以包括输出寄存器文件。Optionally, the data processing apparatus in this embodiment may further include an output register file.
图3为一实施例提供的数据处理装置的结构示意图。如图3所示,该数据处理装置主要包括指令控制、任务输入、计算资源池、寄存器文件以及任务输出等五大模块。FIG. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment. As shown in FIG. 3 , the data processing device mainly includes five modules including instruction control, task input, computing resource pool, register file and task output.
寄存器文件包括:输入寄存器文件B(regFileB)和输出寄存器文件C(regFileC)。输入寄存器文件B和输出寄存器文件C可以为矢量寄存器文件。The register file includes: input register file B (regFileB) and output register file C (regFileC). Input register file B and output register file C may be vector register files.
为降低数据访问时延和存储中间计算过程,采用两个矢量寄存器堆(即,矢量寄存器文件),可以兼顾实现数据输入/输出的流水操作,从而降低整体的处理时延。这些高维度的矢量寄存器堆,可以并行存储多个低维度的矩阵,从而完成多个低维度矩阵的并行求逆,从而提高计算单元的利用率,提高矩阵求逆吞吐量和降低控制开销,降低整体功耗。In order to reduce the data access delay and store the intermediate calculation process, two vector register files (ie, vector register files) are used, which can realize the pipeline operation of data input/output, thereby reducing the overall processing delay. These high-dimensional vector register files can store multiple low-dimensional matrices in parallel, so as to complete the parallel inversion of multiple low-dimensional matrices, thereby improving the utilization of computing units, improving the throughput of matrix inversion and reducing control overhead, reducing overall power consumption.
兼顾实现数据输入/输出的流水操作意为:在数据输入阶段,在输入寄存器文件为空闲状态时,不论输出寄存器文件是否为空闲,均可以实现接收任务请求;在数据输出阶段,不论数据是否输出完成,均可将输入寄存器文件的状态设置为空闲状态。数据输入/输出的流水操作可以降低数据处理的整体处理时延。Taking into account the pipeline operation of data input/output means: in the data input stage, when the input register file is in an idle state, regardless of whether the output register file is idle, the task request can be received; in the data output stage, regardless of whether the data is output or not. When done, the state of the input register file can be set to the idle state. The pipeline operation of data input/output can reduce the overall processing delay of data processing.
步骤204:判断输出寄存器文件是否为空闲。Step 204: Determine whether the output register file is free.
可选地,步骤203可以在步骤204之后执行。在将数据存储之后,在开始处理数据之前,可以判断输出寄存器文件是否为空闲。Optionally, step 203 may be performed after step 204 . After the data is stored, it can be determined whether the output register file is free before starting to process the data.
步骤205:在确定输出寄存器文件空闲时,将输出寄存器文件的状态设置为忙状态。Step 205: When it is determined that the output register file is idle, the state of the output register file is set to a busy state.
步骤206:在将输出寄存器文件的状态设置为忙状态之后,确定执行根据待处理矩阵的维度,以及预置的矩阵维度与处理指令存储地址的对应关系,确定待处理矩阵对应的目标处理指令存储地址的步骤。Step 206: After setting the state of the output register file to the busy state, determine the execution according to the dimension of the matrix to be processed, and the correspondence between the preset matrix dimension and the storage address of the processing instruction, and determine the target processing instruction storage corresponding to the matrix to be processed. address steps.
在进行步骤207之前,为了避免写错误,需要判断输出寄存器文件的状态。在确定输出寄存器文件为空闲时,确定执行步骤207。将输出寄存器文件的状态设置为忙状态,可以避免输出寄存器文件被其他任务占用而导致数据处理过程出错。Before proceeding to step 207, in order to avoid writing errors, it is necessary to judge the state of the output register file. When it is determined that the output register file is free, step 207 is determined to be executed. Setting the state of the output register file to the busy state can prevent the output register file from being occupied by other tasks and causing errors in the data processing process.
步骤207:根据待处理矩阵的维度,以及预置的映射信息,确定待处理矩阵对应的目标处理指令存储地址。Step 207: Determine the target processing instruction storage address corresponding to the matrix to be processed according to the dimension of the matrix to be processed and the preset mapping information.
其中,映射信息用于指示矩阵的维度与处理指令存储地址之间的映射关系。目标处理执行存储地址中存储的目标处理指令用于确定待处理矩阵的逆矩阵。The mapping information is used to indicate the mapping relationship between the dimension of the matrix and the storage address of the processing instruction. The target process executes the target process instruction stored in the memory address for determining the inverse of the matrix to be processed.
本实施例中通过分析矩阵求逆的各种算法,提取出各种求逆算法所必须的基础细粒度算子。主要包括如下几类:In this embodiment, basic fine-grained operators necessary for various inversion algorithms are extracted by analyzing various algorithms for matrix inversion. Mainly include the following categories:
编码coding 指令instruction 功能Features
4'h04'h0 MULMUL A=B*CA=B*C
4'h14'h1 MACMAC A=∑(B*C)A=∑(B*C)
4'h24'h2 SMACSMAC A=A-∑(B*C)A=A-∑(B*C)
4'h34'h3 ZMACZMAC A=0-∑(B*C)A=0-∑(B*C)
4'h44'h4 MACMMACM A=∑(B*C)*DA=∑(B*C)*D
4'h54'h5 SMACMSMACM A=(A-∑(B*C))*DA=(A-∑(B*C))*D
4'h64'h6 ZMACMZMACM A=(0-∑(B*C))*DA=(0-∑(B*C))*D
4'h74'h7 DIVDIV A=1/AA=1/A
4'h84'h8 DMACMDMACM A=1/∑(B*C)A=1/∑(B*C)
4'h94'h9 DSMACMDSMACM A=1/(A-∑(B*C))A=1/(A-∑(B*C))
4'hA4'hA DZMACMDZMACM A=1/(0-∑(B*C))A=1/(0-∑(B*C))
4'hB4'hB MOVEMOVE 数据搬移指令data move instruction
4'hC4'hC FIXFIX 定点化指令fixed-point instruction
其中,B表示输入寄存器文件中的数据,C表示输出寄存器文件中的数据,D表示上一次目标处理指令的执行结果,A表示B和C的处理结果。Among them, B represents the data in the input register file, C represents the data in the output register file, D represents the execution result of the last target processing instruction, and A represents the processing results of B and C.
通过分析矩阵求逆算法流程,其控制相对比较简单,而且属于大规模并行计算。本实施例采用离线编程,即简化控制逻辑,又增强灵活性。By analyzing the matrix inversion algorithm flow, its control is relatively simple, and it belongs to large-scale parallel computing. This embodiment adopts off-line programming, which simplifies control logic and enhances flexibility.
图5为处理指令格式的示意图。如图5所示,该处理指令中的各字段的意义如下:FIG. 5 is a schematic diagram of a processing instruction format. As shown in Figure 5, the meanings of the fields in the processing instruction are as follows:
指令:包括以上的基本细粒度算子,主要用于控制计算单元的执行。Instructions: including the above basic fine-grained operators, which are mainly used to control the execution of computing units.
Halt周期:由于矩阵求逆算法中某些数据之间有较强的依赖关系,为避免读写冲突,某些计算流程之间需要插入等待周期。该字段用于在执行完此条指令后,流水线暂停n个周期。Halt cycle: Due to the strong dependency between some data in the matrix inversion algorithm, in order to avoid read and write conflicts, waiting cycles need to be inserted between some calculation processes. This field is used to suspend the pipeline for n cycles after executing this instruction.
并行度:该字段用于细粒度的块内并行度,表示块内并行处理的元素个数,结合任务参数中的粗粒度的块间并行度(即待处理矩阵的块数),可以把计算资源和各种网络组织成各种形式。比如,如果块数为2,并行度为2,则累加/定标/广播网络会把资源池组织成四个独立的网络,累加/定标/广播功能仅在独立网络内部执行,并不会跨越两个互不相干的独立网络,这个可以同时处理2个矩阵的求逆,并且每个矩阵可同时计算两个元素。Parallelism: This field is used for fine-grained intra-block parallelism, indicating the number of elements processed in parallel within a block. Combined with the coarse-grained inter-block parallelism in the task parameters (ie, the number of blocks of the matrix to be processed), the calculation can be calculated. Resources and various networks are organized in various forms. For example, if the number of blocks is 2 and the degree of parallelism is 2, the accumulation/scaling/broadcasting network will organize the resource pool into four independent networks. Spanning two disjoint independent networks, this one can handle the inversion of 2 matrices simultaneously, and each matrix can compute two elements simultaneously.
定标控制:该字段主要用于算法流程中的各种定标值的计算、传播以及定点化需要的定标值的选择等。Calibration control: This field is mainly used for the calculation and propagation of various calibration values in the algorithm process, as well as the selection of calibration values required for fixed point.
源/目的A:该字段主要用于控制数据A的行为。通过不同的控制,可以从输入寄存器文件B或输出寄存器文件C中获取操作数;也可以把结果回写到输 入寄存器文件B或输出寄存器文件C中;同时,还可以获取常量值或所取数据的共轭值。Source/Destination A: This field is mainly used to control the behavior of data A. Through different controls, the operands can be obtained from the input register file B or the output register file C; the result can also be written back to the input register file B or the output register file C; at the same time, the constant value or the fetched data can also be obtained. the conjugate value of .
源B和源C:这两个字段分别用于控制访问寄存器文件B和C的行为。通过不同的控制,既可以行访问,也可以列访问。其中类型表示访问类型:类型=0,表示行访问;类型=1,表示列访问。(From0,To0)表示不同资源块所访问数据的位置;(From1,To1)表示相同资源块内所访问数据的序列。其中,不同资源块表示不同的行或者列,相同资源块表示同行或者同列。Source B and Source C: These two fields are used to control the behavior of accessing register files B and C, respectively. With different controls, both row and column access are possible. The type represents the access type: type=0, representing row access; type=1, representing column access. (From0, To0) represents the location of data accessed by different resource blocks; (From1, To1) represents the sequence of accessed data within the same resource block. Wherein, different resource blocks represent different rows or columns, and the same resource blocks represent a row or the same column.
图6为处理指令中包括的数据的访问类型的示意图。如图6所示,在处理指令中的类型为1时,表示列访问。From0、To0表示访问不同的列,From1、To1表示访问From0、To0对应的列中的哪几行。举例来说,From0、To0可以为From0、To2,表示访问第0列至第2列,From1、To1可以为From2、To4,表示访问第0列至第2列中每列的第2行至第4行。FIG. 6 is a schematic diagram of access types of data included in processing instructions. As shown in FIG. 6, when the type in the processing instruction is 1, column access is indicated. From0 and To0 represent accessing different columns, and From1 and To1 represent which rows in the columns corresponding to From0 and To0 are accessed. For example, From0 and To0 can be From0 and To2, which means accessing columns 0 to 2; From1 and To1 can be From2, To4, which means accessing rows 2 to 2 of each of columns 0 to 2 4 lines.
如图6所示,在处理指令中的类型为0时,表示行访问。From0、To0表示访问不同的行,From1、To1表示访问From0、To0对应的行中的哪几列。As shown in FIG. 6, when the type in the processing instruction is 0, it indicates a row access. From0 and To0 represent accessing different rows, and From1 and To1 represent which columns in the row corresponding to From0 and To0 are accessed.
根据以上几个字段的不同组合,通过译码,控制着矩阵求逆的整个处理流水过程,从而摆脱硬连线固化的限制,增加控制的灵活性,简化控制逻辑,降低硬件实现复杂度。According to the different combinations of the above fields, through decoding, the entire processing pipeline of matrix inversion is controlled, so as to get rid of the limitation of hard-wired solidification, increase the flexibility of control, simplify the control logic, and reduce the complexity of hardware implementation.
离线指令编程可以根据应用需求,把多种算法和维度组合的矩阵求逆过程编码,并静态和动态下载到指令RAM,同时把映射关系配置到映射信息中,根据任务参数中的待处理矩阵的维度,从映射信息中查找到相应程序并执行。因此,离线指令编程不仅增强灵活性、降低控制复杂度,而且有益于动态功耗的降低。Offline instruction programming can encode the matrix inversion process of various algorithms and dimension combinations according to the application requirements, and download it to the instruction RAM statically and dynamically, and configure the mapping relationship into the mapping information at the same time. dimension, find the corresponding program from the mapping information and execute it. Therefore, offline instruction programming not only enhances flexibility and reduces control complexity, but also benefits the reduction of dynamic power consumption.
步骤208:根据待处理矩阵的块数,配置资源池中的计算资源,确定块间并行的计算单元信息。Step 208 : According to the number of blocks of the matrix to be processed, configure computing resources in the resource pool, and determine information of computing units paralleled between blocks.
步骤207与步骤208之间,没有时序关系。There is no timing relationship between step 207 and step 208 .
步骤209:从目标处理指令存储地址中依次读取目标处理指令之后,根据块内并行度,确定块内并行的计算单元信息。Step 209: After sequentially reading the target processing instructions from the target processing instruction storage addresses, determine the parallel computing unit information in the block according to the degree of parallelism in the block.
步骤209中依次读取目标处理指令的过程与步骤104的具体实现过程想类似,此处不再赘述。The process of sequentially reading the target processing instructions in step 209 is similar to the specific implementation process of step 104, and will not be repeated here.
如图5所示,本实施例中的目标处理指令包括的并行度指的是块内并行度。在从目标处理指令存储地址中读取到目标处理指令之后,可以根据块内并行度,确定块内并行的计算单元信息。As shown in FIG. 5 , the degree of parallelism included in the target processing instruction in this embodiment refers to the degree of parallelism within a block. After the target processing instruction is read from the storage address of the target processing instruction, the parallel computing unit information in the block can be determined according to the degree of parallelism in the block.
在实际应用中,由于要求矩阵的维度范围比较多,比如1,2,4,8,16,24,32等。如果按照最大能力预留资源,那么在处理小维度矩阵的时候,资源无法得到充分利用,既浪费资源,又增加延迟。因此,本实施例中,任务请求中包括待处理矩阵的块数,采用把多块小矩阵拼接成大矩阵,并行处理,充分利用其资源,提高资源的利用率,同时降低由于串行处理引起的时延开销。矩阵求逆一般是针对每个元素点进行相应的计算,如果每次均只处理一个元素点,虽然简单,但并不能充分资源,同时时延也比较大。本本实施例可以把资源按 需求组织起来,并行处理相同矩阵的多个元素点,到达块内并行的目的。其中,块内并行度指的是并行处理的同一个待处理矩阵中的元素的个数。In practical applications, due to the requirement that the dimension range of the matrix is relatively large, such as 1, 2, 4, 8, 16, 24, 32 and so on. If resources are reserved according to the maximum capacity, resources cannot be fully utilized when processing small-dimensional matrices, which wastes resources and increases delay. Therefore, in this embodiment, the task request includes the number of blocks of the matrix to be processed, and multiple small matrices are spliced into a large matrix and processed in parallel, so as to make full use of its resources, improve the utilization rate of resources, and reduce the problems caused by serial processing. delay overhead. The matrix inversion is generally calculated for each element point. If only one element point is processed at a time, although it is simple, it does not have sufficient resources, and the delay is relatively large. In this embodiment, resources can be organized according to requirements, and multiple element points of the same matrix can be processed in parallel to achieve intra-block parallelism. The intra-block parallelism refers to the number of elements in the same matrix to be processed that are processed in parallel.
本实施例采用计算资源池的思想,把计算资源按一定粒度组合成具有多个独立计算资源的池,计算资源池可以根据应用需求,组合成一套或多套计算单元,处理一个大矩阵或并行处理多个小矩阵,以及并行处理同一矩阵多个元素。可以根据需要,把这些资源池组合成多个并行处理的计算单元,并行处理多个矩阵或元素。这样,由于多个并行计算单元共享相同的程序和控制逻辑,降低功耗开销,并且在处理小矩阵时,提高吞吐量,减少时间延迟。This embodiment adopts the idea of computing resource pools, and combines computing resources into pools with multiple independent computing resources according to a certain granularity. The computing resource pools can be combined into one or more sets of computing units according to application requirements to process a large matrix or parallel computing. Process multiple small matrices, and process multiple elements of the same matrix in parallel. These resource pools can be combined into multiple parallel processing computing units as required to process multiple matrices or elements in parallel. In this way, since multiple parallel computing units share the same program and control logic, power consumption overhead is reduced, and when processing small matrices, throughput is improved and time delay is reduced.
一种实现方式中,如果确定计算一个元素最大需要64个乘法单元,同时为支持各种粗/细粒度的并行度,把这些乘法单元分为8组,每组8个。通过各种网络的动态配置,这些资源,即可以组织成一个统一的1个大资源池,也可以组织成最大8个的小资源池。In an implementation manner, if it is determined that a maximum of 64 multiplication units are required to calculate one element, and at the same time to support various coarse/fine-grained parallelisms, these multiplication units are divided into 8 groups, with 8 in each group. Through the dynamic configuration of various networks, these resources can be organized into a unified large resource pool, or can be organized into a maximum of 8 small resource pools.
在步骤208中,根据任务请求中待处理矩阵的块数,在粗粒度层面静态配置各种网络,把计算资源池组织成块间并行处理的计算单元。在步骤209中,根据从目标处理指令存储地址中读取到的目标处理指令中的并行度,再从细粒度层面动态调整各种网络和计算单元,并行处理块内的多个元素,从而实现控制各种单元按需求动态运行。In step 208, according to the number of blocks of the matrix to be processed in the task request, various networks are statically configured at a coarse-grained level, and the computing resource pool is organized into computing units for parallel processing between blocks. In step 209, according to the degree of parallelism in the target processing instruction read from the target processing instruction storage address, various networks and computing units are dynamically adjusted from a fine-grained level, and multiple elements in the block are processed in parallel, thereby realizing Control various units to run dynamically as required.
步骤210:根据目标处理指令、块间并行的计算单元信息对应的计算单元以及块内并行的计算单元信息对应的计算单元,并行处理输入寄存器文件中存储的数据。Step 210 : Process the data stored in the input register file in parallel according to the target processing instruction, the computing unit corresponding to the information of the parallel computing units between the blocks, and the computing unit corresponding to the information of the parallel computing units within the block.
在步骤210中,可以基于目标处理指令,通过块间并行的计算单元信息对应的计算单元以及块内并行的计算单元信息对应的计算单元,并行处理输入寄存器文件中存储的数据。该并行处理过程既可以实现多个待处理矩阵块之间的处理并行,也可以实现同一个待处理矩阵内的多个元素之间的并行处理。In step 210, the data stored in the input register file may be processed in parallel based on the target processing instruction through the computing unit corresponding to the inter-block parallel computing unit information and the computing unit corresponding to the intra-block parallel computing unit information. The parallel processing process can realize parallel processing among multiple matrix blocks to be processed, and can also realize parallel processing among multiple elements in the same matrix to be processed.
本实施例中,由于是依次获取目标处理指令,在每处理一个目标处理指令之后,可以将本次处理输入寄存器文件中存储的数据过程中,获取到的中间数据存储在输出寄存器文件或者输入寄存器文件中,将获取到的结果数据存储在输出寄存器文件中。可选地,获取到的中间数据还可以存储在计算单元的缓存单元中。即,本实施例中的两个寄存器文件可分别用于输入存储、输出存储或中间临时存储。同时,根据并行处理的块数,可以分别存储多个小矩阵。In this embodiment, since the target processing instructions are acquired in sequence, after each target processing instruction is processed, the intermediate data acquired during the current processing of the data stored in the input register file can be stored in the output register file or the input register file, and store the acquired result data in the output register file. Optionally, the acquired intermediate data may also be stored in a cache unit of the computing unit. That is, the two register files in this embodiment can be used for input storage, output storage or intermediate temporary storage, respectively. At the same time, according to the number of blocks processed in parallel, multiple small matrices can be stored separately.
如图5所示,目标指令还包括目标数据的访问地址(即图5中的源B和源C字段)以及目标数据的处理方式。在步骤210中,若根据目标数据的访问地址,确定计算单元需要访问的多个目标数据具有相同的行或者列,则将多个目标数据的访问通道合并成一个访问通道;根据合并后的访问通道,从输入寄存器文件和/或输出寄存器文件中读取数据;从读取到的数据中,根据目标数据的访问地址,获取目标数据;根据目标处理指令包括的处理方式、块间并行的计算单元信息对应的计算单元以及块内并行的计算单元信息对应的计算单元,并行处理目标数据。As shown in FIG. 5 , the target instruction also includes the access address of the target data (ie, the source B and source C fields in FIG. 5 ) and the processing mode of the target data. In step 210, if it is determined according to the access address of the target data that multiple target data to be accessed by the computing unit have the same row or column, then the access channels of the multiple target data are combined into one access channel; according to the combined access channel, read data from the input register file and/or output register file; from the read data, obtain the target data according to the access address of the target data; according to the processing method included in the target processing instruction, the parallel calculation between blocks The computing unit corresponding to the unit information and the computing unit corresponding to the parallel computing unit information in the block process the target data in parallel.
这里的计算单元可以是块间并行的计算单元,也可以是块内并行的计算单 元。本实施例并不以此为限。The computing unit here can be a parallel computing unit between blocks or a parallel computing unit within a block. This embodiment is not limited to this.
在具体实现时,在合并访问通道后,可以产生读控制信息。根据读控制信息,从输入寄存器文件和/或输出寄存器文件中读取合并后的访问通道中的所有数据,然后再从中选择所需要的目标数据。在读取到所需要的目标数据之后,进行相关的预处理,如定点化或共轭处理等;根据以前收集的定点化数据,按目标处理指令要求广播或指派到各个定点化单元;根据目标指令中的指令,启动相关的块间并行的计算单元信息对应的计算单元以及块内并行的计算单元信息对应的计算单元,按指令中的需求,动态处理定点化和收集定点化相关的数据,并把结果回写到寄存器文件中和/或主控单元中。During specific implementation, after combining access channels, read control information can be generated. According to the read control information, read all the data in the combined access channel from the input register file and/or the output register file, and then select the required target data from it. After reading the required target data, carry out relevant preprocessing, such as fixed-pointing or conjugation processing, etc.; according to the fixed-pointing data collected before, broadcast or assign to each fixed-pointing unit according to the target processing instruction requirements; according to the target The instruction in the instruction starts the calculation unit corresponding to the related inter-block parallel calculation unit information and the calculation unit corresponding to the intra-block parallel calculation unit information, dynamically processes the fixed point and collects the fixed point related data according to the requirements in the instruction, And write the result back to the register file and/or the master unit.
由于多个资源池,并行处理所需数据量大。这样,需要从寄存器文件并行读取大量的数据。如果按照常规的读取,会有很多级数据选择器选择逻辑,这样在物理后端上由于拥塞,难于实现。针对矩阵求逆,特别是下三角矩阵求逆数据访问的特性,寄存器文件同时把下三角的数据存储到上三角,相当于拷贝一份,这样可以把列访问转换为行访问,而且由于同时访问的数据仅在局部,通过通道合并之后,可以把读数据的行为控制在一定的较少行数之内,降低第一级选择器的数目;通过读取行数据后,再分别进行列选择。通过这种技术,极大的降低行访问的数量,从而减少大量的数据选择器逻辑,有利于物理后端的实现。Due to multiple resource pools, the amount of data required for parallel processing is large. In this way, a large amount of data needs to be read from the register file in parallel. If you follow the regular reading, there will be many levels of data selector selection logic, which is difficult to implement due to congestion on the physical backend. For matrix inversion, especially the data access characteristics of lower triangular matrix inversion, the register file stores the data of the lower triangle to the upper triangle at the same time, which is equivalent to a copy, which can convert column access to row access, and because of simultaneous access The data is only local. After the channel is merged, the behavior of reading data can be controlled within a certain number of rows, reducing the number of first-level selectors; after reading row data, column selection is performed separately. Through this technology, the number of row accesses is greatly reduced, thereby reducing a large amount of data selector logic, which is beneficial to the realization of the physical backend.
图7为访问通道合并的示意图。同时完成8块8x8的矩阵,在执行图中所示处理指令时,如果通道独立的话,需要8个通道同时读取8行数据;但是经过通道合并后,仅需要4个通道同时读取4行数据。这样,可以极大降低读选择器的数目,利于后端布线。FIG. 7 is a schematic diagram of access channel merging. Complete 8 blocks of 8x8 matrices at the same time. When executing the processing instructions shown in the figure, if the channels are independent, 8 channels are required to read 8 rows of data at the same time; but after the channels are merged, only 4 channels are required to read 4 rows at the same time. data. In this way, the number of read selectors can be greatly reduced, which is beneficial to back-end wiring.
在步骤209中,如果确定当前运行地址等于目标处理指令存储地址中的指令结束地址时,表明该待处理矩阵的求逆指令执行结束,释放输入寄存器文件。之后,根据任务参数和主控定点化结果,对输出数据进行后处理,并输出相关结果参数和数据,然后置输出寄存器文件为空闲状态,完成整个矩阵的下三角求逆过程。In step 209, if it is determined that the current running address is equal to the instruction end address in the target processing instruction storage address, it indicates that the execution of the inversion instruction of the matrix to be processed is completed, and the input register file is released. After that, according to the task parameters and the fixed-pointing result of the main control, the output data is post-processed, and the relevant result parameters and data are output, and then the output register file is set to an idle state to complete the lower triangular inversion process of the entire matrix.
本实施例的上述数据处理过程,采用输入-计算-输出三级流水松耦合,各个流水间弱相关,仅通过触发信号触发下级流水的执行。In the above data processing process of this embodiment, the input-calculation-output three-stage pipeline is loosely coupled, and each pipeline is weakly correlated, and the execution of the lower-level pipeline is only triggered by a trigger signal.
请继续参照图3,在任务输入模块接收到任务请求之后,将接收到的任务请求中包括的待处理矩阵中的数据通过数据写接口存储在输入寄存器文件中;将资源池中的计算单元配置为块间并行的计算单元信息;根据任务请求以及映射信息,例如,映射表,读取到目标处理指令存储地址,进而,获取到目标处理指令:指令0至指令N:通过取指、译码以及控制,依次执行各目标处理指令;根据目标处理指令,将资源池中的资源进行细粒度配置,确定块内并行的计算单元信息;根据配置后的计算单元,并行处理数据。在处理过程中,需要通过数据读接口、通道合并等,从寄存器文件中读取数据。在一个目标处理指令执行完成后,通过数据写接口,存储在寄存器文件中。图3中的资源池中包括乘积累加运算(Multiply Accumulate,MAC)单元。Please continue to refer to FIG. 3, after the task input module receives the task request, the data in the matrix to be processed included in the received task request is stored in the input register file through the data write interface; the computing unit in the resource pool is configured It is the parallel computing unit information between blocks; according to the task request and mapping information, such as the mapping table, the target processing instruction storage address is read, and then, the target processing instruction is obtained: instruction 0 to instruction N: through instruction fetching, decoding and control, execute each target processing instruction in sequence; according to the target processing instruction, fine-grained configuration of resources in the resource pool is performed to determine parallel computing unit information in the block; data is processed in parallel according to the configured computing unit. In the process of processing, it is necessary to read data from the register file through the data read interface, channel merging, etc. After a target processing instruction is executed, it is stored in the register file through the data write interface. The resource pool in FIG. 3 includes a Multiply Accumulate (MAC) unit.
本实施例提供的数据处理方法,克服了传统矩阵求逆方法无法兼顾通用性、吞吐量、复杂度以及低时延等多种指标的问题,借鉴了SIMD+ASIC的思想,提出一种具有高度的可配性和通用性的矩阵求逆实现方法,以适应不同的协议以及协议的不断演进;同时结合通道合并、双矢量寄存器文件、块间块内并行以及各种广播网络等多种技术,采用资源池的思路,降低实现代价、开发和后端风险,满足各种算法、各种维度以各种应用场景的需求。The data processing method provided by this embodiment overcomes the problem that the traditional matrix inversion method cannot take into account various indicators such as versatility, throughput, complexity, and low delay, and draws on the idea of SIMD+ASIC, and proposes a highly efficient method. The configurable and universal matrix inversion implementation method can adapt to different protocols and the continuous evolution of the protocol; at the same time, it combines various technologies such as channel merging, dual vector register files, inter-block and intra-block parallelism, and various broadcast networks. The idea of resource pool is adopted to reduce the implementation cost, development and back-end risks, and meet the needs of various algorithms, various dimensions and various application scenarios.
图8为另一实施例提供的数据处理装置的结构示意图。如图8所示,本实施例提供的数据处理装置包括如下模块:存储模块81、第一确定模块82、第二确定模块83以及处理模块84。FIG. 8 is a schematic structural diagram of a data processing apparatus provided by another embodiment. As shown in FIG. 8 , the data processing apparatus provided in this embodiment includes the following modules: a storage module 81 , a first determination module 82 , a second determination module 83 , and a processing module 84 .
存储模块81,被配置为将接收到的任务请求中包括的待处理矩阵中的数据存储在输入寄存器文件中。The storage module 81 is configured to store the data in the matrix to be processed included in the received task request in the input register file.
其中,任务请求包括:待处理矩阵中的数据、待处理矩阵的维度以及待处理矩阵的块数。The task request includes: the data in the matrix to be processed, the dimension of the matrix to be processed, and the number of blocks of the matrix to be processed.
第一确定模块82,被配置为根据待处理矩阵的维度,以及预置的映射信息,确定待处理矩阵对应的目标处理指令存储地址。The first determining module 82 is configured to determine the target processing instruction storage address corresponding to the matrix to be processed according to the dimension of the matrix to be processed and preset mapping information.
其中,映射信息用于指示矩阵的维度与处理指令存储地址之间的映射关系。目标处理指令存储地址中存储的目标处理指令用于确定待处理矩阵的逆矩阵。The mapping information is used to indicate the mapping relationship between the dimension of the matrix and the storage address of the processing instruction. The target processing instruction stored in the target processing instruction storage address is used to determine the inverse matrix of the matrix to be processed.
第二确定模块83,被配置为根据待处理矩阵的块数,配置资源池中的计算资源,确定块间并行的计算单元信息。The second determination module 83 is configured to configure the computing resources in the resource pool according to the number of blocks of the matrix to be processed, and to determine the parallel computing unit information between the blocks.
处理模块84,被配置为从目标处理指令存储地址中依次读取目标处理指令,并根据目标处理指令以及块间并行的计算单元信息对应的计算单元,并行处理输入寄存器文件中存储的数据。The processing module 84 is configured to sequentially read the target processing instructions from the target processing instruction storage addresses, and process the data stored in the input register file in parallel according to the target processing instructions and the computing units corresponding to the parallel computing unit information between blocks.
可选地,目标处理指令包括块内并行度。该装置还包括:第三确定模块,被配置为根据块内并行度,确定块内并行的计算单元信息。处理模块84具体用于:根据目标处理指令、块间并行的计算单元信息对应的计算单元以及块内并行的计算单元信息对应的计算单元,并行处理输入寄存器文件中存储的数据。Optionally, the target processing instructions include intra-block parallelism. The apparatus further includes: a third determination module configured to determine the parallel computing unit information within the block according to the degree of parallelism within the block. The processing module 84 is specifically configured to process the data stored in the input register file in parallel according to the target processing instruction, the computing units corresponding to the information of the parallel computing units between the blocks, and the computing units corresponding to the information of the parallel computing units within the blocks.
一实现方式中,处理模块84具体用于:将当前运行地址设置为目标处理指令存储地址中的第i个地址;根据第i个地址对应的目标处理指令以及块间并行的计算单元信息对应的计算单元,并行处理输入寄存器文件中存储的数据;在处理完第i个地址对应的目标处理指令之后,在确定当前运行地址不等于目标处理指令存储地址中的指令结束地址时,确定i=i+1,返回执行将当前运行地址设置为目标处理指令存储地址中的第i个地址的步骤。In one implementation, the processing module 84 is specifically configured to: set the current operating address as the i-th address in the target processing instruction storage address; The computing unit processes the data stored in the input register file in parallel; after processing the target processing instruction corresponding to the i-th address, when determining that the current operating address is not equal to the instruction end address in the target processing instruction storage address, determine that i=i +1, return to execute the step of setting the current operating address as the i-th address in the target processing instruction storage address.
进一步地,目标处理指令还包括延迟周期。处理模块84具体用于:在确定当前运行地址不等于目标处理指令存储地址中的指令结束地址之后,在延迟第i个地址对应的目标处理指令中包括的延迟周期之后,确定i=i+1,返回执行将当前运行地址设置为目标处理指令存储地址中的第i个地址的步骤。Further, the target processing instruction also includes a delay period. The processing module 84 is specifically configured to: after determining that the current operating address is not equal to the instruction end address in the target processing instruction storage address, after delaying the delay period included in the target processing instruction corresponding to the ith address, determine i=i+1 , and return to the step of setting the current running address as the i-th address in the storage address of the target processing instruction.
可选地,待处理矩阵为下三角矩阵。存储模块81具体用于:按照预设的下三角矩阵的存储格式,将接收到的任务请求中包括的待处理矩阵中的数据存储在输入寄存器文件中;拷贝接收到的任务请求中包括的待处理矩阵中的数据, 并按照上三角矩阵的存储格式存储拷贝的数据。Optionally, the matrix to be processed is a lower triangular matrix. The storage module 81 is specifically used to: store the data in the matrix to be processed included in the received task request in the input register file according to the preset storage format of the lower triangular matrix; copy the pending data included in the received task request. Process the data in the matrix, and store the copied data according to the storage format of the upper triangular matrix.
一实现方式中,存储模块81还用于:将本次处理输入寄存器文件中存储的数据过程中,获取到的中间数据存储在输出寄存器文件或者输入寄存器文件中,将获取到的结果数据存储在输出寄存器文件中。In an implementation manner, the storage module 81 is further configured to: store the intermediate data obtained in the process of processing the data stored in the input register file this time in the output register file or the input register file, and store the obtained result data in the output register file or the input register file. in the output register file.
可选地,目标指令还包括目标数据的访问地址以及目标数据的处理方式。处理模块84具体用于:若根据目标数据的访问地址,确定计算单元需要访问的多个目标数据具有相同的行或者列,则将多个目标数据的访问通道合并成一个访问通道;根据合并后的访问通道,从输入寄存器文件和/或输出寄存器文件中读取数据;从读取到的数据中,根据目标数据的访问地址,获取目标数据;根据目标处理指令包括的处理方式、块间并行的计算单元信息对应的计算单元以及块内并行的计算单元信息对应的计算单元,并行处理目标数据。Optionally, the target instruction further includes an access address of the target data and a processing method of the target data. The processing module 84 is specifically used for: if according to the access address of the target data, it is determined that the plurality of target data that the computing unit needs to access has the same row or column, then the access channels of the plurality of target data are combined into one access channel; read data from the input register file and/or output register file; from the read data, obtain the target data according to the access address of the target data; according to the processing method included in the target processing instruction, parallel between blocks The calculation unit corresponding to the calculation unit information of the block and the calculation unit corresponding to the parallel calculation unit information in the block process the target data in parallel.
一实现方式中,装置还包括:判断模块,被配置为在确定任务接口存在任务请求时,判断输入寄存器文件是否为空闲;接收设置模块,被配置为在确定输入寄存器文件空闲时,接收任务请求,并将输入寄存器文件的状态设置为忙状态。In an implementation manner, the apparatus further includes: a judgment module, configured to judge whether the input register file is idle when it is determined that there is a task request in the task interface; a receiving setting module, configured to receive the task request when it is determined that the input register file is idle , and sets the state of the input register file to busy.
另一实现方式中,判断模块,还用于判断输出寄存器文件是否为空闲。在该实现方式中,该装置还包括:设置模块以及第四确定模块。设置模块,被配置为在确定输出寄存器文件空闲时,将输出寄存器文件的状态设置为忙状态。第四确定模块,被配置为在将输出寄存器文件的状态设置为忙状态之后,确定执行根据待处理矩阵的维度,以及预置的映射信息,确定待处理矩阵对应的目标处理指令存储地址的步骤。In another implementation manner, the judgment module is further configured to judge whether the output register file is free. In this implementation manner, the apparatus further includes: a setting module and a fourth determination module. The setting module is configured to set the state of the output register file to a busy state when the output register file is determined to be idle. The fourth determination module is configured to, after setting the state of the output register file to the busy state, determine to execute the step of determining the storage address of the target processing instruction corresponding to the matrix to be processed according to the dimension of the matrix to be processed and the preset mapping information .
本实施例提供的数据处理装置用于执行上述任意实施例的数据处理方法,本实施例提供的数据处理装置实现原理和技术效果类似,此处不再赘述。The data processing apparatus provided in this embodiment is used to execute the data processing method in any of the foregoing embodiments. The implementation principle and technical effect of the data processing apparatus provided in this embodiment are similar, and details are not repeated here.
图9为一实施例提供的数据处理设备的结构示意图。如图9所示,该数据处理设备包括处理器91和存储器92;数据处理设备中处理器91的数量可以是一个或多个,图9中以一个处理器91为例;数据处理设备中的处理器91和存储器92;可以通过总线或其他方式连接,图9中以通过总线连接为例。FIG. 9 is a schematic structural diagram of a data processing device provided by an embodiment. As shown in FIG. 9, the data processing device includes a processor 91 and a memory 92; the number of processors 91 in the data processing device may be one or more, and one processor 91 is taken as an example in FIG. 9; The processor 91 and the memory 92 can be connected by a bus or other means, and the connection by a bus is taken as an example in FIG. 9 .
存储器92作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本申请实施例中的数据处理方法对应的程序指令/模块(例如,数据处理装置中的存储模块81、第一确定模块82、第二确定模块83以及处理模块84)。处理器91通过运行存储在存储器92中的软件程序、指令以及模块,从而数据处理设备的各种功能应用以及数据处理,即实现上述的数据处理方法。As a computer-readable storage medium, the memory 92 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the data processing method in the embodiments of the present application (for example, a storage module in a data processing apparatus). 81, a first determination module 82, a second determination module 83, and a processing module 84). The processor 91 executes the software programs, instructions, and modules stored in the memory 92 to perform various functional applications and data processing of the data processing device, that is, to implement the above-mentioned data processing method.
存储器92可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据数据处理设备的使用所创建的数据等。此外,存储器92可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。The memory 92 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the data processing apparatus, and the like. Additionally, memory 92 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
本申请实施例还提供一种包含计算机可执行指令的存储介质,计算机可执 行指令在由计算机处理器执行时用于执行一种数据处理方法,该方法包括:Embodiments of the present application also provide a storage medium containing computer-executable instructions, and the computer-executable instructions are used to execute a data processing method when executed by a computer processor, and the method includes:
将接收到的任务请求中包括的待处理矩阵中的数据存储在输入寄存器文件中;其中,任务请求包括:待处理矩阵中的数据、待处理矩阵的维度以及待处理矩阵的块数;The data in the matrix to be processed included in the received task request is stored in the input register file; wherein, the task request includes: the data in the matrix to be processed, the dimension of the matrix to be processed and the number of blocks of the matrix to be processed;
根据待处理矩阵的维度,以及预置的映射信息,确定待处理矩阵对应的目标处理指令存储地址;其中,映射信息用于指示矩阵的维度与处理指令存储地址之间的映射关系,目标处理指令存储地址中存储的目标处理指令用于确定待处理矩阵的逆矩阵;According to the dimension of the matrix to be processed and the preset mapping information, determine the storage address of the target processing instruction corresponding to the matrix to be processed; wherein, the mapping information is used to indicate the mapping relationship between the dimension of the matrix and the storage address of the processing instruction, and the target processing instruction The target processing instruction stored in the storage address is used to determine the inverse matrix of the matrix to be processed;
根据待处理矩阵的块数,配置资源池中的计算资源,确定块间并行的计算单元信息;According to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool, and determine the parallel computing unit information between blocks;
从目标处理指令存储地址中依次读取目标处理指令,并根据目标处理指令以及块间并行的计算单元信息对应的计算单元,并行处理输入寄存器文件中存储的数据。The target processing instructions are sequentially read from the target processing instruction storage addresses, and the data stored in the input register file is processed in parallel according to the target processing instructions and the computing units corresponding to the computing unit information paralleled between blocks.
当然,本申请所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令不限于如上的方法操作,还可以执行本申请任意实施例所提供的数据处理方法中的相关操作。Certainly, a storage medium containing computer-executable instructions provided by the present application, the computer-executable instructions of which are not limited to the above method operations, and can also perform related operations in the data processing methods provided by any embodiment of the present application.
以上,仅为本申请的示例性实施例而已,并非用于限定本申请的保护范围。The above are merely exemplary embodiments of the present application, and are not intended to limit the protection scope of the present application.
一般来说,本申请的多种实施例可以在硬件或专用电路、软件、逻辑或其任何组合中实现。例如,一些方面可以被实现在硬件中,而其它方面可以被实现在可以被控制器、微处理器或其它计算装置执行的固件或软件中,尽管本申请不限于此。In general, the various embodiments of the present application may be implemented in hardware or special purpose circuits, software, logic, or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device, although the application is not limited thereto.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, functional modules/units in the systems, and devices can be implemented as software, firmware, hardware, and appropriate combinations thereof.
在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components Components execute cooperatively. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media. Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .
以上参照附图说明了本申请实施例的优选实施例,并非因此局限本申请实施例的权利范围。本领域技术人员不脱离本申请实施例的范围和实质内所作的任何修改、等同替换和改进,均应在本申请实施例的权利范围之内。The preferred embodiments of the embodiments of the present application have been described above with reference to the accompanying drawings, which are not intended to limit the scope of rights of the embodiments of the present application. Any modifications, equivalent replacements and improvements made by those skilled in the art without departing from the scope and essence of the embodiments of the present application shall fall within the scope of the rights of the embodiments of the present application.

Claims (11)

  1. 一种数据处理方法,所述方法包括:A data processing method, the method comprising:
    将接收到的任务请求中包括的待处理矩阵中的数据存储在输入寄存器文件中;其中,所述任务请求包括:待处理矩阵中的数据、待处理矩阵的维度以及所述待处理矩阵的块数;The data in the matrix to be processed included in the received task request is stored in the input register file; wherein, the task request includes: the data in the matrix to be processed, the dimension of the matrix to be processed and the block of the matrix to be processed number;
    根据所述待处理矩阵的维度,以及预置的映射信息,确定所述待处理矩阵对应的目标处理指令存储地址;其中,所述映射信息用于指示矩阵的维度与处理指令存储地址之间的映射关系,所述目标处理指令存储地址中存储的目标处理指令用于确定所述待处理矩阵的逆矩阵;Determine the target processing instruction storage address corresponding to the to-be-processed matrix according to the dimension of the matrix to be processed and preset mapping information; wherein the mapping information is used to indicate the difference between the dimension of the matrix and the storage address of the processing instruction a mapping relationship, where the target processing instruction stored in the target processing instruction storage address is used to determine the inverse matrix of the matrix to be processed;
    根据所述待处理矩阵的块数,配置资源池中的计算资源,确定块间并行的计算单元信息;According to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool, and determine the parallel computing unit information between blocks;
    从所述目标处理指令存储地址中依次读取目标处理指令,并根据所述目标处理指令以及所述块间并行的计算单元信息对应的计算单元,并行处理所述输入寄存器文件中存储的数据。The target processing instructions are sequentially read from the storage addresses of the target processing instructions, and the data stored in the input register file is processed in parallel according to the target processing instructions and the computing units corresponding to the computing unit information paralleled between the blocks.
  2. 根据权利要求1所述的方法,所述目标处理指令包括块内并行度;The method of claim 1, the target processing instruction comprising an intra-block parallelism;
    从所述目标处理指令存储地址中依次读取目标处理指令之后,所述方法还包括:After sequentially reading the target processing instructions from the target processing instruction storage addresses, the method further includes:
    根据所述块内并行度,确定块内并行的计算单元信息;According to the degree of parallelism in the block, determine the parallel computing unit information in the block;
    所述根据所述目标处理指令以及所述块间并行的计算单元信息对应的计算单元,处理所述输入寄存器文件中存储的数据,包括:The processing of the data stored in the input register file according to the computing unit corresponding to the target processing instruction and the parallel computing unit information between the blocks includes:
    根据所述目标处理指令、所述块间并行的计算单元信息对应的计算单元以及所述块内并行的计算单元信息对应的计算单元,并行处理所述输入寄存器文件中存储的数据。The data stored in the input register file is processed in parallel according to the target processing instruction, the computing unit corresponding to the inter-block parallel computing unit information, and the computing unit corresponding to the intra-block parallel computing unit information.
  3. 根据权利要求1或2所述的方法,所述从所述目标处理指令存储地址中依次读取目标处理指令,并根据所述目标处理指令以及所述块间并行的计算单元信息对应的计算单元,并行处理所述输入寄存器文件中存储的数据,包括:The method according to claim 1 or 2, wherein the target processing instruction is sequentially read from the target processing instruction storage address, and the calculation unit corresponding to the target processing instruction and the parallel calculation unit information between the blocks is calculated according to the calculation unit. , process the data stored in the input register file in parallel, including:
    将当前运行地址设置为所述目标处理指令存储地址中的第i个地址;Setting the current operating address to the i-th address in the target processing instruction storage address;
    根据所述第i个地址对应的目标处理指令以及所述块间并行的计算单元信息对应的计算单元,并行处理所述输入寄存器文件中存储的数据;According to the target processing instruction corresponding to the ith address and the computing unit corresponding to the parallel computing unit information between the blocks, the data stored in the input register file is processed in parallel;
    在处理完所述第i个地址对应的目标处理指令之后,在确定所述当前运行地址不等于所述目标处理指令存储地址中的指令结束地址时,确定i=i+1,返回执行将当前运行地址设置为所述目标处理指令存储地址中的第i个地址的步骤。After the target processing instruction corresponding to the i-th address is processed, when it is determined that the current operating address is not equal to the instruction end address in the target processing instruction storage address, i=i+1 is determined, and the execution returns to execute the current The step of setting the running address as the i-th address in the storage address of the target processing instruction.
  4. 根据权利要求3所述的方法,所述目标处理指令还包括延迟周期;The method of claim 3, the target processing instruction further comprising a delay period;
    所述在确定所述当前运行地址不等于所述目标处理指令存储地址中的指令结束地址时,确定i=i+1,返回执行将当前运行地址设置为所述目标处理指令存储地址中的第i个地址的步骤,包括:When it is determined that the current operating address is not equal to the instruction end address in the target processing instruction storage address, i=i+1 is determined, and the execution returns to set the current operating address as the first address in the target processing instruction storage address. i address steps, including:
    在确定所述当前运行地址不等于所述目标处理指令存储地址中的指令结束地址之后,在延迟所述第i个地址对应的目标处理指令中包括的延迟周期之后,确定i=i+1,返回执行将当前运行地址设置为所述目标处理指令存储地址中的第i个地址的步骤。After determining that the current operating address is not equal to the instruction end address in the target processing instruction storage address, after delaying the delay period included in the target processing instruction corresponding to the i-th address, it is determined that i=i+1, Return to execute the step of setting the current running address as the i-th address in the storage addresses of the target processing instruction.
  5. 根据权利要求1至4中任一项所述的方法,所述待处理矩阵为下三角矩阵;The method according to any one of claims 1 to 4, wherein the matrix to be processed is a lower triangular matrix;
    所述将接收到的任务请求中包括的待处理矩阵中的数据存储在输入寄存器文件中,包括:The data in the matrix to be processed included in the received task request is stored in the input register file, including:
    按照预设的下三角矩阵的存储格式,将所述接收到的任务请求中包括的待处理矩阵中的数据存储在所述输入寄存器文件中;According to the preset storage format of the lower triangular matrix, the data in the matrix to be processed included in the received task request is stored in the input register file;
    拷贝所述接收到的任务请求中包括的待处理矩阵中的数据,并按照上三角矩阵的存储格式存储拷贝的数据。Copy the data in the matrix to be processed included in the received task request, and store the copied data according to the storage format of the upper triangular matrix.
  6. 根据权利要求1-4任一项所述的方法,所述根据所述目标处理指令、所述块间并行的计算单元信息对应的计算单元以及所述块内并行的计算单元信息对应的计算单元,并行处理所述输入寄存器文件中存储的数据之后,所述方法还包括:The method according to any one of claims 1-4, wherein the calculation unit corresponding to the target processing instruction, the calculation unit information paralleled between the blocks, and the calculation unit corresponding to the parallel calculation unit information within the block , after the data stored in the input register file is processed in parallel, the method further includes:
    将本次处理所述输入寄存器文件中存储的数据过程中,获取到的中间数据存储在输出寄存器文件或者所述输入寄存器文件中,将获取到的结果数据存储在所述输出寄存器文件中。In the process of processing the data stored in the input register file this time, the intermediate data obtained is stored in the output register file or the input register file, and the obtained result data is stored in the output register file.
  7. 根据权利要求6所述的方法,所述目标处理指令还包括目标数据的访问地址以及目标数据的处理方式;The method according to claim 6, the target processing instruction further comprises an access address of the target data and a processing mode of the target data;
    所述根据所述目标处理指令、所述块间并行的计算单元信息对应的计算单元以及所述块内并行的计算单元信息对应的计算单元,并行处理所述输入寄存器文件中存储的数据,包括:The processing of the data stored in the input register file in parallel according to the target processing instruction, the calculation unit corresponding to the information of the parallel calculation unit between the blocks, and the calculation unit corresponding to the information of the parallel calculation unit within the block, including :
    若根据所述目标数据的访问地址,确定计算单元需要访问的多个目标数据具有相同的行或者列,则将所述多个目标数据的访问通道合并成一个访问通道;If, according to the access addresses of the target data, it is determined that a plurality of target data to be accessed by the computing unit have the same row or column, then combine the access channels of the plurality of target data into one access channel;
    根据合并后的访问通道,从所述输入寄存器文件和/或所述输出寄存器文件中读取数据;reading data from the input register file and/or the output register file according to the combined access channel;
    从读取到的数据中,根据所述目标数据的访问地址,获取所述目标数据;From the read data, obtain the target data according to the access address of the target data;
    根据所述目标处理指令包括的处理方式、所述块间并行的计算单元信息对应的计算单元以及所述块内并行的计算单元信息对应的计算单元,并行处理所述目标数据。The target data is processed in parallel according to the processing mode included in the target processing instruction, the computing unit corresponding to the inter-block parallel computing unit information, and the computing unit corresponding to the intra-block parallel computing unit information.
  8. 根据权利要求6或7所述的方法,所述将接收到的任务请求中包括的待处理矩阵中的数据存储在输入寄存器文件中之前,所述方法还包括:The method according to claim 6 or 7, before the data in the matrix to be processed included in the received task request is stored in the input register file, the method further comprises:
    在确定任务接口存在任务请求时,判断输入寄存器文件是否为空闲;When determining that there is a task request in the task interface, determine whether the input register file is free;
    在确定所述输入寄存器文件空闲时,接收所述任务请求,并将所述输入寄存器文件的状态设置为忙状态。When it is determined that the input register file is idle, the task request is received, and the state of the input register file is set to a busy state.
  9. 根据权利要求6或7所述的方法,所述根据所述待处理矩阵的维度,以及预置的映射信息,确定所述待处理矩阵对应的目标处理指令存储地址之前,所述方法还包括:The method according to claim 6 or 7, before determining the storage address of the target processing instruction corresponding to the matrix to be processed according to the dimension of the matrix to be processed and preset mapping information, the method further comprises:
    判断所述输出寄存器文件是否为空闲;Determine whether the output register file is free;
    在确定所述输出寄存器文件空闲时,将所述输出寄存器文件的状态设置为忙状态;When it is determined that the output register file is idle, the state of the output register file is set to a busy state;
    在将所述输出寄存器文件的状态设置为忙状态之后,确定执行根据所述待处理矩阵的维度,以及预置的映射信息,确定所述待处理矩阵对应的目标处理指令存储地址的步骤。After the state of the output register file is set to the busy state, it is determined to execute the step of determining the storage address of the target processing instruction corresponding to the matrix to be processed according to the dimension of the matrix to be processed and preset mapping information.
  10. 一种数据处理设备,所述设备包括存储器、处理器、存储在所述存储器上并可在所述处理器上运行的程序以及用于实现所述处理器和所述存储器之间的连接通信的数据总线,所述程序被所述处理器执行时实现如权利要求1至9任一项所述的数据处理方法的步骤。A data processing device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a program for implementing connection communication between the processor and the memory A data bus, when the program is executed by the processor, implements the steps of the data processing method according to any one of claims 1 to 9.
  11. 一种存储介质,用于计算机可读存储,所述存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现权利要求1至9中任一项所述的数据处理方法的步骤。A storage medium for computer-readable storage, the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize any one of claims 1 to 9 A step of the data processing method.
PCT/CN2021/107658 2020-07-31 2021-07-21 Data processing method and device, and storage medium WO2022022362A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010761369.X 2020-07-31
CN202010761369.XA CN114065122A (en) 2020-07-31 2020-07-31 Data processing method, device and storage medium

Publications (1)

Publication Number Publication Date
WO2022022362A1 true WO2022022362A1 (en) 2022-02-03

Family

ID=80037122

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/107658 WO2022022362A1 (en) 2020-07-31 2021-07-21 Data processing method and device, and storage medium

Country Status (2)

Country Link
CN (1) CN114065122A (en)
WO (1) WO2022022362A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115801147A (en) * 2022-11-30 2023-03-14 珠海笛思科技有限公司 Data communication processing method and system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880134B (en) * 2023-01-31 2024-04-16 南京砺算科技有限公司 Constant data processing method using vector register, graphics processor, and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070208760A1 (en) * 2006-03-06 2007-09-06 Reuter James M Data-state-describing data structures
CN101562744A (en) * 2008-04-18 2009-10-21 展讯通信(上海)有限公司 Two-dimensional inverse transformation device
CN101621306A (en) * 2008-06-30 2010-01-06 中兴通讯股份有限公司 Mapping method and device for multiple-input multiple-output system precoding matrix
CN104572588A (en) * 2014-12-23 2015-04-29 中国电子科技集团公司第三十八研究所 Matrix inversion processing method and device
CN105790809A (en) * 2016-02-24 2016-07-20 东南大学 Coarse-grained reconfigurable array and routing structure for MIMO channel detection system
CN108647007A (en) * 2018-04-28 2018-10-12 天津芯海创科技有限公司 Arithmetic system and chip

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070208760A1 (en) * 2006-03-06 2007-09-06 Reuter James M Data-state-describing data structures
CN101562744A (en) * 2008-04-18 2009-10-21 展讯通信(上海)有限公司 Two-dimensional inverse transformation device
CN101621306A (en) * 2008-06-30 2010-01-06 中兴通讯股份有限公司 Mapping method and device for multiple-input multiple-output system precoding matrix
CN104572588A (en) * 2014-12-23 2015-04-29 中国电子科技集团公司第三十八研究所 Matrix inversion processing method and device
CN105790809A (en) * 2016-02-24 2016-07-20 东南大学 Coarse-grained reconfigurable array and routing structure for MIMO channel detection system
CN108647007A (en) * 2018-04-28 2018-10-12 天津芯海创科技有限公司 Arithmetic system and chip

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115801147A (en) * 2022-11-30 2023-03-14 珠海笛思科技有限公司 Data communication processing method and system
CN115801147B (en) * 2022-11-30 2023-09-22 珠海笛思科技有限公司 Data communication processing method and system

Also Published As

Publication number Publication date
CN114065122A (en) 2022-02-18

Similar Documents

Publication Publication Date Title
KR102443546B1 (en) matrix multiplier
CN107315574B (en) Apparatus and method for performing matrix multiplication operation
US10685082B2 (en) Sparse matrix multiplication using a single field programmable gate array module
WO2022022362A1 (en) Data processing method and device, and storage medium
CN106846235B (en) Convolution optimization method and system accelerated by NVIDIA Kepler GPU assembly instruction
CN108628799B (en) Reconfigurable single instruction multiple data systolic array structure, processor and electronic terminal
US20210089609A1 (en) Methods and apparatus for job scheduling in a programmable mixed-radix dft/idft processor
US11397791B2 (en) Method, circuit, and SOC for performing matrix multiplication operation
WO2021036729A1 (en) Matrix computation method, computation device, and processor
US10754818B2 (en) Multiprocessor device for executing vector processing commands
CN102629238B (en) Method and device for supporting vector condition memory access
Baboulin et al. An efficient distributed randomized algorithm for solving large dense symmetric indefinite linear systems
WO2021109665A1 (en) Data processing apparatus and method, base station, and storage medium
CN114385972A (en) Parallel computing method for directly solving structured triangular sparse linear equation set
US10127040B2 (en) Processor and method for executing memory access and computing instructions for host matrix operations
CN103235717B (en) There is the processor of polymorphic instruction set architecture
KR20210103393A (en) System and method for managing conversion of low-locality data into high-locality data
KR20210084220A (en) System and method for reconfigurable systolic array with partial read/write
CN116301920B (en) Compiling system for deploying CNN model to high-performance accelerator based on FPGA
CN115080496A (en) Network mapping method, data processing method and device, equipment, system and medium
CN114327639A (en) Accelerator based on data flow architecture, and data access method and equipment of accelerator
US20160162290A1 (en) Processor with Polymorphic Instruction Set Architecture
Esposito et al. Performance impact of rank-reordering on advanced polar decomposition algorithms
CN111352894A (en) Single-instruction multi-core system, instruction processing method and storage medium
CN113254078B (en) Data stream processing method for efficiently executing matrix addition on GPDPU simulator

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21848830

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21848830

Country of ref document: EP

Kind code of ref document: A1