WO2022022362A1

WO2022022362A1 - Data processing method and device, and storage medium

Info

Publication number: WO2022022362A1
Application number: PCT/CN2021/107658
Authority: WO
Inventors: 王华勇
Original assignee: 中兴通讯股份有限公司
Priority date: 2020-07-31
Filing date: 2021-07-21
Publication date: 2022-02-03
Also published as: CN114065122A

Abstract

A data processing method and device, and a storage medium, which relate to the technical field of communications. The method comprises: storing data in a matrix to be processed comprised in a received task request into an input register file (101); according to the dimension of the matrix and preset mapping information, determining a target processing instruction storage address corresponding to the matrix (102); according to the number of blocks of the matrix, configuring a computational resource in a resource pool, and determining inter-block parallel computing unit information (103); and sequentially reading a target processing instruction from the target processing instruction storage address, and, according to the target processing instruction and a computing unit corresponding to the inter-block parallel computing unit information, processing in parallel the data stored in the input register file (104).

Description

Data processing method, device and storage medium

cross reference

This application is based on the Chinese patent application with the application number "202010761369.X" and the application date is July 31, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference. into this application.

technical field

The embodiments of the present application relate to the field of communication technologies, and in particular, to a data processing method, device, and storage medium.

Background technique

Multiple-Input Multiple-Output (MIMO) technology, that is, using multiple antennas (or array antennas) and multiple channels to send and receive signals at the same time at both the transmitter and receiver. Through the transmission and reception of multiple sets of antennas, data transmission efficiency and signal stability can be greatly improved, thereby improving network quality, network speed, and user capacity. Therefore, MIMO technology is the core technology of the fifth-generation mobile communication technology network and wireless Internet access (Wi-Fi) 6, and also the core of future wireless network development.

Signal processing in MIMO requires extensive matrix operations. Among them, matrix inversion is the most complex matrix operation. For the conjugate symmetry of the matrix in MIMO, the inversion of the lower triangular matrix is particularly important. The calculation speed of the lower triangular matrix inversion operation directly affects the real-time performance of the MIMO system, thereby affecting the performance of the entire MIMO system.

There are many methods for matrix inversion, and there are different methods for different dimensions, so as to obtain different performance. For example, for a 2*2 matrix, you can directly expand the inversion theorem; for a 4*4 matrix, you can either calculate the adjoint matrix according to the definition, or you can use the block inversion decomposition algorithm in matrix theory; for 6*6 (Inclusive) The above-mentioned dimension inversion implementation method can be simplified by the Cholesky decomposition algorithm. Now that we have entered the era of massive MIMO, it is necessary to perform inverse calculations on matrices of various dimensions, even high-dimensional matrices. The computational complexity of matrix inversion is proportional to the power of the dimension. Therefore, when the dimension of the matrix is large, the computational complexity and computation amount of inversion are very high. All in all, matrix inversion in MIMO needs to support many algorithms, large dimensions, high computational complexity and short processing time.

At present, matrix inversion is mainly used for application specific integrated circuit (ASIC), which is dedicated to processing one or more types of matrix inversion algorithms through hard-wired methods, which can achieve the highest efficiency and power consumption. excellent.

However, in the above matrix inversion methods, the types of matrices that the ASIC supports and processes are limited, and the scalability and tailorability are poor.

SUMMARY OF THE INVENTION

An embodiment of the present application provides a data processing method, the method includes the following steps: storing data in a matrix to be processed included in a received task request in an input register file; wherein the task request includes: data, the dimension of the matrix to be processed, and the number of blocks of the matrix to be processed; according to the dimension of the matrix to be processed and the preset mapping information, determine the target processing instruction storage address corresponding to the matrix to be processed; wherein, the mapping information is used to indicate the matrix The mapping relationship between the dimension and the storage address of the processing instruction, the target processing instruction stored in the storage address of the target processing instruction is used to determine the inverse matrix of the matrix to be processed; according to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool, determine Inter-block parallel computing unit information; sequentially read target processing instructions from the target processing instruction storage address, and process the data stored in the input register file in parallel according to the target processing instruction and the computing unit corresponding to the inter-block parallel computing unit information.

The embodiment of the present application also proposes a data processing device, the device includes a memory, a processor, a program stored in the memory and running on the processor, and a data bus for realizing connection and communication between the processor and the memory, The program implements the steps of the aforementioned method when executed by the processor.

An embodiment of the present application provides a storage medium for computer-readable storage, where the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the steps of the foregoing method.

Description of drawings

1 is a schematic flowchart of a data processing method provided by an embodiment;

2 is a schematic flowchart of an embodiment of a data processing method provided by another embodiment;

3 is a schematic structural diagram of a data processing apparatus provided by an embodiment;

4 is a schematic diagram of a lower triangular matrix storage format provided by an embodiment;

Fig. 5 is the schematic diagram of processing instruction format;

6 is a schematic diagram of the access type of the data included in the processing instruction;

Fig. 7 is the schematic diagram of access channel merging;

8 is a schematic structural diagram of a data processing apparatus provided by another embodiment;

FIG. 9 is a schematic structural diagram of a data processing device provided by an embodiment.

detailed description

It should be understood that the specific embodiments described herein are only used to explain the embodiments of the present application, and are not used to limit the embodiments of the present application.

In the subsequent description, suffixes such as "module", "component" or "unit" used to represent elements are only used to facilitate the description of the embodiments of the present application, and have no specific meaning per se. Thus, "module", "component" or "unit" may be used interchangeably.

The main purpose of the embodiments of the present application is to provide a data processing method, device, and storage medium, aiming at realizing the function of efficiently determining inverse matrices of various dimensional matrices.

FIG. 1 is a schematic flowchart of a data processing method provided by an embodiment. This embodiment is applicable to the scenario of determining the inverse matrix of the received matrix. This embodiment may be executed by a data processing apparatus, which may be implemented in software and/or hardware, and may be integrated into a communication device such as a multi-mode base station or a multi-mode terminal. As shown in FIG. 1, the data processing method provided by this embodiment includes the following steps:

Step 101: Store the data in the matrix to be processed included in the received task request in the input register file.

The task request includes: the data in the matrix to be processed, the dimension of the matrix to be processed, and the number of blocks of the matrix to be processed.

The dimension of the matrix to be processed in this embodiment may be any dimension. For example, 1, 2, 4, 8, 16, 24, and 32, etc. The task request in this embodiment may be sent by other devices to the data processing apparatus, or may be generated by other modules of the data processing apparatus. The number of blocks of the matrix to be processed refers to the number of the matrix to be processed.

Optionally, the matrix to be processed in this embodiment may be a lower triangular matrix. The matrix inversion operation in massive MIMO plays an important role in the channel detection algorithm and precoding algorithm.

Step 102: Determine the target processing instruction storage address corresponding to the matrix to be processed according to the dimension of the matrix to be processed and preset mapping information.

The mapping information is used to indicate the mapping relationship between the dimension of the matrix and the storage address of the processing instruction. The target processing instruction stored in the target processing instruction storage address is used to determine the inverse matrix of the matrix to be processed.

In this embodiment, the application can be analyzed in advance according to the application requirements, and all required matrix dimensions in the application can be extracted from it, and the inversion algorithms and processes of various dimensions can be determined according to the algorithm performance simulation. For example, in the process of 5G MIMO processing, the required matrix inversion ranges from 1x1 to 64x64, of which 2x2 uses the direct inversion algorithm, 4x4 uses the block inversion algorithm, and the others use the Cholesky decomposition algorithm. At the same time, 64x64 can be decomposed into 4x16x16, 8x8x8, 64x1x1, and needs to support 8x2x2, 8x4x4 and so on.

According to the algorithm, these specifications are converted into pseudo-codes by means of scripts, mapped to various hardware resources in the form of processing instructions, and then the address ranges of various specification codes are counted and made into the form of configuration files, waiting for dynamic or static downloaded to the hardware instruction memory in the data processing device, for example, downloaded to a random access memory (Random Access Memory, RAM).

The data processing apparatus in this embodiment can load the above configuration file and download processing instructions and mapping information after power-on reset. The processing instruction represents the calculation and control flow, and the mapping information stores the dimension of the matrix, the start address of the processing instruction, and the end address of the processing instruction (ie, the storage address of the processing instruction).

In step 102, the target processing instruction storage address corresponding to the matrix to be processed may be determined according to the dimension of the matrix to be processed and a preset mapping table, that is, the correspondence between the matrix dimension and the storage address of the processing instruction. The target processing instruction stored in the target processing instruction storage address in this embodiment is used to determine the inverse matrix of the matrix to be processed.

Step 103: According to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool, and determine the parallel computing unit information between the blocks.

The computing resources in the resource pool in this embodiment may include computing resources such as multipliers. The throughput of matrix inversion is mainly determined by the multiplier. According to the number of matrix blocks and dimensions extracted above, the maximum number of multiplication units required to calculate one element is determined, for example, it can be 64 multiplication units. Meanwhile, to support inter-block parallelism, these multiplication units can be grouped. Through the dynamic configuration of various networks, these resources can be organized into a unified large resource pool or into multiple small resource pools.

In step 103, according to the number of blocks of the matrix to be processed, the computing resources in the resource pool are configured, and the information of the computing units paralleled between the blocks is determined. For example, assuming that the number of blocks of the matrix to be processed is 4, the computing resources in the resource pool can be divided into 4 independent networks, and the 4 independent networks process the data in the 4 matrices to be processed in parallel.

The calculation unit information here is used to indicate the parallel calculation units between blocks, and the corresponding relationship between the calculation units and the matrix to be processed. The inter-block parallel computing unit in this embodiment refers to a computing unit that processes data in different matrices to be processed in parallel.

It should be noted that the execution process of step 102 and step 103 have no time sequence relationship. Both can be executed simultaneously and in any order.

Step 104: Read the target processing instructions sequentially from the target processing instruction storage addresses, and process the data stored in the input register file in parallel according to the target processing instructions and the computing units corresponding to the computing unit information paralleled between blocks.

The number of target processing instructions stored in the target processing instruction storage address in this embodiment may be multiple. In step 104, the target processing instructions are sequentially read. After each target processing instruction is read, the data stored in the input register file is processed in parallel according to the target processing instruction and the computing unit corresponding to the parallel computing unit information between the blocks.

The target processing instruction includes the access address and processing method. After reading the data from the input register file according to the access address in the target processing instruction, the read data can be processed according to the processing method included in the target processing instruction. to be processed.

In this embodiment, after all the target processing instructions in the storage address of the target processing instruction are executed, the inverse matrix of the matrix to be processed can be obtained.

In one embodiment, step 104 may be performed through the following process:

Step 1041: Set the current running address as the i-th address in the storage address of the target processing instruction.

Step 1042 : Process the data stored in the input register file in parallel according to the target processing instruction corresponding to the ith address and the computing unit corresponding to the computing unit information in parallel between the blocks.

Step 1043: After processing the target processing instruction corresponding to the i-th address, when it is determined that the current operating address is not equal to the instruction end address in the target processing instruction storage address, determine i=i+1, and return to execute to set the current operating address. The step of storing the ith address of the addresses for the target processing instruction.

It should be noted that, during initial operation, the current operation address is the start address in the storage address of the target processing instruction. i is an integer greater than or equal to 0.

Furthermore, since some data in the matrix inversion algorithm have strong dependencies, in order to avoid read-write conflicts, waiting periods need to be inserted between some calculation processes. Based on this requirement, target processing instructions also include delay cycles. The implementation process of the above-mentioned step 1043 is specifically: after determining that the current operating address is not equal to the instruction end address in the target processing instruction storage address, after delaying the delay period included in the target processing instruction corresponding to the ith address, determining i=i +1, return to execute the step of setting the current operating address as the i-th address in the target processing instruction storage address. For example, assuming i=3, the instruction end address in the target processing instruction storage address is the 11th address. After processing the target processing instruction corresponding to the third address, it is determined that the current running address is not equal to the target processing instruction storage address. When the instruction ends in the address, after delaying the delay period included in the target processing instruction corresponding to the third address, determine i=4, and return to execute setting the current running address to the fourth address in the storage address of the target processing instruction A step of.

The data processing method provided in this embodiment draws on the idea of Single Instruction Multiple Data (SIMD) and ASIC, and can implement a general configurable matrix inversion method and device, so that the programmable SIMD can be obtained. The advantages of scalability and scalability can also be obtained from the advantages of low latency, high efficiency and low power consumption of ASIC.

The data processing method provided in this embodiment can be applied to MIMO technical fields such as 5G wireless mobile communication, deep space communication, optical fiber communication, satellite digital video and audio broadcasting.

This embodiment provides a data processing method, including: storing data in a matrix to be processed included in a received task request in an input register file, wherein the task request includes: data in the matrix to be processed, a matrix to be processed The dimension of the matrix to be processed and the number of blocks of the matrix to be processed; according to the dimension of the matrix to be processed and the preset mapping information, determine the target processing instruction storage address corresponding to the matrix to be processed, wherein the mapping information is used to indicate the dimension of the matrix and the storage address of the processing instruction The mapping relationship between addresses, the target processing executes the target processing instructions stored in the storage address to determine the inverse matrix of the matrix to be processed; according to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool to determine the parallel computing between blocks Unit information; sequentially read target processing instructions from the target processing instruction storage address, and process the data stored in the input register file in parallel according to the target processing instructions and the computing unit corresponding to the computing unit information paralleled between blocks. In the data processing method, the storage address of the target processing instruction corresponding to the matrix to be processed can be obtained according to the dimension of the matrix to be processed, and at the same time, the computing resources in the resource pool can be configured according to the number of blocks of the matrix to be processed, and the parallelism between blocks can be determined. After that, the data stored in the input register file is processed in parallel according to the target processing instructions sequentially read in the target processing instruction storage address and the computing units corresponding to the parallel computing unit information between blocks. It has strong programmability and scalability to process matrices of various dimensions. On the other hand, it can process data in parallel, with low latency and high efficiency.

FIG. 2 is a schematic flowchart of an embodiment of a data processing method provided by another embodiment. In this embodiment, on the basis of the embodiment shown in FIG. 1 and various optional solutions, other steps included in the data processing method are described in detail. As shown in Figure 2, the data processing method provided by this embodiment includes the following steps:

Step 201: When it is determined that there is a task request in the task interface, determine whether the input register file is free.

Step 202: When it is determined that the input register file is idle, a task request is received, and the state of the input register file is set to a busy state.

In step 201, when it is judged that the input register file is idle, a task request is received, which can avoid a write operation error. At the same time, setting the state of the input register file to the busy state can prevent the input register file from being occupied by other tasks and cause errors in the data processing process. To improve efficiency, the state of the input register file can be set to the busy state while receiving the task request. Of course, the state of the input register file can also be set to the busy state before receiving the task request.

Step 203: Store the data in the matrix to be processed included in the received task request in the input register file.

Optionally, the matrix to be processed in this embodiment may be a lower triangular matrix. Step 203 may specifically be: according to the preset storage format of the lower triangular matrix, store the data in the matrix to be processed included in the received task request in the input register file; copy the data to be processed included in the received task request The data in the matrix, and the copied data is stored according to the storage format of the upper triangular matrix.

Optionally, after the data is stored, preprocessing operations such as data alignment may also be performed.

FIG. 4 is a schematic diagram of a lower triangular matrix storage format provided by an embodiment. As shown in FIG. 4 , it is assumed that the minimum storage unit of the input register file can store an 8*8 matrix, and the input register file has 4 rows and 4 columns of the minimum storage unit. Then, as shown in (1) in Figure 4, the input register file can store 1 to 64 1x1 matrices; as shown in (2) in Figure 4, the input register file can store 1 to 8 8x8 matrices, The storage location is shown in the gray area in (2); as shown in (3) in Figure 4, the input register file can store 1 to 4 16x16 matrices, and the storage location is shown in gray in (3) As shown in (4) in Figure 4, the input register file can store a 32x32 matrix, and the storage location is shown in the gray area in (4). Further, the gray area above the dotted line in (4) can store a 24x24 matrix.

By analyzing various algorithms of matrix inversion, the access to data in the algorithm mainly includes scalar access, row vector access and column vector access. For the inversion of the lower triangular matrix, the upper triangle can be stored at the same time, thereby converting column vector access into row vector access and reducing the complexity of data access.

Optionally, the data processing apparatus in this embodiment may further include an output register file.

FIG. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment. As shown in FIG. 3 , the data processing device mainly includes five modules including instruction control, task input, computing resource pool, register file and task output.

The register file includes: input register file B (regFileB) and output register file C (regFileC). Input register file B and output register file C may be vector register files.

In order to reduce the data access delay and store the intermediate calculation process, two vector register files (ie, vector register files) are used, which can realize the pipeline operation of data input/output, thereby reducing the overall processing delay. These high-dimensional vector register files can store multiple low-dimensional matrices in parallel, so as to complete the parallel inversion of multiple low-dimensional matrices, thereby improving the utilization of computing units, improving the throughput of matrix inversion and reducing control overhead, reducing overall power consumption.

Taking into account the pipeline operation of data input/output means: in the data input stage, when the input register file is in an idle state, regardless of whether the output register file is idle, the task request can be received; in the data output stage, regardless of whether the data is output or not. When done, the state of the input register file can be set to the idle state. The pipeline operation of data input/output can reduce the overall processing delay of data processing.

Step 204: Determine whether the output register file is free.

Optionally, step 203 may be performed after step 204 . After the data is stored, it can be determined whether the output register file is free before starting to process the data.

Step 205: When it is determined that the output register file is idle, the state of the output register file is set to a busy state.

Step 206: After setting the state of the output register file to the busy state, determine the execution according to the dimension of the matrix to be processed, and the correspondence between the preset matrix dimension and the storage address of the processing instruction, and determine the target processing instruction storage corresponding to the matrix to be processed. address steps.

Before proceeding to step 207, in order to avoid writing errors, it is necessary to judge the state of the output register file. When it is determined that the output register file is free, step 207 is determined to be executed. Setting the state of the output register file to the busy state can prevent the output register file from being occupied by other tasks and causing errors in the data processing process.

Step 207: Determine the target processing instruction storage address corresponding to the matrix to be processed according to the dimension of the matrix to be processed and the preset mapping information.

The mapping information is used to indicate the mapping relationship between the dimension of the matrix and the storage address of the processing instruction. The target process executes the target process instruction stored in the memory address for determining the inverse of the matrix to be processed.

In this embodiment, basic fine-grained operators necessary for various inversion algorithms are extracted by analyzing various algorithms for matrix inversion. Mainly include the following categories:

编码coding	指令instruction	功能Features
4'h04'h0	MULMUL	A＝BCA=BC
4'h14'h1	MACMAC	A＝∑(BC)A=∑(BC)
4'h24'h2	SMACSMAC	A＝A-∑(BC)A=A-∑(BC)
4'h34'h3	ZMACZMAC	A＝0-∑(BC)A=0-∑(BC)
4'h44'h4	MACMMACM	A＝∑(BC)DA=∑(BC)D
4'h54'h5	SMACMSMACM	A＝(A-∑(BC))DA=(A-∑(BC))D
4'h64'h6	ZMACMZMACM	A＝(0-∑(BC))DA=(0-∑(BC))D
4'h74'h7	DIVDIV	A＝1/AA=1/A
4'h84'h8	DMACMDMACM	A＝1/∑(BC)A=1/∑(BC)
4'h94'h9	DSMACMDSMACM	A＝1/(A-∑(BC))A=1/(A-∑(BC))
4'hA4'hA	DZMACMDZMACM	A＝1/(0-∑(BC))A=1/(0-∑(BC))
4'hB4'hB	MOVEMOVE	数据搬移指令data move instruction
4'hC4'hC	FIXFIX	定点化指令fixed-point instruction

Among them, B represents the data in the input register file, C represents the data in the output register file, D represents the execution result of the last target processing instruction, and A represents the processing results of B and C.

By analyzing the matrix inversion algorithm flow, its control is relatively simple, and it belongs to large-scale parallel computing. This embodiment adopts off-line programming, which simplifies control logic and enhances flexibility.

FIG. 5 is a schematic diagram of a processing instruction format. As shown in Figure 5, the meanings of the fields in the processing instruction are as follows:

Instructions: including the above basic fine-grained operators, which are mainly used to control the execution of computing units.

Halt cycle: Due to the strong dependency between some data in the matrix inversion algorithm, in order to avoid read and write conflicts, waiting cycles need to be inserted between some calculation processes. This field is used to suspend the pipeline for n cycles after executing this instruction.

Parallelism: This field is used for fine-grained intra-block parallelism, indicating the number of elements processed in parallel within a block. Combined with the coarse-grained inter-block parallelism in the task parameters (ie, the number of blocks of the matrix to be processed), the calculation can be calculated. Resources and various networks are organized in various forms. For example, if the number of blocks is 2 and the degree of parallelism is 2, the accumulation/scaling/broadcasting network will organize the resource pool into four independent networks. Spanning two disjoint independent networks, this one can handle the inversion of 2 matrices simultaneously, and each matrix can compute two elements simultaneously.

Calibration control: This field is mainly used for the calculation and propagation of various calibration values in the algorithm process, as well as the selection of calibration values required for fixed point.

Source/Destination A: This field is mainly used to control the behavior of data A. Through different controls, the operands can be obtained from the input register file B or the output register file C; the result can also be written back to the input register file B or the output register file C; at the same time, the constant value or the fetched data can also be obtained. the conjugate value of .

Source B and Source C: These two fields are used to control the behavior of accessing register files B and C, respectively. With different controls, both row and column access are possible. The type represents the access type: type=0, representing row access; type=1, representing column access. (From0, To0) represents the location of data accessed by different resource blocks; (From1, To1) represents the sequence of accessed data within the same resource block. Wherein, different resource blocks represent different rows or columns, and the same resource blocks represent a row or the same column.

FIG. 6 is a schematic diagram of access types of data included in processing instructions. As shown in FIG. 6, when the type in the processing instruction is 1, column access is indicated. From0 and To0 represent accessing different columns, and From1 and To1 represent which rows in the columns corresponding to From0 and To0 are accessed. For example, From0 and To0 can be From0 and To2, which means accessing columns 0 to 2; From1 and To1 can be From2, To4, which means accessing rows 2 to 2 of each of columns 0 to 2 4 lines.

As shown in FIG. 6, when the type in the processing instruction is 0, it indicates a row access. From0 and To0 represent accessing different rows, and From1 and To1 represent which columns in the row corresponding to From0 and To0 are accessed.

According to the different combinations of the above fields, through decoding, the entire processing pipeline of matrix inversion is controlled, so as to get rid of the limitation of hard-wired solidification, increase the flexibility of control, simplify the control logic, and reduce the complexity of hardware implementation.

Offline instruction programming can encode the matrix inversion process of various algorithms and dimension combinations according to the application requirements, and download it to the instruction RAM statically and dynamically, and configure the mapping relationship into the mapping information at the same time. dimension, find the corresponding program from the mapping information and execute it. Therefore, offline instruction programming not only enhances flexibility and reduces control complexity, but also benefits the reduction of dynamic power consumption.

Step 208 : According to the number of blocks of the matrix to be processed, configure computing resources in the resource pool, and determine information of computing units paralleled between blocks.

There is no timing relationship between step 207 and step 208 .

Step 209: After sequentially reading the target processing instructions from the target processing instruction storage addresses, determine the parallel computing unit information in the block according to the degree of parallelism in the block.

The process of sequentially reading the target processing instructions in step 209 is similar to the specific implementation process of step 104, and will not be repeated here.

As shown in FIG. 5 , the degree of parallelism included in the target processing instruction in this embodiment refers to the degree of parallelism within a block. After the target processing instruction is read from the storage address of the target processing instruction, the parallel computing unit information in the block can be determined according to the degree of parallelism in the block.

In practical applications, due to the requirement that the dimension range of the matrix is relatively large, such as 1, 2, 4, 8, 16, 24, 32 and so on. If resources are reserved according to the maximum capacity, resources cannot be fully utilized when processing small-dimensional matrices, which wastes resources and increases delay. Therefore, in this embodiment, the task request includes the number of blocks of the matrix to be processed, and multiple small matrices are spliced into a large matrix and processed in parallel, so as to make full use of its resources, improve the utilization rate of resources, and reduce the problems caused by serial processing. delay overhead. The matrix inversion is generally calculated for each element point. If only one element point is processed at a time, although it is simple, it does not have sufficient resources, and the delay is relatively large. In this embodiment, resources can be organized according to requirements, and multiple element points of the same matrix can be processed in parallel to achieve intra-block parallelism. The intra-block parallelism refers to the number of elements in the same matrix to be processed that are processed in parallel.

This embodiment adopts the idea of computing resource pools, and combines computing resources into pools with multiple independent computing resources according to a certain granularity. The computing resource pools can be combined into one or more sets of computing units according to application requirements to process a large matrix or parallel computing. Process multiple small matrices, and process multiple elements of the same matrix in parallel. These resource pools can be combined into multiple parallel processing computing units as required to process multiple matrices or elements in parallel. In this way, since multiple parallel computing units share the same program and control logic, power consumption overhead is reduced, and when processing small matrices, throughput is improved and time delay is reduced.

In an implementation manner, if it is determined that a maximum of 64 multiplication units are required to calculate one element, and at the same time to support various coarse/fine-grained parallelisms, these multiplication units are divided into 8 groups, with 8 in each group. Through the dynamic configuration of various networks, these resources can be organized into a unified large resource pool, or can be organized into a maximum of 8 small resource pools.

In step 208, according to the number of blocks of the matrix to be processed in the task request, various networks are statically configured at a coarse-grained level, and the computing resource pool is organized into computing units for parallel processing between blocks. In step 209, according to the degree of parallelism in the target processing instruction read from the target processing instruction storage address, various networks and computing units are dynamically adjusted from a fine-grained level, and multiple elements in the block are processed in parallel, thereby realizing Control various units to run dynamically as required.

Step 210 : Process the data stored in the input register file in parallel according to the target processing instruction, the computing unit corresponding to the information of the parallel computing units between the blocks, and the computing unit corresponding to the information of the parallel computing units within the block.

In step 210, the data stored in the input register file may be processed in parallel based on the target processing instruction through the computing unit corresponding to the inter-block parallel computing unit information and the computing unit corresponding to the intra-block parallel computing unit information. The parallel processing process can realize parallel processing among multiple matrix blocks to be processed, and can also realize parallel processing among multiple elements in the same matrix to be processed.

In this embodiment, since the target processing instructions are acquired in sequence, after each target processing instruction is processed, the intermediate data acquired during the current processing of the data stored in the input register file can be stored in the output register file or the input register file, and store the acquired result data in the output register file. Optionally, the acquired intermediate data may also be stored in a cache unit of the computing unit. That is, the two register files in this embodiment can be used for input storage, output storage or intermediate temporary storage, respectively. At the same time, according to the number of blocks processed in parallel, multiple small matrices can be stored separately.

As shown in FIG. 5 , the target instruction also includes the access address of the target data (ie, the source B and source C fields in FIG. 5 ) and the processing mode of the target data. In step 210, if it is determined according to the access address of the target data that multiple target data to be accessed by the computing unit have the same row or column, then the access channels of the multiple target data are combined into one access channel; according to the combined access channel, read data from the input register file and/or output register file; from the read data, obtain the target data according to the access address of the target data; according to the processing method included in the target processing instruction, the parallel calculation between blocks The computing unit corresponding to the unit information and the computing unit corresponding to the parallel computing unit information in the block process the target data in parallel.

The computing unit here can be a parallel computing unit between blocks or a parallel computing unit within a block. This embodiment is not limited to this.

During specific implementation, after combining access channels, read control information can be generated. According to the read control information, read all the data in the combined access channel from the input register file and/or the output register file, and then select the required target data from it. After reading the required target data, carry out relevant preprocessing, such as fixed-pointing or conjugation processing, etc.; according to the fixed-pointing data collected before, broadcast or assign to each fixed-pointing unit according to the target processing instruction requirements; according to the target The instruction in the instruction starts the calculation unit corresponding to the related inter-block parallel calculation unit information and the calculation unit corresponding to the intra-block parallel calculation unit information, dynamically processes the fixed point and collects the fixed point related data according to the requirements in the instruction, And write the result back to the register file and/or the master unit.

Due to multiple resource pools, the amount of data required for parallel processing is large. In this way, a large amount of data needs to be read from the register file in parallel. If you follow the regular reading, there will be many levels of data selector selection logic, which is difficult to implement due to congestion on the physical backend. For matrix inversion, especially the data access characteristics of lower triangular matrix inversion, the register file stores the data of the lower triangle to the upper triangle at the same time, which is equivalent to a copy, which can convert column access to row access, and because of simultaneous access The data is only local. After the channel is merged, the behavior of reading data can be controlled within a certain number of rows, reducing the number of first-level selectors; after reading row data, column selection is performed separately. Through this technology, the number of row accesses is greatly reduced, thereby reducing a large amount of data selector logic, which is beneficial to the realization of the physical backend.

FIG. 7 is a schematic diagram of access channel merging. Complete 8 blocks of 8x8 matrices at the same time. When executing the processing instructions shown in the figure, if the channels are independent, 8 channels are required to read 8 rows of data at the same time; but after the channels are merged, only 4 channels are required to read 4 rows at the same time. data. In this way, the number of read selectors can be greatly reduced, which is beneficial to back-end wiring.

In step 209, if it is determined that the current running address is equal to the instruction end address in the target processing instruction storage address, it indicates that the execution of the inversion instruction of the matrix to be processed is completed, and the input register file is released. After that, according to the task parameters and the fixed-pointing result of the main control, the output data is post-processed, and the relevant result parameters and data are output, and then the output register file is set to an idle state to complete the lower triangular inversion process of the entire matrix.

In the above data processing process of this embodiment, the input-calculation-output three-stage pipeline is loosely coupled, and each pipeline is weakly correlated, and the execution of the lower-level pipeline is only triggered by a trigger signal.

Please continue to refer to FIG. 3, after the task input module receives the task request, the data in the matrix to be processed included in the received task request is stored in the input register file through the data write interface; the computing unit in the resource pool is configured It is the parallel computing unit information between blocks; according to the task request and mapping information, such as the mapping table, the target processing instruction storage address is read, and then, the target processing instruction is obtained: instruction 0 to instruction N: through instruction fetching, decoding and control, execute each target processing instruction in sequence; according to the target processing instruction, fine-grained configuration of resources in the resource pool is performed to determine parallel computing unit information in the block; data is processed in parallel according to the configured computing unit. In the process of processing, it is necessary to read data from the register file through the data read interface, channel merging, etc. After a target processing instruction is executed, it is stored in the register file through the data write interface. The resource pool in FIG. 3 includes a Multiply Accumulate (MAC) unit.

The data processing method provided by this embodiment overcomes the problem that the traditional matrix inversion method cannot take into account various indicators such as versatility, throughput, complexity, and low delay, and draws on the idea of SIMD+ASIC, and proposes a highly efficient method. The configurable and universal matrix inversion implementation method can adapt to different protocols and the continuous evolution of the protocol; at the same time, it combines various technologies such as channel merging, dual vector register files, inter-block and intra-block parallelism, and various broadcast networks. The idea of resource pool is adopted to reduce the implementation cost, development and back-end risks, and meet the needs of various algorithms, various dimensions and various application scenarios.

FIG. 8 is a schematic structural diagram of a data processing apparatus provided by another embodiment. As shown in FIG. 8 , the data processing apparatus provided in this embodiment includes the following modules: a storage module 81 , a first determination module 82 , a second determination module 83 , and a processing module 84 .

The storage module 81 is configured to store the data in the matrix to be processed included in the received task request in the input register file.

The first determining module 82 is configured to determine the target processing instruction storage address corresponding to the matrix to be processed according to the dimension of the matrix to be processed and preset mapping information.

The second determination module 83 is configured to configure the computing resources in the resource pool according to the number of blocks of the matrix to be processed, and to determine the parallel computing unit information between the blocks.

The processing module 84 is configured to sequentially read the target processing instructions from the target processing instruction storage addresses, and process the data stored in the input register file in parallel according to the target processing instructions and the computing units corresponding to the parallel computing unit information between blocks.

Optionally, the target processing instructions include intra-block parallelism. The apparatus further includes: a third determination module configured to determine the parallel computing unit information within the block according to the degree of parallelism within the block. The processing module 84 is specifically configured to process the data stored in the input register file in parallel according to the target processing instruction, the computing units corresponding to the information of the parallel computing units between the blocks, and the computing units corresponding to the information of the parallel computing units within the blocks.

In one implementation, the processing module 84 is specifically configured to: set the current operating address as the i-th address in the target processing instruction storage address; The computing unit processes the data stored in the input register file in parallel; after processing the target processing instruction corresponding to the i-th address, when determining that the current operating address is not equal to the instruction end address in the target processing instruction storage address, determine that i=i +1, return to execute the step of setting the current operating address as the i-th address in the target processing instruction storage address.

Further, the target processing instruction also includes a delay period. The processing module 84 is specifically configured to: after determining that the current operating address is not equal to the instruction end address in the target processing instruction storage address, after delaying the delay period included in the target processing instruction corresponding to the ith address, determine i=i+1 , and return to the step of setting the current running address as the i-th address in the storage address of the target processing instruction.

Optionally, the matrix to be processed is a lower triangular matrix. The storage module 81 is specifically used to: store the data in the matrix to be processed included in the received task request in the input register file according to the preset storage format of the lower triangular matrix; copy the pending data included in the received task request. Process the data in the matrix, and store the copied data according to the storage format of the upper triangular matrix.

In an implementation manner, the storage module 81 is further configured to: store the intermediate data obtained in the process of processing the data stored in the input register file this time in the output register file or the input register file, and store the obtained result data in the output register file or the input register file. in the output register file.

Optionally, the target instruction further includes an access address of the target data and a processing method of the target data. The processing module 84 is specifically used for: if according to the access address of the target data, it is determined that the plurality of target data that the computing unit needs to access has the same row or column, then the access channels of the plurality of target data are combined into one access channel; read data from the input register file and/or output register file; from the read data, obtain the target data according to the access address of the target data; according to the processing method included in the target processing instruction, parallel between blocks The calculation unit corresponding to the calculation unit information of the block and the calculation unit corresponding to the parallel calculation unit information in the block process the target data in parallel.

In an implementation manner, the apparatus further includes: a judgment module, configured to judge whether the input register file is idle when it is determined that there is a task request in the task interface; a receiving setting module, configured to receive the task request when it is determined that the input register file is idle , and sets the state of the input register file to busy.

In another implementation manner, the judgment module is further configured to judge whether the output register file is free. In this implementation manner, the apparatus further includes: a setting module and a fourth determination module. The setting module is configured to set the state of the output register file to a busy state when the output register file is determined to be idle. The fourth determination module is configured to, after setting the state of the output register file to the busy state, determine to execute the step of determining the storage address of the target processing instruction corresponding to the matrix to be processed according to the dimension of the matrix to be processed and the preset mapping information .

The data processing apparatus provided in this embodiment is used to execute the data processing method in any of the foregoing embodiments. The implementation principle and technical effect of the data processing apparatus provided in this embodiment are similar, and details are not repeated here.

FIG. 9 is a schematic structural diagram of a data processing device provided by an embodiment. As shown in FIG. 9, the data processing device includes a processor 91 and a memory 92; the number of processors 91 in the data processing device may be one or more, and one processor 91 is taken as an example in FIG. 9; The processor 91 and the memory 92 can be connected by a bus or other means, and the connection by a bus is taken as an example in FIG. 9 .

As a computer-readable storage medium, the memory 92 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the data processing method in the embodiments of the present application (for example, a storage module in a data processing apparatus). 81, a first determination module 82, a second determination module 83, and a processing module 84). The processor 91 executes the software programs, instructions, and modules stored in the memory 92 to perform various functional applications and data processing of the data processing device, that is, to implement the above-mentioned data processing method.

The memory 92 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the data processing apparatus, and the like. Additionally, memory 92 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.

Embodiments of the present application also provide a storage medium containing computer-executable instructions, and the computer-executable instructions are used to execute a data processing method when executed by a computer processor, and the method includes:

The data in the matrix to be processed included in the received task request is stored in the input register file; wherein, the task request includes: the data in the matrix to be processed, the dimension of the matrix to be processed and the number of blocks of the matrix to be processed;

According to the dimension of the matrix to be processed and the preset mapping information, determine the storage address of the target processing instruction corresponding to the matrix to be processed; wherein, the mapping information is used to indicate the mapping relationship between the dimension of the matrix and the storage address of the processing instruction, and the target processing instruction The target processing instruction stored in the storage address is used to determine the inverse matrix of the matrix to be processed;

According to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool, and determine the parallel computing unit information between blocks;

The target processing instructions are sequentially read from the target processing instruction storage addresses, and the data stored in the input register file is processed in parallel according to the target processing instructions and the computing units corresponding to the computing unit information paralleled between blocks.

Certainly, a storage medium containing computer-executable instructions provided by the present application, the computer-executable instructions of which are not limited to the above method operations, and can also perform related operations in the data processing methods provided by any embodiment of the present application.

The above are merely exemplary embodiments of the present application, and are not intended to limit the protection scope of the present application.

In general, the various embodiments of the present application may be implemented in hardware or special purpose circuits, software, logic, or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device, although the application is not limited thereto.

Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, functional modules/units in the systems, and devices can be implemented as software, firmware, hardware, and appropriate combinations thereof.

In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components Components execute cooperatively. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media. Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .

The preferred embodiments of the embodiments of the present application have been described above with reference to the accompanying drawings, which are not intended to limit the scope of rights of the embodiments of the present application. Any modifications, equivalent replacements and improvements made by those skilled in the art without departing from the scope and essence of the embodiments of the present application shall fall within the scope of the rights of the embodiments of the present application.

Claims

A data processing method, the method comprising:

The data in the matrix to be processed included in the received task request is stored in the input register file; wherein, the task request includes: the data in the matrix to be processed, the dimension of the matrix to be processed and the block of the matrix to be processed number;

Determine the target processing instruction storage address corresponding to the to-be-processed matrix according to the dimension of the matrix to be processed and preset mapping information; wherein the mapping information is used to indicate the difference between the dimension of the matrix and the storage address of the processing instruction a mapping relationship, where the target processing instruction stored in the target processing instruction storage address is used to determine the inverse matrix of the matrix to be processed;

According to the number of blocks of the matrix to be processed, configure the computing resources in the resource pool, and determine the parallel computing unit information between blocks;

The target processing instructions are sequentially read from the storage addresses of the target processing instructions, and the data stored in the input register file is processed in parallel according to the target processing instructions and the computing units corresponding to the computing unit information paralleled between the blocks.
The method of claim 1, the target processing instruction comprising an intra-block parallelism;

After sequentially reading the target processing instructions from the target processing instruction storage addresses, the method further includes:

According to the degree of parallelism in the block, determine the parallel computing unit information in the block;

The processing of the data stored in the input register file according to the computing unit corresponding to the target processing instruction and the parallel computing unit information between the blocks includes:

The data stored in the input register file is processed in parallel according to the target processing instruction, the computing unit corresponding to the inter-block parallel computing unit information, and the computing unit corresponding to the intra-block parallel computing unit information.
The method according to claim 1 or 2, wherein the target processing instruction is sequentially read from the target processing instruction storage address, and the calculation unit corresponding to the target processing instruction and the parallel calculation unit information between the blocks is calculated according to the calculation unit. , process the data stored in the input register file in parallel, including:

Setting the current operating address to the i-th address in the target processing instruction storage address;

According to the target processing instruction corresponding to the ith address and the computing unit corresponding to the parallel computing unit information between the blocks, the data stored in the input register file is processed in parallel;

After the target processing instruction corresponding to the i-th address is processed, when it is determined that the current operating address is not equal to the instruction end address in the target processing instruction storage address, i=i+1 is determined, and the execution returns to execute the current The step of setting the running address as the i-th address in the storage address of the target processing instruction.
The method of claim 3, the target processing instruction further comprising a delay period;

When it is determined that the current operating address is not equal to the instruction end address in the target processing instruction storage address, i=i+1 is determined, and the execution returns to set the current operating address as the first address in the target processing instruction storage address. i address steps, including:

After determining that the current operating address is not equal to the instruction end address in the target processing instruction storage address, after delaying the delay period included in the target processing instruction corresponding to the i-th address, it is determined that i=i+1, Return to execute the step of setting the current running address as the i-th address in the storage addresses of the target processing instruction.
The method according to any one of claims 1 to 4, wherein the matrix to be processed is a lower triangular matrix;

The data in the matrix to be processed included in the received task request is stored in the input register file, including:

According to the preset storage format of the lower triangular matrix, the data in the matrix to be processed included in the received task request is stored in the input register file;

Copy the data in the matrix to be processed included in the received task request, and store the copied data according to the storage format of the upper triangular matrix.
The method according to any one of claims 1-4, wherein the calculation unit corresponding to the target processing instruction, the calculation unit information paralleled between the blocks, and the calculation unit corresponding to the parallel calculation unit information within the block , after the data stored in the input register file is processed in parallel, the method further includes:

In the process of processing the data stored in the input register file this time, the intermediate data obtained is stored in the output register file or the input register file, and the obtained result data is stored in the output register file.
The method according to claim 6, the target processing instruction further comprises an access address of the target data and a processing mode of the target data;

The processing of the data stored in the input register file in parallel according to the target processing instruction, the calculation unit corresponding to the information of the parallel calculation unit between the blocks, and the calculation unit corresponding to the information of the parallel calculation unit within the block, including :

If, according to the access addresses of the target data, it is determined that a plurality of target data to be accessed by the computing unit have the same row or column, then combine the access channels of the plurality of target data into one access channel;

reading data from the input register file and/or the output register file according to the combined access channel;

From the read data, obtain the target data according to the access address of the target data;

The target data is processed in parallel according to the processing mode included in the target processing instruction, the computing unit corresponding to the inter-block parallel computing unit information, and the computing unit corresponding to the intra-block parallel computing unit information.
The method according to claim 6 or 7, before the data in the matrix to be processed included in the received task request is stored in the input register file, the method further comprises:

When determining that there is a task request in the task interface, determine whether the input register file is free;

When it is determined that the input register file is idle, the task request is received, and the state of the input register file is set to a busy state.
The method according to claim 6 or 7, before determining the storage address of the target processing instruction corresponding to the matrix to be processed according to the dimension of the matrix to be processed and preset mapping information, the method further comprises:

Determine whether the output register file is free;

When it is determined that the output register file is idle, the state of the output register file is set to a busy state;

After the state of the output register file is set to the busy state, it is determined to execute the step of determining the storage address of the target processing instruction corresponding to the matrix to be processed according to the dimension of the matrix to be processed and preset mapping information.
A data processing device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a program for implementing connection communication between the processor and the memory A data bus, when the program is executed by the processor, implements the steps of the data processing method according to any one of claims 1 to 9.
A storage medium for computer-readable storage, the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize any one of claims 1 to 9 A step of the data processing method.