CN116150055B

CN116150055B - Data access method and device based on-chip cache and transposition method and device

Info

Publication number: CN116150055B
Application number: CN202211580278.1A
Authority: CN
Inventors: 王胤燊; 周良将; 汪丙南; 丁满来; 丁赤飚
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2022-12-09
Filing date: 2022-12-09
Publication date: 2023-12-29
Anticipated expiration: 2042-12-09
Also published as: CN116150055A

Abstract

The invention relates to a data access method and device based on-chip cache and a transposition method and device; the data access method comprises the following steps: adopting a block-shaped structure DMA supporting matrix transmission, and reading out data blocks from DDR in batches according to blocks; writing the data block into an array type on-chip cache by adopting a line writing mode or a column writing mode, and taking out parallel data by adopting a line taking-out mode or a column taking-out mode according to a memory access request of a vector processor for caching data in the array type on-chip cache; the transposition method comprises the following steps: dividing a large-scale two-dimensional matrix to be transposed into small-scale matrices with equal scale size; reading out the small-scale matrix block by adopting a data access method based on-chip cache to transpose; and storing the transposed matrix obtained after each block according to the correct position to synthesize a large-scale two-dimensional transposed matrix. The invention improves the performance of the vector processor for acquiring the column data when the matrix data is accessed and improves the efficiency of matrix transposition.

Description

Data access method and device based on-chip cache and transposition method and device

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a data access method and device based on-chip cache, and a transposition method and device.

Background

In the field of large-scale signal processing, the matrix to be processed is large, often reaching the scale of several hundred MB, and needs to be processed in multiple data batches. And the signal matrix is often subjected to column-wise data processing, for example: column FFT, column conjugate operation, etc.

Vector processors based on SIMD (all-called single instruction multiple data stream) computing modes, on the other hand, are often used to handle vector and matrix class computing tasks. The method realizes higher calculation density by packing a plurality of operation data for executing the same instruction operation into a large register and performing access and operation together. However, SIMD operations directly reduce the operational efficiency of vector processors when faced with matrix-wise accesses because the non-contiguous storage of data elements results in operation data that cannot be packed for retrieval.

In addition, conventional matrix transposes are put into on-chip caches by DMA read-by-row, and then stored back into DDR memory by column. In the column-wise memory process, because the data is discontinuous on the DDR, the DDR activates a plurality of access channels to reduce the data transmission bandwidth, and finally becomes a performance bottleneck of the whole processing flow.

Disclosure of Invention

In view of the above analysis, the invention aims to disclose a data access method, a device and a transposition method and a device based on-chip cache, which improve the performance of a vector processor in acquiring column data when accessing matrix data and improve the efficiency of matrix transposition.

The invention discloses a data access method based on-chip cache, which comprises the following steps:

step S1, adopting a block-shaped structure DMA supporting matrix transmission, and reading out data blocks from DDR in batches according to blocks; writing the data block into an array type on-chip cache by adopting a line writing mode or a column writing mode;

in a line-oriented writing mode, continuously writing data in a data block line by line into a cache;

in a column writing mode, writing data in a data block into corresponding rows in a cache line by line respectively, so that the cache data and the data block keep the same column-row structure;

s2, for the cache data in the array type on-chip cache, parallel data is fetched by adopting a line fetching mode or a column fetching mode according to a memory access request of a vector processor;

in a line direction fetching mode, the vector processor sequentially fetches data from the cache in a mode of continuous line direction addresses of the cache data and then outputs SIMD vector data;

in the column fetch mode, the vector processor fetches the data from each line address of the cache according to the same parallel sequence of the cache data column address, performs SIMD merging, and then performs SIMD vector data output.

Further, the data access method comprises three data transmission modes:

the first mode adopts a line-direction writing mode in the step S1, and adopts a line-direction writing mode in the step S2; normal data transmission of the data block is realized;

step S1 adopts a column writing mode, and step S2 adopts a column writing mode; realizing transposed data transmission of the data block;

step S1 adopts a column writing mode, and step S2 adopts a row writing mode; and normal data transmission of the data block is realized.

Further, in the storage structure of the array type on-chip cache, splitting a plurality of line groups from a storage space; the number of the split row groups is the same as or multiple of the width calculated by SIMD merging of the vector processor.

Further, the step S1 specifically includes:

step S101, performing parameterization configuration on a DMA with a block structure, so that the size of a data block moved by the DMA each time is matched with the storage scale of an array type on-chip cache;

step S102, according to the parameterized configuration result, DMA automatically calculates the addresses of each row of the data block moved from DDR;

step S103, judging a transmission mode according to a transmission mode flag bit in parameterized configuration, and entering step S104 if the flag bit is 0; if the flag bit is "1", the step S105 is entered;

step S104, adopting a line direction writing mode to perform normal transmission; calculating the total size of the data block, and continuously writing the data in the data block into a cache line by line to finish the writing process of the data block into the cache;

step S105, performing transposed transmission by adopting a column writing mode; according to the line number of the data block, calculating the line number address of each line of the data block to be written into the cache, and according to the column number of the data block, calculating the line address of each data in each line; writing the data in the data block into each corresponding row in the cache line by line respectively, so that the cache data and the data block keep the same row-column structure.

Further, parameterizing the block structure DMA includes:

configuring a data block starting address Mem_Addr0, which is used for representing the storage address of a starting element at the left upper corner of the data block at the DDR;

configuring a continuous effective length X_Slice of the data block; when transposed data is transmitted, the size of the X_slice cannot exceed the capacity of each row in the array type on-chip cache; the size of the X_slice is also smaller than the line width X_full of the Full-scale matrix of the original data in the DDR;

configuring the line number Y_Slice of the data block; when transposed data is transmitted, the Y_slice is an array type on-chip cache line number or an integer multiple of the array type on-chip cache line number;

configuring a transmission mode flag bit of the data block written into the array type on-chip cache; when the flag bit is '1', the bit is 'transposed transmission'; when the flag bit is "0", it is "normal transmission".

Further, the step S2 specifically includes:

step S201, for the parameterized configuration in step S1, judging the writing mode of writing data in the array type on-chip cache, and if the writing mode is the line writing mode, entering step S202; if the data block data in the array type on-chip cache is written in the column writing mode, the step S203 is entered;

step S202, adopting a line-direction extraction mode, sequentially extracting data from a cache by a vector processor according to a cache data line-direction address mode, and then outputting SIMD vector data to finish the extraction of the cache data;

step S203, judging the memory access instruction of the vector processor, if the memory access instruction is a transpose memory access instruction, entering step S204, and if the memory access instruction is a normal memory access instruction, entering step S205;

and S204, adopting a column-direction extraction mode, and carrying out SIMD merging on the number extracted from each row of cache addresses by the vector processor according to the same parallel sequence of the cache data column-direction addresses, and then carrying out SIMD vector data output to finish the extraction of the cache data.

Step S205, adopting a line-direction extraction mode, and sequentially extracting data lines from the cache line by the vector processor according to a mode of continuous line-direction addresses of the cache data, and then outputting SIMD vector data to finish the extraction of the cache data.

Further, in a column direction fetching mode, calculating a cache data column direction address and a row direction address for SIMD combination through an address decoder; and sequentially extracting the data on each row with the same column address from the cache data according to the calculated column address and row address, and carrying out SIMD merging.

The invention also discloses a data access device based on the array type on-chip cache, which comprises a data writing module and a data reading module:

the data writing module is used for reading out data blocks from DDR in bulk by adopting a block structure DMA supporting matrix transmission; writing the data block into an array type on-chip cache by adopting a line writing mode or a column writing mode; in a line-oriented writing mode, continuously writing data in a data block line by line into a cache; in a column writing mode, writing data in a data block into corresponding rows in a cache line by line respectively, so that the cache data and the data block keep the same column-row structure;

the data reading module is used for carrying out parallel data extraction on the cache data in the array type on-chip cache by adopting a row-direction extraction mode or a column-direction extraction mode according to the access request of the vector processor; in a line direction fetching mode, the vector processor sequentially fetches data from the cache in a mode of continuous line direction addresses of the cache data and then outputs SIMD vector data; in the column fetch mode, the vector processor fetches the data from each line address of the cache according to the same parallel sequence of the cache data column address, performs SIMD merging, and then performs SIMD vector data output.

The invention also discloses a transposition method of the large-scale two-dimensional matrix, which comprises the following steps:

step S1, dividing a large-scale two-dimensional matrix [ L1, L2] to be transposed, which is positioned in DDR, into small-scale matrices [ M1, M2] with equal scale size; wherein, n1=l1/M1, n2=l2/M2, N1, N2 are positive integers greater than 0;

step S2, adopting the data access method based on the on-chip cache, reading out small-scale matrixes [ M1, M2] from the DDR by utilizing the block-shaped structure DMA block by block, and writing the matrixes into the array on-chip cache in a column writing mode; a vector processor is adopted to carry out parallel data extraction in a column extraction mode, and a small-scale matrix [ M1, M2] is transposed into a matrix [ M2, M1];

s3, transposed each block to obtain a matrix [ M2, M1] and storing the synthesized matrix [ L2, L1] according to the correct position; the correct position refers to the corresponding position of each matrix block [ M2, M1] in the matrix [ L2, L1] when the matrix [ L1, L2] is transposed into the matrix [ L2, L1 ].

The invention also discloses a device for transposing the large-scale two-dimensional matrix; comprising the following steps: the system comprises a matrix dividing module, a small-scale matrix transposition module and a matrix synthesizing module;

the matrix dividing module is used for dividing a large-scale two-dimensional matrix [ L1, L2] to be transposed in the DDR into small-scale matrixes [ M1, M2] with equal scale size; wherein, n1=l1/M1, n2=l2/M2, N1, N2 are positive integers greater than 0;

the small-scale matrix transposition module is used for reading out small-scale matrixes [ M1, M2] from the DDR by using the block-shaped structure DMA block by adopting the data access method based on the on-chip cache, and writing the small-scale matrixes into the array on-chip cache in a column writing mode; a vector processor is adopted to carry out parallel data extraction in a column extraction mode, and a small-scale matrix [ M1, M2] is transposed into a matrix [ M2, M1];

the matrix synthesis module is used for obtaining a matrix [ M2, M1] after each block is transposed, and storing the synthesis matrix [ L2, L1] according to the correct position; the correct position refers to the corresponding position of each matrix block [ M2, M1] in the matrix [ L2, L1] when the matrix [ L1, L2] is transposed into the matrix [ L2, L1 ].

The invention can realize one of the following beneficial effects:

different from the traditional matrix transposition mode of whole row reading and whole column writing, the invention adopts block transmission, improves the transmission efficiency of transposed data block reading and writing, and improves the access bandwidth by combining the continuous memory access characteristic of the DDR memory.

The array structure design of the on-chip cache for storing the transposed data blocks can support natural transpose emission of data, meets the requirement of a vector processor on SIMD vector transposition, and improves the data access efficiency of the vector processor.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

FIG. 1 is a flowchart of a data access method based on-chip cache in a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating a process of writing a data block into an on-chip cache according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a read process of on-chip cache data according to a first embodiment of the present invention;

FIG. 4 is a flowchart of a method for transposing a large-scale two-dimensional matrix in a third embodiment of the present invention;

fig. 5 is a block diagram illustrating a large-scale two-dimensional matrix transpose apparatus in accordance with a fourth embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention are described in detail below with reference to the attached drawing figures, which form a part of the present application and, together with the embodiments of the present invention, serve to explain the principles of the invention.

Example 1

One embodiment of the invention discloses a data access method based on-chip cache, as shown in fig. 1, comprising the following steps:

Specifically, in this embodiment, in the storage structure of the array on-chip cache, the storage space is split by a plurality of line groups; the number of the split row groups is the same as or multiple of the width calculated by SIMD merging of the vector processor.

When the number of the split line groups is the same as the width calculated by SIMD merging of the vector processor, writing data of the line number which is the same as the SIMD calculation width into an array type on-chip cache in a column direction fetching mode, and when the SIMD merging is performed, fetching the data from each line address of the cache according to the same parallel sequence of the cache data column direction address in the line number which is the same as the SIMD calculation width.

When the number of the split line groups is in a multiple relation with the width calculated by the SIMD combination of the vector processor, the data processing is sequentially carried out by cycling the above processes.

Specifically, the data access method based on-chip cache in this embodiment includes three data transmission modes:

in the mode, the normal data transmission data block can be simply regarded as a continuous data segment, and is sequentially stored in the on-chip cache without line distinction and interval processing; corresponding to the normal row-wise fetch format of a vector processor.

in the second mode, each row of the data block is stored in each row of the array type on-chip cache in a one-to-one correspondence manner, and the transposition of the data block is realized by adopting a column direction number-taking mode of the vector processor, so that the transmission efficiency of reading and writing back the transposed data block can be improved, and the access bandwidth is improved by combining the continuous access characteristic of the DDR memory.

In mode three, the reading of the data block data and the reading of the transposed data of the data block can be achieved by simple instruction changes. For complex operations of the data block, the processing efficiency is improved.

In a more specific scheme, as shown in fig. 2, in step S1, specifically includes:

specifically, the parameterizing configuration of the block-structured DMA includes:

In the step, through the DMA structure of parameter configurable support matrix transmission, the data in the memory can be transmitted to the on-chip cache of the processor in a block-shaped batch manner at a hardware level, so that the time consumption of the data preprocessing of the processor is reduced. And simultaneously combines the on-chip stored column access function to support the column-wise computation requirement of the vector processor.

Specifically, as shown in fig. 3, the step S2 specifically includes:

Specifically, in a column direction extraction mode, calculating a buffer data column direction address and a buffer data row direction address for SIMD combination through an address decoder; and sequentially extracting the data on each row with the same column address from the cache data according to the calculated column address and row address, and carrying out SIMD merging.

More specifically, the memory access instruction of the vector processor is LOADC/STORC (C represents Column). And the column access instruction is sent to the on-chip cache memory line in parallel according to the column address in each line obtained after decoding. And reading out the data elements corresponding to the column addresses from each row, sending the data elements to the SIMD merging unit for merging, splicing the data elements into complete SIMD data, and finally, outputting the data elements to the processor through the selection logic.

Specifically, the memory access instruction of the vector processor is LOAD/STORE. The column access command can obtain the address in the row of the single memory row after decoding, and the address is only sent to the corresponding 1 memory row. The memory line fetches a segment of continuous line-oriented SIMD data, which is returned to the processor via the selection logic, thereby implementing a simple line access function.

In summary, unlike the conventional matrix transposition mode of whole row reading and whole column writing, in the embodiment of the invention, block transmission is adopted, meanwhile, the transmission efficiency of reading and writing back transposed data blocks is improved, and the access bandwidth is improved by combining the continuous memory access characteristic of the DDR memory.

Example two

The embodiment of the invention discloses a data access device based on array type on-chip cache, which comprises a data writing module and a data reading module:

The specific technical details and technical effects in this embodiment are the same as those in the previous embodiment, and specific reference is made thereto, and details are not described here.

Example III

The embodiment of the invention discloses a method for transposing a large-scale two-dimensional matrix, which is shown in fig. 4 and comprises the following steps:

step S2, adopting the data access method based on-chip cache as described in the first embodiment, reading out small-scale matrixes [ M1, M2] from DDR by block by using a block structure DMA, and writing the matrixes into the array on-chip cache in a column writing mode; a vector processor is adopted to carry out parallel data extraction in a column extraction mode, and a small-scale matrix [ M1, M2] is transposed into a matrix [ M2, M1];

The transpose of the large-scale matrix realized by the embodiment solves the problems that the conventional large-scale matrix mainly faces the following difficulties:

for a vector processor, under the condition of a certain SIMD calculation width, the traditional DMA continuously carries matrix row data to an on-chip cache in batches, the data format stored according to the rows cannot match the requirement of column calculation, and the SIMD data parallelism cannot be fully utilized.

If simple matrix transposition pretreatment is performed first, after reading is performed from the DDR source data matrix position according to rows, the data is stored back to the intermediate matrix in a column mode, and then the data is conveyed to an on-chip cache for vector calculation, and the whole transposition-calculation performance is reduced in the redundant data conveying process and the lower column writing bandwidth.

Therefore, the present embodiment has an effect of improving the overall transpose-computation performance.

Example IV

The embodiment discloses a transposition device of a large-scale two-dimensional matrix; as shown in fig. 5, includes: the system comprises a matrix dividing module, a small-scale matrix transposition module and a matrix synthesizing module;

the small-scale matrix transpose module is configured to read out, by block, a small-scale matrix [ M1, M2] from the DDR by using a block-structured DMA by using the data access method based on-chip cache as described in the first embodiment, and write the small-scale matrix [ M1, M2] into the array on-chip cache in a column write mode; a vector processor is adopted to carry out parallel data extraction in a column extraction mode, and a small-scale matrix [ M1, M2] is transposed into a matrix [ M2, M1];

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. The data access method based on the on-chip cache is characterized by comprising the following steps of:

in a column-direction fetching mode, the vector processor fetches the data from each line address of the cache according to the same parallel sequence of the cache data column-direction address, and then performs SIMD vector data output;

in the storage structure of the array type on-chip cache, splitting a plurality of line groups from a storage space; the number of the split row groups is the same as the width calculated by SIMD merging of the vector processor or is in a multiple relation;

when the number of the split line groups is the same as the width calculated by SIMD merging of a vector processor, writing data of the line number which is the same as the SIMD calculation width into an array type on-chip cache in a column direction extraction mode, and when the SIMD merging is performed, taking the number from each line address of the cache according to the same parallel sequence of the cache data column direction address in the line number which is the same as the SIMD calculation width;

when the number of the split line groups is in a multiple relation with the width calculated by the SIMD combination of the vector processor, the data processing is sequentially carried out by cycling the processing procedures when the widths are the same;

the data access method comprises three data transmission modes:

in the mode, the normal data transmission data block can be simply regarded as a continuous data segment, and is sequentially stored in the on-chip cache without line distinction and interval processing;

in the second mode, each row of the data block is stored in each row of the array type on-chip cache in a one-to-one correspondence manner, and the transposition of the data block is realized by adopting a column direction number taking mode of a vector processor;

step S1 adopts a column writing mode, and step S2 adopts a row writing mode; normal data transmission of the data block is realized;

the step S1 specifically includes: step S101, performing parameterization configuration on a DMA with a block structure, so that the size of a data block moved by the DMA each time is matched with the storage scale of an array type on-chip cache;

2. The on-chip cache based data access method of claim 1, wherein parametrizing the block structure DMA comprises:

3. The on-chip cache-based data access method of claim 1, wherein,

the step S2 specifically includes:

step S204, adopting a column-direction extraction mode, and carrying out SIMD merging on the number extracted from each row of cache addresses by the vector processor according to the same parallel sequence of the cache data column-direction addresses, and then carrying out SIMD vector data output to finish the extraction of the cache data;

4. The method for on-chip cache based data access of claim 3, wherein,

in a column direction extraction mode, calculating a buffer data column direction address and a buffer data row direction address for SIMD combination through an address decoder; and sequentially extracting the data on each row with the same column address from the cache data according to the calculated column address and row address, and carrying out SIMD merging.

5. A data access device according to the on-chip cache based data access method of any one of claims 1 to 4, comprising a data writing module and a data reading module:

6. A method for transposing a large-scale two-dimensional matrix, comprising:

step S2, adopting the data access method based on-chip cache as claimed in any one of claims 1-4, reading out small-scale matrixes [ M1, M2] from DDR by block by using a block-shaped structure DMA, and writing the matrixes into an array on-chip cache in a column writing mode; a vector processor is adopted to carry out parallel data extraction in a column extraction mode, and a small-scale matrix [ M1, M2] is transposed into a matrix [ M2, M1];

7. A large-scale two-dimensional matrix transposition device; characterized by comprising the following steps: the system comprises a matrix dividing module, a small-scale matrix transposition module and a matrix synthesizing module;

the small-scale matrix transposition module is used for reading out small-scale matrixes [ M1, M2] from DDR by using a block-shaped structure DMA block by adopting the data access method based on the on-chip cache as claimed in any one of claims 1 to 4, and writing the small-scale matrixes into the array on-chip cache in a column writing mode; a vector processor is adopted to carry out parallel data extraction in a column extraction mode, and a small-scale matrix [ M1, M2] is transposed into a matrix [ M2, M1];