CN116150055B - Data access method and device based on-chip cache and transposition method and device - Google Patents

Data access method and device based on-chip cache and transposition method and device Download PDF

Info

Publication number
CN116150055B
CN116150055B CN202211580278.1A CN202211580278A CN116150055B CN 116150055 B CN116150055 B CN 116150055B CN 202211580278 A CN202211580278 A CN 202211580278A CN 116150055 B CN116150055 B CN 116150055B
Authority
CN
China
Prior art keywords
data
cache
line
mode
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211580278.1A
Other languages
Chinese (zh)
Other versions
CN116150055A (en
Inventor
王胤燊
周良将
汪丙南
丁满来
丁赤飚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aerospace Information Research Institute of CAS
Original Assignee
Aerospace Information Research Institute of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aerospace Information Research Institute of CAS filed Critical Aerospace Information Research Institute of CAS
Priority to CN202211580278.1A priority Critical patent/CN116150055B/en
Publication of CN116150055A publication Critical patent/CN116150055A/en
Application granted granted Critical
Publication of CN116150055B publication Critical patent/CN116150055B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention relates to a data access method and device based on-chip cache and a transposition method and device; the data access method comprises the following steps: adopting a block-shaped structure DMA supporting matrix transmission, and reading out data blocks from DDR in batches according to blocks; writing the data block into an array type on-chip cache by adopting a line writing mode or a column writing mode, and taking out parallel data by adopting a line taking-out mode or a column taking-out mode according to a memory access request of a vector processor for caching data in the array type on-chip cache; the transposition method comprises the following steps: dividing a large-scale two-dimensional matrix to be transposed into small-scale matrices with equal scale size; reading out the small-scale matrix block by adopting a data access method based on-chip cache to transpose; and storing the transposed matrix obtained after each block according to the correct position to synthesize a large-scale two-dimensional transposed matrix. The invention improves the performance of the vector processor for acquiring the column data when the matrix data is accessed and improves the efficiency of matrix transposition.

Description

Data access method and device based on-chip cache and transposition method and device
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a data access method and device based on-chip cache, and a transposition method and device.
Background
In the field of large-scale signal processing, the matrix to be processed is large, often reaching the scale of several hundred MB, and needs to be processed in multiple data batches. And the signal matrix is often subjected to column-wise data processing, for example: column FFT, column conjugate operation, etc.
Vector processors based on SIMD (all-called single instruction multiple data stream) computing modes, on the other hand, are often used to handle vector and matrix class computing tasks. The method realizes higher calculation density by packing a plurality of operation data for executing the same instruction operation into a large register and performing access and operation together. However, SIMD operations directly reduce the operational efficiency of vector processors when faced with matrix-wise accesses because the non-contiguous storage of data elements results in operation data that cannot be packed for retrieval.
In addition, conventional matrix transposes are put into on-chip caches by DMA read-by-row, and then stored back into DDR memory by column. In the column-wise memory process, because the data is discontinuous on the DDR, the DDR activates a plurality of access channels to reduce the data transmission bandwidth, and finally becomes a performance bottleneck of the whole processing flow.
Disclosure of Invention
In view of the above analysis, the invention aims to disclose a data access method, a device and a transposition method and a device based on-chip cache, which improve the performance of a vector processor in acquiring column data when accessing matrix data and improve the efficiency of matrix transposition.
The invention discloses a data access method based on-chip cache, which comprises the following steps:
step S1, adopting a block-shaped structure DMA supporting matrix transmission, and reading out data blocks from DDR in batches according to blocks; writing the data block into an array type on-chip cache by adopting a line writing mode or a column writing mode;
in a line-oriented writing mode, continuously writing data in a data block line by line into a cache;
in a column writing mode, writing data in a data block into corresponding rows in a cache line by line respectively, so that the cache data and the data block keep the same column-row structure;
s2, for the cache data in the array type on-chip cache, parallel data is fetched by adopting a line fetching mode or a column fetching mode according to a memory access request of a vector processor;
in a line direction fetching mode, the vector processor sequentially fetches data from the cache in a mode of continuous line direction addresses of the cache data and then outputs SIMD vector data;
in the column fetch mode, the vector processor fetches the data from each line address of the cache according to the same parallel sequence of the cache data column address, performs SIMD merging, and then performs SIMD vector data output.
Further, the data access method comprises three data transmission modes:
the first mode adopts a line-direction writing mode in the step S1, and adopts a line-direction writing mode in the step S2; normal data transmission of the data block is realized;
step S1 adopts a column writing mode, and step S2 adopts a column writing mode; realizing transposed data transmission of the data block;
step S1 adopts a column writing mode, and step S2 adopts a row writing mode; and normal data transmission of the data block is realized.
Further, in the storage structure of the array type on-chip cache, splitting a plurality of line groups from a storage space; the number of the split row groups is the same as or multiple of the width calculated by SIMD merging of the vector processor.
Further, the step S1 specifically includes:
step S101, performing parameterization configuration on a DMA with a block structure, so that the size of a data block moved by the DMA each time is matched with the storage scale of an array type on-chip cache;
step S102, according to the parameterized configuration result, DMA automatically calculates the addresses of each row of the data block moved from DDR;
step S103, judging a transmission mode according to a transmission mode flag bit in parameterized configuration, and entering step S104 if the flag bit is 0; if the flag bit is "1", the step S105 is entered;
step S104, adopting a line direction writing mode to perform normal transmission; calculating the total size of the data block, and continuously writing the data in the data block into a cache line by line to finish the writing process of the data block into the cache;
step S105, performing transposed transmission by adopting a column writing mode; according to the line number of the data block, calculating the line number address of each line of the data block to be written into the cache, and according to the column number of the data block, calculating the line address of each data in each line; writing the data in the data block into each corresponding row in the cache line by line respectively, so that the cache data and the data block keep the same row-column structure.
Further, parameterizing the block structure DMA includes:
configuring a data block starting address Mem_Addr0, which is used for representing the storage address of a starting element at the left upper corner of the data block at the DDR;
configuring a continuous effective length X_Slice of the data block; when transposed data is transmitted, the size of the X_slice cannot exceed the capacity of each row in the array type on-chip cache; the size of the X_slice is also smaller than the line width X_full of the Full-scale matrix of the original data in the DDR;
configuring the line number Y_Slice of the data block; when transposed data is transmitted, the Y_slice is an array type on-chip cache line number or an integer multiple of the array type on-chip cache line number;
configuring a transmission mode flag bit of the data block written into the array type on-chip cache; when the flag bit is '1', the bit is 'transposed transmission'; when the flag bit is "0", it is "normal transmission".
Further, the step S2 specifically includes:
step S201, for the parameterized configuration in step S1, judging the writing mode of writing data in the array type on-chip cache, and if the writing mode is the line writing mode, entering step S202; if the data block data in the array type on-chip cache is written in the column writing mode, the step S203 is entered;
step S202, adopting a line-direction extraction mode, sequentially extracting data from a cache by a vector processor according to a cache data line-direction address mode, and then outputting SIMD vector data to finish the extraction of the cache data;
step S203, judging the memory access instruction of the vector processor, if the memory access instruction is a transpose memory access instruction, entering step S204, and if the memory access instruction is a normal memory access instruction, entering step S205;
and S204, adopting a column-direction extraction mode, and carrying out SIMD merging on the number extracted from each row of cache addresses by the vector processor according to the same parallel sequence of the cache data column-direction addresses, and then carrying out SIMD vector data output to finish the extraction of the cache data.
Step S205, adopting a line-direction extraction mode, and sequentially extracting data lines from the cache line by the vector processor according to a mode of continuous line-direction addresses of the cache data, and then outputting SIMD vector data to finish the extraction of the cache data.
Further, in a column direction fetching mode, calculating a cache data column direction address and a row direction address for SIMD combination through an address decoder; and sequentially extracting the data on each row with the same column address from the cache data according to the calculated column address and row address, and carrying out SIMD merging.
The invention also discloses a data access device based on the array type on-chip cache, which comprises a data writing module and a data reading module:
the data writing module is used for reading out data blocks from DDR in bulk by adopting a block structure DMA supporting matrix transmission; writing the data block into an array type on-chip cache by adopting a line writing mode or a column writing mode; in a line-oriented writing mode, continuously writing data in a data block line by line into a cache; in a column writing mode, writing data in a data block into corresponding rows in a cache line by line respectively, so that the cache data and the data block keep the same column-row structure;
the data reading module is used for carrying out parallel data extraction on the cache data in the array type on-chip cache by adopting a row-direction extraction mode or a column-direction extraction mode according to the access request of the vector processor; in a line direction fetching mode, the vector processor sequentially fetches data from the cache in a mode of continuous line direction addresses of the cache data and then outputs SIMD vector data; in the column fetch mode, the vector processor fetches the data from each line address of the cache according to the same parallel sequence of the cache data column address, performs SIMD merging, and then performs SIMD vector data output.
The invention also discloses a transposition method of the large-scale two-dimensional matrix, which comprises the following steps:
step S1, dividing a large-scale two-dimensional matrix [ L1, L2] to be transposed, which is positioned in DDR, into small-scale matrices [ M1, M2] with equal scale size; wherein, n1=l1/M1, n2=l2/M2, N1, N2 are positive integers greater than 0;
step S2, adopting the data access method based on the on-chip cache, reading out small-scale matrixes [ M1, M2] from the DDR by utilizing the block-shaped structure DMA block by block, and writing the matrixes into the array on-chip cache in a column writing mode; a vector processor is adopted to carry out parallel data extraction in a column extraction mode, and a small-scale matrix [ M1, M2] is transposed into a matrix [ M2, M1];
s3, transposed each block to obtain a matrix [ M2, M1] and storing the synthesized matrix [ L2, L1] according to the correct position; the correct position refers to the corresponding position of each matrix block [ M2, M1] in the matrix [ L2, L1] when the matrix [ L1, L2] is transposed into the matrix [ L2, L1 ].
The invention also discloses a device for transposing the large-scale two-dimensional matrix; comprising the following steps: the system comprises a matrix dividing module, a small-scale matrix transposition module and a matrix synthesizing module;
the matrix dividing module is used for dividing a large-scale two-dimensional matrix [ L1, L2] to be transposed in the DDR into small-scale matrixes [ M1, M2] with equal scale size; wherein, n1=l1/M1, n2=l2/M2, N1, N2 are positive integers greater than 0;
the small-scale matrix transposition module is used for reading out small-scale matrixes [ M1, M2] from the DDR by using the block-shaped structure DMA block by adopting the data access method based on the on-chip cache, and writing the small-scale matrixes into the array on-chip cache in a column writing mode; a vector processor is adopted to carry out parallel data extraction in a column extraction mode, and a small-scale matrix [ M1, M2] is transposed into a matrix [ M2, M1];
the matrix synthesis module is used for obtaining a matrix [ M2, M1] after each block is transposed, and storing the synthesis matrix [ L2, L1] according to the correct position; the correct position refers to the corresponding position of each matrix block [ M2, M1] in the matrix [ L2, L1] when the matrix [ L1, L2] is transposed into the matrix [ L2, L1 ].
The invention can realize one of the following beneficial effects:
different from the traditional matrix transposition mode of whole row reading and whole column writing, the invention adopts block transmission, improves the transmission efficiency of transposed data block reading and writing, and improves the access bandwidth by combining the continuous memory access characteristic of the DDR memory.
The array structure design of the on-chip cache for storing the transposed data blocks can support natural transpose emission of data, meets the requirement of a vector processor on SIMD vector transposition, and improves the data access efficiency of the vector processor.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.
FIG. 1 is a flowchart of a data access method based on-chip cache in a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating a process of writing a data block into an on-chip cache according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating a read process of on-chip cache data according to a first embodiment of the present invention;
FIG. 4 is a flowchart of a method for transposing a large-scale two-dimensional matrix in a third embodiment of the present invention;
fig. 5 is a block diagram illustrating a large-scale two-dimensional matrix transpose apparatus in accordance with a fourth embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention are described in detail below with reference to the attached drawing figures, which form a part of the present application and, together with the embodiments of the present invention, serve to explain the principles of the invention.
Example 1
One embodiment of the invention discloses a data access method based on-chip cache, as shown in fig. 1, comprising the following steps:
step S1, adopting a block-shaped structure DMA supporting matrix transmission, and reading out data blocks from DDR in batches according to blocks; writing the data block into an array type on-chip cache by adopting a line writing mode or a column writing mode;
in a line-oriented writing mode, continuously writing data in a data block line by line into a cache;
in a column writing mode, writing data in a data block into corresponding rows in a cache line by line respectively, so that the cache data and the data block keep the same column-row structure;
s2, for the cache data in the array type on-chip cache, parallel data is fetched by adopting a line fetching mode or a column fetching mode according to a memory access request of a vector processor;
in a line direction fetching mode, the vector processor sequentially fetches data from the cache in a mode of continuous line direction addresses of the cache data and then outputs SIMD vector data;
in the column fetch mode, the vector processor fetches the data from each line address of the cache according to the same parallel sequence of the cache data column address, performs SIMD merging, and then performs SIMD vector data output.
Specifically, in this embodiment, in the storage structure of the array on-chip cache, the storage space is split by a plurality of line groups; the number of the split row groups is the same as or multiple of the width calculated by SIMD merging of the vector processor.
When the number of the split line groups is the same as the width calculated by SIMD merging of the vector processor, writing data of the line number which is the same as the SIMD calculation width into an array type on-chip cache in a column direction fetching mode, and when the SIMD merging is performed, fetching the data from each line address of the cache according to the same parallel sequence of the cache data column direction address in the line number which is the same as the SIMD calculation width.
When the number of the split line groups is in a multiple relation with the width calculated by the SIMD combination of the vector processor, the data processing is sequentially carried out by cycling the above processes.
Specifically, the data access method based on-chip cache in this embodiment includes three data transmission modes:
the first mode adopts a line-direction writing mode in the step S1, and adopts a line-direction writing mode in the step S2; normal data transmission of the data block is realized;
in the mode, the normal data transmission data block can be simply regarded as a continuous data segment, and is sequentially stored in the on-chip cache without line distinction and interval processing; corresponding to the normal row-wise fetch format of a vector processor.
Step S1 adopts a column writing mode, and step S2 adopts a column writing mode; realizing transposed data transmission of the data block;
in the second mode, each row of the data block is stored in each row of the array type on-chip cache in a one-to-one correspondence manner, and the transposition of the data block is realized by adopting a column direction number-taking mode of the vector processor, so that the transmission efficiency of reading and writing back the transposed data block can be improved, and the access bandwidth is improved by combining the continuous access characteristic of the DDR memory.
Step S1 adopts a column writing mode, and step S2 adopts a row writing mode; and normal data transmission of the data block is realized.
In mode three, the reading of the data block data and the reading of the transposed data of the data block can be achieved by simple instruction changes. For complex operations of the data block, the processing efficiency is improved.
In a more specific scheme, as shown in fig. 2, in step S1, specifically includes:
step S101, performing parameterization configuration on a DMA with a block structure, so that the size of a data block moved by the DMA each time is matched with the storage scale of an array type on-chip cache;
specifically, the parameterizing configuration of the block-structured DMA includes:
configuring a data block starting address Mem_Addr0, which is used for representing the storage address of a starting element at the left upper corner of the data block at the DDR;
configuring a continuous effective length X_Slice of the data block; when transposed data is transmitted, the size of the X_slice cannot exceed the capacity of each row in the array type on-chip cache; the size of the X_slice is also smaller than the line width X_full of the Full-scale matrix of the original data in the DDR;
configuring the line number Y_Slice of the data block; when transposed data is transmitted, the Y_slice is an array type on-chip cache line number or an integer multiple of the array type on-chip cache line number;
configuring a transmission mode flag bit of the data block written into the array type on-chip cache; when the flag bit is '1', the bit is 'transposed transmission'; when the flag bit is "0", it is "normal transmission".
Step S102, according to the parameterized configuration result, DMA automatically calculates the addresses of each row of the data block moved from DDR;
step S103, judging a transmission mode according to a transmission mode flag bit in parameterized configuration, and entering step S104 if the flag bit is 0; if the flag bit is "1", the step S105 is entered;
step S104, adopting a line direction writing mode to perform normal transmission; calculating the total size of the data block, and continuously writing the data in the data block into a cache line by line to finish the writing process of the data block into the cache;
step S105, performing transposed transmission by adopting a column writing mode; according to the line number of the data block, calculating the line number address of each line of the data block to be written into the cache, and according to the column number of the data block, calculating the line address of each data in each line; writing the data in the data block into each corresponding row in the cache line by line respectively, so that the cache data and the data block keep the same row-column structure.
In the step, through the DMA structure of parameter configurable support matrix transmission, the data in the memory can be transmitted to the on-chip cache of the processor in a block-shaped batch manner at a hardware level, so that the time consumption of the data preprocessing of the processor is reduced. And simultaneously combines the on-chip stored column access function to support the column-wise computation requirement of the vector processor.
Specifically, as shown in fig. 3, the step S2 specifically includes:
step S201, for the parameterized configuration in step S1, judging the writing mode of writing data in the array type on-chip cache, and if the writing mode is the line writing mode, entering step S202; if the data block data in the array type on-chip cache is written in the column writing mode, the step S203 is entered;
step S202, adopting a line-direction extraction mode, sequentially extracting data from a cache by a vector processor according to a cache data line-direction address mode, and then outputting SIMD vector data to finish the extraction of the cache data;
step S203, judging the memory access instruction of the vector processor, if the memory access instruction is a transpose memory access instruction, entering step S204, and if the memory access instruction is a normal memory access instruction, entering step S205;
and S204, adopting a column-direction extraction mode, and carrying out SIMD merging on the number extracted from each row of cache addresses by the vector processor according to the same parallel sequence of the cache data column-direction addresses, and then carrying out SIMD vector data output to finish the extraction of the cache data.
Specifically, in a column direction extraction mode, calculating a buffer data column direction address and a buffer data row direction address for SIMD combination through an address decoder; and sequentially extracting the data on each row with the same column address from the cache data according to the calculated column address and row address, and carrying out SIMD merging.
More specifically, the memory access instruction of the vector processor is LOADC/STORC (C represents Column). And the column access instruction is sent to the on-chip cache memory line in parallel according to the column address in each line obtained after decoding. And reading out the data elements corresponding to the column addresses from each row, sending the data elements to the SIMD merging unit for merging, splicing the data elements into complete SIMD data, and finally, outputting the data elements to the processor through the selection logic.
Step S205, adopting a line-direction extraction mode, and sequentially extracting data lines from the cache line by the vector processor according to a mode of continuous line-direction addresses of the cache data, and then outputting SIMD vector data to finish the extraction of the cache data.
Specifically, the memory access instruction of the vector processor is LOAD/STORE. The column access command can obtain the address in the row of the single memory row after decoding, and the address is only sent to the corresponding 1 memory row. The memory line fetches a segment of continuous line-oriented SIMD data, which is returned to the processor via the selection logic, thereby implementing a simple line access function.
In summary, unlike the conventional matrix transposition mode of whole row reading and whole column writing, in the embodiment of the invention, block transmission is adopted, meanwhile, the transmission efficiency of reading and writing back transposed data blocks is improved, and the access bandwidth is improved by combining the continuous memory access characteristic of the DDR memory.
The array structure design of the on-chip cache for storing the transposed data blocks can support natural transpose emission of data, meets the requirement of a vector processor on SIMD vector transposition, and improves the data access efficiency of the vector processor.
Example two
The embodiment of the invention discloses a data access device based on array type on-chip cache, which comprises a data writing module and a data reading module:
the data writing module is used for reading out data blocks from DDR in bulk by adopting a block structure DMA supporting matrix transmission; writing the data block into an array type on-chip cache by adopting a line writing mode or a column writing mode; in a line-oriented writing mode, continuously writing data in a data block line by line into a cache; in a column writing mode, writing data in a data block into corresponding rows in a cache line by line respectively, so that the cache data and the data block keep the same column-row structure;
the data reading module is used for carrying out parallel data extraction on the cache data in the array type on-chip cache by adopting a row-direction extraction mode or a column-direction extraction mode according to the access request of the vector processor; in a line direction fetching mode, the vector processor sequentially fetches data from the cache in a mode of continuous line direction addresses of the cache data and then outputs SIMD vector data; in the column fetch mode, the vector processor fetches the data from each line address of the cache according to the same parallel sequence of the cache data column address, performs SIMD merging, and then performs SIMD vector data output.
The specific technical details and technical effects in this embodiment are the same as those in the previous embodiment, and specific reference is made thereto, and details are not described here.
Example III
The embodiment of the invention discloses a method for transposing a large-scale two-dimensional matrix, which is shown in fig. 4 and comprises the following steps:
step S1, dividing a large-scale two-dimensional matrix [ L1, L2] to be transposed, which is positioned in DDR, into small-scale matrices [ M1, M2] with equal scale size; wherein, n1=l1/M1, n2=l2/M2, N1, N2 are positive integers greater than 0;
step S2, adopting the data access method based on-chip cache as described in the first embodiment, reading out small-scale matrixes [ M1, M2] from DDR by block by using a block structure DMA, and writing the matrixes into the array on-chip cache in a column writing mode; a vector processor is adopted to carry out parallel data extraction in a column extraction mode, and a small-scale matrix [ M1, M2] is transposed into a matrix [ M2, M1];
s3, transposed each block to obtain a matrix [ M2, M1] and storing the synthesized matrix [ L2, L1] according to the correct position; the correct position refers to the corresponding position of each matrix block [ M2, M1] in the matrix [ L2, L1] when the matrix [ L1, L2] is transposed into the matrix [ L2, L1 ].
The transpose of the large-scale matrix realized by the embodiment solves the problems that the conventional large-scale matrix mainly faces the following difficulties:
for a vector processor, under the condition of a certain SIMD calculation width, the traditional DMA continuously carries matrix row data to an on-chip cache in batches, the data format stored according to the rows cannot match the requirement of column calculation, and the SIMD data parallelism cannot be fully utilized.
If simple matrix transposition pretreatment is performed first, after reading is performed from the DDR source data matrix position according to rows, the data is stored back to the intermediate matrix in a column mode, and then the data is conveyed to an on-chip cache for vector calculation, and the whole transposition-calculation performance is reduced in the redundant data conveying process and the lower column writing bandwidth.
Therefore, the present embodiment has an effect of improving the overall transpose-computation performance.
Example IV
The embodiment discloses a transposition device of a large-scale two-dimensional matrix; as shown in fig. 5, includes: the system comprises a matrix dividing module, a small-scale matrix transposition module and a matrix synthesizing module;
the matrix dividing module is used for dividing a large-scale two-dimensional matrix [ L1, L2] to be transposed in the DDR into small-scale matrixes [ M1, M2] with equal scale size; wherein, n1=l1/M1, n2=l2/M2, N1, N2 are positive integers greater than 0;
the small-scale matrix transpose module is configured to read out, by block, a small-scale matrix [ M1, M2] from the DDR by using a block-structured DMA by using the data access method based on-chip cache as described in the first embodiment, and write the small-scale matrix [ M1, M2] into the array on-chip cache in a column write mode; a vector processor is adopted to carry out parallel data extraction in a column extraction mode, and a small-scale matrix [ M1, M2] is transposed into a matrix [ M2, M1];
the matrix synthesis module is used for obtaining a matrix [ M2, M1] after each block is transposed, and storing the synthesis matrix [ L2, L1] according to the correct position; the correct position refers to the corresponding position of each matrix block [ M2, M1] in the matrix [ L2, L1] when the matrix [ L1, L2] is transposed into the matrix [ L2, L1 ].
The specific technical details and technical effects in this embodiment are the same as those in the previous embodiment, and specific reference is made thereto, and details are not described here.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims (7)

1. The data access method based on the on-chip cache is characterized by comprising the following steps of:
step S1, adopting a block-shaped structure DMA supporting matrix transmission, and reading out data blocks from DDR in batches according to blocks; writing the data block into an array type on-chip cache by adopting a line writing mode or a column writing mode;
in a line-oriented writing mode, continuously writing data in a data block line by line into a cache;
in a column writing mode, writing data in a data block into corresponding rows in a cache line by line respectively, so that the cache data and the data block keep the same column-row structure;
s2, for the cache data in the array type on-chip cache, parallel data is fetched by adopting a line fetching mode or a column fetching mode according to a memory access request of a vector processor;
in a line direction fetching mode, the vector processor sequentially fetches data from the cache in a mode of continuous line direction addresses of the cache data and then outputs SIMD vector data;
in a column-direction fetching mode, the vector processor fetches the data from each line address of the cache according to the same parallel sequence of the cache data column-direction address, and then performs SIMD vector data output;
in the storage structure of the array type on-chip cache, splitting a plurality of line groups from a storage space; the number of the split row groups is the same as the width calculated by SIMD merging of the vector processor or is in a multiple relation;
when the number of the split line groups is the same as the width calculated by SIMD merging of a vector processor, writing data of the line number which is the same as the SIMD calculation width into an array type on-chip cache in a column direction extraction mode, and when the SIMD merging is performed, taking the number from each line address of the cache according to the same parallel sequence of the cache data column direction address in the line number which is the same as the SIMD calculation width;
when the number of the split line groups is in a multiple relation with the width calculated by the SIMD combination of the vector processor, the data processing is sequentially carried out by cycling the processing procedures when the widths are the same;
the data access method comprises three data transmission modes:
the first mode adopts a line-direction writing mode in the step S1, and adopts a line-direction writing mode in the step S2; normal data transmission of the data block is realized;
in the mode, the normal data transmission data block can be simply regarded as a continuous data segment, and is sequentially stored in the on-chip cache without line distinction and interval processing;
step S1 adopts a column writing mode, and step S2 adopts a column writing mode; realizing transposed data transmission of the data block;
in the second mode, each row of the data block is stored in each row of the array type on-chip cache in a one-to-one correspondence manner, and the transposition of the data block is realized by adopting a column direction number taking mode of a vector processor;
step S1 adopts a column writing mode, and step S2 adopts a row writing mode; normal data transmission of the data block is realized;
the step S1 specifically includes: step S101, performing parameterization configuration on a DMA with a block structure, so that the size of a data block moved by the DMA each time is matched with the storage scale of an array type on-chip cache;
step S102, according to the parameterized configuration result, DMA automatically calculates the addresses of each row of the data block moved from DDR;
step S103, judging a transmission mode according to a transmission mode flag bit in parameterized configuration, and entering step S104 if the flag bit is 0; if the flag bit is "1", the step S105 is entered;
step S104, adopting a line direction writing mode to perform normal transmission; calculating the total size of the data block, and continuously writing the data in the data block into a cache line by line to finish the writing process of the data block into the cache;
step S105, performing transposed transmission by adopting a column writing mode; according to the line number of the data block, calculating the line number address of each line of the data block to be written into the cache, and according to the column number of the data block, calculating the line address of each data in each line; writing the data in the data block into each corresponding row in the cache line by line respectively, so that the cache data and the data block keep the same row-column structure.
2. The on-chip cache based data access method of claim 1, wherein parametrizing the block structure DMA comprises:
configuring a data block starting address Mem_Addr0, which is used for representing the storage address of a starting element at the left upper corner of the data block at the DDR;
configuring a continuous effective length X_Slice of the data block; when transposed data is transmitted, the size of the X_slice cannot exceed the capacity of each row in the array type on-chip cache; the size of the X_slice is also smaller than the line width X_full of the Full-scale matrix of the original data in the DDR;
configuring the line number Y_Slice of the data block; when transposed data is transmitted, the Y_slice is an array type on-chip cache line number or an integer multiple of the array type on-chip cache line number;
configuring a transmission mode flag bit of the data block written into the array type on-chip cache; when the flag bit is '1', the bit is 'transposed transmission'; when the flag bit is "0", it is "normal transmission".
3. The on-chip cache-based data access method of claim 1, wherein,
the step S2 specifically includes:
step S201, for the parameterized configuration in step S1, judging the writing mode of writing data in the array type on-chip cache, and if the writing mode is the line writing mode, entering step S202; if the data block data in the array type on-chip cache is written in the column writing mode, the step S203 is entered;
step S202, adopting a line-direction extraction mode, sequentially extracting data from a cache by a vector processor according to a cache data line-direction address mode, and then outputting SIMD vector data to finish the extraction of the cache data;
step S203, judging the memory access instruction of the vector processor, if the memory access instruction is a transpose memory access instruction, entering step S204, and if the memory access instruction is a normal memory access instruction, entering step S205;
step S204, adopting a column-direction extraction mode, and carrying out SIMD merging on the number extracted from each row of cache addresses by the vector processor according to the same parallel sequence of the cache data column-direction addresses, and then carrying out SIMD vector data output to finish the extraction of the cache data;
step S205, adopting a line-direction extraction mode, and sequentially extracting data lines from the cache line by the vector processor according to a mode of continuous line-direction addresses of the cache data, and then outputting SIMD vector data to finish the extraction of the cache data.
4. The method for on-chip cache based data access of claim 3, wherein,
in a column direction extraction mode, calculating a buffer data column direction address and a buffer data row direction address for SIMD combination through an address decoder; and sequentially extracting the data on each row with the same column address from the cache data according to the calculated column address and row address, and carrying out SIMD merging.
5. A data access device according to the on-chip cache based data access method of any one of claims 1 to 4, comprising a data writing module and a data reading module:
the data writing module is used for reading out data blocks from DDR in bulk by adopting a block structure DMA supporting matrix transmission; writing the data block into an array type on-chip cache by adopting a line writing mode or a column writing mode; in a line-oriented writing mode, continuously writing data in a data block line by line into a cache; in a column writing mode, writing data in a data block into corresponding rows in a cache line by line respectively, so that the cache data and the data block keep the same column-row structure;
the data reading module is used for carrying out parallel data extraction on the cache data in the array type on-chip cache by adopting a row-direction extraction mode or a column-direction extraction mode according to the access request of the vector processor; in a line direction fetching mode, the vector processor sequentially fetches data from the cache in a mode of continuous line direction addresses of the cache data and then outputs SIMD vector data; in the column fetch mode, the vector processor fetches the data from each line address of the cache according to the same parallel sequence of the cache data column address, performs SIMD merging, and then performs SIMD vector data output.
6. A method for transposing a large-scale two-dimensional matrix, comprising:
step S1, dividing a large-scale two-dimensional matrix [ L1, L2] to be transposed, which is positioned in DDR, into small-scale matrices [ M1, M2] with equal scale size; wherein, n1=l1/M1, n2=l2/M2, N1, N2 are positive integers greater than 0;
step S2, adopting the data access method based on-chip cache as claimed in any one of claims 1-4, reading out small-scale matrixes [ M1, M2] from DDR by block by using a block-shaped structure DMA, and writing the matrixes into an array on-chip cache in a column writing mode; a vector processor is adopted to carry out parallel data extraction in a column extraction mode, and a small-scale matrix [ M1, M2] is transposed into a matrix [ M2, M1];
s3, transposed each block to obtain a matrix [ M2, M1] and storing the synthesized matrix [ L2, L1] according to the correct position; the correct position refers to the corresponding position of each matrix block [ M2, M1] in the matrix [ L2, L1] when the matrix [ L1, L2] is transposed into the matrix [ L2, L1 ].
7. A large-scale two-dimensional matrix transposition device; characterized by comprising the following steps: the system comprises a matrix dividing module, a small-scale matrix transposition module and a matrix synthesizing module;
the matrix dividing module is used for dividing a large-scale two-dimensional matrix [ L1, L2] to be transposed in the DDR into small-scale matrixes [ M1, M2] with equal scale size; wherein, n1=l1/M1, n2=l2/M2, N1, N2 are positive integers greater than 0;
the small-scale matrix transposition module is used for reading out small-scale matrixes [ M1, M2] from DDR by using a block-shaped structure DMA block by adopting the data access method based on the on-chip cache as claimed in any one of claims 1 to 4, and writing the small-scale matrixes into the array on-chip cache in a column writing mode; a vector processor is adopted to carry out parallel data extraction in a column extraction mode, and a small-scale matrix [ M1, M2] is transposed into a matrix [ M2, M1];
the matrix synthesis module is used for obtaining a matrix [ M2, M1] after each block is transposed, and storing the synthesis matrix [ L2, L1] according to the correct position; the correct position refers to the corresponding position of each matrix block [ M2, M1] in the matrix [ L2, L1] when the matrix [ L1, L2] is transposed into the matrix [ L2, L1 ].
CN202211580278.1A 2022-12-09 2022-12-09 Data access method and device based on-chip cache and transposition method and device Active CN116150055B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211580278.1A CN116150055B (en) 2022-12-09 2022-12-09 Data access method and device based on-chip cache and transposition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211580278.1A CN116150055B (en) 2022-12-09 2022-12-09 Data access method and device based on-chip cache and transposition method and device

Publications (2)

Publication Number Publication Date
CN116150055A CN116150055A (en) 2023-05-23
CN116150055B true CN116150055B (en) 2023-12-29

Family

ID=86351643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211580278.1A Active CN116150055B (en) 2022-12-09 2022-12-09 Data access method and device based on-chip cache and transposition method and device

Country Status (1)

Country Link
CN (1) CN116150055B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103135096A (en) * 2013-01-11 2013-06-05 北京理工大学 Synthetic aperture radar imaging and processing transposition storage method and data access method
CN106933756A (en) * 2015-12-31 2017-07-07 北京国睿中数科技股份有限公司 For the quick transposition methods of DMA and device of variable matrix
CN110781447A (en) * 2019-10-19 2020-02-11 天津大学 DDR-based high-efficiency matrix transposition processing method
CN115185859A (en) * 2022-09-13 2022-10-14 北京天地一格科技有限公司 Radar signal processing system and low-delay matrix transposition processing device and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103135096A (en) * 2013-01-11 2013-06-05 北京理工大学 Synthetic aperture radar imaging and processing transposition storage method and data access method
CN106933756A (en) * 2015-12-31 2017-07-07 北京国睿中数科技股份有限公司 For the quick transposition methods of DMA and device of variable matrix
CN110781447A (en) * 2019-10-19 2020-02-11 天津大学 DDR-based high-efficiency matrix transposition processing method
CN115185859A (en) * 2022-09-13 2022-10-14 北京天地一格科技有限公司 Radar signal processing system and low-delay matrix transposition processing device and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于GPU的多模式SAR实时成像算法研究;翟新刚等;《电子测量技术》;第39卷(第10期);第81-86页 *

Also Published As

Publication number Publication date
CN116150055A (en) 2023-05-23

Similar Documents

Publication Publication Date Title
US10210935B2 (en) Associative row decoder
US10153042B2 (en) In-memory computational device with bit line processors
US5546343A (en) Method and apparatus for a single instruction operating multiple processors on a memory chip
KR920001618B1 (en) Orthoginal transform processor
US20120163113A1 (en) Memory controller and memory controlling method
JP2010521728A (en) Circuit for data compression and processor using the same
WO2001090915A2 (en) Processor array and parallel data processing methods
KR100503094B1 (en) DSP having wide memory bandwidth and DSP memory mapping method
US6804771B1 (en) Processor with register file accessible by row column to achieve data array transposition
CN110674927A (en) Data recombination method for pulse array structure
US10552307B2 (en) Storing arrays of data in data processing systems
CN101776988B (en) Restructurable matrix register file with changeable block size
JP3320922B2 (en) Memory device
Hidalgo et al. Area-efficient architecture for fast Fourier transform
US6085304A (en) Interface for processing element array
CN116150055B (en) Data access method and device based on-chip cache and transposition method and device
US5673214A (en) Discrete cosine transform processor
US20140089370A1 (en) Parallel bit reversal devices and methods
US8581918B2 (en) Method and system for efficiently organizing data in memory
CN111814675B (en) Convolutional neural network feature map assembly system supporting dynamic resolution based on FPGA
CN114330635A (en) Device and method for scaling and accelerating data of neural network
CN114072778A (en) Memory processing unit architecture
CN115995249B (en) Matrix transposition operation device based on DRAM
JPH07210545A (en) Parallel processing processors
US20240045922A1 (en) Zero padding for convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant