CN113448624A - Data access method, device and system and AI accelerator - Google Patents

Data access method, device and system and AI accelerator Download PDF

Info

Publication number
CN113448624A
CN113448624A CN202110801631.3A CN202110801631A CN113448624A CN 113448624 A CN113448624 A CN 113448624A CN 202110801631 A CN202110801631 A CN 202110801631A CN 113448624 A CN113448624 A CN 113448624A
Authority
CN
China
Prior art keywords
data
read
block group
cache block
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110801631.3A
Other languages
Chinese (zh)
Other versions
CN113448624B (en
Inventor
刘聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Lingsi Intelligent Technology Co ltd
Original Assignee
Anhui Lingsi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Lingsi Intelligent Technology Co ltd filed Critical Anhui Lingsi Intelligent Technology Co ltd
Priority to CN202110801631.3A priority Critical patent/CN113448624B/en
Publication of CN113448624A publication Critical patent/CN113448624A/en
Application granted granted Critical
Publication of CN113448624B publication Critical patent/CN113448624B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0877Cache access modes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A data access method, a device and a system and an AI accelerator are provided. The method comprises the following steps: reading data to be calculated stored in a memory; based on the number of multipliers in the processing array, writing the read data into a preset cache block group, so that the data of the same write address of each cache block group respectively correspond to different read addresses in the memory, and the data stored in each cache block group are completely different; and the processing array reads the data to be calculated from the preset cache block group for parallel calculation. By adopting the scheme, the overall performance of the AI accelerator can be improved.

Description

Data access method, device and system and AI accelerator
Technical Field
The invention relates to the technical field of AI acceleration, in particular to a data access method, a device and a system thereof and an AI accelerator.
Background
Nowadays, Artificial Intelligence (AI) technology has been increasingly applied to people's daily life, such as face recognition, image segmentation, voice recognition, voice synthesis, and so on. The development of AI techniques is not independent of the advances in AI algorithms. The carrier required by the falling of the AI algorithm is a chip, the increasingly complex AI algorithm puts forward higher and higher performance requirements on the chip, and the calculation capability is greatly improved. In the past, a System On Chip (SOC) Chip based on a traditional Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) has been difficult to meet the algorithm requirements, and a heterogeneous SOC Chip with an AI acceleration processor is becoming a main research direction.
To improve the computing power, the main computing unit of the AI accelerator is usually a Processing (PE) array including a plurality of multipliers to perform various massively parallel computations required by the algorithm, such as convolution operations, matrix operations, and the like in a neural network. The larger the number of multipliers, the higher the theoretical calculation power. However, due to the limitations of chip area and power consumption, the number of multipliers cannot be increased without limit, so how to fully utilize the limited number of multipliers becomes the key to improve the chip performance.
However, the overall performance of the conventional AI accelerator is poor, and the requirement of a user on the overall performance of the AI accelerator cannot be met.
Disclosure of Invention
The invention aims to solve the problems that: how to improve the overall performance of the AI accelerator.
To solve the above problem, an embodiment of the present invention provides a data access method, where the method includes: reading data to be calculated stored in a memory; based on the number of multipliers in the processing array, writing the read data into a preset cache block group, so that the data of the same write address of each cache block group respectively correspond to different read addresses in the memory, and the data stored in each cache block group are completely different; and the processing array reads the data to be calculated from the preset cache block group for parallel calculation.
The embodiment of the invention also provides a data access device, which comprises: the device comprises a cache block group, a configuration unit, a reading control unit and a writing control unit; the reading control unit is suitable for reading the data to be calculated stored in the memory; the configuration unit is suitable for determining an address allocation mode, an interactive data bit width B and a basic data unit bit width U based on the number of the multipliers in the processing array, and inputting the address allocation mode, the interactive data bit width B and the basic data unit bit width U to the write-in control unit so as to control the write-in control unit to write the read data into a preset cache block group; the data of the same write-in address of each cache block group respectively corresponds to different read addresses in the memory through the address allocation mode, and the data stored in each cache block group are completely different; the write-in control unit is suitable for writing the read data into the preset cache block group; the processing array is suitable for reading the data to be calculated from the preset cache block group for parallel calculation.
The embodiment of the invention also provides an AI accelerator, which comprises the data access device.
The embodiment of the invention also provides a data access system, which comprises: the memory is used for storing data to be calculated; the AI accelerator is used for reading data to be calculated from the memory and storing the data in the preset cache block group; the AI accelerator further comprises a PE array; and the PE array is used for reading data from the preset cache block group for parallel computation.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following advantages:
by applying the scheme of the invention, on one hand, after the data to be calculated is read from the memory, the data to be calculated is written into the preset cache block group, and format conversion can be completed by utilizing the cache without changing the data storage format in the memory. On the other hand, the data of the same write address of each cache block group respectively corresponds to different read addresses in the memory, so that the data read from the memory in one clock cycle can be written into the cache in one clock cycle, and meanwhile, the PE array can read one two-dimensional block of data from the same address of the cache in one clock cycle, thereby saving the clock cycle and reducing the difficulty in reading and controlling the AI accelerator data. In addition, the read data is written into the preset cache block group based on the number of the multipliers in the Processing (PE) array, so that the writing mode can be adjusted according to the number of the multipliers, the multipliers in the Processing (PE) array are made to be as close to the ideal working state as possible, that is, each multiplier can perform effective operation in each clock cycle, and the overall performance of the AI accelerator is improved.
Furthermore, based on the serial-parallel conversion configuration signal, the depth of the preset cache block group is adjusted, and then serial-parallel conversion of the cache block can be realized, so that idle cache blocks in the cache can be used for storing data, and the utilization rate of the cache is improved.
Drawings
FIG. 1 is a block diagram of a data access system according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating acceleration of matrix operations and convolution operations by parallel computation according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a structure of a cache according to an embodiment of the present invention;
FIG. 4 is a flow chart of a data access method according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a data access process according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating a serial arrangement of N cache blocks according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating a serial-to-parallel conversion between two cache blocks according to an embodiment of the present invention;
FIG. 8 is a block diagram of a data access device according to an embodiment of the present invention;
FIG. 9 is a diagram illustrating a Process Engine (PE) array reading data from a cache according to an embodiment of the present invention.
Detailed Description
At present, a PE array is usually a synchronous digital sequential circuit, and is driven by the same clock to operate, and its ideal operating state is: all multipliers operate efficiently each clock cycle, which requires that each multiplier needs to be assigned two valid input data each clock cycle.
Generally, the input of parallel computation in the algorithm is a group of multidimensional data, and the data amount is far larger than the number of multipliers, so that the input data needs to be divided into a certain number of parts and distributed to each multiplier in the PE array according to a certain specific format and sequence.
The following two schemes are currently commonly used to assign data to each multiplier in the PE array:
the first scheme is as follows: the input data are well arranged in advance according to a format and a sequence required by parallel computing and are stored in a memory of a chip, so that an AI accelerator can conveniently read the data and distribute the data to each multiplier;
scheme II: the data are stored in the memory of the chip according to a normal sequence, and the AI accelerator reads the data according to a division format and a sequence required by parallel computation and redistributes the data to each multiplier.
In the first scheme, data is arranged in the memory in advance, on one hand, if the CPU is used for performing the operations, the CPU needs to undergo a large number of instruction fetch execution and data access operations, and the bit width of data processed by the CPU is usually small (32bit or 64bit), which consumes a large number of clock cycles and seriously affects the overall performance of algorithm implementation; on the other hand, data stored in the memory in an abnormal order is not friendly to software, and is not beneficial to the operation which cannot be realized by the AI accelerator performed by the software.
In the second scheme, the AI accelerator is required to read data in a specific manner, which increases the difficulty of data reading control in the AI accelerator on one hand, and on the other hand, for various parallel computations, it is almost impossible to read the input data required to be operated by each multiplier in the PE array in one clock cycle, thereby reducing performance.
Besides, the above two schemes also face the following problems: for different parallel computations and different parameters of the same parallel computation (such as the length and width of a matrix in matrix operation, the length and width of a feature image in convolution operation, the number of channels, and the like), different input data formats and sequences need to be adopted, otherwise, the utilization rate of multipliers in the PE array is affected, and thus the performance is affected.
In order to solve the problem, the invention provides a data access method, on one hand, after data to be calculated is read from a memory, the data to be calculated is written into a preset cache block group, the format conversion can be completed by utilizing cache without changing the data storage format in the memory, so that the clock period is saved, the software is more friendly, and the operation which cannot be realized by an AI accelerator can be performed on the memory by the software. On the other hand, the data of the same write address of each cache block group respectively corresponds to different read addresses in the memory, so that the PE array can read one two-dimensional block data from the same address in the cache in one clock cycle, and the difficulty in reading and controlling the AI accelerator data is reduced. In addition, the read data is written into the preset cache block group based on the number of the multipliers in the Processing (PE) array, so that the data writing mode can be adjusted according to the number of the multipliers, the multipliers in the PE array are as close to the ideal working state as possible, that is, each multiplier can perform effective operation in each clock cycle, and the overall performance of the AI accelerator is improved.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Fig. 1 is a data access system according to an embodiment of the present invention, where the system may include: a memory 11 and an AI accelerator 12. The AI accelerator 12 may include a cache 121 and a PE array 122.
In a specific implementation, the input data of the memory 11 is usually a group of multidimensional data, and to ensure software friendliness, the input data of the memory 11 is stored according to a general multidimensional array raster scanning sequence.
Specifically, taking any one of the N-dimensional data { D0, D1, D2, …, DN } as an example, when storing in the multi-dimensional array raster scan order, store the data of dimension D0 first, then the data of D1, and so on.
In general, data of any dimension in any input data can be read as one two-dimensional data when stored in the memory 11. In the equivalent two-dimensional data, the remaining dimensions may be regarded as equivalent dimensions with respect to a certain dimension.
For example, referring to fig. 2(b), taking convolution operation in a Convolutional Neural Network (CNN) as an example, the input feature image is usually three-dimensional data W × H × C, and may be regarded as two-dimensional data of one (W × H) C, where the dimension C is a deterministic dimension and the dimension (W × H) is an equivalent dimension. The input feature image may also be regarded as two-dimensional data of (W × C) × H, where the dimension H is a definite dimension, and the dimension (W × C) is an equivalent dimension.
The data for parallel computation by the PE array 122 is not typically one-dimensional data, but rather one two-dimensional block data, i.e., data in two dimensions.
For example, as shown in fig. 2, matrix operation is performed on the left matrix and the right matrix in fig. 2(a), the left matrix is divided into a plurality of small matrices similar to matrix 1, the right matrix is divided into a plurality of small matrices similar to matrix 2, and finally the plurality of small matrices in the left matrix and the right matrix are operated.
Similarly, when the convolution operation is performed on the input feature image and a plurality of convolution kernels (e.g., K0, K1, K2, and K3 … …) in fig. 2(b), the input feature image may be divided into two-dimensional data blocks for operation. For example, the convolution operation is performed after the data block is split into two dimensions of C and W.
However, due to the storage manner of the data in the memory 11, only one-dimensional data can be read from the memory 11 in one clock cycle, and the requirement of parallel computation by the PE array cannot be met, so that format conversion needs to be performed once, the converted data is stored in the cache 121, and it is ensured that the PE array can read one two-dimensional block data from the cache 121 in one clock cycle.
Fig. 3 is a schematic structural diagram of the cache. Referring to fig. 3, in a specific implementation, the cache may be composed of a plurality of Static Random-Access Memory (SRAM) blocks (cache blocks for short) on an SOC chip, where each cache block has an independent control line, a data line, and an address line, that is, it indicates that data at different addresses in each cache block can be read and written in one clock cycle. Assuming that the data bit width of each cache block is band and the depth is depth, there are N blocks, and the data bit width of the whole cache is N band.
In practical applications, the size of each cache block may be the same or different. For convenience of description, it is assumed that the size of each cache block is the same, i.e. the data bit width and depth of each cache block are the same.
The data storage method provided in the embodiment of the present invention is described in detail below:
referring to fig. 4, the method may include the steps of:
and step 41, reading the data to be calculated stored in the memory.
In one embodiment, the data input into the memory is generally a set of multidimensional data, and the data read from the memory each time is one-dimensional data.
Step 42, based on the number of multipliers in the Processing (PE) array, writing the read data into a preset cache block group, so that the data at the same write address of each cache block group respectively corresponds to different read addresses in the memory, and the data stored in each cache block group is completely different.
Wherein the number of the cache block groups is at least one; the Processing (PE) array is adapted to read the data to be calculated from the preset cache block set for parallel calculation.
In a specific implementation, the number of cache blocks included in each cache block group may be one, or may be two or more, and is not limited specifically. The addresses of the cache blocks within the same cache block group are consecutive.
In an embodiment of the present invention, to simplify the write complexity, the size of each cache block group may be the same, that is, the depth and the data bit width of each cache block group are the same. For example, referring to fig. 3, each cache block group may be composed of 2 cache blocks in the cache, the depth of the cache block group is 2 × depth, and the data bit width of the cache block group is still band.
The read data is written into the preset cache block group, so that the data of the same write address of each cache block group respectively corresponds to different read addresses in the memory, the data read from one clock cycle in the memory can be written into the cache in one clock cycle, and meanwhile, the PE array can read one two-dimensional block of data from the same address in the cache in one clock cycle, thereby meeting the requirement of parallel computation of the PE array. And based on the number of multipliers in the Processing (PE) array, writing the read data into the preset buffer block group, so that the data writing mode can be adjusted according to the number of multipliers, so that the multipliers in the PE array are as close to the ideal working state as possible, that is, each multiplier can perform effective operation in each clock cycle, thereby improving the overall performance of the AI accelerator.
In a specific implementation, based on the number of multipliers in a Processing (PE) array, the read data may be written into a preset cache block group by using a plurality of methods, which is not limited specifically, as long as the data at the same write address of each cache block group respectively corresponds to different read addresses in the memory, and the data stored in each cache block group is completely different.
In a specific implementation, the multiplier performs a multiplication operation on a data block when performing the multiplication operation. The size of the data block can be set according to the width of a line of data in the memory.
In an embodiment of the present invention, based on the number of multipliers in the Processing (PE) array, an exchange data bit width B and a basic data unit bit width U may be determined, and then the exchange data bit width B and the basic data unit bit width U are used to store one-dimensional data read from the memory each time until data to be calculated stored in the memory is completely written into the preset buffer block group.
The basic data unit is the size of a data block read from the memory at a time. The basic data unit bit width U is the width of the data block read from the memory each time. The width of a line of data in the memory is an integral multiple of the bit width U of the basic data unit, so that the size of data blocks read from the memory each time can be the same. The interactive data bit width B is the number of data blocks read from the memory each time, and is usually an integer multiple of the basic data unit bit width U.
Specifically, when the interactive data bit width B and the basic data unit bit width U are used to store the one-dimensional data read from the memory each time, the one-dimensional data can be read from the data to be calculated stored in the memory according to the interactive data bit width B; dividing the read one-dimensional data into a plurality of basic data units according to a preset basic data unit bit width U; and writing the plurality of basic data units into different write addresses of a preset cache block group. And the bit width B of the interactive data is an integral multiple of the bit width U of the basic data unit.
It is understood that, referring to fig. 3, the interactive data bit width B should be less than or equal to the data bit width N band of the entire cache of the cache, so that the one-dimensional data with the interactive data bit width B can be respectively written into different write addresses of the preset cache block group.
Referring to fig. 5, data stored in the memory is as shown in fig. 5 (a). For convenience of description, it is assumed that data to be calculated stored in the memory includes a plurality of lines, and each line is divided into W basic data units in units of basic data units U. 0. 1, 2, … …, 4W-1, is the identity of the underlying primitive. For example, the first row of data includes primitives 1 through W-1.
Assuming that B is 4U, the one-dimensional data read from memory for the first time can be equally divided into 4 primitives with bit width U, primitive 0, primitive 1, primitive 2, and primitive 3. The basic data units 0 to 3 are written into different addresses of the buffer blocks 0 to 3, respectively.
In one embodiment of the present invention, the write address interval of adjacent elementary data units is a. As shown in fig. 5(b), when the basic data unit 0 is written into the address addr _0 of the buffer block group 0, the basic data unit 1 should be written into the address addr _ a of the buffer block group 1, the basic data unit 2 should be written into the address addr _2A of the buffer block group 2, and the basic data unit 3 should be written into the address addr _3A of the buffer block group 3.
In a specific implementation, when the one-dimensional data is read from the memory in the mth dimension crossing manner, in order to enable the first basic data unit of the data read in the dimension crossing manner to be located in the same write address of a different cache block as the first basic data unit of the data read in the previous time, a plurality of basic data units corresponding to the read data can be cyclically shifted by m basic data units and then written into different write addresses of the preset cache block group, wherein m is greater than or equal to 1 and m is an integer. The basic data unit shift can be, but not limited to, a cyclic right shift, and can also be a cyclic shift in other directions.
In conjunction with fig. 5(a), the reading across latitudes is called, that is, the data of two adjacent readings are not the data of the same latitude. For example, the data in the first row in the memory is read for the first time, and the data in the second row in the memory is read for the second time, which is the data read across latitudes relative to the data read for the first time.
Referring to fig. 5(c), when the one-dimensional data is read from the memory across dimensions for the first time, the corresponding four basic data units are W, W +1, W +2, and W +3, respectively, and the four basic data units are circularly right-shifted by 1 basic data unit length and then written into different write addresses of the preset buffer block group. Thus, the basic data unit W is written into the address addr _0 in the buffer block set 1, the basic data unit W +1 is written into the address addr _ a in the buffer block set 2, the basic data unit W +2 is written into the address addr _2A in the buffer block set 3, and the basic data unit W +3 is written into the address addr _3A in the buffer block set 3.
As shown in fig. 5(d), when the one-dimensional data is read from the memory across the dimension for the second time, the corresponding four basic data units are respectively 2W, 2W +1, 2W +2, and 2W +3, after the four basic data units are circularly shifted to the right by 2 basic data unit lengths, the basic data unit 2W is written into the address addr _0 in the buffer block group 2, the basic data unit 2W +1 is written into the address addr _ a in the buffer block group 3, the basic data unit 2W +2 is written into the address addr _2A in the buffer block group 0, and the basic data unit 2W +3 is written into the address addr _3A in the buffer block group 1.
As shown in fig. 5(e), when the one-dimensional data is read from the memory across the dimension for the third time, the corresponding four basic data units are respectively 3W, 3W +1, 3W +2, and 3W +3, after the four basic data units are circularly shifted to the right by 3 basic data unit lengths, the basic data unit 3W is written into the address addr _0 in the buffer block group 3, the basic data unit 3W +1 is written into the address addr _ a in the buffer block group 0, the basic data unit 3W +2 is written into the address addr _2A in the buffer block group 1, and the basic data unit 3W +3 is written into the address addr _3A in the buffer block group 2.
The data storage format in the final buffer is as shown in fig. 5(f), and at different write addresses, the data at the same read address will rotate regularly due to the shift. And on the same write address, the data of different read addresses are interleaved together. Thus, when the PE array is used for calculation, only the data of the same write address in all the cache block groups needs to be read, and two-dimensional block data can be obtained. By using the rotary interleaving of the data, the algorithm requirements are met, and meanwhile, the access of the input data between the memory and the cache and between the cache and the PE array can be guaranteed to be completed in a single clock cycle.
In a specific implementation, the data size of the data to be calculated may be large, and the AI accelerator needs to read from the memory many times. For example, in the embodiment shown in fig. 5, after the third cross-latitude reading, there may be a 4 th cross-latitude reading, a 5 th cross-latitude reading, and so on.
In order to ensure that the one-dimensional data read each time can be written into the cache according to the method of the present invention, in an embodiment of the present invention, the write address interval a of the adjacent basic data units may be set to L1/(B/U), where L1 represents the length of the equivalent dimension outside the dimension of the data read from the memory with respect to the dimension of the data read.
For example, referring to fig. 5(a), it is assumed that L1 is 32 and a is 8, so that the write address addr _0 and the write address addr _ a are not consecutive. Therefore, when the dimension is crossed every B/U times, because the data of the same address in the cache is fully written, the write address of the first basic data unit of the k B/U time data read in the cross-latitude mode can be continuous with the write address of the first basic data unit of the (k-1) B/U time data read in the cross-latitude mode, and therefore the PE array can conveniently identify the data with different read addresses in the memory. Wherein k is not less than 1 and k is an integer.
Specifically, the write address of the first elementary data unit of the k × B/U-th data read across latitude may be set to be increased by 1 with respect to the write address of the first elementary data unit of the (k-1) × B/U data read across latitude.
For example, referring to fig. 5(a), the data read for the fourth time across latitudes corresponds to the first basic data unit of 4W, and the write address may be addr _ 1. For the eighth time of data read across the latitude, the corresponding first basic data unit is 8W, and the write address may be addr _ 2. Therefore, the PE array can read the data of a plurality of addresses in the cache in succession to obtain the corresponding data blocks.
In another embodiment of the present invention, a may also be set to a fixed value. For example, a may be set to 1, where m is k B/U, and the write address of the first basic data unit in the read data is L2/U, where L2 represents the length of the dimension in which the data is read from the memory. Referring to fig. 5(f), the data read for the fourth time across latitudes corresponds to the first basic data unit of 4W, and the write address may be addr _ W.
It is to be understood that, after determining the write address of the first basic data unit of the kth B/U data read across latitudes, the kth B/U +1 to the (k +1) th B/U-1 data read across latitudes are the same as the write address of each corresponding basic data unit in the data read while m is k B/U, and the description of fig. 5(c) to 5(d) may be referred to for implementation, and will not be repeated herein.
In the implementation, the exchange data bit width B and the basic data unit bit width U can be flexibly adjusted according to the number of multipliers in a Processing (PE) array.
In particular, during a parallel computation, it may be necessary to read data from the buffer multiple times and perform the corresponding multiplication.
According to the number of multipliers in the PE array, the bit width B of the interactive data can be determined firstly. Because the interactive data bit width B is an integral multiple of the basic data unit bit width U, after the interactive data bit width B is determined, selectable values of a plurality of basic data unit bit widths U can be obtained. And selecting one value from a plurality of selectable values as the bit width U of the basic data unit, wherein the selection of the bit width U of the basic data unit can ensure that the multiplier is basically not vacant during each multiplication operation in the whole parallel calculation process, thereby improving the highest utilization rate of the multiplier in the parallel calculation process.
For example, in the matrix multiplication, it is assumed that the number of multipliers is 64, which means that the PE array can perform 64 multiplication operations in parallel in one clock cycle. The left and right matrices of the multiplication can be divided into 4 × 4 small data blocks, and the multiplication of two 4 × 4 small matrices is performed each time, i.e. each small block in fig. 2(a) is 4 × 4 in size, which corresponds to exactly 64 multiplications. Finally, the result of multiplying two large matrixes can be obtained by matching with an accumulator. At this time, the interactive data bit width B is 16.
Accordingly, the optional value of the primitive bit width U is 2, 4, or 8. Assuming that the matrix size for performing the matrix operation is 12 × 12, when U is 2 or 8, some of 64 multipliers are left vacant during the process of reading data from the small matrix blocks at the edge of the left matrix or the right matrix and performing the multiplication. If U is 4, the multiplier multiplies the data read each time, and the utilization rate is 100%, so U can be selected to be 4. Therefore, for matrixes with different sizes, the utilization rate of the multiplier can be improved as much as possible by reasonably configuring B and U.
At this time, the data is stored just as shown in fig. 5 (f). As can be seen from fig. 5(f), the data of the same write address is exactly a 4 × 4 data block. And taking the 4 x 4 data block as an operation unit, and expressing the row coordinate and the column coordinate of the data block in the large matrix by using an index value (x, y), wherein the corresponding cache address is equal to A x + y.
In one embodiment of the present invention, as shown in fig. 2(a), it is assumed that the left matrix needs to be divided into m × n small data blocks, and the right matrix needs to be divided into n × l small data blocks, where m × n × l is equal to the number of multipliers. Then for the storage of the left matrix, B ═ m × n and U ═ m are configured; for the storage of the right matrix, B ═ n × l and U ═ n are configured, and the requirement can be met.
Therefore, the PE array can read the corresponding 4 x 4 data blocks from the same address of each cache block group in each clock cycle for operation, and the continuity of data reading and operation is ensured.
In other words, by adopting the scheme of the invention, B and U are reasonably set according to the number of the multipliers, and the utilization rate of the multipliers can be effectively improved.
In the specific implementation, if the interactive data bit width B is configured to be a very small value according to the requirement of parallel computing, according to the data storage method in the embodiment of the present invention, a large number of cache blocks are left in the cache, which reduces the utilization rate of the cache. For example, B is band, only buffer block 1 is used to store data in fig. 3, and buffer blocks 2 through N are empty.
In order to avoid this, the utilization rate of the cache is increased, and in the specific implementation, before the read data is written into the preset cache block group, the original parallel cache blocks may be converted into a serial cache block, so that all the cache blocks do not need to be left, and the utilization rate of the cache blocks is increased. At this time, each cache block group is composed of two or more cache blocks. The cache blocks in the same cache block group are arranged in series, i.e. the addresses of the cache blocks in the same cache block group are consecutive.
In an embodiment of the invention, before writing the read data into the predetermined buffer block, a serial-to-parallel conversion configuration signal may be received, and then the depth of the predetermined buffer block may be adjusted based on the serial-to-parallel conversion configuration signal. Wherein the deserialization configuration signal is related to the exchange data bit width B.
Because the serial-parallel conversion configuration signal is related to the interactive data bit width B, the depth of a cache block group used for storing data to be calculated can be flexibly adjusted according to different interactive data bit widths B, so that idle cache blocks in a cache are used for storing the data to be calculated as far as possible, and the utilization rate of the cache blocks in the cache is improved.
For example, for the cache structure shown in fig. 3, if the bit width B of the interactive data is equal to band, N cache blocks may be configured in a serial structure, and as shown in fig. 6, the N cache blocks form one cache block group. When the depth of each cache block is depth, the depth of the cache block group is N depth. At this point, all cache blocks are not empty.
In the specific implementation, each cache block is provided with an independent control line, a data line and an address line, and the serial-parallel conversion of the cache blocks can be simply realized by using the high order bits of the address to perform chip selection.
In an embodiment of the present invention, when the number of cache blocks available for storing the data to be calculated is greater than or equal to M times of the bit width B of the interactive data, M cache blocks are used as a cache block group, where M is greater than or equal to 2 and M is an integer.
Taking two buffer blocks as an example, the buffer block 1 and the buffer block 2 are respectively, and it is assumed that the data bit width of the buffer block 1 and the data bit width of the buffer block 2 are W, and the depth is 2HAt this time, the addresses of the cache block 1 and the cache block 2 are identified by H binary bits. Both cache block 1 and cache block 2 have control line port cs, data line port wdata and address line port addr.
Wherein addr [ H-1:0] represents the numerical value on the addr port of the address line, and the numerical value on the addr port of the addr [ H-1:0] corresponding to the address line is identified by H binary bits. data [ W-1:0] and data [2W-1: W ] represent values on corresponding data line ports.
In the embodiment of the present invention, the high addr [ H ] of the address and the chip select signal cs are logically operated to obtain the logical operation result, and the logical operation result is inputted to the selectors 71 and 72. The data [2W-1: W ] and data [ W-1:0] are input to the data line port of the cache block 2 via the selector 73. The serial-to-parallel conversion configuration signal select controls the outputs of the selectors 71-73.
Specifically, when the serial-to-parallel conversion configuration signal select is 0, the selectors 71-73 all input the signal inputted from the "0" input terminal to the corresponding port. The selectors 71 and 72 select the original chip selection signal cs as the chip selection signal, and input the chip selection signal to the control line ports of the cache block 1 and the cache block 2, the cache block 1 and the cache block 2 can be simultaneously selected as the cache block for storing the data to be calculated, and the two cache blocks form a cache block with a bit width of 2W and a depth of 2WHTo cache. Accordingly, data [ W-1:0]Write to the data line port of cache Block 1, and data [2W-1: W]Write to cache block2.
When the serial-to-parallel conversion configuration signal select is 1, the selectors 71-73 all input the signal inputted from the "1" input terminal to the corresponding port. The selectors 71 and 72 select the high order of the address and the result of the logic operation performed by the original chip selection signal cs as the chip selection signal, and input the chip selection signal to the control line ports of the cache block 1 and the cache block 2, so that only one cache block of the cache block 1 and the cache block 2 is selected as the cache block for storing the data to be calculated at the same time. At this time, two cache blocks constitute one bit with width W and depth 2H+1To cache. Accordingly, data [ W-1:0]Write to the data line ports of cache block 1 and cache block 2.
In a specific implementation, the devices for logical operations include an and circuit 74, an and circuit 75, and an inverter 76. The high order of the address and the original chip selection signal cs are input to the and circuit 74, and the obtained logical operation result is input to the "1" input terminal of the selector 71. The high order bits of the address are inverted by the inverter 76, and then input to the and circuit 75 together with the original chip select signal cs, and the obtained logical operation result is input to the "1" input terminal of the selector 72.
It is understood that, in an implementation, the depth of the preset cache block group may also be adjusted by other methods, which are not limited herein.
By adopting the data access method in the embodiment of the invention, the format conversion is completed by adopting a first-level cache mode without changing the data storage format in the memory. Meanwhile, by using the rotation interleaving of the data, the algorithm requirement is met, and meanwhile, the access of the input data between the memory and the cache and between the cache and the PE array can be guaranteed to be completed in a single clock period. In addition, different parameters can be flexibly configured for different parallel computations so as to improve the utilization rate of the multiplier. The method in the embodiment of the invention can make the multipliers in the PE array as close to the ideal working state as possible on the premise of ensuring the friendliness to the software layer, namely, each multiplier can carry out effective operation in each clock cycle, thereby improving the overall performance of the AI accelerator.
In order to make those skilled in the art better understand and realize the present invention, the following detailed description is made on a device, an AI accelerator, and a computer-readable storage medium corresponding to the above-described method.
Referring to fig. 8, an embodiment of the present invention provides a data access apparatus 80, where the apparatus 80 may include: a buffer block group 81, a configuration unit 82, a read control unit 83, and a write control unit 84. Wherein:
the cache block group 81 includes at least one cache block, and the number of the cache block groups is at least one;
a read control unit 83 adapted to read data to be calculated stored in the memory;
a configuration unit 82, adapted to determine an address allocation manner, an interactive data bit width B, and a basic data unit bit width U based on the number of multipliers in the processing array, and input the determined address allocation manner, interactive data bit width B, and basic data unit bit width U to the write control unit 84, so as to control the write control unit 84 to write the read data into the preset buffer block group 81; the data of the same write-in address of each cache block group respectively corresponds to different read addresses in the memory through the address allocation mode, and the data stored in each cache block group are completely different;
a write control unit 84, adapted to write the read data into a preset buffer block group, wherein the processing array is adapted to read the data to be calculated from the preset buffer block group for parallel calculation.
In an embodiment of the invention, the configuration unit 82 is further adapted to generate a serial-to-parallel conversion configuration signal and input the serial-to-parallel conversion configuration signal to the predetermined buffer block 81 before writing the read data into the predetermined buffer block 81, so as to adjust the depth of the predetermined buffer block.
And the bit width B of the interactive data is an integral multiple of the bit width U of the basic data unit.
In a specific implementation, the configuration unit 81 may further provide a data reading manner for the reading control unit 83 to control the reading control unit 83 to read data from the memory according to the provided data reading manner.
Taking the input feature image in fig. 2 as an example, if three-dimensional data of W × H × C is stored in the memory, the data access method and apparatus in the embodiment of the present invention may be used to read the three-dimensional data according to the two-dimensional data of (W × H) × C by controlling the reading manner, and the PE array may read the data block shown as the shaded portion in fig. 9(a) every clock cycle. It can also be read as two-dimensional data of (W × C) × H, and the PE array can read the data block shown as the shaded portion in fig. 9(b) every clock cycle, thereby speeding up the convolution operation.
The invention also provides an AI accelerator, which is characterized by comprising the data access device 80.
The present invention also provides a data access system, which may include, with reference to fig. 1: a memory 11 and the AI accelerator 12 described above. Wherein:
the memory 11 is used for storing data to be calculated;
the AI accelerator 12 is configured to read data to be calculated from the memory 11 and store the data in a preset cache block group;
the AI accelerator 12 also includes a PE array 122; the PE array 122 is used for reading data from the preset buffer block group for parallel computation.
The data access device 80, the AI accelerator and the data access system can be specifically implemented with reference to the above description of the data access method, and are not described herein again.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (13)

1. A method for accessing data, comprising:
reading data to be calculated stored in a memory;
based on the number of multipliers in the processing array, writing the read data into a preset cache block group, so that the data of the same write address of each cache block group respectively correspond to different read addresses in the memory, and the data stored in each cache block group are completely different;
and the processing array reads the data to be calculated from the preset cache block group for parallel calculation.
2. The data access method of claim 1, wherein writing the read data to a predetermined set of cache blocks based on the number of multipliers in the processing array comprises:
determining an interactive data bit width B and a basic data unit bit width U based on the number of multipliers in the processing array;
storing the one-dimensional data read from the memory each time by adopting the following method until the data to be calculated stored in the memory is completely written into the preset cache block group:
reading one-dimensional data from the data to be calculated stored in the memory according to the interactive data bit width B; dividing the read one-dimensional data into a plurality of basic data units according to a preset basic data unit bit width U; writing the plurality of basic data units into different write addresses of a preset cache block group;
and the bit width B of the interactive data is an integral multiple of the bit width U of the basic data unit.
3. The data accessing method of claim 2, wherein the write address spacing of adjacent elementary data units is L1/(B/U), or A1, wherein L1 represents the length of the equivalent dimension from the memory except the dimension of the data to be read.
4. The data accessing method of claim 2, wherein writing the plurality of primitive data units to different addresses of a predetermined set of buffer blocks comprises:
when the one-dimensional data is read from the memory in the mth dimension spanning way, a plurality of basic data units corresponding to the read data are circularly shifted by the length of m basic data units and then written into different write-in addresses of the preset cache block group, wherein m is more than or equal to 1 and is an integer.
5. The data accessing method of claim 4, wherein writing the plurality of primitive data units to different addresses of a predetermined set of buffer blocks further comprises:
adding 1 to the write address of the first basic data unit in the read data when m is k B/U, wherein k is more than or equal to 1, and k is an integer, relative to the write address of the first basic data unit in the read data when m is (k-1) B/U;
or, when m is k × B/U, the write address of the first basic data unit in the read data is L2/U, where L2 represents the length of the dimension in which the read data is located from the memory.
6. The data accessing method of claim 4, wherein writing the plurality of primitive data units to different addresses of a predetermined set of buffer blocks further comprises:
when m is k × B/U +1 to (k +1) × B/U-1, the write address of each basic data unit in the read data is the same as the write address of each corresponding basic data unit in the read data when m is k × B/U.
7. A data access method according to any one of claims 2 to 6, wherein the set of cache blocks consists of one cache block.
8. The data access method of any of claims 2 to 6, further comprising, before writing the read data to the predetermined set of cache blocks:
receiving a serial-to-parallel conversion configuration signal;
adjusting the depth of the preset buffer block group based on the serial-parallel conversion configuration signal;
wherein the deserialization configuration signal is related to the exchange data bit width B.
9. The data access method of claim 8, wherein said adjusting the depth of the predetermined buffer block based on the deserialization configuration signal comprises:
and when the number of the cache blocks which can be used for storing the data to be calculated is more than or equal to M times of the bit width B of the interactive data, taking the M cache blocks as a cache block group, wherein M is more than or equal to 2 and M is an integer.
10. A data access device, comprising: buffer block group, configuration unit, reading control unit and writing control unit, wherein:
the reading control unit is suitable for reading the data to be calculated stored in the memory;
the configuration unit is suitable for determining an address allocation mode, an interactive data bit width B and a basic data unit bit width U based on the number of the multipliers in the processing array, and inputting the address allocation mode, the interactive data bit width B and the basic data unit bit width U to the write-in control unit so as to control the write-in control unit to write the read data into a preset cache block group; the data of the same write-in address of each cache block group respectively corresponds to different read addresses in the memory through the address allocation mode, and the data stored in each cache block group are completely different;
the write-in control unit is suitable for writing the read data into the preset cache block group;
the processing array is suitable for reading the data to be calculated from the preset cache block group for parallel calculation.
11. The data access device as claimed in claim 10, wherein the configuration unit is further adapted to generate a serial-to-parallel conversion configuration signal and input the serial-to-parallel conversion configuration signal to the predetermined buffer set before writing the read data into the predetermined buffer set, so as to adjust the depth of the predetermined buffer set;
wherein the deserialization configuration signal is related to the exchange data bit width B.
12. AI accelerator, characterized in that it comprises a data access device according to any one of claims 10 or 11.
13. A data access system, comprising:
the memory is used for storing data to be calculated;
the AI accelerator of claim 12, configured to read data to be computed from a memory and store the data in a preset cache block set;
the AI accelerator further comprises a PE array; and the PE array is used for reading data from the preset cache block group for parallel computation.
CN202110801631.3A 2021-07-15 2021-07-15 Data access method, device, system and AI accelerator Active CN113448624B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110801631.3A CN113448624B (en) 2021-07-15 2021-07-15 Data access method, device, system and AI accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110801631.3A CN113448624B (en) 2021-07-15 2021-07-15 Data access method, device, system and AI accelerator

Publications (2)

Publication Number Publication Date
CN113448624A true CN113448624A (en) 2021-09-28
CN113448624B CN113448624B (en) 2023-06-27

Family

ID=77816280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110801631.3A Active CN113448624B (en) 2021-07-15 2021-07-15 Data access method, device, system and AI accelerator

Country Status (1)

Country Link
CN (1) CN113448624B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113886286A (en) * 2021-12-02 2022-01-04 南京芯驰半导体科技有限公司 Two-dimensional structure compatible data reading and writing system and method

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916884A (en) * 2005-08-19 2007-02-21 骏亿电子股份有限公司 Integration type data processor
CN103164363A (en) * 2011-12-16 2013-06-19 中兴通讯股份有限公司 Data processing method and data processing device
US20140359219A1 (en) * 2013-05-31 2014-12-04 Altera Corporation Cache Memory Controller for Accelerated Data Transfer
CN108229645A (en) * 2017-04-28 2018-06-29 北京市商汤科技开发有限公司 Convolution accelerates and computation processing method, device, electronic equipment and storage medium
CN109886400A (en) * 2019-02-19 2019-06-14 合肥工业大学 The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel
CN110073329A (en) * 2016-12-16 2019-07-30 华为技术有限公司 Memory access equipment calculates equipment and the equipment applied to convolutional neural networks operation
CN110309912A (en) * 2018-03-27 2019-10-08 北京深鉴智能科技有限公司 Data access method, hardware accelerator, calculates equipment, storage medium at device
CN110688616A (en) * 2019-08-26 2020-01-14 陈小柏 Strip array convolution module based on ping-pong RAM and operation method thereof
CN110892373A (en) * 2018-07-24 2020-03-17 深圳市大疆创新科技有限公司 Data access method, processor, computer system and removable device
CN112328522A (en) * 2020-11-26 2021-02-05 北京润科通用技术有限公司 Data processing method and device
CN112506567A (en) * 2020-11-27 2021-03-16 海光信息技术股份有限公司 Data reading method and data reading circuit

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916884A (en) * 2005-08-19 2007-02-21 骏亿电子股份有限公司 Integration type data processor
CN103164363A (en) * 2011-12-16 2013-06-19 中兴通讯股份有限公司 Data processing method and data processing device
US20140359219A1 (en) * 2013-05-31 2014-12-04 Altera Corporation Cache Memory Controller for Accelerated Data Transfer
CN110073329A (en) * 2016-12-16 2019-07-30 华为技术有限公司 Memory access equipment calculates equipment and the equipment applied to convolutional neural networks operation
CN108229645A (en) * 2017-04-28 2018-06-29 北京市商汤科技开发有限公司 Convolution accelerates and computation processing method, device, electronic equipment and storage medium
CN110309912A (en) * 2018-03-27 2019-10-08 北京深鉴智能科技有限公司 Data access method, hardware accelerator, calculates equipment, storage medium at device
CN110892373A (en) * 2018-07-24 2020-03-17 深圳市大疆创新科技有限公司 Data access method, processor, computer system and removable device
CN109886400A (en) * 2019-02-19 2019-06-14 合肥工业大学 The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel
CN110688616A (en) * 2019-08-26 2020-01-14 陈小柏 Strip array convolution module based on ping-pong RAM and operation method thereof
CN112328522A (en) * 2020-11-26 2021-02-05 北京润科通用技术有限公司 Data processing method and device
CN112506567A (en) * 2020-11-27 2021-03-16 海光信息技术股份有限公司 Data reading method and data reading circuit

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王阳 等: "基于脉动阵列的矩阵乘法器硬件加速技术研究", vol. 32, no. 11, pages 120 - 124 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113886286A (en) * 2021-12-02 2022-01-04 南京芯驰半导体科技有限公司 Two-dimensional structure compatible data reading and writing system and method
CN113886286B (en) * 2021-12-02 2022-03-01 南京芯驰半导体科技有限公司 Two-dimensional structure compatible data reading and writing system and method

Also Published As

Publication number Publication date
CN113448624B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
US5410727A (en) Input/output system for a massively parallel, single instruction, multiple data (SIMD) computer providing for the simultaneous transfer of data between a host computer input/output system and all SIMD memory devices
CN110222818B (en) Multi-bank row-column interleaving read-write method for convolutional neural network data storage
WO2022037257A1 (en) Convolution calculation engine, artificial intelligence chip, and data processing method
US4138720A (en) Time-shared, multi-phase memory accessing system
EP3938917B1 (en) Moving data in a memory and command for memory control
CN105190762A (en) Semiconductor device and entry address write/read method for semiconductor device
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN112487750A (en) Convolution acceleration computing system and method based on memory computing
WO2018027706A1 (en) Fft processor and algorithm
CN111783933A (en) Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation
CN103760525A (en) Completion type in-place matrix transposition method
US11908541B2 (en) Processing-in-memory (PIM) systems
CN112950656A (en) Block convolution method for pre-reading data according to channel based on FPGA platform
JP2022508028A (en) Data read / write methods and systems, storage media and terminals in 3D image processing
CN116521611A (en) Generalized architecture design method of deep learning processor
CN113448624B (en) Data access method, device, system and AI accelerator
CN112988621A (en) Data loading device and method for tensor data
CN116431562B (en) Multi-head attention mechanism fusion calculation distribution method based on acceleration processor
US9268744B2 (en) Parallel bit reversal devices and methods
CN111522776B (en) Computing architecture
CN111078589B (en) Data reading system, method and chip applied to deep learning calculation
EP0313787A2 (en) A hardware mechanism for the dynamic customization of permutation using bit-matrix multiplication
JP2851804B2 (en) 2D orthogonal transform device
CN117194861A (en) Reconfigurable mixed-base FFT device supporting output pruning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Liu Cong

Inventor after: Wang Zhiguo

Inventor after: Shao Zhiyong

Inventor after: Liu Wei

Inventor before: Liu Cong