CN114840470A - Dimension transformation device friendly to on-chip cache and neural network processor - Google Patents

Dimension transformation device friendly to on-chip cache and neural network processor Download PDF

Info

Publication number
CN114840470A
CN114840470A CN202210335890.6A CN202210335890A CN114840470A CN 114840470 A CN114840470 A CN 114840470A CN 202210335890 A CN202210335890 A CN 202210335890A CN 114840470 A CN114840470 A CN 114840470A
Authority
CN
China
Prior art keywords
data
dimension
output
input
control module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210335890.6A
Other languages
Chinese (zh)
Inventor
谢耀
李智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Denglin Technology Co ltd
Chengdu Denglin Technology Co ltd
Original Assignee
Shanghai Denglin Technology Co ltd
Chengdu Denglin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Denglin Technology Co ltd, Chengdu Denglin Technology Co ltd filed Critical Shanghai Denglin Technology Co ltd
Priority to CN202210335890.6A priority Critical patent/CN114840470A/en
Publication of CN114840470A publication Critical patent/CN114840470A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06MCOUNTING MECHANISMS; COUNTING OF OBJECTS NOT OTHERWISE PROVIDED FOR
    • G06M1/00Design features of general application
    • G06M1/27Design features of general application for representing the result of count in the form of electric signals, e.g. by sensing markings on the counter drum
    • G06M1/272Design features of general application for representing the result of count in the form of electric signals, e.g. by sensing markings on the counter drum using photoelectric means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The application provides a dimension transformation device friendly to on-chip cache, which comprises a control module, a data cache module consisting of a plurality of storage blocks, a write-in control module and a read-out control module. The control module receives a data handling instruction to be processed and simultaneously acquires configuration information related to the instruction, generates a corresponding input address according to the configuration information in response to the fact that the input data dimension is not matched with the output data dimension in the configuration information, reads corresponding data from an external storage unit according to the output data dimension information from a low dimension to a high dimension, writes the input data from the external storage unit into the data cache module through the write-in control module, and reads the data from the data cache module through the read-out control module for output. The dimension conversion device can effectively complete data dimension conversion on the premise of not reducing the access efficiency of on-chip cache.

Description

Dimension transformation device friendly to on-chip cache and neural network processor
Technical Field
The present application relates to data processing technology in a neural network, and in particular, to a data dimension transformation device suitable for a neural network processor.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art for the purposes of describing the present disclosure.
Artificial Intelligence (AI) technology has been developed rapidly in recent years, and has penetrated into various fields such as visual perception, speech recognition, driving assistance, smart home, traffic scheduling, and the like. Many AI algorithms involve neural network-based learning and computation, such as convolutional neural networks CNN, recurrent neural networks RNN, deep neural networks DNN, and so on. These AI algorithms often involve a neural network of a multilayer structure and a learning model composed of a plurality of neural networks, require strong parallel computing power to process massive data, and therefore processors capable of supporting multi-core parallel computing such as GPUs, gpgpgpus, and AI acceleration chips are generally employed to perform relevant neural network operations. Often these processors need to provide the same batch of data to different layers of the neural network or to different neural networks, and often the format and dimensionality of the data needed is different between different layers of these neural networks or different neural networks.
Taking a convolutional neural network as an example, each layer in the neural network performs a correlation operation (e.g., a convolution operation) of the layer by using parameters (e.g., convolution parameters, etc.) of the input feature data and the layer. Wherein the feature data may also be referred to as a feature map, which may be considered as a data block having a certain width and height. The output feature data obtained from each layer can be provided to the next layer as input feature data of the next layer, but because the number of nodes and the operation mode of each layer are different, the required dimension, arrangement, format, etc. of the input feature data are often different. The existing software programming mode realizes the data matching problem between different layers or different networks by reading data to be processed from an on-chip cache to a memory, performing software dimension conversion and then writing the data into the on-chip cache from the memory. However, this approach not only increases the software complexity of the AI algorithm, but also increases the memory access overhead of the chip and reduces the throughput of the chip.
The above-mentioned contents are only for assisting understanding of the technical solutions of the present application, and are not taken as a basis for evaluating the prior art of the present application.
Disclosure of Invention
The application provides a dimension conversion device friendly to on-chip cache, which can effectively complete data dimension conversion on the premise of not reducing on-chip cache access efficiency.
The above purpose is realized by the following technical scheme:
according to a first aspect of the embodiments of the present application, a dimension transformation module friendly to on-chip cache is provided, which includes a control module, a data cache module composed of a plurality of memory blocks, a write-in control module, and a read-out control module. The control module receives a data handling instruction to be processed and simultaneously acquires configuration information related to the instruction, wherein the configuration information at least comprises base address information of input data and output data, dimension information of the input data and the output data, and data size and data step length of each dimension of the input data and the output data. The control module generates corresponding input addresses using the received configuration information to read corresponding data in an order from a low dimension to a high dimension of the output data dimension. The write-in control module writes the input data from the external storage unit into the data cache module. And the reading control module reads data from the data cache module according to the instruction of the control module so as to output the data.
The dimension conversion device of the embodiment can not only transfer data among different storages, but also realize dimension conversion or matching among different data while transferring the data. For example, after the processor completes the computation of one layer of the neural network, the dimension transformation device can be used for moving data and simultaneously performing the data dimension transformation in the process of moving the computation result from the on-chip cache to the off-chip storage, so that when the processor starts to perform the next layer of operation, the data which is already transformed in advance is read. The processor no longer needs to occupy additional clock cycles and computational resources to perform the data dimension transformation process. Therefore, the calculation load of the processor can be reduced and the data throughput of the processor can be improved through the dimension transformation device. In addition, the dimension conversion device performs dimension conversion in parallel according to the configuration information in the data transportation process, and can read and output corresponding data in sequence from a low dimension to a high dimension. Compared with the mode of completely taking out data through software, performing dimensionality transformation and then writing, unnecessary software overhead is saved. And the dimension conversion is completed in the process of not sacrificing the data transmission performance.
In some embodiments, each memory block in the data cache module is an on-chip random access memory (i.e. an on-chip RAM), and the number of the memory blocks at least satisfies the requirement of dividing the preset bit width of the input data and the preset bit width of the output data; where the bit width of the input data is the same as the bit width of the output data. In the above-described embodiment, an expensive on-chip RAM accommodating the entire bit width of the input data is no longer required, but the input data is divided into a plurality of pieces and stored in parallel in a plurality of memory blocks by adopting a data cache structure constituted by a plurality of smaller-width memory blocks. Thereby saving area overhead and reducing the cost and performance requirements for on-chip RAM.
In some embodiments, the apparatus may include a plurality of output channels, and the number of the output channels is at least equal to the bit width of the input data and the bit width of the output data. In some embodiments, each output channel corresponds to an output address, the data within each channel is contiguous, and wherein the depth of each memory block is at least equal to or greater than the ratio of the bit width of the data to the bit width of the individual data for each output channel.
In the embodiment, the internal cache overhead of the dimension conversion device is reduced by adopting a plurality of output channels, the data can be output in the data cache only by meeting the requirement of one output channel, and the data transmission efficiency is improved.
In some embodiments, the dimension data of the input data and the output data are stored in a linear arrangement from a low dimension to a high dimension.
In some embodiments, the control module is further configured to include a set of input counters and a plurality of sets of output counters. The control module, when generating the input address to read the data, indicates dimension information of the currently read input data using the set of input counters. With one input counter for each dimension. The count values of the set of input counters are updated together after each data read request is completed. The multiple groups of output counters correspond to multiple output channels, and each channel corresponds to one group of output counters. The control module generates an output address for each channel, indicates the dimension information of the currently output data by using a group of output counters corresponding to each channel, and each dimension corresponds to one output counter. After each output data of each channel is completed, the count values of the group of output counters of the channel are updated together. In general terms, the number of dimensions of the input data should be the same as the number of dimensions of the output data.
When the control module requests to read data from the external storage unit, the generated input address is calculated according to the input data base address information, the data size information of each dimension, the data step length information of each dimension and the count value of the input counter of each dimension.
In some embodiments, the control module adjusts the input counters of each dimension to read data by: the data are read back completely from the low dimension to the high dimension according to the dimension of the input data except the lowest dimension of the output data, an input counter of the lowest dimension of the output data is increased by 1 after each data request, and the input counter of the lowest dimension of the original input data is increased by a single input data amount which is related to the data amount in the input data read each time until the lowest dimension of the output data is finished. For example, the single input data amount may be a ratio of a bit width of the input data to a bit width of the single data.
In some embodiments, the control module reads data in order from the low dimension to the high dimension for the input data dimension when there is no change in the input data dimension and the output data dimension (e.g., NDHWC- > NDHWC), or when there is no change in the lowest dimension of the input data and the output data dimension (e.g., NDHWC- > WHDNC). That is, after each data input request, the input counter of the lowest dimension (for example, dimension C in the above example) corresponding to the input data is incremented by the single input data amount (that is, the data amount included in each input data), when the count value of the input counter of the lowest dimension corresponding to the input data reaches the data size of the corresponding dimension, the input counter of the lowest dimension is cleared and incremented by 1 to the input counter of the upper dimension (for example, if the input data dimension is nwhc, then the data of dimension C is read, and then incremented by 1 to the read counter of the upper dimension W), and then the above process is repeated until all data are read. When the data is output, the control module adjusts the output counters of all dimensions in the following mode to control all output channels to output the data: each output channel writes data according to the descending order of the input data dimension, namely after each input data is output, the output counter corresponding to the lowest dimension of the input data is increased by a single output data volume (the size of the output counter is related to the width of each data and the number of channels), when the count value of the output counter corresponding to the lowest dimension of the input data reaches the data size of the dimension, the output counter of the lowest dimension is emptied and 1 is added to the output counter of the previous dimension, and then the process is repeated continuously until all data are output.
In some embodiments, when the lowest dimension of the input data and the output data changes (e.g., NDHWC- > NDHCW or NDHWC- > CDHNW), the control module reads the data in order from the lower dimension to the higher dimension of the output data dimension. That is, after each input data request, the input counter of the lowest dimension (e.g., dimension W in the above example) corresponding to the output data is incremented by 1, when the count value of the input counter corresponding to the lowest dimension of the output data reaches the data size of the dimension, the input counter of the dimension is cleared and carried to the input counter of the previous dimension in the output data dimension, if the previous dimension (C) of the lowest dimension (W) in the output data dimension is the lowest dimension of the input data dimension (e.g., input data dimension NDHWC to output data dimension NDHCW as illustrated above), the carry is the single input data amount (which is related to the data width of each input data), and in other cases, the carry is 1 (e.g., if the output data dimension is NDHCW, after the data of the lowest dimension W is read, the input counter of the previous dimension C is carried by the single input data amount B/B, because dimension C is exactly the lowest dimension of the input data; and if the dimension of the output data is CDHNW, after the dimension of the lowest dimension W is completely read, carrying to 1 by an input counter of the previous dimension N), and continuously repeating the process until all the data are completely read. When the data is output, the control module adjusts the output counters of all dimensions according to the following modes to control the output channels to output the data: each output channel writes data according to the sequence of the output data dimension from low to high, namely, each output data is paired, the output counter corresponding to the lowest dimension of the output data increases the single output data volume (the size of the output counter is related to the width of each data and the number of channels), when the count value of the output counter corresponding to the lowest dimension of the output data reaches the data size of the dimension, the output counter corresponding to the lowest dimension is emptied and carried to the output counter corresponding to the previous dimension in the output data dimension, if the previous dimension of the output data dimension is the lowest dimension of the input data dimension, the carry number is the single output data volume (the size of the carry number is related to the data width of each output data), and if the previous dimension of the output data dimension is the lowest dimension of the input data dimension, the process is repeated until all data are output continuously.
In the above embodiment, by setting the corresponding input counter and output counter in each dimension, a flexible input address and output address generation manner is provided, thereby facilitating the dimension conversion device to read and output data more simply and conveniently.
In some embodiments, the write control module is configured to: when the input data dimension and the output data dimension are not changed or the lowest dimension of the input data and the output data is not changed, counting the input data received each time, performing modulo operation on the depth of each storage block in the data cache module according to the current count value to generate a write address for each storage block, and accordingly writing the currently received input data into each storage block of the data cache module. The storage depth of the data cache module can be obtained by using B/M/B, wherein B is the bit width of output data, M is the number of output channels, and B is the bit number of single data. Thus, the write address generated by the write control module for each memory block of the data cache module can be obtained by the following formula:
write address [i]=(write cnt )%(B/M/b),i∈[0,N)
wherein write cnt The count of the received input data for the write control module is a count value starting from 0, i is an integer starting from 0, and N is the number of memory blocks.
In some embodiments, the readout control module is configured to: when the input data dimension and the output data dimension are not changed or the lowest dimension of the input data and the output data is not changed, generating read addresses for all storage blocks in the data cache module according to the following formula according to the indication of the control module:
read address [i]=(read cnt )%(B/M/b),i∈[0,N)
wherein read cnt The count value of the read data, which is read by the read control module, is a count value starting from 0; b is the bit width of the output data, M is the number of output channels, B is the number of bits of the data read, N is the number of memory blocks, i is an integer starting from 0.
In some embodiments, the write control module is configured to: counting the input data received each time when the lowest dimension of the input data dimension and the output data dimension is changed; performing cyclic bit right shift operation on currently received input data, wherein the right shift number is the multiplication of the current count value by B/N, N is the number of storage blocks in the data cache module, and B is the bit width of the input data; and generating a write address for each storage block according to the current count value by performing modulo operation on the depth of each storage block in the data cache module, and writing the processed data into each storage block of the data cache module according to the write address. Generally, the storage depth of the data buffer module can be obtained by using B/M/B, where B is the bit width of the output data, M is the number of output channels, and B is the number of bits of a single data. Thus, the write address generated by the write control module for each memory block of the data cache module can be obtained by the following formula:
write address [i]=(write cnt )%(B/M/b),i∈[0,N)
wherein write cnt The count of the received input data for the write control module is a count value starting from 0, i is an integer starting from 0, and N is the number of memory blocks.
In some embodiments, the readout control module is configured to: when the lowest dimension of the input data dimension and the output data dimension is changed, generating read addresses for all storage blocks in the data cache module according to the following formula according to the indication of the control module:
read address [i]=(i+read cnt )%(B/M/b),i∈[0,N)
wherein read cnt A count value indicating the read data is a count value from 0; b is the bit width of the output data, M is the number of output channels, B is the number of bits occupied by a single data, i is an integer starting from 0, and N is the number of memory blocks.
In the above embodiment, a flexible read-write manner for the data buffer inside the dimension conversion apparatus is provided, so that the data with the lowest output dimension is stored in different storage blocks, and the data with the lowest output dimension can be read out at the same time when outputting.
According to a second aspect of the embodiments of the present application, there is provided a processor for a neural network, including the dimension transformation apparatus according to the first aspect of the embodiments of the present application, configured to perform data transfer between an on-chip cache and an off-chip memory of the processor.
The technical scheme of the embodiment of the application can have the following beneficial effects:
the dimension conversion device can not only carry data among different storages, but also realize dimension conversion or matching among different data while carrying the data, thereby reducing the calculation load of a processor and improving the data throughput of the processor. In addition, the dimension conversion device adopts data cache and multi-channel parallel output which are composed of a plurality of small on-chip RAMs, output can be started only by meeting the data quantity of a single output channel with part of cached data, and feature data which are transmitted from one neural network to another neural network or transmitted from one layer of the neural network to another layer are not required to be completely read into the cache for conversion. Therefore, the area overhead of on-chip cache is saved, and the efficiency of data transmission is not influenced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
fig. 1 is a schematic block diagram of a dimension transformation apparatus according to an embodiment of the present application.
FIG. 2 is a diagram illustrating a relationship between a data storage address and a dimension size and a step size according to an embodiment of the present application.
Fig. 3 is a schematic diagram illustrating a process of reading data from an external storage unit by a dimension transformation device according to an embodiment of the present application.
Fig. 4 is a schematic diagram illustrating a process of reading data from an external storage unit by a dimension transformation device according to an embodiment of the present application.
Fig. 5 is a schematic structural diagram of a dimension transformation apparatus according to another embodiment of the present application.
Detailed Description
For the purpose of making the present application more apparent, its technical solutions and advantages will be further described in detail by means of specific embodiments in the following, with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the embodiments of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The data dimension information may be, but is not limited to, one or more of the following dimensions commonly used in artificial intelligence networks: height (H), width (W), depth (D), channel (C), number of samples (N), etc. Common data, such as the NDHWC format, has a dimension of 5, a lowest dimension of C, and a highest dimension of N. In the operation process of the neural network, after a processor usually processes one layer of the neural network, the output result of the layer is transmitted to the next layer of the network as feature data for processing. The dimensional arrangement of feature data is often different between different layers and different networks, for example, some layers are arranged in a dimension of NDHWC and some layers are arranged in a dimension of NCDHW. As mentioned above, performing dimension transformation by way of software programming increases the memory access overhead of the chip and reduces the throughput of the chip. The inventor tries to realize dimension transformation in an on-chip cache by using a special hardware module in practice, but finds that the on-chip cache which can at least contain large-capacity data for dimension transformation is needed for completing the dimension transformation, and the large-capacity data is high in cost in terms of area and energy consumption.
The embodiment of the application provides a dimension transformation device which does not sacrifice data transmission performance and can obviously reduce the use amount of on-chip cache. Fig. 1 is a schematic structural diagram of functional modules of a dimension transformation apparatus according to an embodiment of the present application. The dimension conversion device comprises a control module and a data path formed by a write-in control module, a data cache module and a read-out control module. Wherein the control module generates an input address to request read data from the previous-stage module and to instruct the write control module to prepare to receive the input data from the previous-stage module, according to an instruction received from an external control unit and configuration information thereof. The received instruction refers to an instruction that requires participation by the dimension transformation device and may include, but is not limited to, a data handling instruction such as a STORE instruction (STORE). The former-level module may be, but is not limited to, a memory, an on-chip random access memory (may be simply referred to as an on-chip RAM), an on-chip read only memory (may be simply referred to as an on-chip ROM), an external memory, and any other data cache. The control module may also generate an output address based on configuration information received from an external control unit to request write data to the next-stage module and to instruct the read control module to prepare to read data from the data cache module to deliver the output data to the next-stage module. The latter module may be, but is not limited to, memory, on-chip RAM, external storage, and any other data cache. When the write-in control module receives the input data from the previous-stage module, the write-in control module generates a write address for the data caching module according to the relevant storage state of the write-in control module, and then transmits the write signal, the write address and the input data to the data caching module together. The data caching module receives the write signal, the write address and the input data input by the write-in module and caches the input data according to the write address. Meanwhile, the data cache module can also receive a reading signal and a reading address from the reading control module and transmit output data corresponding to the reading address to the reading control module. The read control module can generate a read address and a read signal according to the indication signal and the configuration information from the control module and send the read address and the read signal to the data cache module, and then output data returned by the data cache module and the output address generated by the control module are transmitted to the next-level module together.
In an embodiment of the present application, the control module instructs, according to the received configuration information, the respective modules to read data in a specific order and output the data in a correspondingly set manner, so as to implement dimension transformation between input data and output data (which will be described in detail below). The dimension conversion device can not only transfer data among different storages, but also realize dimension conversion or matching among different data while transferring the data. For example, after the processor completes the calculation of one layer of the neural network, the dimension conversion device can be used for moving data and simultaneously performing the dimension conversion in the process of moving the calculation result from the on-chip cache to the off-chip storage, so that when the processor starts to execute the next layer of operation, the data which is converted in advance is read. The processor no longer needs to occupy additional clock cycles and computational resources to perform the data dimension transformation process. Therefore, the calculation load of the processor can be reduced and the data throughput of the processor can be improved through the dimension transformation device.
In the embodiment of the invention, the bit width of the input data and the bit width of the output data are required to be the same, so that the throughput of the data is not changed during dimension conversion. The bit widths of the input data and the output data are denoted by B hereinafter, the bit ranges of the input data and the output data may be denoted as [ B-1:0 ]. Common bit widths for input and output data include, but are not limited to, 2048,1024 and 512. However, in computer processing, data is not accessed in units of bits, but is accessed in units of data consisting of a plurality of bits, for example, 8 bits constitute 1 byte. The number of bits occupied by a single data is denoted in the following by b, which should typically be a multiple of 8. The data types supported in embodiments of the present application may include, but are not limited to, 8-bit integers, 16-bit integers, 32-bit integers, 16-bit floating point numbers, 32-bit floating point numbers, and 64-bit floating point numbers. Thus, the dimension conversion device can read the data from the external unit at a time with the width of B/B in the text, namely B/B data can be read at one time (hereinafter, the data can also be collectively referred to as 'one data')
In an embodiment of the application, the data caching module is composed of a plurality of on-chip RAMs. For convenience of description, it is assumed that the data cache module includes N RAMs, which may also be referred to as N memory blocks (banks), i.e., bank0 to bank kn-1. A single data, which is a basic unit of dimension transformation, can be stored only in one bank, and cannot be stored across a plurality of banks. The bit width of each bank should be an integer multiple of the number of bits b occupied by a single data. In order to facilitate reading and outputting data at once to improve data throughput, input data with a bit width of B is uniformly buffered in N banks and can be read and output at once, and thus N is set to a natural number that satisfies a bit width B that can divide input data and output data evenly. That is, in the data buffer module, the bit width of each bank is B/N, and B/N is an integral multiple of the bit number B occupied by single data. Thus, each bank corresponds to one write address, one read address and data with the data bit width of B/N. Namely, bank0 corresponds to write address 0, read address 0, and data with bit range [ B/N-1:0] in input data/output data; the bank1 corresponds to write address 1, read address 1, and data with bit range [2 × B/N-1: B/N ] in input data/output data; by analogy, bankN-1 corresponds to data with a bit range [ B-1 (N-1) × B/N ] in a writing address N-1, a reading address N-1 and input data/output data.
The depth of each bank of the data cache module depends on specific output requirements, namely the depth of each bank at least meets the set minimum output data quantity. In some embodiments of the present application, to reduce the internal cache overhead, the output data may be divided into M output channels, where each output channel corresponds to one output address, i.e., output address 0 to output address M-1. The data within each channel must be continuous and may not be continuous between channels. M should be a natural number that satisfies the bit width B that can divide the input data and the output data equally, so the data bit width of each output channel is B/M. It should be understood that a single data cannot be output across the channel, and therefore the B/M cannot be too small, and should be an integer multiple of the number of bits B occupied by a single data. Accordingly, the depth of each bank of the data cache (namely the number of data capable of being stored in each bank) is B/M/B, so as to meet the storage requirement required by dimension transformation. In some embodiments, to satisfy continuous data transfer, ping-pong RAM with a depth of 2 × B/M/B may be used for each bank. The size of M influences the time and resources required for the transformation from the lowest dimension to the high dimension, the larger M requires larger on-chip storage resources, the larger output delay is, and the more complete dimension transformation result is obtained. The choice of M is a compromise result. Common parameters defined above may take on, but are not limited to, the reference values provided below: b2048, N32, and M8. And meanwhile, B is the minimum value which can be obtained for supporting all data formats, wherein B is 8, the required depth of each bank is B/M/B32 or 2B/M/B64, and the bit width of each bank is B/N64.
It should also be noted that without loss of generality, embodiments of the present application are directed to a dimension transformation of data stored in a linear arrangement format, i.e., each dimension of data is stored sequentially in order from lowest dimension to highest dimension and addresses are incremented. Taking the dimension permutation format NDHWC as an example, the dimension is 5, wherein the lowest dimension is C and the highest dimension is N. Each piece of data can be uniquely represented by one dimension coordinate in the data dimension arrangement. For example, the data in the NDHWC format can be represented by a 5-dimensional coordinate, where (0,0,0,0,0) represents the first data in the pen data; (0, 2, 0,0,1) represents the second data in the third data set in the D dimension and the first data set in the H and W dimensions. Meanwhile, as the coordinates increase, the address of the external storage where the data is located is increased by default, as shown in fig. 2, the data size of each dimension indicates the number of data in the dimension, for example, the data size of the dimension 0 is L0, that is, there are L0 data in the dimension 0, starting from data 0 to data L0-1. According to the coordinate representation, the value on a certain dimension coordinate is smaller than the size of corresponding dimension data. For example, assuming that the data size for the C dimension in a given NDHWC is 2, the C dimension direction has only two data coordinates: 0 and 1. When the coordinates of the low dimension are finished, the data of the high dimension are carried. For example, assuming that the size of the C dimension is 2 and the size of the W dimension is 2 in NDHWC, the data should be arranged in the memory in the order of (0,0,0,0,0), (0,0,0,0,1), (0,0,0,1,0), (0,0,0,1,1) … …. The dimension step size refers to the change of the address caused by adding 1 to each dimension. The step size of each dimension must be greater than or equal to the data size of its lower dimension, and the step size of the lowest dimension is the size of 1 data. For example, the data permutations of the exemplary NDHWC format described above, from (0,0,0,1,0) to (0,0,0,1,1), may be increased in address by the step size of the W dimension. As another example, in FIG. 2, dimension 0 has dimension data size L0, dimension 1 has dimension data size L1, and dimension 1 data chunk 0 is address-spaced from the next dimension 1 data chunk 1 by a dimension 1 step size that is greater than or equal to dimension 0 dimension data size L0. As can be seen from fig. 2, in the scenario of storage in the above-mentioned linear arrangement format, the storage address of a certain data in the corresponding dimension can be easily determined according to a given start address, dimension data size and dimension step size, so that the corresponding data can be obtained from the storage address.
With reference to fig. 1, when the control module receives a data moving instruction issued by an external control unit, the control module may simultaneously obtain configuration information related to the execution of the instruction. The configuration information includes, but is not limited to: dimension information of input data, dimension data size information of each dimension of the input data, dimension data step length information of each dimension of the input data, dimension information of output data, dimension data size information of each dimension of the output data, dimension data step length information of each dimension of the output data, base address information of the input data, base address information of the output data, data format information and the like. The control module generates an input address according to the received configuration information and sends the input address and an input request to a previous-stage module to request to read data. The input request and the input address are sent once per clock cycle until all dimensions of data are read.
In some embodiments, the control module includes a set of input counters and a plurality of sets of output counters for generation of the input addresses and the output addresses, respectively. Wherein the set of input counters corresponds to dimension information of the input data, one input counter for each dimension to identify dimension information of the currently read data. The set of input counters update count values together after each data read request. The multiple output counters correspond to multiple output channels, and each channel corresponds to one of the output counters. The set of output counters corresponding to each output channel corresponds to the dimension information of the output data, and each dimension corresponds to one output counter for identifying the dimension information of the currently output data. Each set of output counters updates the count value together after each output of data. In general, the number of dimensions of the input data should be the same as the number of dimensions of the output data. In general, the initial values of the counters are zero, and the counters are automatically increased when corresponding counting conditions are met, cleared again and carried to a counter of a higher-level dimensionality when the counting values reach the data sizes of the corresponding dimensionalities. That is, the counter of the higher dimension counts as long as the value of the counter of its lower dimension reaches its dimension data size. The self-increment value of the counter of the higher dimension is always 1, but the self-increment value of the counter of the lowest dimension is related to the data width of each data. The counting condition of the lowest dimension counter is the number of data in one data per request or one data per write.
The input address calculated by the control module each time a read of data is requested is the result of the summation of the product of the count result of the input counter for each dimension and the data step size for each dimension. And the control module starts an input counter with an initial value of 0 for each dimension when starting reading so as to represent the dimension coordinate used by each data, and calculates the input address corresponding to each input request by using the dimension coordinate and the step length information of each dimension. For example, in the case of NDHWC, the requested input address may be calculated according to the following formula:
read_in address
=N cnt ×N stride +D cnt ×D stride +H cnt ×H stride +W cnt ×W stride +C cnt ×C stride
wherein N is cnt ,D cnt ,H cnt ,W cnt And C cnt The value of the input counter for each dimension in data representing NDHWC format, and N stride ,D stride ,H stride ,W stride And C stride Representing the data step size for each dimension. Wherein for the case of NDHWC, the data step size C of the lowest dimension C stride Has a value of 1. In the embodiment of the application, the data requested by the control module to the previous stage at each moment is the data of the lowest dimension of the input data dimension, so that the situation that the input data dimension is not changed in the input request at one time, namely the situation that the high-dimension coordinates of the first data and the last data of the input request at one time are different does not occur. If the minimum dimension residual data is not enough for the data amount requested at one time, only the existing data amount is requested, and the insufficient part is free to ensure that the address calculation is not confused. For example: for the case that the input data dimension is NDHWC, the data corresponding to the address of the first input request of the control module is (0,0,0,0,0, 0) to (0,0,0,0, B/B-1), as mentioned above, B is the bit width of the preset input data, and B is the number of bits contained by a single data; however, if the data size of the C dimension of the input data is only B/2/B, namely, only B/2/B data exists in the C dimension, the data corresponding to the address of the first input request is (0,0,0,0,0) to (0,0,0, B/2/B-1).
And the control module determines whether dimension conversion is needed according to input data dimension information and output data dimension information in the received configuration information. When it is determined that the dimension conversion is not necessary, the data may be sequentially read from the low dimension to the high dimension in the manner described above. When it is determined that a dimension change is required, the corresponding input addresses generated can be made to read the corresponding data in order from the lower dimension to the higher dimension in the output data dimension by adjusting the input counters for each dimension.
More specifically, when the input data dimension and the output data dimension are not changed or when the lowest dimension of the input data and the output data is not changed, the control module reads data from a low dimension to a high dimension in sequence according to the input data dimension. For example: the input dimension is NDHWC, the control module reads data of B/B C dimensions in turn in each clock cycle until all data of C dimensions are obtained under the condition that NDHW is (0,0,0, 0), then the control module adds 1 to the W dimension and continues reading data of B/B C dimensions in turn in each clock cycle until all data of C dimensions are obtained under the condition that NDHW is (0,0,0, 1), and so on.
Fig. 3 is a schematic diagram illustrating a flow of reading data from an external unit by a dimension transformation apparatus according to an example of the present application. During initialization, each chip of the dimension conversion device is reset, all counters in the control module are reset, and the internal control logic restores the initial device. When the control module receives the instruction information and the configuration information, if the input data dimension and the output data dimension are determined not to change (for example, NDHWC- > NDHWC) or the lowest dimension of the input data and the output data is determined not to change (for example, NDHWC- > WHDNC), the control module calculates an input address according to the count value of the input counter of each dimension and the data step size of each dimension and sends out an input data request, and data reading is started according to the sequence from the low dimension to the high dimension of the input data dimension. After each input data request, the input counter corresponding to the lowest dimension (e.g., dimension C as exemplified above) corresponding to the input data is increased by the single input data amount (i.e., the number of data corresponding to each data); when the count value of the input counter corresponding to the lowest dimension of the input data reaches the data size of the dimension, the input counter of the lowest dimension is cleared, the input counter corresponding to the previous dimension is increased by 1 (for example, if the dimension of the input data is NDHWC, after the data of the dimension C is completely read, the input counter of the previous dimension W is increased by 1, that is, the input counter of the dimension W is increased by 1), and then the above processes are continuously repeated from the low dimension to the high dimension until all the data are completely read. And when the control module judges whether the count value of the input counter of each dimension reaches the data size of each corresponding dimension, if all the count values reach the data size of each corresponding dimension, all the data of the current instruction are read, and the internal state of the control module can be restored to prepare for executing the next instruction.
In the case where the lowest dimension of the input data dimension and the output data dimension changes, the control module needs to adjust the order in which the data is read. Taking the input dimension of NDHWC and the output dimension of NCDHW as an example, the control module detects that the lowest dimension of the input data is C and the lowest dimension of the output data is W, generates an address to read the data from (0,0,0,0, B/B-1) first, then reads the data from (0,0,0,1,0) to (0,0,0,1, B/B-1), and continues to read the data from (0,0,0,0,0, B/B) to (0,0,0,0, 2B/B-1) until the W dimension is finished, and so on. That is, when the lowest dimension of the input data dimension and the output data dimension is found to be changed, the control module will read back all the data in sequence from the low dimension to the high dimension according to the input data dimension except the lowest dimension of the output data. And the input counter of the lowest dimension of the output data is increased by 1 after each data request, and the input counter of the lowest dimension of the original input data is increased by the single input data amount until the lowest dimension of the output data is finished. It should be understood that the above manner is only an example and not a limitation, and in the scenario of storing in the linear arrangement format as described in fig. 2 above, the storage address of a certain data in the corresponding dimension can be easily determined according to the starting address, the dimension size and the dimension step size of the input data, so that the corresponding data can be obtained from the storage address. Therefore, the control module can generate corresponding input addresses according to the initial address, the dimension size of the input data and the dimension step length given in the configuration information to read corresponding data from the former-stage module for caching according to the dimension arrangement of the output data in the configuration information, so that corresponding dimension conversion is completed in the data moving process.
Fig. 4 is a schematic diagram illustrating a process of reading data from an external unit when the lowest dimension of the input data dimension and the output data dimension is changed according to an exemplary dimension transformation apparatus of the present application. The input address is generated based on the count value of the input counter for each dimension and the data step size for each dimension, again according to the input address formula given above, but unlike fig. 3, the count manner of the input counter for each dimension changes. When the lowest dimension of the input data and the output data changes (such as NDHWC- > NDHCW or NDHWC- > CDHNW), the control module enables the generated corresponding input address to read the corresponding data from the lower dimension to the higher dimension according to the dimension of the output data by adjusting the counting mode of the input counter of each dimension. As shown in fig. 4, after each input data request, increment the input counter corresponding to the lowest dimension of the output data (e.g., dimension W in the above example) by 1, and when the count value of the input counter corresponding to the lowest dimension of the output data reaches the data size of the dimension, clearing the input counter of the dimension and carrying to the input counter of the previous dimension in the output data dimension; if the last dimension (C) of the lowest dimension (e.g., W) in the output data dimension is the lowest dimension in the input data dimension (e.g., input data dimension NDHWC to output data dimension NDHCW as exemplified above), then the carry is the amount of single-time input data (i.e., the input counter corresponding to the lowest dimension of the input data is incremented by the number of data corresponding to each data), while in other cases, the carry of the input counter of the last dimension is 1 (i.e., incremented by 1) (e.g., assuming that the input data dimension is NDHWC, if the output data dimension is NDHCW, after the data of the lowest dimension W is read out, the input counter of the last dimension C is carried over to the amount of single-time input data B/B since dimension C is exactly the lowest dimension of the input data, and if the output data dimension is hncdw, after the data of the lowest dimension is read out, the input counter of the upper dimension N carries 1), and the above process is repeated until all data are read.
With continued reference to fig. 1, after the control module issues an input request and an input address to the previous stage module, the previous stage module will feed back input data and an input indication after a certain delay. The write-in control module counts the input data after receiving the input instruction, and evenly divides the input data into N parts according to the number N of the banks in the data cache module, so that each part of data is correspondingly stored in each bank of the data cache module subsequently. As mentioned above, bank0 corresponds to write address 0, data in the input data having a bit range of [ B/N-1:0 ]; the bank1 corresponds to the write address 1 and the data with the bit range of [2 × B/N-1: B/N ] in the input data; by analogy, bankN-1 corresponds to the data with the write address N-1 and the input data bit range of [ B-1 (N-1) × B/N ]. The write control module generates a write address for the input data based on the count of the input data. As the input data count increases, the write address corresponding to each bank also increases accordingly. For example, the segments of the first input data are stored in the addresses numbered 0 of the banks, respectively, and the segments of the second input data may be sequentially stored in the addresses numbered 1 of the banks, respectively. The maximum number of data that can be stored per bank is called the depth of the bank. The larger the depth of the Bank is, the larger the occupied area is, and the higher the cost is; the smaller the bank depth is, the larger the access overhead is, and the data transfer can be completed by frequently accessing for many times. As mentioned above, the depth of each bank of the data cache module may be set according to the bit width of the input data and the number of output channels (e.g., B/M/B or 2 × B/M/B as exemplified above). When the count of the input data by the write control module is the same as the depth of the bank, the count returns to 0, and a new round of counting is restarted.
When the input data dimension and the output data dimension are not changed, or when the lowest dimension of the input data and the output data is not changed, the write address generated by the write control module for each storage block of the data cache module can be obtained by the following formula:
write address [i]=(write cnt )%(B/M/b),i∈[0,N)
wherein write cnt The count of the received input data for the write control module is a count value starting from 0, i is an integer starting from 0, and N is the number of memory blocks. B/M/B indicates the storage depth of the data buffer module (which can also be understood as the depth of each storage block), B represents the bit width of input or output data, M is the number of output channels, and B is the number of bits of single data.
Under the condition that the lowest dimension of the input data dimension and the output data dimension is changed, the write-in control module counts the input data received each time; performing cyclic bit right shift operation on currently received input data, wherein the right shift number is the multiplication of the current count value by B/N, N is the number of storage blocks in the data cache module, and B is the bit width of the input data; and performing modulo operation on the depth of each storage block in the data caching module according to the current count value to generate a write address for each storage block, and accordingly writing the processed data into each storage block of the data caching module. Generally, the write address generated by the write control module for each memory block of the data cache module can be obtained by the following formula:
write address [i]=(write cnt )%(B/M/b),i∈[0,N)
wherein write cnt The count of the received input data by the write control module is a count value starting from 0, i is an integer starting from 0, N is the number of memory blocks, and the rest of the parameters are the same as above.
It can be seen that, in the case that the lowest dimension of the input data dimension and the output data dimension changes, the write control module adjusts the storage order of the input data in the data cache, so that the output data which is also linearly arranged according to the dimension of the input data is output to the next-stage module. Still taking the example where the input dimension is NDHWC and the output dimension is NCDHW, the control module sends a corresponding dimension transformation indication to the write control module when detecting a change in the input data dimension and the output data dimension minimum dimension. The write control module counts the received input data (e.g., modulo the depth of each bank as mentioned above) and performs a bit right shift operation of the count value multiplied by B/N on the current input data. This operation can put data of the same output lowest dimension into different banks so that data of the same output lowest dimension can be read out at the same time.
With continued reference to fig. 1, the write control module may transmit the generated write address, the write signal, and the processed input data to the data caching module, and notify the control module how much data has been transmitted to the data caching module. The control module can inform the reading control module to read the data from the data caching module for output when the data is output enough. Under the condition that the input data dimension and the output data dimension have no change, when one piece of data is written into the data cache, the control module can determine that enough data is output. In this case, the continuous data acquired from the previous stage is continuously output as it is, in accordance with the continuous linear arrangement of the input data. However, in the case where the minimum dimensions of the input data dimension and the output data dimension are changed, since data is read across dimensions, as mentioned above, when data is acquired from the previous module, the minimum dimension of the output data dimension is increased by 1 after each data request, and it cannot be determined that there is sufficient data output until the number of data coordinates of the minimum dimension of the output data reaches the size of the data amount required by the output channel. In the embodiment of the dimension conversion apparatus having M output channels, in order to ensure stable throughput, the control module may determine that there is enough data output when at least B/M/B data is written into the data buffer (i.e. at least data bit width of one channel is satisfied).
For the process of writing data buffers, again taking the example above where the input dimension is NDHWC and the output dimension is NCDHW, using the reference values given above: b is 2048, N is 32, M is 8, and B is 8, the depth of each bank is 64, and the bit width of each bank is 64. For a translation with input dimension NDHWC and output dimension NCDHW, the process is as follows: for the first clock written NDHWC data, address 0 of bank0 to which data with dimensional coordinates of (0,0,0,0,0) to (0,0,0,0, 7) is written, address 1 of bank1 to which data of (0,0,0, 31, 0) to (0,0,0,1, 7) is written at the time of the second clock cycle, and bank31 to which data of (0,0,0, 31, 0) to (0,0, 31, 7) is written at the time of the thirty-second clock cycle. At this time, the data reaches the requirement of the minimum data quantity output.
With continued reference to FIG. 1, the control module triggers the readout control module when the control module determines that there is sufficient data output. The reading control module can generate N reading addresses and reading signals to read the data in the data cache module. Corresponding to the above write control module, when the input data dimension and the output data dimension are not changed, or when the lowest dimension of the input data and the output data is not changed, the read control module generates the read addresses for the storage blocks in the data cache module according to the following formula according to the indication of the control module:
read address [i]=(read cnt )%(B/M/b),i∈[0,N)
wherein read cnt The count value of the read data, which is read by the read control module, is a count value starting from 0; b is the bit width of the output data, M is the number of output channels, B is the number of bits of the data read, N is the number of memory blocks, i is an integer starting from 0. When the lowest dimension of the input data dimension and the output data dimension changes, the read control module generates read addresses for all storage blocks in the data cache module according to the following formula (wherein all parameters are the same as above) according to the indication of the control module:
read address [i]=(i+read cnt )%(B/M/b),i∈[0,N)。
wherein, each read address uses different offsets, so that the output data is read and output in sequence from low to high in output dimension.
For the process of reading the data cache, the reference values given above are still used, and for the input dimension NDHWC, the translation process for the output dimension NCDHW is as follows: at the first time of the read, the read control module reads the data at address 0, i.e., (0,0,0,0,0) to (0,0,0,0, 7) in the dimension of NDHWC, from bank 0. The same time sense control module will read the data of address 1, i.e., (0,0,0,1,0) to (0,0,0,1, 7) in the NDHWC dimension, for bank 1. the same time sense control module will read the data of address 31, i.e., (0,0,0, 31, 0) to (0,0,0, 31, 7) in the NDHWC dimension, for bank 31. Finally, the read-out control module arranges the data into 512-bit data of (0,0,0,0,0) to (0,0,0,0, 31) expressed by NCDHW dimension according to W dimension as data corresponding to the output address 0; 512-bit data of (0, 1,0, 0,0) to (0, 1,0, 0, 31) as data corresponding to the output address 1; 512-bit data of (0, 2, 0,0,0) to (0, 2, 0,0, 31) as data corresponding to the output address 2; 512-bit data of (0, 3, 0,0,0) to (0, 3, 0,0, 31) as data corresponding to the output address 3; 512-bit data of (0, 4, 0,0,0) to (0, 4, 0,0, 31) as data corresponding to the output address 4; 512-bit data of (0, 5, 0,0,0) to (0, 5, 0,0, 31) as data corresponding to the output address 5; 512-bit data of (0, 6, 0,0,0) to (0, 6, 0,0, 31) as data corresponding to the output address 6; 512-bit data of (0, 7, 0,0,0) to (0, 7, 0,0, 31) is used as data corresponding to the output address 7.
And the control module instructs the read-out control module to work and simultaneously generates an output address corresponding to each output channel according to the counting of the output counter of each dimension of each output channel and the data step length of each dimension. And the control module starts an output counter with an initial value of 0 for each dimension when starting to output so as to represent the dimension coordinate used by each data, and calculates the corresponding output address by using the dimension coordinate and the step length information of each dimension. For example, each output address calculation for the five dimensions N, D, H, W, C can be given by the following equation:
write_out address
=N_out cnt ×N_out stride +D_out cnt ×D_out stride +H_out cnt ×H_out stride +W_out cnt ×W_out stride +C_out cnt ×C_out stride
wherein N _ out cnt ,D_out cnt ,H_out cnt ,W_out cnt And C _ out cnt The value of the output counter for each dimension of the output data representing the format of the NDHWC, and N _ out stride ,D_out stride ,H_out stride ,W_out stride And C _ out stride Representing the step size for each output dimension. Wherein W _ out for the case of output data dimension of NCDHW stride Has a value of 1.
When the input data dimension and the output data dimension are not changed or the lowest dimension of the input data and the output data is not changed, the control module adjusts the output counters of the dimensions to control the output channels to output data according to the following mode when outputting the data: each output channel writes out data in order of input data dimension from low to high. That is, after each stroke of data is output, the output counter corresponding to the lowest dimension of input is increased by a single output data amount (the size of the output counter is related to the width of each stroke of data and the number of channels), when the count value of the output counter corresponding to the lowest dimension of input data reaches the data size of the dimension, the output counter of the lowest dimension is emptied and 1 is added to the output counter of the previous dimension, and then the process is repeated continuously until all data are output.
Under the condition that the lowest dimension of the input data dimension and the output data dimension changes, the control module adjusts each dimension output counter to control each output channel to output data according to the following mode when outputting data: each output channel writes out data in the order of the output data dimension from low to high. Namely, after each piece of output data is paired, the output counter corresponding to the lowest dimension of the output data increases the single output data volume (the size of the output counter is related to the width of each piece of data and the number of channels), when the count value of the output counter corresponding to the lowest dimension of the output data reaches the data size of the dimension, the output counter corresponding to the lowest dimension is emptied and carries to the output counter corresponding to the upper dimension of the output data dimension, if the upper dimension of the output data dimension is just the lowest dimension of the input data dimension, the carry is the single output data volume (the size of the carry is related to the data width of each piece of output data), and the carry is 1 under other conditions, and the above processes are repeated continuously until all data are output.
It should be understood that, although the description of dimension transformation is performed by taking N, D, H, W, C five dimensions as an example of the input data and the output data, the input data and the output data are not limited to necessarily adopting five dimensions, and any one or more of the above dimensions (for example, four dimensions, three dimensions, or two dimensions, etc.) may be adopted, so long as the input data and the output data have the same number of dimensions and are only arranged differently, the dimension transformation apparatus of the embodiment of the present application described above is applicable.
Fig. 5 is a schematic structural diagram of functional modules of a dimension transformation apparatus according to another embodiment of the present invention. The dimension conversion apparatus is different from the apparatus shown in fig. 1 in that the control module is responsible for generating an input address and an output address through a dedicated input address generator (also referred to as an input walker) and a dedicated output address generator (also referred to as an output walker), respectively. Wherein the control module controls and schedules the other modules according to configuration information received from an external control unit. The input walker may generate an input address based on information provided by the control module to request read data from a previous stage module and instruct the write control module to initialize internal states in preparation for receiving the input data from the previous stage module. The output walker may generate an output address based on information provided by the control module to request write data from a subsequent stage module. The remaining modules are similar to those described above in connection with fig. 1 and will not be described again here.
It can be seen that the dimension transformation device according to the embodiment of the present invention can reduce the computational load of the processor by realizing data dimension conversion during data movement or data transmission. The dimension conversion device adopts a data cache consisting of a plurality of small on-chip RAMs, all feature data transmitted from one neural network to another neural network or transmitted from one layer of the neural network to another layer of the neural network are not required to be read into the cache for conversion, and the data of the corresponding dimension is read according to the configuration information and part of the data can be output from each dimension of the feature data, so that the area overhead of the on-chip cache is saved, and the efficiency of data transmission is not influenced. The dimension transformation device can be directly integrated on a processor chip or can be integrated in a DMA module of a processor for use.
In still other embodiments of the present invention, a processor for a neural network is also provided, which includes the dimension transformation apparatus described above in conjunction with the figures. In the processor, tasks of multiple threads run on different computing cores of the processor at the same time, different computing cores perform different computations according to instructions, data processed by each computing core and computation results are temporarily stored in an internal on-chip cache, and data transfer between the on-chip cache and an off-chip memory of the processor is performed by using the dimension transformation device described above.
Reference in the specification to "various embodiments," "some embodiments," "one embodiment," or "an embodiment," etc., means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases "in various embodiments," "in some embodiments," "in one embodiment," or "in an embodiment," or the like, in various places throughout this specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Thus, a particular feature, structure, or characteristic illustrated or described in connection with one embodiment may be combined, in whole or in part, with a feature, structure, or characteristic of one or more other embodiments without limitation, as long as the combination is not logical or operational.
The terms "comprises," "comprising," and "having," and similar referents in this specification, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The word "a" or "an" does not exclude a plurality. Additionally, the various elements of the drawings of the present application are merely schematic illustrations and are not drawn to scale.
Although the present application has been described through the above-described embodiments, the present application is not limited to the embodiments described herein, and various changes and modifications may be made without departing from the scope of the present application.

Claims (10)

1. A dimension conversion device friendly to on-chip cache comprises a control module, a data cache module consisting of a plurality of storage blocks, a write-in control module and a read-out control module, wherein:
the control module is configured to receive a data handling instruction to be processed and simultaneously acquire configuration information related to the instruction, wherein the configuration information at least comprises base address information of input data and output data, dimension information of the input data and the output data, and data size and data step length of each dimension of the input data and the output data;
the control module is configured to generate corresponding input addresses by using the received configuration information so as to read corresponding data according to the order of the output data dimension from a low dimension to a high dimension;
the write control module is configured to write input data from an external storage unit into the data cache module;
the readout control module is configured to read and output data from the data cache module according to the instruction of the control module.
2. The apparatus according to claim 1, wherein each memory block in the data cache module is an on-chip random access memory, and the number of the memory blocks at least satisfies a preset bit width of input data and a preset bit width of output data, wherein the bit width of the input data is the same as the bit width of the output data.
3. The apparatus according to claim 1, wherein said apparatus comprises a plurality of output channels, the number of said output channels being at least sufficient to divide a preset bit width of said input data and a preset bit width of said output data.
4. The apparatus of claim 3, wherein each of said output channels corresponds to an output address, the data within each of said output channels is contiguous, and wherein the depth of each of said memory blocks is at least equal to or greater than the ratio of the bit width of the data to the bit width of the individual data for each of said output channels.
5. The apparatus of claim 1, wherein the dimension data of the input data and the output data are stored in a linear arrangement from a low dimension to a high dimension, and the input address is calculated according to base address information of the input data, size information of the dimension data, step size information of the dimension data, and a count value of an input counter of the dimension.
6. The apparatus of claim 5, the control module further configured to adjust an input counter to read data in order from a low dimension to a high dimension in an output data dimension upon detecting a change in a lowest dimension of the input data dimension and the output data dimension, comprising:
after each input data request, the input counter of the lowest dimension corresponding to the output data is increased by 1; when the count value of the input counter corresponding to the lowest dimension of the output data reaches the data size of the dimension, clearing the input counter of the dimension and carrying to the input counter of the previous dimension in the dimension of the output data; if the last-level dimensionality is the lowest dimensionality of the input data dimensionality, the carry is the data size of single input, otherwise, the carry is 1; the above process is repeated until all data is read.
7. The apparatus of claim 1, wherein the write control module is further configured to:
when the input data dimension and the output data dimension are not changed or the lowest dimension of the input data and the output data is not changed, counting the input data received each time, performing modulo on the depth of each storage block in the data cache module according to the current count value to generate a write address for each storage block, and accordingly writing the currently received input data into each storage block of the data cache module.
8. The apparatus of claim 1, wherein the write control module is further configured to:
counting the input data received each time when the lowest dimension of the input data dimension and the output data dimension is changed;
performing bit cycle right shift operation on currently received input data, wherein the right shift number is the current count value multiplied by B/N, N is the number of storage blocks in the data cache module, and B is the bit width of the input data;
and performing modulo operation on the depth of each storage block in the data cache module according to the current count value to generate a write address for each storage block, and accordingly writing the processed data into each storage block of the data cache module.
9. The apparatus of claim 3, wherein the read control module is further configured to generate the read addresses for the respective memory blocks in the data cache module according to the following formula according to the indication of the control module when the lowest dimension of the input data dimension and the output data dimension changes:
read address [i]=(i+read cnt )%(B/M/b),i∈[0,N)
wherein read cnt A count value indicating the amount of data read out, which is a count value from 0; b is the bit width of the output data, M is the number of output channels, B is single dataI is an integer starting from 0, and N is the number of memory blocks.
10. A processor for a neural network, comprising the dimension transformation apparatus of any one of claims 1-9, for data transfer between an on-chip cache and an off-chip memory of the processor.
CN202210335890.6A 2022-03-31 2022-03-31 Dimension transformation device friendly to on-chip cache and neural network processor Pending CN114840470A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210335890.6A CN114840470A (en) 2022-03-31 2022-03-31 Dimension transformation device friendly to on-chip cache and neural network processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210335890.6A CN114840470A (en) 2022-03-31 2022-03-31 Dimension transformation device friendly to on-chip cache and neural network processor

Publications (1)

Publication Number Publication Date
CN114840470A true CN114840470A (en) 2022-08-02

Family

ID=82564707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210335890.6A Pending CN114840470A (en) 2022-03-31 2022-03-31 Dimension transformation device friendly to on-chip cache and neural network processor

Country Status (1)

Country Link
CN (1) CN114840470A (en)

Similar Documents

Publication Publication Date Title
CN107657581B (en) Convolutional neural network CNN hardware accelerator and acceleration method
US10936941B2 (en) Efficient data access control device for neural network hardware acceleration system
CN111240743B (en) Artificial intelligence integrated circuit
US20140181427A1 (en) Compound Memory Operations in a Logic Layer of a Stacked Memory
CN109840585B (en) Sparse two-dimensional convolution-oriented operation method and system
CN106021182A (en) Line transpose architecture design method based on two-dimensional FFT (Fast Fourier Transform) processor
US11714651B2 (en) Method and tensor traversal engine for strided memory access during execution of neural networks
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN112905530B (en) On-chip architecture, pooled computing accelerator array, unit and control method
CN110705702A (en) Dynamic extensible convolutional neural network accelerator
EP3973401A1 (en) Interleaving memory requests to accelerate memory accesses
CN113762493A (en) Neural network model compression method and device, acceleration unit and computing system
CN111753962A (en) Adder, multiplier, convolution layer structure, processor and accelerator
US11443185B2 (en) Memory chip capable of performing artificial intelligence operation and method thereof
CN110569970B (en) Data transmission method applied to hardware accelerator in convolutional neural network
US6640296B2 (en) Data processing method and device for parallel stride access
CN114840470A (en) Dimension transformation device friendly to on-chip cache and neural network processor
CN101751356B (en) Method, system and apparatus for improving direct memory access transfer efficiency
CN115204364A (en) Convolution neural network hardware accelerating device for dynamic allocation of cache space
CN112396072B (en) Image classification acceleration method and device based on ASIC (application specific integrated circuit) and VGG16
CN110490312B (en) Pooling calculation method and circuit
CN112712167A (en) Memory access method and system supporting acceleration of multiple convolutional neural networks
US20090235010A1 (en) Data processing circuit, cache system, and data transfer apparatus
US11442643B2 (en) System and method for efficiently converting low-locality data into high-locality data
US20240045922A1 (en) Zero padding for convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination