WO2024012492A1 - 人工智能芯片、灵活地访问数据的方法、设备和介质 - Google Patents

人工智能芯片、灵活地访问数据的方法、设备和介质 Download PDF

Info

Publication number
WO2024012492A1
WO2024012492A1 PCT/CN2023/107010 CN2023107010W WO2024012492A1 WO 2024012492 A1 WO2024012492 A1 WO 2024012492A1 CN 2023107010 W CN2023107010 W CN 2023107010W WO 2024012492 A1 WO2024012492 A1 WO 2024012492A1
Authority
WO
WIPO (PCT)
Prior art keywords
read
loop
address
tensor
layer
Prior art date
Application number
PCT/CN2023/107010
Other languages
English (en)
French (fr)
Inventor
施云峰
周俊
王剑
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2024012492A1 publication Critical patent/WO2024012492A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7839Architectures of general purpose stored program computers comprising a single central processing unit with memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/177Initialisation or configuration control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit

Definitions

  • the present application relates to the field of artificial intelligence, and more specifically, to an artificial intelligence processor chip for flexibly accessing data, a method, an electronic device, and a non-transitory storage medium for flexibly accessing data in an artificial intelligence processor chip. .
  • the design of the data path is also very critical.
  • the input and output data of the computing unit are provided and stored in various ways, which also determines different chip storage computing architectures.
  • the Graphic Processing Unit GPU
  • Its storage level consists of L1 cache, shared memory (Shared Memory), register group, L2 cache and external storage DRAM. etc. composition. The main purpose of dividing these storage levels is to reduce data transfer delays and improve bandwidth.
  • L1 cache is generally divided into L1D cache and L1I cache, which are used to store data and instructions respectively.
  • L1 caches are set up for each processor core, with sizes ranging from 16k to 64k.
  • L2 cache is often used as a private cache and is not distinguished.
  • corresponding L2 caches are usually set for each processor core, with sizes ranging from 256k to 1M.
  • L1 cache is the fastest but has less space
  • L2 cache is slower and has more space
  • external DRAM has the most space but is the slowest and so on. Therefore, storing frequently accessed data from DRAM into the L1 cache can reduce the delay in transferring data from the external DRAM to the memory each time it is accessed, and improve the efficiency of data processing.
  • Some AI chips can achieve high efficiency through customized channels, but the corresponding price is the loss of flexibility. Once the network structure is modified, there is a risk that it cannot be used. In addition, some AI chips solve bandwidth and latency problems by adding large on-chip buffers, but the access method for static random access memory (SRAM) is initiated by hardware, which means that there is a gap between calculation and storage. It is hardware coupled. This will cause the problem of inflexible strategies, so efficiency is reduced in certain scenarios and the software cannot intervene.
  • SRAM static random access memory
  • an artificial intelligence processor chip for flexibly accessing data, including: a memory configured to store read-in tensor data from outside the processor chip, the read-in tensor data Comprising a plurality of elements for performing tensor operations of operators included in artificial intelligence calculations; a storage control unit configured to control reading of elements from the memory to send to A computing unit.
  • the storage control unit includes an address computing module.
  • the address computing module has an interface for receiving parameters configured through software.
  • the address computing module performs a read loop at one level according to the configured parameters received through the interface. Or calculate the address in the memory in a multi-level read loop nesting to read elements from the calculated address to send to the calculation unit; the calculation unit is configured to use the received element to perform The tensor operation of the operator.
  • a method for flexibly accessing data in an artificial intelligence processor chip including: storing tensor data read from outside the processor chip by a memory in the artificial intelligence processor chip , the read-in tensor data includes a plurality of elements, which are used to perform tensor operations of operators included in artificial intelligence calculations; the reading of elements from the memory to send is controlled according to the tensor operations of the operators.
  • a computing unit including: calculating an address in the memory in one level of read loop or multiple levels of read loop nesting according to received parameters configured by software, to read an element from the calculated address to send to A computing unit in the artificial intelligence processor chip; the computing unit uses the received elements to perform tensor operations on the operator.
  • an electronic device including: a memory for storing instructions; and a processor for reading the instructions in the memory and executing methods according to various embodiments of the present application.
  • a non-transitory storage medium having instructions stored thereon
  • the address in the memory can be flexibly calculated through software-configured parameters to flexibly read the elements in the memory, without being limited to the order or address ordering in which these elements are stored in the memory.
  • Figure 1 shows an example diagram of a computational graph in a neural network applied to image data processing and recognition.
  • Figure 2 shows a schematic diagram of an artificial intelligence processor chip for flexibly accessing data according to an embodiment of the present application.
  • Figure 3 shows an exploded schematic diagram of an artificial intelligence processor chip for flexibly accessing data according to an embodiment of the present application.
  • Figure 4 shows an example of performing 2-layer loop nested reading on an input tensor according to an embodiment of the present application.
  • FIG. 5 shows a schematic diagram of calculating an address according to software-configured parameters according to an embodiment of the present application.
  • Figure 6 shows an example of incompletely aligned 3-layer loop nested reading of input tensors according to an embodiment of the present application.
  • FIG. 7 shows a schematic diagram of the internal structure of an SRAM according to an embodiment of the present application.
  • Figure 8 shows a flowchart of a method for flexibly accessing data in an artificial intelligence processor chip according to an embodiment of the present application.
  • FIG. 9 illustrates a block diagram of an exemplary electronic device suitable for implementing embodiments of the present application.
  • Figure 10 shows a schematic diagram of a non-transitory computer-readable storage medium according to an embodiment of the present disclosure.
  • a prompt message is sent to the user to clearly remind the user that the operation requested will require the acquisition and use of the user's personal information. Therefore, users can autonomously choose whether to provide personal information to software or hardware such as electronic devices, applications, servers or storage media that perform the operations of the technical solution of the present disclosure based on the prompt information.
  • the method of sending prompt information to the user may be, for example, a pop-up window, and the prompt information may be presented in the form of text in the pop-up window.
  • the pop-up window can also contain a selection control for the user to choose "agree” or "disagree” to provide personal information to the electronic device.
  • the recognition process of the above application scenarios can be realized by the neural network receiving various input application data as tensors and calculating by the neural network.
  • Current neural networks and machine learning systems use tensors as basic data structures.
  • the core of the concept of a tensor is that it is a container of data, and the data it contains is almost always numerical data, so it is a container of numbers.
  • the specific values in the tensor can be application data, including image data, natural language data, etc.
  • a scalar is a 0-dimensional tensor, such as 2, 3, 5.
  • it is image data. 2 for example, represents the grayscale value of a pixel in the image data. 3, for example, represents the grayscale value of a pixel in the image data. 5, for example, represents the grayscale value of a pixel in the image data. Worth waiting.
  • a vector is a 1-dimensional tensor, such as [0,3,20], and a matrix is a 2-dimensional tensor, such as Or [[2,3],[1,5]].
  • tensors there can also be three-dimensional tensors (such as a:(shape:(3,2,1)),[[[1],[2]],[[3],[4]],[[5],[ 6]]]), four-dimensional tensor, etc.
  • These tensors can be used to represent data in specific application scenarios, such as image data, natural language data, etc.
  • the neural network functions for these application data can include image recognition (for example, input image data, identify what animal is included in the image), natural language recognition (for example, input the user's language, the user's intention to speak can be identified, such as Is the user talking about opening a music player) etc.
  • the recognition process of the above application scenarios can be realized by the neural network receiving various input application data as tensors and calculating by the neural network.
  • the calculation of a neural network can consist of a series of tensor operations, and these tensor operations can be complex geometric transformations of input data of several dimensional tensors.
  • These tensor operations can be called operators, which can convert the calculations of the neural network into a calculation graph. There are multiple operators in the calculation graph, and the multiple operators can be connected by lines to represent the distance between the calculations of each operator. Dependencies.
  • AI chips are chips dedicated to neural network operations. They are mainly specially designed to accelerate the execution of neural networks. Neural networks can be expressed in purely mathematical formulas. According to these mathematical formulas, the neural network can be represented by a computational graph model. Computational graphs are visual representations of these mathematical formulas. The computational graph model can split a compound operation into multiple sub-operations, and each sub-operation is called an operator (Operator, abbreviated as Op).
  • Op operator
  • the calculation of neural networks will generate a large amount of intermediate data. If it is stored in dynamic random access memory (DRAM), the overall performance will be low due to excessive delay and insufficient bandwidth.
  • DRAM dynamic random access memory
  • This problem can be alleviated by adding L2 cache.
  • the advantage is that it is invisible to programming, so it does not affect programming and can reduce latency.
  • the cache miss rate will be high due to the access address and access timing of the L2 cache.
  • the on-chip SRAM of the artificial intelligence chip is used to store the data required and generated during neural network calculations.
  • the flow of data can be actively controlled through software, and the time of data transfer between SRAM and DRAM can be hidden through pre-configuration. Since the data access mode of neural network calculation is relatively flexible, if the access flexibility of SRAM is not enough, some operators will be completed using multiple calculation processes.
  • Figure 1 shows an example diagram of a computational graph in a neural network applied to image data processing and recognition.
  • a tensor carrying image data (such as the chrominance value of a pixel) is input into an example computational graph as shown in Figure 1.
  • This calculation diagram only shows part of the operators for the convenience of readers.
  • the operation process of this calculation graph is to first calculate the tensor through the Transpose operator, then one branch is calculated through the Reshape operator, and the other branch is calculated through the Fully connected operator.
  • the Transpose operator is a tensor operation that does not change the value in the input tensor.
  • the function of the Transpose operator is to change the order of the dimensions (axis) of the array. For example, for a two-dimensional array, if the order of the two dimensions is interchanged, it is a matrix transpose.
  • the Transpose operator can be applied to more dimensions.
  • the input parameter of the Transpose operator is the dimension order of the output array, and the sequence number starts from 0.
  • the input tensor of the Transpose operator is, for example, a two-bit matrix [[1,2,3],[4,5,6],[7,8,9],[10,11,12]] or expressed as
  • the image data is a 4*3 two-dimensional matrix.
  • transpose([[1,2,3],[4,5,6],[7,8,9],[10,11,12]]) means to transpose this two-dimensional matrix, which becomes [ [1,4,7,10],[2,5,8,11],[3,6,9,12]], or expressed as 3*4 matrix.
  • the Transpose operator changes the order of dimensions, that is, changes the shape of the tensor, it does not change the values in the tensor. For example, it is still 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12.
  • the Transpose operator may not change the shape of the tensor.
  • a 3*3 matrix is still a 3*3 matrix after transposition, and the values in the tensor are not changed, but the arrangement of the values in the transposed matrix is The order is different.
  • the specific operation of the Reshape operator is to change the shape attribute of the tensor, which can arrange the m*n matrix a into an i*j-sized matrix b.
  • the Reshape operator (Reshape(A, 2, 6), where A is the input tensor) converts the above tensor The shape changes from 3*4 to 2*6. Therefore, after the Reshape operator, the output tensor obtained is, for example
  • Reshape operator also changes the shape of the tensor, it does not change the values in the tensor. For example, it is still 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12.
  • the Fully connected operator (also known as the Full Connection operator) can be regarded as a special convolution layer, or as the product of tensors. It inputs the entire tensor as a feature map and performs feature extraction operations. That is, linear transformation from one feature space to another feature space, and the output tensor is the weighted sum of the input tensors.
  • the Fully connected operator uses the input tensor (the output tensor of the Transpose operator)
  • the matrix is multiplied by the weight matrix x of size 4*1, for example That is, read the first row of the transposed matrix multiplied by the weight matrix x, then the second row multiplied by the weight matrix x, and then the third row multiplied by the weight matrix x.
  • an artificial intelligence chip is to be used to perform the calculation process from the Transpose operator to the Fully connected operator in the calculation diagram shown in Figure 1. Then you need to use the artificial intelligence chip to first read the input tensor of the Transpose operator into the memory in the storage unit within the chip.
  • the input tensor is assumed to be Storage into the memory of the storage unit in the chip is generally continuous storage, that is, it is stored as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, and then the computing unit in the chip runs Transpose
  • the operation of the operator (for example, converting 4*3 to 3*4) makes it Convert to The chip then converts the resulting tensor of this conversion into Stored into the memory of the memory unit within the chip as intermediate data.
  • the memory stored in the memory unit in the chip is generally stored continuously, that is, stored as 1, 4, 7, 10, 2, 5, 8, 11, 3, 6, 9, 12. At this point, the operation of the Transpose operator is completed.
  • the input tensor (the output tensor of the Transpose operator)
  • the matrix is multiplied by the weight matrix x of size 4*1, for example
  • the chip reads in the weight matrix x, for example
  • the memory stored in the memory unit in the chip is generally stored continuously, that is, it is stored as 40, 50, 60, and 70.
  • the computing unit reads a value in the output tensor of the Transpose operator from the memory of the storage unit in the order of storage, and reads the corresponding value in the weight matrix x in the order of storage, and performs multiplication and addition.
  • the computing unit of the chip then stores the calculated results into the memory of the storage unit within the chip.
  • the various hardware units of the chip need to cooperate with the process of reading, calculating, storing, re-reading, re-calculating, and then storing, but in this way
  • the computational efficiency of the entire process is very low, and the flexibility is also very low.
  • the present disclosure proposes a method for an artificial intelligence processor chip to flexibly access data, which can utilize the software configuration and related parameters of the artificial intelligence processor chip to replace certain calculations through the read operation of the chip to flexibly access data. sub-operations and improve calculation efficiency.
  • Figure 2 shows a schematic diagram of an artificial intelligence processor chip for flexibly accessing data according to an embodiment of the present application.
  • an artificial intelligence processor chip 200 for flexibly accessing data includes: a memory 201 configured to store read tensor data from outside the processor chip 200.
  • the read tensor data includes multiple elements, used to perform tensor operations of operators included in artificial intelligence calculations; the storage control unit 202 is configured to control the reading of elements from the memory to be sent to the computing unit 203 according to the tensor operations of the operators, and store
  • the control unit 202 includes an address calculation module 2021.
  • the address calculation module 2021 has an interface for receiving parameters configured through software.
  • the address calculation module 2021 calculates the memory in a one-level read loop or a multi-level read loop nesting according to the configured parameters. address to read elements from the calculated address to send to the calculation unit 203; the calculation unit 203 is configured to perform tensor operations of the operator with the received elements.
  • an address calculation module 2021 is provided in the storage control unit 202.
  • the address calculation module 2021 has an interface for receiving parameters configured through software.
  • the address calculation module 2021 can read in one layer or multiple layers according to the configured parameters.
  • the address in the memory is calculated in a nested level read loop to read elements from the calculated address to send to the calculation unit 203 .
  • the address in the memory can be flexibly calculated and read through the parameters configured by the software. That is to say, the flexibly calculated addresses can be different from the order in which they are stored, and the user can configure the new address in the memory through the parameters according to the user. Reading order, rather than having to read elements in the order they are stored, as is the case with existing technologies.
  • the address in the memory can be flexibly calculated through software-configured parameters to flexibly read the elements in the memory, without being limited to the order or address ordering of storing these elements in the memory.
  • the Transpose operator changes the order of dimensions, it does not change the elements in the tensor. For example, it is still 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. That is, if the address in the memory can be calculated in one level of read loop or multi-level read loop nesting according to the configured parameters to read elements from the calculated address to send to the computing unit 203, the user can Use parameters to software configure the new reading order, so that from the stored 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, the tensor is transposed according to the Transpose operator way to read these elements.
  • the input tensor is assumed to be Storage into the memory of the memory unit in the chip is generally continuous storage, that is, storing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 (For example, the storage addresses are (hexadecimal) 00000001, 00000002, 00000003, 00000004, 00000005, 00000006, 00000007, 00000008, 00000009, 00000010, 00000011, 00000012). Then the user can use parameters to configure the new reading sequence in software, so that the address calculation module 2021 can calculate the addresses in the memory according to the configured parameters.
  • the order of the addresses in the memory calculated according to the configured parameters are: 00000001, 00000004, 00000007, 00000010, 00000002, 00000005, 00000008, 00000011, 00000003, 00000006, 00000009, 00000012, that is, the Transpose operator can be directly replaced by the order of reading addresses. Transpose operation.
  • the address in the memory can be flexibly calculated and read through the parameters configured by the software. That is to say, the flexibly calculated addresses can be different from the order in which they are stored, and the user can configure the new address in the memory through the parameters according to the user. Reading order, rather than having to read elements in the order they are stored, as is the case with existing technologies.
  • the parameters configured by the software indicate that the tensor operation of the Transpose operator is replaced by reading elements from an address in the memory according to the tensor operation of the operator.
  • the parameters configured by the software indicate that the tensor operation of the Transpose operator is replaced by reading elements from the address in the memory according to the tensor operation of the Transpose operator.
  • the parameters configured through software include: a value indicating how many addresses are separated between the address of the first element read in the first step of each layer's read cycle and the initial address of the input tensor in the memory, indicating The value for reading several steps in a read loop, and the value indicating the step size between steps within a read loop. Note that here, the step size is similar to the stride/stride in the neural network concept.
  • the parameters configured through the software only need to tell the starting address, how many elements are read in total, and how many addresses are spaced between each element during each read. That is, only 3 parameters are needed.
  • the operation of the Gather operator is to select some values from some values in the input tensor as the output tensor. It can be seen that the operation of the Gather operator is a tensor operation that does not change the value in the input tensor. At this time, the input tensor is [1,2,3,4,5,6,7,8]. Assume that the operation of the Gather operator is to convert [1,2,3,4,5,6,7,8] The selection is [1,3,5,7].
  • the chip needs to first continuously store the input tensor [1, 2, 3, 4, 5, 6, 7, 8] in the memory at the storage address (16 base): For example, 00000001, 00000002, 00000003, 00000004, 00000005, 00000006, 00000007, 00000008, then the computing unit of the chip performs the Gather operator operation to obtain [1,3,5,7], and then stores [1,3,5,7] as the address (Hex): For example, 00000009, 000000010, 00000011, 00000012. The chip then proceeds to perform further operator operations on this result [1,3,5,7] through the computing unit.
  • the address calculation module can directly calculate the addresses of 1, 3, 5, and 7 from the input tensor [1, 2, 3, 4, 5, 6, 7, 8] through the parameters configured by the software. And read out [1,3,5,7] from these addresses to perform further operator operations on the result [1,3,5,7] through the calculation unit.
  • the parameters configured by the software may include: a value of 4 indicating how many reading steps are performed in a read cycle (indicating a total of 4 reads), a value indicating the step length between steps within a one-level read cycle. The value is 2 (add 2 addresses after each read before performing the next read).
  • the address calculation module can calculate that the order of addresses to be read is 00000001, 00000003, 00000005, 00000007 (starting from the 00000001 address, adding 2 addresses each time before reading the next time, a total of 4 reads).
  • the computing unit correspondingly reads the addresses stored at 00000001, 00000003, 00000005, and 00000007, that is, 1, 3, 5, and 7 according to the calculated address sequence.
  • the parameters configured by the software allow the computing unit to correspondingly read the addresses stored at 00000001, 00000003, 00000005, and 00000007 according to the calculated address sequence, that is, 1, 3, 5, and 7, which can directly replace the Gather calculation. operation of the Gather operator, and save the time and hardware cost of calculating the operation of the Gather operator in the prior art, the time and hardware cost of storing the result tensor of the operation of the Gather operator, and the storage of the result tensor of the operation of the Gather operator. The amount of time and hardware cost involved in reading individual elements of the address.
  • the parameters configured through software may include: a value indicating how many reading steps are performed in each layer of reading cycle, and a value indicating a step length between steps in each layer of reading cycle, where each The layer reading loop proceeds in a nested manner from the outer layer to the inner layer.
  • the parameters configured through software may also include: a value indicating how many addresses are separated between the address of the first element read in the first step of each layer's read cycle and the initial address of the input tensor in the memory. . This makes the reading method more flexible.
  • Loop nesting means that the outer loop is executed once. After the inner loop is executed, it will enter the second outer loop and execute the inner loop again.
  • the above outer loop and inner loop are examples of 2 levels of loop nesting. For example, when reading two When a dimensional matrix is used, the outer loop can be used to control which column is read, and the inner loop can be used to control which row value in a certain column is read. In C language, you can use multi-level for loop nesting statements to execute multi-level read loop nesting.
  • the input tensor is assumed to be Storage into the memory of the memory unit in the chip is generally continuous storage, that is, storing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 (for example, the storage address is (sixteen) base)00000001, 00000002, 00000003, 00000004, 00000005, 00000006, 00000007, 00000008, 00000009, 00000010, 00000011, 00000012).
  • a 2-level read loop can be set up. Assume that the first-level read loop is the inner loop and the second-level read loop is the outer loop. That is, in the present disclosure, for multi-level loop nesting, the larger the number of levels, the closer the outer loop is, and the smaller the number of levels, the closer the inner loop is.
  • the parameters configured through software can include: a value indicating how many steps to read in each layer of read loop (the number of steps in the second layer (outer layer) read loop is 3, that is, traversing 3 columns of the tensor, the first layer
  • the number of steps of the (inner layer) read loop is 4, that is, traversing all the rows in a column), indicating the value of the step length between each step in the read loop of each layer (the step length between the steps of the second layer read loop
  • the step size is 1, that is, the first step starts reading from 00000001, and the second step starts reading from 00000001 plus 1 address equal to 00000002;
  • the step size between each step of the second-level reading loop is 3, that is, the first step starts from 00000001 Start reading, in the second step, start reading from 00000001 plus 3 addresses equal to 00000004), in which the reading loops of each layer are carried out in a nested manner from the outer layer to the inner layer.
  • loop_1 represents the second-level read loop
  • loop_0 represents the first-level read loop
  • loop_1_cnt_max is 3
  • loop_0_cnt_max is 4.
  • the address calculation module calculates the storage addresses 00000001, 00000002, 00000003, 00000001, 00000002, 00000003, 00000004, 00000005, 00000006, 00000007, 00000008, 00000009, Read from 00000010, 00000011, 00000012.
  • the compiler calculates the initial address of the read, in this example, for example, the initial address 00000001 where the input tensor is stored in the memory.
  • the second-level reading loop reading starts from 00000001. All steps of the first-level read loop are executed in the first step of the second-level read loop. That is, in the 4 steps of the first-level read cycle, the address is read 4 times with a step size of 4.
  • the address calculation module calculates that the read address is 00000001, 00000004, 00000007, 00000010, so the elements to be read in the first step of the second-level read loop are 1, 4, 7, and 10 stored at addresses 00000001, 00000004, 00000007, and 00000010 respectively.
  • the step size between the initial address of the second step of the second-level read loop and the initial address of the first step is 1, that is, this time reading starts from the address 00000001 plus 1, that is, the address 00000002.
  • All steps of the first-level read loop are executed in the second step of the second-level read loop. That is, in the 4 steps of the first-level read cycle, the address is read 4 times with a step size of 4. That is, in the second step of the second-level read cycle, the address calculation module calculates that the read address is 00000002, 00000005, 00000008, 00000011, so the elements to be read in the second step of the second-level read loop are 2, 5, 8, and 11 stored at addresses 00000002, 00000005, 00000008, and 00000011 respectively.
  • the step size between the initial address of the third step of the second-level read loop and the initial address of the second step is 1, that is, this time reading starts from the address 00000002 plus 1, that is, the address 00000003.
  • All steps of the first-level read loop are executed in the third step of the second-level read loop. That is, in the 4 steps of the first-level read cycle, the address is read 4 times with a step size of 4. That is, in the third step of the second-level read cycle, the address calculation module calculates that the read address is 00000003, 00000006, 00000009, 00000012, so the elements to be read in the third step of the second-level read loop are 3, 6, 9, and 12 stored at addresses 00000003, 00000006, 00000009, and 00000012 respectively.
  • the order of the addresses calculated by the address calculation module is 00000001, 00000004, 00000007, 00000010, 00000002, 00000005, 00000008, 00000011, 00000003, 00000006, 00000009, 000000 12, therefore, reading from these addresses in sequence
  • the elements of are 1, 4, 7, 10, 2, 5, 8, 11, 3, 6, 9, 12.
  • the number of steps of the second-level read loop is 3. Therefore, after the address calculation module executes three second-level read loops and the address calculation of the first-level read loop within each second-level read loop, It can be stopped.
  • the situation of only one cycle can also be configured through software configuration parameters.
  • the number of steps of the second-level read cycle is set to 1, that is, all steps of the first-level read cycle are executed only once.
  • using software configuration parameters and using the address calculation module to calculate the address can directly replace various tensor operations, and setting up more than one loop nested address calculation process can make the read address calculated by the address calculation module more flexible. It is not limited to storing the sequential address of each value of the input tensor itself. And it can directly replace the operation of the operator, and save the time and hardware cost of calculating the operation of the operator, the time and hardware cost of storing the result tensor of the operation of the operator in the existing technology, and the time and cost of storing the operation of the operator.
  • the address of the resulting tensor reads out the time and hardware cost of each element, thereby reducing computing delays, reducing hardware computing costs, and improving the operating efficiency of artificial intelligence chips.
  • the replaced tensor operation may be a tensor operation that does not change the value in the input tensor, wherein the address calculation module is embedded in a one-level read loop or a multi-level read loop according to the configured parameters received through the interface.
  • the set computes the address in memory to be read on each read, allowing reads to replace tensor operations.
  • tensor operations may include the following: operations of transpose operator, operation of reshape operator, operation of broadcast operator, operation of gather operator, operation of reverse operator, operation of concat operator, cast Operator operations and so on.
  • operations of transpose operator operation of reshape operator
  • operation of broadcast operator operation of gather operator
  • operation of reverse operator operation of concat operator
  • cast Operator operations and so on.
  • operator operations there are many other types of operator operations that can also be flexibly implemented through the address calculation module and software configuration parameters of the embodiments of the present application.
  • Figure 3 shows an exploded schematic diagram of an artificial intelligence processor chip for flexibly accessing data according to an embodiment of the present application.
  • FIG. 3 shows the internal units and parameters used in the processing engine (Process Engine, PE) 300 of the artificial intelligence processor chip.
  • PE Processing Engine
  • the processing engine 300 may include a configuration unit 301, a computing unit 302, and a storage unit 303.
  • the storage control unit 304 is used to configure the calculation unit 302 and the storage unit 303;
  • the calculation unit 302 is mainly used for convolution/matrix calculation/vector calculation, etc.;
  • the storage unit 303 includes an on-chip SRAM memory 3031 (the size is, for example, 8MB, but this is not a limitation ), and the memory access control module 3032, used for interaction between internal data and external data of the processing engine 300 and data access of the computing unit 302 within the processing engine 300.
  • the processing engine 300 also includes a storage control unit 304.
  • the storage control unit 304 implements the following specific functions:
  • the sram_read function is used to read data from SRAM 3031 and send it to the computing unit 302 to perform calculations on the read data, such as tensor operations such as convolution/matrix calculation/vector calculation, etc.;
  • the sram_write function is used to obtain calculation result data from the calculation unit 302, write and store it into SRAM 3031;
  • the sram_upload function is used to transfer the data stored in SRAM 3031 to the outside of the processing engine 300 (for example, to other processing engines or DRAM);
  • the sram_download function is used to download data outside the processing engine 300 (data from other processing engines or DRAM) to the SRAM 3031.
  • the sram_upload function and the sram_download function are used for data interaction with devices outside the processing engine 300 .
  • the sram_read function and the sram_write function are used to process data interaction between the computing unit 302 and the storage unit 303 inside the engine 300.
  • SRAM 3031 is a shared memory within the processing engine 300, the size is not limited to 8MB, and is mainly used to store intermediate data within the processing engine 300 (including data to be calculated and calculation results). SRAM 3031 can be divided into multiple channels (banks) to increase the overall data bandwidth.
  • the full crossbar interconnection path (crossbar) 3041 is a full interconnection structure of the internal storage control access interface of the processing engine 300 and the SRAM multi-channel.
  • the full cross interconnection path 3041 together with the memory access control module 3032 of the storage unit 303 controls the address calculated by the address calculation module 3042 to read elements from the address in the SRAM memory 3031 in the storage unit 303.
  • the computing path (computing unit 302) and the data path (storage unit 303) of the entire processing engine 300 are configured separately. If you want to complete an operator calculation, multiple modules need to cooperate to complete it. For example, for the calculation of a convolution, you need to configure sram_read to input feature data to the calculation unit 302, configure sram_read to input weights to the calculation unit 302, the calculation unit 302 performs matrix convolution calculation, and configure sram_write to output the calculation results to the storage unit 303 SRAM memory 3031 in. In this way, the choice of calculation method can be made more flexible.
  • the storage control unit 304 is configured to control the reading of data from the memory 3031 to be sent to the computing unit according to the tensor operation of the operator.
  • the storage control unit 304 includes an address calculation module 3042, and the address calculation module 3042 has a software that receives Interface of the configured parameter data_noc, the address calculation module 3042 calculates the address in the memory 3031 in one level of read loop or multi-level read loop nesting according to the configured parameters to read elements from the calculated address to send to the calculation Unit 302.
  • the parameters configured through software include: a value indicating how many reading steps to perform in a read cycle, and a step length between each step in a read cycle. value.
  • the parameters configured through software include: a value indicating how many reading steps to perform in each level of read loop, a value indicating the number of reading steps to be performed in each level of read loop, The value of the step size, in which each layer of read loops is performed in a nested manner from the outer layer to the inner layer.
  • the above setting of multiple read loop nesting can realize multiple reads of the address stored in the same tensor or the same segment of address, thereby realizing the calculation of various complex addresses and gaining the flexibility of reading addresses more flexibly.
  • the above 8-layer read loop runs like this: First, run a total of loop_7_cnt_max steps of the Loop7-layer loop. In each step of the Loop7-layer loop, run a total of loop_6_cnt_max steps of the Loop6-layer loop. In each step of the Loop6-layer loop, a total of loop_5_cnt_max steps of the Loop5 layer loop are run. In each step of the Loop5 layer loop, a total of loop_4_cnt_max steps of the Loop4 layer loop are run. In each step of the Loop4 layer loop, a total of loop_3_cnt_max steps of the Loop3 layer loop are run.
  • the total loop_2_cnt_max steps of the Loop2 layer loop are run.
  • the total loop_1_cnt_max steps of the Loop1 layer loop are run.
  • the Loop0 layer is run. Total loop_0_cnt_max steps of the loop.
  • the lowest Loop0 layer loop has run a total of loop_0_cnt_max*loop_1_cnt_max*loop_2_cnt_max*loop_3_cnt_max*loop_4_cnt_max*loop_5_cnt_max*loop_6_cnt_max*loop_7_cnt_max steps, while the upper Loop1 layer loop has run a total of loop_1_cnt_max*loop_2_cnt_max*loop_3_cnt_ max*loop_4_cnt_max*loop_5_cnt_max*loop_6_cnt_max *loop_7_cnt_max steps.
  • the highest-level Loop7 layer loop runs a total of loop_7_cnt_max steps.
  • Figure 4 shows an example of performing 2-layer nested loop reading on an input tensor according to an embodiment of the present application. example.
  • the input tensor is The numbers in the box in Figure 4 represent the address of the SRAM that stores each element of the input tensor (for convenience of explanation, the address of the storage element is directly written as a number corresponding to the element itself).
  • the reading (access) address that you want to achieve through software configuration parameters is the gray block in Figure 4, and the reading sequence is 0-2-4-6-9-11-13-15.
  • base_address is the starting address, which can be pre-calculated by the compiler, usually the initial address of the input tensor in a section of address stored in SRAM (i.e., the location where the first element is stored, in this example address 0).
  • Configure a parameter loop_0_cnt_max 4 through the software, which means that the number of steps of the first layer (inner layer) read loop is 4 or the size of the loop is 4.
  • Configure another parameter jump0_addr 2 through the software, which means that each read loop in the first layer The step length between steps is 2.
  • Each read cycle sets a corresponding counter loop_xx_cnt (xx indicates which read cycle), for example, loop_0_cnt increases from 1 to 4, for example, loop_1_cnt increases from 1 to 2.
  • Figure 5 shows a schematic diagram of calculating the address according to the parameters configured by the software according to an embodiment of the present application, where sram_addr represents the addresses 0-15 stored in the SRAM), the address calculation module This is how the address is calculated:
  • the arrow in 5 starts from the address 9 obtained from the base_address initial address 0+9, and runs the first step of the first layer (memory) read loop loop_0 (1_0 in Figure 5, a total of 4 steps), starting from the address 9 Read out element 9.
  • the parameters configured through software may also include: a value indicating how many addresses are separated between the address of the first element read by a layer of read loop and the initial address of the input tensor in the memory. This also allows you to flexibly configure the initial address of each layer of read loops.
  • the order of reading addresses has a first rule at a certain address, but at the other end of the address, there may be a second rule that is different from the first rule.
  • Two rules There can be 2 different loop nesting methods.
  • the parameters configured through software may include conditions for the parameters, and the parameters satisfy Different values are taken when the condition is met and when the condition is not met.
  • the parameter is a value indicating how many reading steps are performed in a specific layer of read loop, and the condition is what step is performed in another layer of read loop that is outer than the specific layer. For example, you can add a configuration to a specified read loop and bind it to another read loop to solve the misalignment situation. For example, the Loop5 read loop is selected. There are two different synchronization configurations for the Loop1 read loop, loop_1_cnt_max0 and loop_1_cnt_max1, which are linked with the loop_5 read loop.
  • Figure 6 shows an incomplete alignment of the input tensor according to an embodiment of the present application.
  • the input tensor is The address stored in SRAM is as shown in Figure 6.
  • the numbers in the box in Figure 6 represent the address of the SRAM that stores each element of the input tensor (for convenience of explanation, the address of the storage element is directly written as a number corresponding to the element itself).
  • the reading (access) address that you want to achieve through software configuration parameters is the gray block in Figure 6, and the reading sequence is 0-8-1-9-2-10-3-11-4-12 -16-17-18-19-20.
  • loop nesting Loop2, Loop1, Loop0
  • the software configuration parameters can be: the number of steps loop_2_cnt_max of the top layer Loop2 is 2, the step size jump2_addr is 16, the number of steps loop_1_cnt_max of the next layer Loop1 is 5, the step size jump1_addr is 1, and the number of steps loop_0_cnt_max_0 of the bottom layer Loop0 is 2.
  • the step size jump0_addr is 8, then configure and specify the binding of Loop2 and Loop0, and set the condition.
  • step 2 of Loop1 the step size is 1, that is, all steps of Loop0 are run from 1, that is, 2 steps are run from 1, and each step has a step size of 8. Therefore, 1 is read first, and then 8 addresses are added. Read 9, so in 2 steps, read 1-9.
  • step 3 of Loop1 the step size is 1, that is, all steps of Loop0 are run from 2, that is, 2 steps are run from 2, and each step has a step size of 8. Therefore, 2 is read first, and then 8 addresses are added. Read 10, so in 2 steps, read 2-10.
  • step 4 of Loop1 the step size is 1, that is, all steps of Loop0 are run from 3, that is, 2 steps are run from 3, and each step has a step size of 8. Therefore, 3 is read first, and then 8 addresses are added. Read 11, follow these 2 steps, and read 3-11.
  • step 5 of Loop1 the step size is 1, that is, all steps of Loop0 are run from 4, that is, 2 steps are run from 4, and each step has a step size of 8. Therefore, 4 is read first, and then 8 addresses are added. Read 12, follow these 2 steps, and read 4-12.
  • step 2 of the top-level Loop2 (2 steps in total) is executed, adding the step size 16 from the initial address 9, that is, starting to read from 16.
  • step 1 of Loop1 run all steps of Loop0, that is, run 1 step starting from 16, which means it is only read once, then the step size 8 is not used, so only 16 is read.
  • the parameters configured by the software enable three levels of read loop nesting to achieve a complex address reading sequence, 0-8-1-9-2-10-3-11-4-12-16-17-18 -19-20.
  • the address in the memory can be flexibly calculated to read elements from the calculated address and send them to the computing unit to implement a flexible address reading method and increase the artificial intelligence processor
  • the computing efficiency and cost of the chip can also replace the specific tensor operation itself in artificial intelligence calculations in some cases, thereby simplifying the operation of the operator.
  • the current read loop is calculated based on the current step of one or each layer of read loops and the respective step lengths of one or each layer of read loops. The address taken.
  • the address calculation unit can be calculated in a manner similar to determining the position of a point in a multi-dimensional (read loop) spatial coordinate system. Address read each time:
  • base_address can be pre-calculated by the compiler, usually the initial address of the input tensor in a section of address stored in SRAM (that is, the location where the first element is stored), and offset_address_dim is the address offset of each dimension (read loop) The sum of:
  • offset_address_dim offset_addr_0+offset_addr_1+offset_addr_2+offset_addr_3+offset_addr_4+offset_addr_5+offset_addr_6+offset_addr_7
  • the actual address calculation module calculates the address, it only needs to know the current step of the read cycle of one layer or each layer and the respective step lengths of the read cycle of one layer or each layer to calculate the current read cycle. The address taken.
  • FIG. 7 shows a schematic diagram of the internal structure of an SRAM according to an embodiment of the present application.
  • SRAM can also be divided into multiple channels (banks).
  • data written from the outside can be directly placed in different channels to read data, intermediate calculation data and results.
  • Data can also be placed directly into different channels.
  • the address addressing method in SRAM is also configurable. By default, the highest bit of the address can be used to distinguish different channels, and other granular interleaving can also be performed through a configurable address hash method. From the hardware design, a multi-bit channel selection signal bank_sel will eventually be generated, which is used to select different SRAM channels sram_bank (sram_bank0-sram_bank3, etc.) for the data of each port port0, port1, port2, port3, etc.
  • Multi-port access can use handshake signals, and the data path supports backpressure (when the inlet flow is greater than the outlet flow, backpressure is needed at this time, or when the subsequent stage is not ready, if the current stage is transmitting data, then it needs to backpressure. Press the front-end, so the front-end needs to keep the data unchanged at this time, and the data cannot be updated until the handshake is successful).
  • the storage control unit 304 includes a fully cross-interconnect path structure that can access multiple channels of SRAM in parallel with separate read functions and write functions, which is equivalent to a 2-level full-cross interconnect path cascade to alleviate hardware implementation troubles. line problem.
  • the underlying memory uses single-port SRAM to save power consumption and area.
  • the full cross-connection path structure can access multiple channels that store read data, intermediate result data, and final result data in parallel or simultaneously, thereby further accelerating reading and writing speeds and improving the operating efficiency of the artificial intelligence chip.
  • using software configuration parameters and using the address calculation module to calculate the address can directly replace various tensor operations, and setting up more than one loop nested address calculation process can make the read address calculated by the address calculation module more flexible. It is not limited to storing the sequential address of each value of the input tensor itself. And it can directly replace the operation of the operator, and save the time and hardware cost of calculating the operation of the operator, the time and hardware cost of storing the result tensor of the operation of the operator in the existing technology, and the time and cost of storing the operation of the operator.
  • the address of the resulting tensor reads out the time and hardware cost of each element, thereby reducing computing delays, reducing hardware computing costs, and improving the operating efficiency of artificial intelligence chips.
  • the software is fully controllable and the flexibility is maximized.
  • multi-layer read loop addressing mode various complex addresses can be realized Access pattern.
  • multi-layer asymmetric looping non-aligned configurations can be achieved to achieve more complex address access patterns.
  • the addressing method between channels is maintained through software.
  • the flexibility of data access is ensured.
  • it can achieve data transfer and calculation paths, hide each other, and achieve the effect of concurrency of different modules.
  • different channels of SRAM store different types of data, such as different types of data that can be accessed at the same time, so that when these data need to be accessed at the same time, these data can be read from different channels of SRAM in parallel at the same time. , to speed up efficiency.
  • the media access control address (MAC) utilization of the convolution calculation can reach almost 100%.
  • Figure 8 shows a flowchart of a method for flexibly accessing data in an artificial intelligence processor chip according to an embodiment of the present application.
  • a method 800 for flexibly accessing data in an artificial intelligence processor chip includes: Step 801, using the memory in the artificial intelligence processor chip to store tensor data read from outside the processor chip, and reading The input tensor data includes multiple elements, which are used to perform tensor operations of operators included in artificial intelligence calculations; step 802, control reading of elements from the memory to send to the computing unit according to the tensor operations of the operators, including : Calculate the address in the memory in one-level read loop or multi-level read loop nesting according to the received parameters configured by software to read the elements from the calculated address to send to the computing unit in the artificial intelligence processor chip ; Step 803, the calculation unit uses the received elements to perform tensor operations on the operator.
  • the address in the memory can be flexibly calculated through software-configured parameters to flexibly read the elements in the memory, without being limited to the order or address ordering in which these elements are stored in the memory.
  • the parameters configured through software include: a value indicating how many elements are to be read from the read tensor data in a one-level read loop, and a step length between each step in a one-level read loop. value.
  • the parameters configured through software include: a value indicating how many reading steps are performed in each layer of read cycle, and a value indicating the step length between each step in each layer of read cycle, where each layer Read loops are nested from the outer layer to the inner layer.
  • the parameters configured through software include: a value indicating how many addresses are spaced between the address of the first element read by the one-level read loop and the initial address of the input tensor in the memory.
  • the parameters configured by software include conditions for the parameters, and wherein the parameters take different values when the conditions are met and when the conditions are not met.
  • the parameter is a value indicating how many reading steps are performed in a specific layer of read loop, and the condition is what step is performed in another layer of read loop that is outer than the specific layer.
  • the method 800 further includes: calculating the address currently to be read based on the current step of the read cycle of one layer or each layer and the respective step sizes of the read cycle of one layer or each layer.
  • the parameters configured by the software indicate that the tensor operation of the operator is replaced by reading the elements from an address in the memory according to the tensor operation of the operator.
  • a tensor operation is a tensor operation that does not change the values in the input tensor, where each read is computed in one level of read loop or multiple levels of read loop nesting according to configured parameters received through the interface. The address in memory to be read, making the read a replacement for a tensor operation.
  • tensor operations include at least one of the following: operations of transpose operator, operation of reshape operator, operation of broadcast operator, operation of gather operator, operation of reverse operator, and concat operator operation, operation of cast operator.
  • the operations of certain tensor operators can be directly replaced, and the calculations in the existing technology can be saved.
  • the time and hardware cost of computing the operations of these tensor operators, the time and hardware cost of storing the result tensor of the operation of these tensor operators, and reading from the address of the tensor that stores the result of the operation of these tensor operators Figure out the time and hardware costs of each element.
  • the memory is divided into a plurality of channels for respectively storing data that can be accessed in parallel, and the method further includes accessing multiple channels of the memory in parallel through a full cross-connection path that separates read functions and write functions. The data stored in the channel.
  • FIG. 9 illustrates a block diagram of an exemplary electronic device suitable for implementing embodiments of the present application.
  • the electronic device may include a processor (H1); a storage medium (H2) coupled to the processor (H1) and storing computer-executable instructions therein for performing various methods of embodiments of the present application when executed by the processor. A step of.
  • the processor (H1) may include but is not limited to, for example, one or more processors or microprocessors.
  • the storage medium (H2) may include, but is not limited to, for example, random access memory (RAM), read only memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, computer storage media (such as hard disk, floppy disk, Solid state drive, removable disk, CD-ROM, DVD-ROM, Blu-ray disk, etc.).
  • RAM random access memory
  • ROM read only memory
  • flash memory EPROM memory
  • EEPROM memory EEPROM memory
  • registers computer storage media (such as hard disk, floppy disk, Solid state drive, removable disk, CD-ROM, DVD-ROM, Blu-ray disk, etc.).
  • the electronic device may also include a data bus (H3), an input/output (I/O) bus (H4), a display (H5), and input/output devices (H6) (e.g., keyboard, mouse, speakers etc.
  • H3 data bus
  • I/O input/output
  • H5 display
  • H6 input/output devices
  • the processor (H1) can communicate with external devices (H5, H6, etc.) via the I/O bus (H4) via a wired or wireless network (not shown).
  • the storage medium (H2) may also store at least one computer-executable instruction for performing various functions and/or method steps in the embodiments described in the present technology when executed by the processor (H1).
  • the at least one computer-executable instruction may also be compiled into or constitute a software product, wherein the one or more computer-executable instructions are executed by a processor when executing each of the embodiments described in the present technology. Function and/or method steps.
  • Figure 10 shows a schematic diagram of a non-transitory computer-readable storage medium according to an embodiment of the present disclosure.
  • Computer-readable storage media includes, but is not limited to, volatile memory and/or non-volatile memory, for example.
  • Volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache), etc.
  • Non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc.
  • the computer-readable storage medium 1020 may be connected to a computing device such as a computer, and then, in the case where the computing device executes the computer-readable instructions 1010 stored on the computer-readable storage medium 1020, various methods as described above may be performed.
  • An artificial intelligence processor chip for flexible access to data including:
  • a memory configured to store read-in tensor data from outside the processor chip, the read-in tensor data including a plurality of elements for performing tensor operations of operators included in the artificial intelligence calculation;
  • a storage control unit configured to control reading elements from the memory to be sent to the computing unit according to the tensor operation of the operator;
  • the storage control unit includes an address calculation module, and the address calculation module has a function of receiving An interface for software-configured parameters.
  • the address calculation module calculates the address in the memory in one-level read loop or multi-level read loop nesting according to the configured parameters received through the interface to calculate the address from the calculated parameter.
  • the element is read from the address to be sent to the computing unit;
  • a computing unit configured to perform tensor operations of the operator using the received elements.
  • the parameters configured through software include: the value indicating how many elements are to be read from the read tensor data in a one-level reading loop, and the step length between each step in the one-level reading loop. value,
  • the parameters configured through software include: a value indicating how many reading steps are to be performed in each layer of reading cycle, and a value indicating a step length between each step in each layer of reading cycle, wherein each layer of reading cycle Proceed in a nesting manner from outer layer to inner layer.
  • Item 3 The processor chip according to item 1, wherein the parameters configured through software include: the address representing the first element read by a layer of read loop and the initial address of the input tensor in the memory. The values of several addresses are spaced between them.
  • Item 4 The processor chip of item 1, wherein the parameter configured by software includes a condition for the parameter, and wherein the parameter takes different values when the condition is met and when the condition is not met.
  • Item 5 The processor chip of item 4, wherein the parameter is a value indicating how many reading steps are performed in a specific layer of read cycle, and the condition is in a layer further outside the specific layer. Which step is reached in the read loop of another layer.
  • Item 6 The processor chip according to any one of items 2-5, wherein the address calculation module is based on which step of one layer or each layer of read cycle is currently in progress and the respective steps of one layer or each layer of read cycle. long to calculate the current address to be read.
  • Item 7 The processor chip of item 1, wherein the parameter configured by software indicates a manner in which elements are read from an address in the memory according to a tensor operation of an operator. Tensor operations that replace the operator.
  • Item 8 The processor chip of item 7, wherein the tensor operation is a tensor operation that does not change a value in an input tensor, and wherein the address calculation module performs the operation according to configured parameters received through the interface.
  • the address in the memory to be read for each read is calculated in one level of read loop or multiple levels of read loop nesting, so that the read replaces the tensor operation.
  • Item 9 The processor chip according to item 8, wherein the tensor operation includes at least one of the following: an operation of a transpose operator, an operation of a reshape operator, an operation of a broadcast operator, and an operation of a gather operator. operation, the operation of the reverse operator, the operation of the concat operator, and the operation of the cast operator, wherein the memory is divided into multiple channels for storing data that can be accessed in parallel, and the storage control unit includes a read A full cross-connection path with separate fetch and write functions is provided to access data stored in multiple channels of the memory in parallel.
  • a method for flexible access to data in artificial intelligence processor chips including:
  • the memory in the artificial intelligence processor chip stores read-in tensor data from outside the processor chip.
  • the read-in tensor data includes a plurality of elements and is used to perform tensors of operators included in artificial intelligence calculations. operation; operation
  • Controlling reading of elements from the memory to be sent to a computing unit according to tensor operations of the operator includes: calculating in one level of read loop or multi-level read loop nesting according to received parameters configured through software. an address in said memory to read an element from said calculated address to be sent to a computing unit in said artificial intelligence processor chip;
  • the received elements are used by the calculation unit to perform tensor operations of the operator.
  • the parameters configured through software include: the value indicating how many elements are to be read from the read tensor data in a one-level reading loop, and the step length between each step in the one-level reading loop. value,
  • the parameters configured through software include: a value indicating how many reading steps are to be performed in each layer of reading cycle, and a value indicating a step length between each step in each layer of reading cycle, wherein each layer of reading cycle Proceed in a nesting manner from outer layer to inner layer.
  • Item 12 The method according to item 10, wherein the parameters configured by software include: an interval representing the address of the first element read by a layer of read loop and the initial address of the input tensor in the memory. The value of several addresses.
  • Item 13 The method of item 10, wherein the parameter configured by software includes a condition for the parameter, and wherein the parameter takes different values when the condition is met and when the condition is not met.
  • Item 14 The method of item 13, wherein the parameter is a value representing several reading steps in a specific layer of read loop, and the condition is in a layer outer than the specific layer. Which step is reached in another layer of reading loop.
  • Item 15 According to the method described in any one of items 11-14, it also includes: calculating the current required step according to the current step of the read cycle of one layer or each layer and the respective step length of the read cycle of one layer or each layer. The address to read.
  • Item 16 The method of item 10, wherein the parameters configured by the software indicate a manner in which elements are read from addresses in the memory based on tensor operations of the operator instead of Tensor operations.
  • Item 17 The method of item 16, wherein the tensor operation is a tensor operation that does not change a value in an input tensor, wherein one read loop or more is performed based on configured parameters received through the interface.
  • the address in the memory to be read for each read is calculated in the nested read loop, so that the read replaces the tensor operation.
  • Item 18 The method according to item 17, wherein the tensor operation includes at least one of the following: an operation of a transpose operator, an operation of a reshape operator, an operation of a broadcast operator, an operation of a gather operator, The operation of the reverse operator, the operation of the concat operator, and the operation of the cast operator, wherein the memory is divided into multiple channels for respectively storing data that can be accessed in parallel, and the method also includes a read function A full cross-connect interconnect path separate from the write function allows parallel access to data stored in multiple channels of the memory.
  • An electronic device including:
  • Memory used to store instructions
  • a processor configured to read instructions in the memory and execute the method described in any one of items 10-18.
  • Item 20 A non-transitory storage medium having instructions stored thereon,
  • the instruction when the instruction is read by the processor, the instruction causes the processor to perform the method described in any one of items 10-18.
  • This means may include various hardware and/or software components and/or modules, including but not limited to hardware circuits, application specific integrated circuits (ASICs), or processors.
  • ASICs application specific integrated circuits
  • a general purpose processor digital signal processor (DSP), ASIC, field programmable gate array (FPGA) or other programmable logic device (PLD), discrete gate or transistor designed to perform the functions described herein may be utilized logic, discrete hardware components, or any combination thereof Various illustrated logic blocks, modules, and circuits.
  • a general purpose processor may be a microprocessor, but alternatively the processor may be any commercially available processor, controller, microcontroller or state machine.
  • a processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, a microprocessor cooperating with a DSP core, or any other such configuration.
  • the steps of a method or algorithm described in connection with the present disclosure may be embedded directly in hardware, in a software module executed by a processor, or in a combination of the two.
  • Software modules may exist on any form of tangible storage medium. Some examples of storage media that can be used include random access memory (RAM), read only memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, hard disks, removable disks, CD-ROMs, etc.
  • RAM random access memory
  • ROM read only memory
  • flash memory EPROM memory
  • EEPROM memory electrically erasable programmable read-only memory
  • registers hard disks, removable disks, CD-ROMs, etc.
  • the storage medium can be coupled to the processor so that the processor can read information from and write information to the storage medium. In the alternative, the storage medium may be integral with the processor.
  • a software module may be a single instruction or many instructions, and may be distributed over several different code segments, between different programs, and across multiple storage media.
  • Storage media can be any available tangible media that can be accessed by a computer.
  • such computer-readable media may include RAM, ROM, EEPROM, CD-ROM, or other optical disk storage, magnetic disk storage, or other magnetic storage devices or may be used to carry or store instructions or data structures in the form of desired program code and any other tangible medium that can be accessed by a computer.
  • disk and disc include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy disks, and Blu-ray discs, where disks typically reproduce data magnetically, while discs typically reproduce data magnetically. The data is optically reproduced using lasers.
  • Such a computer program product can perform the operations given here.
  • a computer program product may be a computer-readable tangible medium having instructions tangibly stored (and/or encoded) thereon, the instructions executable by a processor to perform the operations described herein.
  • a computer program product may include packaging materials.
  • Software or instructions may also be transmitted over a transmission medium.
  • a transmission medium such as fiber optic cable, twisted pair cable, digital subscriber line (DSL), or wireless technology such as infrared, radio, or microwave transmits the Software from a website, server, or other remote source.
  • DSL digital subscriber line
  • wireless technology such as infrared, radio, or microwave transmits the Software from a website, server, or other remote source.
  • modules and/or other appropriate means for performing the methods and techniques described herein may be downloaded and/or otherwise obtained by user terminals and/or base stations as appropriate.
  • a device may be coupled to a server to facilitate the transfer of means for performing the methods described herein.
  • the various methods described herein may be provided via storage components (e.g., RAM, ROM, physical storage media such as CDs or floppy disks, etc.) so that user terminals and/or base stations can be coupled to or to the device.
  • storage components e.g., RAM, ROM, physical storage media such as CDs or floppy disks, etc.
  • storage components e.g., RAM, ROM, physical storage media such as CDs or floppy disks, etc.
  • any other suitable technology for providing the methods and techniques described herein to a device may be utilized.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

提供一种人工智能处理器芯片、用于灵活地访问人工智能处理器芯片中的数据的方法、电子设备和非暂时存储介质。该芯片包括:存储器,被配置为存储来自处理器芯片外部的读入张量数据,读入张量数据包括多个元素,用于进行人工智能计算所包括的算子的张量运算;存储控制单元,被配置为根据算子的张量运算来控制从存储器中读取元素以发送到计算单元,存储控制单元包括地址计算模块,地址计算模块具有接收通过软件配置的参数的接口,地址计算模块根据通过接口接收的所配置的参数来在一层读循环或多层读循环嵌套中计算存储器中的地址,以从计算的地址中读取元素以发送到计算单元;计算单元,被配置为用接收到的元素来进行算子的张量运算。

Description

人工智能芯片、灵活地访问数据的方法、设备和介质
本申请要求于2022年7月15日递交的中国专利申请第202210836577.0号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。
技术领域
本申请涉及人工智能领域,且更具体地,涉及用于灵活地访问数据的人工智能处理器芯片、用于灵活地,访问人工智能处理器芯片中的数据的方法、电子设备和非暂时存储介质。
背景技术
人工智能(Artificial Intelligence,AI)芯片为了达到更好的性能,充分地发挥算力,提高媒体存取控制位址(MAC)利用率,那么数据通路的设计也是非常关键的。在神经网络计算过程中,计算单元输入和输出数据的提供方式和存储方式多样,这也决定了不同的芯片存储计算架构。例如图像处理单元(Graphic Processing Unit,GPU)是一种并行计算处理器,采用多级的高速缓存系统,其存储层次由L1缓存、共享内存(Shared Memory)、寄存器组、L2缓存以及外部存储DRAM等构成。划分这些存储层次主要是为了减少数据搬运的延迟,提升带宽。L1缓存一般分为L1D缓存和L1I缓存,用于分别存储数据和指令,通常为各个处理器核分别设置相应的L1缓存,大小分别为16k~64k不等,L2缓存常作为私有缓存,不区分指令和数据,通常为各个处理器核分别设置相应的L2缓存,大小在256k~1M不等。例如,L1缓存速度最快,但空间较小,L2缓存速度较慢,空间较大,外部DRAM空间最大,但速度最慢等等。因此将经常存取的数据从DRAM存入L1缓存,可以减少每次存取时从外部DRAM进行搬运数据到内存的延迟,提高数据处理的效率。但是由于处理器结构为了保证通用性和灵活性,数据通路(pipeline)存在一定的冗余度,例如每次计算过程都需要从寄存器取数开始到最终将数据存入寄存器,功耗较高。
有的AI芯片通过定制通路,可以达到很高的效率,相应的代价就是失去了灵活性,一旦修改了网络结构,就存在无法使用的风险。另外还有的AI芯片通过增加很大的片上缓冲器来解决带宽和延迟问题,但是对于静态随机存取存储器(SRAM)的存取数方式是由硬件发起的,也就是说计算和存储之间是硬件耦合的。这样会存在策略不灵活的问题,因此某些场景下效率降低,而软件无法介入。
发明内容
根据本申请的一个方面,提供一种用于灵活地访问数据的人工智能处理器芯片,包括:存储器,被配置为存储来自处理器芯片外部的读入张量数据,所述读入张量数据包括多个元素,用于进行人工智能计算所包括的算子的张量运算;存储控制单元,被配置为根据所述算子的张量运算来控制从所述存储器中读取元素以发送到计算单元,所述存储控制单元包括地址计算模块,所述地址计算模块具有接收通过软件配置的参数的接口,所述地址计算模块根据通过所述接口接收的所配置的参数来在一层读循环或多层读循环嵌套中计算所述存储器中的地址,以从所述计算的地址中读取元素以发送到所述计算单元;计算单元,被配置为用所述接收到的元素来进行所述算子的张量运算。
在另一方面中,提供一种用于灵活地访问人工智能处理器芯片中的数据的方法,包括:由所述人工智能处理器芯片中的存储器存储来自处理器芯片外部的读入张量数据,所述读入张量数据包括多个元素,用于进行人工智能计算所包括的算子的张量运算;根据所述算子的张量运算来控制从所述存储器中读取元素以发送到计算单元,包括:根据接收的通过软件配置的参数来在一层读循环或多层读循环嵌套中计算所述存储器中的地址,以从所述计算的地址中读取元素以发送到所述人工智能处理器芯片中的计算单元;由所述计算单元用所述接收到的元素来进行所述算子的张量运算。
在另一方面中,提供一种电子设备,包括:存储器,用于存储指令;处理器,用于读取所述存储器中的指令,并执行如本申请各个实施例的方法。
在另一方面中,一种非暂时存储介质,其上存储有指令,
其中,所述指令在被处理器读取时,使得所述处理器执行如本申请各个 实施例的方法。
如此,可以通过软件配置的参数来灵活地计算存储器中的地址以进行灵活地读取存储器中的元素,而不局限于存储器中存储这些元素的顺序或地址排序。
附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了应用于图像数据处理和识别的神经网络中的一个计算图的示例图。
图2示出了根据本申请实施例的用于灵活地访问数据的人工智能处理器芯片的示意图。
图3示出了根据本申请实施例的用于灵活地访问数据的人工智能处理器芯片的分解示意图。
图4示出了根据本申请实施例的对输入张量进行2层循环嵌套读取的例子。
图5示出了根据本申请实施例的根据软件配置的参数来计算地址的示意图。
图6示出了根据本申请实施例的对输入张量进行不完全对齐的3层循环嵌套读取的例子。
图7示出了根据本申请实施例的SRAM的内部结构的示意图。
图8示出了根据本申请实施例的用于灵活地访问人工智能处理器芯片中的数据的方法的流程图。
图9示出了适于用来实现本申请实施例的示例性电子设备的框图。
图10示出了根据本公开的实施例的非暂时性计算机可读存储介质的示意图。
具体实施方式
现在将详细参照本申请的具体实施例,在附图中例示了本申请的例子。尽管将结合具体实施例描述本申请,但将理解,不是想要将本申请限于描述的实施例。相反,想要覆盖由所附权利要求限定的在本申请的精神和范围内包括的变更、修改和等价物。应注意,这里描述的方法步骤都可以由任何功能块或功能布置来实现,且任何功能块或功能布置可被实现为物理实体或逻辑实体、或者两者的组合。
可以理解的是,在使用本公开各实施例公开的技术方案之前,均应当依据相关法律法规通过恰当的方式对本公开所涉及个人信息的类型、使用范围、使用场景等告知用户并获得用户的授权。
例如,在响应于接收到用户的主动请求时,向用户发送提示信息,以明确地提示用户,其请求执行的操作将需要获取和使用到用户的个人信息。从而,使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的电子设备、应用程序、服务器或存储介质等软件或硬件提供个人信息。
作为一种可选的但非限定性的实现方式,响应于接收到用户的主动请求,向用户发送提示信息的方式例如可以是弹窗的方式,弹窗中可以以文字的方式呈现提示信息。此外,弹窗中还可以承载供用户选择“同意”或者“不同意”向电子设备提供个人信息的选择控件。
可以理解的是,上述通知和获取用户授权过程仅是示意性的,不对本公开的实现方式构成限定,其它满足相关法律法规的方式也可应用于本公开的实现方式中。
可以理解的是,本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。
对于上述应用场景的识别过程可以由神经网络接收输入的各种应用数据作为张量,并经过神经网络的计算来实现。当前神经网络、机器学习系统使用张量(Tensor)作为基本数据结构。张量这一概念的核心在于,它是一个数据容器,它包含的数据几乎总是数值数据,因此它是数字的容器。张量中的具体数值可以是应用数据,例如包括图像数据、自然语言数据等等。
例如标量是一种0维张量,例如2,3,5。在具体应用场景中,例如是图像数据,2例如表示图像数据中的一个像素的灰度值,3例如表示图像数据中的一个像素的灰度值5例如表示图像数据中的一个像素的灰度值等。而例如, 向量是一种1维张量,例如[0,3,20],矩阵是一种二维张量,例如或[[2,3],[1,5]]。例如还可以有三维张量(例如a:(shape:(3,2,1)),[[[1],[2]],[[3],[4]],[[5],[6]]])、四维张量等等。这些张量都可以用来表示具体的应用场景中的数据,例如图像数据、自然语言数据等等。而针对这些应用数据的神经网络功能可以包括图像识别(例如,输入图像数据,识别出图像中包含的是什么动物)、自然语言识别(例如,输入用户的语言,可以识别用户讲话的意图,例如用户是否在说打开音乐播放器)等等。
对于上述应用场景的识别过程可以由神经网络接收输入的各种应用数据作为张量,并经过神经网络的计算来实现。如上说明,神经网络的计算可以由一系列张量运算组成,而这些张量运算可以是几维张量的输入数据的复杂几何变换。这些张量运算可以称为算子,可以将神经网络的计算转换为计算图,其中计算图中具有多个算子,多个算子之间可以由线连接,表示各个算子计算之间的依赖关系。
人工智能(AI)芯片是专用于进行神经网络运算的芯片,主要是为了加速神经网络执行而专门设计的芯片。神经网络可以用纯数学公式来表达。根据这些数学公式,可以用计算图模型来表示该神经网络。计算图就是这些数学公式的可视化表示。计算图模型可以将一个复合运算拆分成为多个子运算,每个子运算称为一个算子(Operator,缩写为Op)。
而神经网络的计算会产生大量的中间数据,如果都存储到动态随机存储存储器(Dynamic Random Access Memory,DRAM),会因为延时过大、带宽不足的原因导致整体性能低下。通过增加L2缓存可以缓解该问题,优点是对编程不可见,因此不影响编程,同时能减少延时。但是会因为对L2缓存的访问地址和访问时机的问题导致缓存命中失败(cache miss)率较高,另外局部性不好的时候也不方便隐藏数据访问的时间。
使用人工智能芯片的片上SRAM来存储神经网络计算时需要和产生的数据,可以通过软件主动控制数据的流向,通过预配置的方式隐藏SRAM和DRAM之间数据搬运的时间。由于神经网络计算的数据访问模式比较灵活,如果SRAM的访问灵活度不够,那么会导致部分算子会采用多次计算过程来完成。
而如果将数据计算和数据搬运完全耦合,通过硬件发起这些操作的话,那么算子的计算方式也就被固化了,没有软件调整的空间。
因此,仍需要更灵活地访问人工智能芯片的片上SRAM的方式。
图1示出了应用于图像数据处理和识别的神经网络中的一个计算图的示例图。
例如,载有图像数据(例如像素的色度值)的张量被输入到如图1所示的一个示例计算图中。该计算图只示出了部分算子以便于读者浏览。该计算图的运算过程是先将张量经过Transpose算子计算,然后一个分支经过Reshape算子计算,另一分支经过Fully connected算子计算。
其中,假设该张量首先被输入到Transpose算子,Transpose算子是一种不改变输入张量中的数值的张量运算。Transpose算子的作用是改变数组的维度(axis)排列顺序。比如对于二维数组,如果把两个维度的顺序互换,那就是矩阵转置。而Transpose算子可以应用于更多维度的情况。Transpose算子的入参(Input Parameter)是输出数组的维度排列顺序,序号从0开始计数。假设Transpose算子的输入张量是例如二位矩阵[[1,2,3],[4,5,6],[7,8,9],[10,11,12]]或表示为例如表示图像数据是4*3的一个二维矩阵。而transpose([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])表示将这个二维矩阵转置,即变为[[1,4,7,10],[2,5,8,11],[3,6,9,12]],或表示为的3*4的矩阵。
可以看出,Transpose算子虽然改变了维度排列顺序,即改变了张量的形状,但是并未改变张量中的数值,例如,仍然是1、2、3、4、5、6、7、8、9、10、11、12。当然,Transpose算子也可能不改变张量的形状,例如3*3的矩阵转置后仍然是3*3的矩阵,也并未改变张量中的数值,但转置后的矩阵内的数值的排列顺序不同。
然后,将上述经过Transpose算子运算之后的张量分为两个分支。一个分支经过Reshape算子计算,另一分支经过Fully connected算 子计算。
Reshape算子的具体运算是改变张量的形状属性,它可以把m*n的矩阵a排列成为i*j大小的矩阵b。例如Reshape算子(Reshape(A,2,6),其中A是输入张量)把上述张量的形状从3*4改变为2*6。因此,经过Reshape算子,得到的输出张量例如是
可见,Reshape算子也是虽然改变了张量的形状,但是并未改变张量中的数值,例如,仍然是1、2、3、4、5、6、7、8、9、10、11、12。
而Fully connected算子(也称为Full Connection算子)可以看作一种特殊卷积层,或者看作张量的乘积,是将整个张量输入作为特征图,进行特征提取的运算。即,由一个特征空间线性变换到另一个特征空间,且输出张量是输入张量的加权和。例如Fully connected算子是用输入张量(Transpose算子的输出张量)矩阵乘4*1大小的权重矩阵x,例如即,分别读取转置后的矩阵的第一行乘以权重矩阵x,然后第二行乘以权重矩阵x,然后第三行乘以权重矩阵x。具体地,1乘以40加上4乘以50加上7乘以60加上10乘以70,作为Fully connected算子的结果张量的第一个值,2乘以40加上5乘以50加上8乘以60加上11乘以70,作为Fully connected算子的结果张量的第二个值,3乘以40加上6乘以50加上9乘以60加上12乘以70,作为Fully connected算子的结果张量的第三个值。
在现有技术中,如果要通过人工智能芯片来进行如图1所示的计算图中的Transpose算子到Fully connected算子的计算过程。则需要用人工智能芯片首先将Transpose算子的输入张量读入芯片内的存储单元中的存储器中,该输入张量假设是存储到芯片内的存储单元的存储器中一般是连续存储,即存储为1、2、3、4、5、6、7、8、9、10、11、12,然后芯片中的计算单元运行Transpose算子的运算(例如,4*3转换为3*4),使得其 转换为然后,芯片将该转换的结果张量存储到芯片内的存储单元的存储器中作为中间数据。此时,存储到芯片内的存储单元的存储器中一般是连续存储,即存储为1、4、7、10、2、5、8、11、3、6、9、12。至此,Transpose算子的运算完成了。
然后执行Fully connected算子的运算。用输入张量(Transpose算子的输出张量)矩阵乘4*1大小的权重矩阵x,例如首先,芯片读入权重矩阵x,例如存储到芯片内的存储单元的存储器中一般是连续存储,即存储为40、50、60、70。然后计算单元按存储的顺序分别从存储单元的存储器中读取Transpose算子的输出张量中的一个数值,并按存储的顺序分别读取权重矩阵x中的相应的数值,并进行乘加。具体地,1乘以40加上4乘以50加上7乘以60加上10乘以70,作为Fully connected算子的结果张量的第一个值,2乘以40加上5乘以50加上8乘以60加上11乘以70,作为Fully connected算子的结果张量的第二个值,3乘以40加上6乘以50加上9乘以60加上12乘以70,作为Fully connected算子的结果张量的第三个值。然后,芯片的计算单元将计算的结果存储到芯片内的存储单元的存储器中。
也就是说,对于Transpose算子的计算和后续的Fully connected算子的计算,需要由芯片的各个硬件单元配合进行读取、计算、存储、再读取、再计算、再存储的过程,但是这样的整个过程的计算效率非常低下,灵活性也很低。
本公开提出了一种用于灵活地访问数据的人工智能处理器芯片的方式,能够利用人工智能处理器芯片的软件配置以及相关参数,通过芯片的灵活访问数据的读取操作来取代某些算子的运算,并提高计算效率。
图2示出了根据本申请实施例的用于灵活地访问数据的人工智能处理器芯片的示意图。
如图2所示,一种用于灵活地访问数据的人工智能处理器芯片200包括:存储器201,被配置为存储来自处理器芯片200外部的读入张量数据,读入张量数据包括多个元素,用于进行人工智能计算所包括的算子的张量运算;存储控制单元202,被配置为根据算子的张量运算来控制从存储器中读取元素以发送到计算单元203,存储控制单元202包括地址计算模块2021,地址计算模块2021具有接收通过软件配置的参数的接口,地址计算模块2021根据所配置的参数来在一层读循环或多层读循环嵌套中计算存储器中的地址,以从计算的地址中读取元素以发送到计算单元203;计算单元203,被配置为用接收到的元素来进行算子的张量运算。
根据本实施例,在存储控制单元202中设置地址计算模块2021,该地址计算模块2021具有接收通过软件配置的参数的接口,地址计算模块2021可以根据所配置的参数来在一层读循环或多层读循环嵌套中计算存储器中的地址,以从计算的地址中读取元素以发送到计算单元203。这样可以通过软件配置的参数来灵活地计算存储器中的地址来读取,也就是说,这种灵活地计算的地址可以与被存储的顺序不同,且按用户来通过参数来软件地配置新的读取顺序,而不是像现有技术那样必须按被存储的顺序来读取元素。
如此,可以通过软件配置的参数来灵活地计算存储器中的地址以进行灵活地读取存储器中的元素,而不局限于存储器中存储这些元素的顺序或地址排序。
结合图1的例子,Transpose算子虽然改变了维度排列顺序,但是并未改变张量中的元素,例如,仍然是1、2、3、4、5、6、7、8、9、10、11、12。也就是说,如果能够根据所配置的参数来在一层读循环或多层读循环嵌套中计算存储器中的地址,以从计算的地址中读取元素以发送到计算单元203,则用户可以用参数来软件地配置新的读取顺序,使得从存储的1、2、3、4、5、6、7、8、9、10、11、12,按Transpose算子转置后的张量的方式来读取这些元素。
具体地,该输入张量假设是存储到芯片内的存储单元的存储器中一般是连续存储,即存储1、2、3、4、5、6、7、8、9、10、11、12 (例如,存储地址例如为(十六进制)00000001、00000002、00000003、00000004、00000005、00000006、00000007、00000008、00000009、00000010、00000011、00000012)。那么用户可以用参数来软件地配置新的读取顺序,使得地址计算模块2021可以根据所配置的参数来计算存储器中的地址,例如根据所配置的参数计算的存储器中的地址的顺序分别为:00000001、00000004、00000007、00000010、00000002、00000005、00000008、00000011、00000003、00000006、00000009、00000012,即可以直接用读取地址的顺序来替代Transpose算子的转置运算。
这样可以通过软件配置的参数来灵活地计算存储器中的地址来读取,也就是说,这种灵活地计算的地址可以与被存储的顺序不同,且按用户来通过参数来软件地配置新的读取顺序,而不是像现有技术那样必须按被存储的顺序来读取元素。
在一个实施例中,通过软件配置的参数指示根据Transpose算子的张量运算的、从存储器中的地址中读取元素的方式来替代算子的张量运算。结合图1的例子,即通过软件配置的参数指示根据Transpose算子的张量运算的、从存储器中的地址中读取元素的方式来替代Transpose算子的张量运算。
在一个实施例中,通过软件配置的参数包括:表示各层读循环中第一步读取的第一元素的地址与输入张量在存储器中的初始地址之间间隔几个地址的值、表示在一个读循环中进行几步读取的值、表示要在一层读循环内的各步之间的步长的值。注意,在这里,步长类似于神经网络概念中的步幅/步长(stride)。
在该实施例中,如果只存在一个读循环,则通过软件配置的参数只需要告知起始地址、一共读几个元素以及每次读取时各个元素之间间隔几个地址。即只需要3个参数即可。
例如,假设要进行Gather算子的运算。Gather算子的运算是从输入张量中的一些数值中挑一些数值作为输出张量。可见,Gather算子的运算是不改变输入张量中的数值的张量运算。此时,输入张量是[1,2,3,4,5,6,7,8],假设Gather算子的运算是将[1,2,3,4,5,6,7,8]挑选为[1,3,5,7]。
在现有技术中,要完成Gather算子的运算,需要首先芯片要将输入张量[1,2,3,4,5,6,7,8]连续存储在存储器中为存储地址(十六进制):例如,00000001、 00000002、00000003、00000004、00000005、00000006、00000007、00000008,然后芯片的计算单元进行Gather算子的运算得到[1,3,5,7],然后将[1,3,5,7]存储为地址(十六进制):例如,00000009、000000010、00000011、00000012。然后芯片继续通过计算单元执行对该结果[1,3,5,7]的进一步的算子运算。
但是,通过本实施例,地址计算模块可以通过软件配置的参数直接从输入张量[1,2,3,4,5,6,7,8]计算出1、3、5、7的地址,并从这些地址读取出[1,3,5,7]来通过计算单元执行对该结果[1,3,5,7]的进一步的算子运算。具体地,软件配置的参数可以包括:表示在一个读循环中进行几步读取的值4(表示一共进行4次读取)、表示要在一层读循环内的各步之间的步长的值2(每次读取之后加2个地址再进行下一次读取)。
因此,根据这些参数,地址计算模块可以计算出要读取的地址顺序为00000001、00000003、00000005、00000007(从00000001地址开始,每次读取加2个地址再进行下一次读取,一共进行4次读取)。计算单元根据该计算出的地址顺序来对应地读取00000001、00000003、00000005、00000007处存储的地址、即1、3、5、7。
如此,通过软件配置的参数来使得计算单元根据该计算出的地址顺序来对应地读取00000001、00000003、00000005、00000007处存储的地址、即1、3、5、7,就能直接取代Gather算子的运算,并节省现有技术中的计算Gather算子的运算的时间和硬件成本、存储Gather算子的运算的结果张量的时间和硬件成本,以及从存储Gather算子的运算的结果张量的地址读出各个元素的时间和硬件成本。
对于更加复杂和灵活的读取形式,可以存在多层读循环嵌套。在一个实施例中,通过软件配置的参数可以包括:表示在各层读循环中进行几步读取的值、表示要在各层读循环内各步之间的步长的值,其中,各层读循环以从外层到内层的嵌套方式进行。在一个实施例中,通过软件配置的参数还可以包括:表示各层读循环中第一步读取的第一元素的地址与输入张量在存储器中的初始地址之间间隔几个地址的值。这样可以使得读取的方式更加灵活。
在该实施例中,存在多层读循环嵌套。循环嵌套的意思是外层循环执行一次,内层循环执行完毕之后,才会进入到外层第二次循环,再执行一次内层循环。上述外层循环和内层循环是2层循环嵌套的例子。例如,在读取二 维矩阵时,可以通过外层循环控制读取哪一列,通过内层循环控制读取某一列中的哪一行的值。在C语言中,可以采用多层for循环嵌套语句来执行多层读循环嵌套。
例如结合图1的例子,来举例2层循环嵌套的例子。输入张量假设是存储到芯片内的存储单元的存储器中一般是连续存储,即存储1、2、3、4、5、6、7、8、9、10、11、12(例如,存储地址例如为(十六进制)00000001、00000002、00000003、00000004、00000005、00000006、00000007、00000008、00000009、00000010、00000011、00000012)。
要替代这个Transpose算子的运算,即,为了依次读出1、4、7、10、2、5、8、11、3、6、9、12,可以设置2层读循环。假设第一层读循环是内层循环,第二层读循环是外层循环。即,在本公开中,对于多层循环嵌套来说,层数越大,越是外层循环,层数越小,越是内层循环。
则通过软件配置的参数可以包括:表示在各层读循环中进行几步读取的值(第二层(外层)读循环的步数是3,即遍历张量的3列,第一层(内层)读循环的步数是4,即遍历一列中的所有行)、表示要在各层读循环内各步之间的步长的值(第二层读循环的各步之间的步长是1,即第一步从00000001开始读,第二步从00000001加1个地址等于00000002开始读;第二层读循环的各步之间的步长是3,即第一步从00000001开始读,第二步从00000001加3个地址等于00000004开始读),其中,各层读循环以从外层到内层的嵌套方式进行。
假设2层读循环嵌套的伪码为例如:
For loop_1from 1to loop_1_cnt_max
     For loop_0from 1to loop_0_cnt_max
其中loop_1表示第二层读循环,loop_0表示第一层读循环。且根据上述参数设置,loop_1_cnt_max为3,loop_0_cnt_max为4。
如此,接下来,地址计算模块根据这些软件配置的这些参数来从存储1、2、3、4、5、6、7、8、9、10、11、12的存储地址00000001、00000002、00000003、00000004、00000005、00000006、00000007、00000008、00000009、 00000010、00000011、00000012中进行读取。
具体地,根据这些软件配置的参数,在第二层读循环中,假设编译器计算了读取的初始地址,在该例子中,例如是输入张量在存储器中存储的初始地址00000001,在第二层读循环中,从00000001开始读。在第二层读循环的第一步中执行第一层读循环的全部步。即,在第一层读循环的4步中,按步长为4个地址来读取4次,即在第二层读循环的第一步中,地址计算模块计算出读取地址是00000001、00000004、00000007、00000010,因此在第二层读循环的第一步中要读取的元素为分别在地址00000001、00000004、00000007、00000010处存储的1、4、7、10。
第二层读循环的第二步的初始地址与第一步的初始地址之间的步长为1,即此次从00000001加1个地址,即00000002地址处开始读。在第二层读循环的第二步中执行第一层读循环的全部步。即,在第一层读循环的4步中,按步长为4个地址来读取4次,即在第二层读循环的第二步中,地址计算模块计算出读取地址是00000002、00000005、00000008、00000011,因此在第二层读循环的第二步中要读取的元素为分别在地址00000002、00000005、00000008、00000011处存储的2、5、8、11。
第二层读循环的第三步的初始地址与第二步的初始地址之间的步长为1,即此次从00000002加1个地址,即00000003地址处开始读。在第二层读循环的第三步中执行第一层读循环的全部步。即,在第一层读循环的4步中,按步长为4个地址来读取4次,即在第二层读循环的第三步中,地址计算模块计算出读取地址是00000003、00000006、00000009、00000012,因此在第二层读循环的第三步中要读取的元素为分别在地址00000003、00000006、00000009、00000012处存储的3、6、9、12。
如此,根据这些软件配置的参数,地址计算模块计算的地址的顺序是00000001、00000004、00000007、00000010、00000002、00000005、00000008、00000011、00000003、00000006、00000009、00000012,因此,从这些地址依次读取的元素是1、4、7、10、2、5、8、11、3、6、9、12。
根据这些软件配置的参数,第二层读循环的步数是3,因此地址计算模块执行完三次第二层读循环以及每个第二层读循环内的第一层读循环的地址计算之后,就可以停止了。
在该实施例中,也可以通过软件配置的参数来配置只有1次循环的情况,例如将第二层读循环的步数设置为1,即只执行一次第一层读循环的全部步。
另外,上述具体的参数的值的取值仅是示例,即用数字几来直接代表其含义是几,但是这不是限制,可以用其他数字取值或除了数字以外的其他内容来代表其含义,也可以用比如0表示例如步长或步数为1等(因为芯片中通常从0开始计数等等),或者用A表示例如步长或步数为1等,只要芯片能够推导出该值表示对应的含义即可。
如此,用软件配置的参数并用地址计算模块来配合计算地址可以直接取代各种张量运算,而且设置多于一个循环嵌套的地址计算过程,可以使得地址计算模块计算的读取地址更加灵活,不局限于本身存储输入张量的各个数值的顺序地址。且能够直接取代算子的运算,并节省现有技术中的计算算子的运算的时间和硬件成本、存储算子的运算的结果张量的时间和硬件成本,以及从存储算子的运算的结果张量的地址读出各个元素的时间和硬件成本,从而减少计算延迟,减少硬件计算成本,并提高人工智能芯片的运行效率。
总的来说,用软件配置的参数并用地址计算模块来配合计算地址除了可以使得计算更灵活排列的地址序列,还可以来取代各种张量运算。在一个实施例中,被取代的张量运算可以是不改变输入张量中的数值的张量操作,其中地址计算模块根据通过接口接收的所配置的参数来在一层读循环或多层读循环嵌套中计算每次读取时要读取的存储器中的地址,使得读取替代张量操作。
在一个实施例中,张量运算可以包括如下:transpose算子的运算、reshape算子的运算、broadcast算子的运算、gather算子的运算、reverse算子的运算、concat算子的运算、cast算子的运算等等。当然,还有很多其他类型的算子的运算也可以通过本申请的实施例的地址计算模块和软件配置的参数来灵活地实现。
当然,用软件配置的参数并用地址计算模块来配合计算地址不仅可用于取代特定的张量运算,而是可以以各种灵活的方式来计算要读取的地址,从而实现超越硬件本身的顺序读取的固有限制的灵活读取功能。
下面结合一个具体的芯片硬件示例和参数示例来描述在实际场景中的具体应用。
图3示出了根据本申请实施例的用于灵活地访问数据的人工智能处理器芯片的分解示意图。
图3示出了人工智能处理器芯片的处理引擎(Process Engine,PE)300内部的单元和使用的参数。
该处理引擎300可以包括配置单元301、计算单元302和存储单元303。存储控制单元304用于配置计算单元302和存储单元303;计算单元302主要用于卷积/矩阵计算/向量计算等;存储单元303包括片上的SRAM存储器3031(大小例如是8MB,但这不是限制),和访存控制模块3032,用于进行处理引擎300的内部数据和外部数据的交互和处理引擎300内的计算单元302的数据的存取。
该处理引擎300还包括存储控制单元304,存储控制单元304实现具体功能如下:
sram_read功能用于从SRAM 3031读取数据,并发送到计算单元302中以对读取的数据进行计算,例如卷积/矩阵计算/向量计算等的张量运算;
sram_write功能用于从计算单元302获取计算结果数据,并写入并存储到SRAM 3031中;
sram_upload功能用于将SRAM 3031中存储的数据搬运到处理引擎300的外侧(例如,给其他处理引擎或者DRAM);
sram_download功能用于将处理引擎300外侧的数据(来自于其他处理引擎或者DRAM的数据)下载到SRAM 3031中。
也就是说,sram_upload功能和sram_download功能用于与处理引擎300外侧的器件进行数据交互。sram_read功能和sram_write功能用于处理引擎300内部的计算单元302与存储单元303的数据交互。
SRAM 3031是处理引擎300内的共享存储器,大小不限于8MB,主要用于存储处理引擎300内的中间数据(包括要进行计算的数据以及计算的结果)。SRAM 3031可以被分割成多个通道(bank)以提升总体数据带宽。
全交叉互联通路(crossbar)3041是处理引擎300内部存储控制访问接口和SRAM多通道的全互联结构。全交叉互联通路3041与存储单元303的访存控制模块3032一起控制通过地址计算模块3042计算的地址而从存储单元303中的SRAM存储器3031中的该地址处读取元素。
可以看出,整个处理引擎300的计算通路(计算单元302)和数据通路(存储单元303)是分离配置的。如果要完成一次算子的计算,需要多个模块配合完成。例如,对于一次卷积的计算,需要配置sram_read以输入特征数据到计算单元302,配置sram_read以输入权重到计算单元302,计算单元302进行矩阵卷积计算,配置sram_write以输出计算结果到存储单元303中的SRAM存储器3031。在这种方式下,能够使得计算方式的选择更加灵活。
具体地,存储控制单元304被配置为根据算子的张量运算来控制从存储器3031中读取数据以发送到计算单元,存储控制单元304包括地址计算模块3042,地址计算模块3042具有接收通过软件配置的参数data_noc的接口,地址计算模块3042根据所配置的参数来在一层读循环或多层读循环嵌套中计算存储器3031中的地址,以从计算的地址中读取元素以发送到计算单元302。
由于人工智能AI的计算大部分寻址是比较有规律的,比如矩阵乘、全连接、卷积等计算都是有规律地从存储张量的地址中读取数据。因此可以考虑通过软件配置参数的方式实现各种复杂地址的计算。
在一个实施例中,如果只设置一个读循环,通过软件配置的参数包括:表示在一个读循环中进行几步读取的值、表示要在一层读循环内的各步之间的步长的值。
在一个实施例中,如果设置多个读循环嵌套,通过软件配置的参数包括:表示在各层读循环中进行几步读取的值、表示要在各层读循环内各步之间的步长的值,其中,各层读循环以从外层到内层的嵌套方式进行。
以上设置多个读循环嵌套可以实现对同一张量存储的地址或者对于同一段地址进行多次读取,从而实现各种复杂地址的计算,以获得更灵活地读取地址的灵活性。
例如,采用8层循环嵌套(从外层循环到内层循环依次是loop_7到Loop_0)的方式,其伪码如下所示:

在寄存器中配置与寻址相关的如下参数:表示在各层读循环中进行几步读取的值loop_xx_cnt_max(其中xx表示第几个读循环)、表示要在各层读循环内各步之间的步长的值jump_xx_addr(其中xx表示第几个读循环)。
上述8层读循环运行时是这样运行的:首先,运行Loop7层循环的总共loop_7_cnt_max步,在Loop7层循环的每一步中,都运行Loop6层循环的总共loop_6_cnt_max步,在Loop6层循环的每一步中,都运行Loop5层循环的总共loop_5_cnt_max步,在Loop5层循环的每一步中,都运行Loop4层循环的总共loop_4_cnt_max步,在Loop4层循环的每一步中,都运行Loop3层循环的总共loop_3_cnt_max步,在Loop3层循环的每一步中,都运行Loop2层循环的总共loop_2_cnt_max步,在Loop2层循环的每一步中,都运行Loop1层循环的总共loop_1_cnt_max步,在Loop1层循环的每一步中,都运行Loop0层循环的总共loop_0_cnt_max步。可见,最底层的Loop0层循环总共运行了loop_0_cnt_max*loop_1_cnt_max*loop_2_cnt_max*loop_3_cnt_max*loop_4_cnt_max*loop_5_cnt_max*loop_6_cnt_max*loop_7_cnt_max步,而上一层的Loop1层循环总共运行了loop_1_cnt_max*loop_2_cnt_max*loop_3_cnt_max*loop_4_cnt_max*loop_5_cnt_max*loop_6_cnt_max*loop_7_cnt_max步。以此类推,最高层的Loop7层循环总共运行了loop_7_cnt_max步。
可以看出,如上述设置的八层嵌套循环,从抽象角度上看,从底层到上层是一个乘法的关系,就是底层要循环的次数等于其所有上层的循环次数的乘积。
举一个对2维空间的张量进行2层循环嵌套读取的例子,如图4所示,图4示出了根据本申请实施例的对输入张量进行2层循环嵌套读取的例子。
假设输入张量为图4的方框中数字表示的是存储输入张量的各个元素的SRAM的地址(为方便解释起见,直接将存储元素的地址写成与元素本身对应的数字)。假设,想要通过软件配置的参数来实现的读取(访问)地址是图4中的灰色的块,读取顺序是0-2-4-6-9-11-13-15。
那么base_address是起始的地址,其可以是编译器预先计算好的,通常是输入张量在SRAM中存储的一段地址中的初始地址(即,第一个元素存储的位置,在该例子中是地址0)。通过软件配置一个参数loop_0_cnt_max=4,表示第一层(内层)读循环的步数是4或循环的大小是4,通过软件配置另一参数jump0_addr=2,表示在第一层读循环内各步之间的步长为2,通过软件配置另一参数loop_1_cnt_max=2,表示第二层(外层)读循环的步数是2或循环的大小是2,通过软件配置另一参数jump1_addr=9,表示在第二层读循环内各步之间的步长为9。
伪码如下:
For loop_1from 1to 2
     For loop_0from 1to 4
每个读循环设置相应的计数器loop_xx_cnt(xx表示第几个读循环),例如loop_0_cnt从1到4依次递增,例如loop_1_cnt从1到2依次递增。
根据上述配置的参数,结合图5(图5示出了根据本申请实施例的根据软件配置的参数来计算地址的示意图,其中,sram_addr表示在SRAM中存储的地址0-15),地址计算模块是这样计算地址的:
首先,运行第二层(外层)读循环loop_1的第1步(总共2步),从base_address初始地址0开始,运行第一层(内存)读循环loop_0的第1步(如图5中的0_0,总共4步),从地址0处读出元素0。由于参数jump0_addr=2,表示在第一层读循环内各步之间的步长为2,运行第一层(内存)读循环loop_0的第2步(如图5中的0_1,总共4步),从地址2(地址0+2)处读出元素2,运行第一层(内存)读循环loop_0的第3步(如图5中的0_2,总共4步),从地址4(地址2+2)处读出元素4,运行第一层(内存)读循 环loop_0的第4步(如图5中的0_3,总共4步),从地址6(地址4+2)处读出元素6。然后第一层(内存)读循环loop_0的4步运行完毕。
接下来,运行第二层(外层)读循环loop_1的第2步(总共2步),由于参数jump1_addr=9,表示在第二层读循环内各步之间的步长为9,如图5中的箭头,从base_address初始地址0+9得到的地址9处开始,运行第一层(内存)读循环loop_0的第1步(如图5中的1_0,总共4步),从地址9处读出元素9。由于参数jump0_addr=2,表示在第一层读循环内各步之间的步长为2,运行第一层(内存)读循环loop_0的第2步(如图5中的1_1,总共4步),从地址11(地址9+2)处读出元素11,运行第一层(内存)读循环loop_0的第3步(如图5中的1_2,总共4步),从地址13(地址11+2)处读出元素13,运行第一层(内存)读循环loop_0的第4步(如图5中的1_3,总共4步),从地址15(地址13+2)处读出元素15。然后第一层(内存)读循环loop_0的4步运行完毕。
至此,第二层(外层)读循环loop_1的总共2步都执行完了,地址计算模块的地址计算和地址读取终止。如此,读取元素的顺序是0-2-4-6-9-11-13-15。
当然,在一个实施例中,通过软件配置的参数还可以包括:表示一层读循环读取的第一元素的地址与输入张量在存储器中的初始地址之间间隔几个地址的值。这样还可以灵活地配置每一层读循环的初始地址。
可见,通过软件配置上述2层读循环嵌套的相应的参数,可以实现灵活地从SRAM的地址中读取元素。
同理,可以采用多于2层读循环嵌套的机制,在此不做限制。
假设如上述设置八层嵌套循环,从抽象角度上看,从底层到上层是一个乘法的关系,就是底层要循环的次数等于其所有上层的循环次数的乘积。这是一种完全对齐的方式,即每个嵌套循环层都是有规律的,那么读取的地址也是呈现一定规律的。
但是对于某些特殊场景,可能存在不是完全对齐的情况,例如,读取地址的顺序在某一段地址上是有第一规律的,但是在另一端地址上可能是有不同于第一规律的第二规律的。则可以存在2种不同的循环嵌套方式。在这种情况下,通过软件配置的参数可以包括针对参数的条件,且其中参数在满足 条件和不满足条件的情况下取不同的值。
在一个实施例中,参数是表示在特定的一层读循环中进行几步读取的值,条件是在比特定的一层更外层的另一层读循环中进行到第几步。例如,可以在指定读循环中增加一个配置,和另一个读循环进行绑定,解决不对齐的情况。例如选用的是对于Loop5读循环,存在对于Loop1读循环的2种不同步数的配置loop_1_cnt_max0和loop_1_cnt_max1,和loop_5读循环进行联动。
也就是说,通过软件配置的参数可以包括针对参数(在Loop1层读循环中进行几步读取的值,即loop_1_cnt_max)的条件(loop_5_cnt==loop_5_cnt_max),且其中参数loop_1_cnt_max在满足条件(在比Loop1层读循环更外层的Loop5层读循环中进行到最后一步,即。loop_5_cnt==loop_5_cnt_max)和不满足条件(loop_5_cnt<>loop_5_cnt_max)的情况下取不同的值loop_1_cnt_max_1、loop_1_cnt_max_0。
即,在运行到Loop5时,在每个Loop5的步中都会运行至少一次Loop1,如果Loop5没有运行到最后一步,即loop_5_cnt<>loop_5_cnt_max,则在该Loop5的步中运行的Loop1的步数都是loop_1_cnt_max_0。如果Loop5运行到最后一步,即loop_5_cnt==loop_5_cnt_max,则在该Loop5的步中运行的 Loop1的步数都是loop_1_cnt_max_1。
如此,可以使得计算地址以进行读取的方式更加灵活。
举一个对2维空间的张量进行如上不完全对齐情况的3层循环嵌套读取的例子,如图6所示,图6示出了根据本申请实施例的对输入张量进行不完全对齐的3层循环嵌套读取的例子。
假设输入张量为在SRAM中存储的地址就像图6所示的那样。图6的方框中数字表示的是存储输入张量的各个元素的SRAM的地址(为方便解释起见,直接将存储元素的地址写成与元素本身对应的数字)。假设,想要通过软件配置的参数来实现的读取(访问)地址是图6中的灰色的块,读取顺序是0-8-1-9-2-10-3-11-4-12-16-17-18-19-20。
可以看出,读取0-8-1-9-2-10-3-11-4-12时是完全对齐的读取方式,有一个规律,而读取16-17-18-19-20则与读取0-8-1-9-2-10-3-11-4-12时的读取方式不完全对齐,且具有另一个规律。此时,考虑利用软件配置的参数来实现不完全对齐的循环嵌套读取。
具体地,设置3层循环嵌套(Loop2、Loop1、Loop0)来计算要读取的地址并实现上述读取顺序。
其中,软件配置的参数可以是:最高层Loop2的步数loop_2_cnt_max是2,步长jump2_addr是16,下一层Loop1的步数loop_1_cnt_max是5,步长jump1_addr是1,最底层Loop0的步数loop_0_cnt_max_0是2,步长jump0_addr是8,然后配置指定Loop2与Loop0绑定,设置条件Loop2进行到最后一步时(loop_2_cnt==loop_2_cnt_max),Loop0的步数从loop_0_cnt_max_0、即2变为loop_0_cnt_max_1,取值为1。
之所以将最高层Loop2的步数loop_2_cnt_max设置为2,就是考虑第1步中执行读取0-8-1-9-2-10-3-11-4-12的顺序,而在第2步中执行读取16-17-18-19-20的顺序。由于2步执行的顺序和规律是不同的,那么就要考虑在第 2步中下层读循环Loop1和Loop0如何配合来实现不同的读顺序。
具体地,根据上述软件配置的参数,运行这3层循环嵌套,计算得到的地址如下:
先执行最高层Loop2的第1步(总共2步),在该第1步中,运行Loop1的5步。
在Loop1的第1步中,运行Loop0的所有步,即从0开始运行2步,每步步长为8,因此,先读取0,然后加8个地址读取8,如此2步,读取了0-8。
在Loop1的第2步中,步长为1,即从1开始运行Loop0的所有步,即从1开始运行2步,每步步长为8,因此,先读取1,然后加8个地址读取9,如此2步,读取了1-9。
在Loop1的第3步中,步长为1,即从2开始运行Loop0的所有步,即从2开始运行2步,每步步长为8,因此,先读取2,然后加8个地址读取10,如此2步,读取了2-10。
在Loop1的第4步中,步长为1,即从3开始运行Loop0的所有步,即从3开始运行2步,每步步长为8,因此,先读取3,然后加8个地址读取11,如此2步,读取了3-11。
在Loop1的第5步中,步长为1,即从4开始运行Loop0的所有步,即从4开始运行2步,每步步长为8,因此,先读取4,然后加8个地址读取12,如此2步,读取了4-12。
然后,执行最高层Loop2的第2步(总共2步),从初始地址9加上步长16,即从16开始读取。
此时,满足条件loop_2_cnt==loop_2_cnt_max。因此,Loop0的步数是loop_0_cnt_max_1,即不是2步了,而是1步。在该第2步中,运行Loop1的5步,在Loop1的每一步中,都运行步数为1的Loop0读循环。
具体地,在Loop1的第1步中,运行Loop0的所有步,即从16开始运行1步,表示只读取一次,那么步长8就不使用了,因此,仅读取16。
在Loop1的第2步中,步长为1,即从16+1=17开始运行Loop0的1步,因此,即读取17。
在Loop1的第3步中,步长为1,即从17+1=18开始运行Loop0的1步, 因此,即读取18。
在Loop1的第4步中,步长为1,即从18+1=19开始运行Loop0的1步,因此,即读取19。
在Loop1的第5步中,步长为1,即从19+1=20开始运行Loop0的1步,因此,即读取20。
如此,就通过软件配置的参数使得进行3层读循环嵌套实现了复杂的地址读取顺序,0-8-1-9-2-10-3-11-4-12-16-17-18-19-20。
当然,上述仅举例了通过设置条件是在比特定的一层更外层的另一层读循环中进行到第几步,然后在条件是否满足的情况下,分别设置在特定的一层读循环中进行不同数量步,但是本申请不限于此,可以考虑其他条件以及满足条件的其他参数的变化,来更灵活地实现更复杂的地址读取顺序。
因此,根据本申请的软件配置的参数,可以使得灵活地计算存储器中的地址,以从计算的地址中读取元素以发送到计算单元,来实现灵活的地址读取方式,增加人工智能处理器芯片的计算效率和成本,在某些情况下也可以取代人工智能计算中的特定的张量运算本身,从而简化算子的运算。
对于读循环嵌套的具体地址计算来说,在一个实施例中,根据一层或各个层读循环当前进行到第几步以及一层或各个层读循环各自的步长,来计算当前要读取的地址。
具体地,对于最终要读取SRAM中哪个地址的地址计算,根据上述软件配置的参数,地址计算单元可以通过类似于通过多维度(读循环)空间坐标系中确定一个点的位置的方式来计算每次读取的地址Address:
Address=base_address+offset_address_dim;
base_address可以是编译器预先计算好的,通常是输入张量在SRAM中存储的一段地址中的初始地址(即,第一个元素存储的位置),而offset_address_dim是各个维度(读循环)地址偏置的和:
offset_address_dim=offset_addr_0+offset_addr_1+offset_addr_2+offset_addr_3+offset_addr_4+offset_addr_5+offset_addr_6+offset_addr_7
其中,offset_addr_xx表示是第xx层读循环的初始地址相对于上一层读循环(或在最高层读循环的情况下、相对于base_address)的初始地址的偏移量。即每个维度的offset_address_xx=(loop_xx_counter-1)*jump_xx_addr, 其中loop_xx_counter表示当前进行到第xx层读循环的第几步。
也就是说,在实际地址计算模块计算地址的时候,只需要知道一层或各个层读循环当前进行到第几步以及一层或各个层读循环各自的步长,即可求出当前应该读取的地址。
图7示出了根据本申请实施例的SRAM的内部结构的示意图。
另外,SRAM也可以划分为多个通道(bank),为了让数据写入和读取的速度更快,从外部写入数据进来可以直接放到不同通道里,读入数据、中间计算数据和结果数据也可以直接放到不同通道里。SRAM中的地址编址方式也是可配置的,默认可以采用地址最高位来区分不同的通道,也可以通过可配置地址哈希(hash)方式来进行其他粒度的交织。从硬件设计上,最终会产生多比特的通道选择信号bank_sel,用于各端口port0、port1、port2、port3等的数据选择不同的SRAM通道sram_bank(sram_bank0-sram_bank3等等)。
多端口访问可以采用握手信号,数据通路支持反压(当入口流量大于出口流量,这时候就需要反压,或者,当后级未准备好时,如果本级进行数据传递,那么它就需要反压前级,所以此时前级需要将数据保持不动,直到握手成功才能更新数据)。存储控制单元304中包括可并行地访问SRAM的多个通道的读取功能和写入功能分离的全交叉互联通路结构,相当于是2级全交叉互联通路级联的方式,以缓解硬件实现的绕线问题。底层存储器使用的单口SRAM,以节省功耗和面积。全交叉互联通路结构可以并行或同时访问分别存储了读入数据、中间结果数据和最终结果数据的多个通道,从而进一步加快读取和写入速度,并提高人工智能芯片的运行效率。
如此,用软件配置的参数并用地址计算模块来配合计算地址可以直接取代各种张量运算,而且设置多于一个循环嵌套的地址计算过程,可以使得地址计算模块计算的读取地址更加灵活,不局限于本身存储输入张量的各个数值的顺序地址。且能够直接取代算子的运算,并节省现有技术中的计算算子的运算的时间和硬件成本、存储算子的运算的结果张量的时间和硬件成本,以及从存储算子的运算的结果张量的地址读出各个元素的时间和硬件成本,从而减少计算延迟,减少硬件计算成本,并提高人工智能芯片的运行效率。
综上,通过数据通路和计算通路分离配置的方式,做到软件的完全可控,最大限度提高灵活度。通过多层读循环寻址方式,能够实现各种复杂的地址 访问模式(pattern)。通过多层的不对称循环的方式,可以实现非对齐的配置,来实现更多复杂的地址访问模式。通过将片上共享存储器进行多通道的拆分,并通过软件方式进行维护通道间的编址方式。通过数据通路和计算通路的分离,保证了数据访问的灵活性。同时又能够做到数据搬运和计算通路,做到相互隐藏,实现不同模块并发的效果。通过编译器合理的中间数据划分,SRAM的不同通道存储不同类型的数据,例如可以同时访问的不同类型的数据,使得需要同时访问这些数据时,可以同时并行地从SRAM的不同通道读取这些数据,以加快效率。在启动计算后,卷积计算的媒体存取控制地址(MAC)利用率可以达到几乎100%。
图8示出了根据本申请实施例的用于灵活地访问人工智能处理器芯片中的数据的方法的流程图。
如图8所示,用于灵活地访问人工智能处理器芯片中的数据的方法800包括:步骤801,由人工智能处理器芯片中的存储器存储来自处理器芯片外部的读入张量数据,读入张量数据包括多个元素,用于进行人工智能计算所包括的算子的张量运算;步骤802,根据算子的张量运算来控制从存储器中读取元素以发送到计算单元,包括:根据接收的通过软件配置的参数来在一层读循环或多层读循环嵌套中计算存储器中的地址,以从计算的地址中读取元素以发送到人工智能处理器芯片中的计算单元;步骤803,由计算单元用接收到的元素来进行算子的张量运算。
如此,可以通过软件配置的参数来灵活地计算存储器中的地址以进行灵活地读取存储器中的元素,而不局限于存储器中存储这些元素的顺序或地址排序。
在一个实施例中,通过软件配置的参数包括:表示在一层读循环中对读入张量数据读取几个元素的值、表示要在一层读循环内的各步之间的步长的值。
在一个实施例中,通过软件配置的参数包括:表示在各层读循环中进行几步读取的值、表示要在各层读循环内各步之间的步长的值,其中,各层读循环以从外层到内层的嵌套方式进行。
在一个实施例中,通过软件配置的参数包括:表示一层读循环读取的第一元素的地址与输入张量在存储器中的初始地址之间间隔几个地址的值。
在一个实施例中,通过软件配置的参数包括针对参数的条件,且其中参数在满足条件和不满足条件的情况下取不同的值。
在一个实施例中,参数是表示在特定的一层读循环中进行几步读取的值,条件是在比特定的一层更外层的另一层读循环中进行到第几步。
在一个实施例中,该方法800还包括:根据一层或各个层读循环当前进行到第几步以及一层或各个层读循环各自的步长,来计算当前要读取的地址。
如此,可以使得计算地址以进行读取的方式更加灵活。
在一个实施例中,通过软件配置的参数指示根据算子的张量运算的、从存储器中的地址中读取元素的方式来替代算子的张量运算。
在一个实施例中,张量运算是不改变输入张量中的数值的张量操作,其中根据通过接口接收的所配置的参数来在一层读循环或多层读循环嵌套中计算每次读取时要读取的存储器中的地址,使得读取替代张量操作。
在一个实施例中,张量运算包括如下中的至少一种:transpose算子的运算、reshape算子的运算、broadcast算子的运算、gather算子的运算、reverse算子的运算、concat算子的运算、cast算子的运算。
如此,通过软件配置的参数来使得计算单元根据该计算出的地址顺序来对应地读取计算的地址处存储的地址,就能直接取代某些张量算子的运算,并节省现有技术中的计算这些张量算子的运算的时间和硬件成本、存储这些张量算子的运算的结果张量的时间和硬件成本,以及从存储这些张量算子的运算的结果张量的地址读出各个元素的时间和硬件成本。
在一个实施例中,存储器被划分为多个通道,用于分别存储可并行访问的数据,且方法还包括通过读取功能和写入功能分离的全交叉互联通路,并行地访问存储器的多个通道中存储的数据。
如此,用软件配置的参数并计算地址可以直接取代各种张量运算,而且设置多于一个循环嵌套的地址计算过程,可以使得计算的读取地址更加灵活,不局限于本身存储输入张量的各个数值的顺序地址。且能够直接取代算子的运算,并节省现有技术中的计算算子的运算的时间和硬件成本、存储算子的运算的结果张量的时间和硬件成本,以及从存储算子的运算的结果张量的地址读出各个元素的时间和硬件成本,从而减少计算延迟,减少硬件计算成本,并提高人工智能芯片的运行效率。
图9示出了适于用来实现本申请实施例的示例性电子设备的框图。
电子设备可以包括处理器(H1);存储介质(H2),耦合于处理器(H1),且在其中存储计算机可执行指令,用于在由处理器执行时进行本申请的实施例的各个方法的步骤。
处理器(H1)可以包括但不限于例如一个或者多个处理器或者或微处理器等。
存储介质(H2)可以包括但不限于例如,随机存取存储器(RAM)、只读存储器(ROM)、快闪存储器、EPROM存储器、EEPROM存储器、寄存器、计算机存储介质(例如硬碟、软碟、固态硬盘、可移动碟、CD-ROM、DVD-ROM、蓝光盘等)。
除此之外,该电子设备还可以包括数据总线(H3)、输入/输出(I/O)总线(H4),显示器(H5)以及输入/输出设备(H6)(例如,键盘、鼠标、扬声器等)等。
处理器(H1)可以通过I/O总线(H4)经由有线或无线网络(未示出)与外部设备(H5、H6等)通信。
存储介质(H2)还可以存储至少一个计算机可执行指令,用于在由处理器(H1)运行时执行本技术所描述的实施例中的各个功能和/或方法的步骤。
在一个实施例中,该至少一个计算机可执行指令也可以被编译为或组成一种软件产品,其中一个或多个计算机可执行指令被处理器运行时执行本技术所描述的实施例中的各个功能和/或方法的步骤。
图10示出了根据本公开的实施例的非暂时性计算机可读存储介质的示意图。
如图10所示,计算机可读存储介质1020上存储有指令,指令例如是计算机可读指令1010。当计算机可读指令1010由处理器运行时,可以执行参照以上描述的各个方法。计算机可读存储介质包括但不限于例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。例如,计算机可读存储介质1020可以连接于诸如计算机等的计算设备,接着,在计算设备运行计算机可读存储介质1020上存储的计算机可读指令1010的情况下,可以进行如上描述的各个方法。
本申请提供如下项目:
项目1.一种用于灵活地访问数据的人工智能处理器芯片,包括:
存储器,被配置为存储来自处理器芯片外部的读入张量数据,所述读入张量数据包括多个元素,用于进行人工智能计算所包括的算子的张量运算;
存储控制单元,被配置为根据所述算子的张量运算来控制从所述存储器中读取元素以发送到计算单元,所述存储控制单元包括地址计算模块,所述地址计算模块具有接收通过软件配置的参数的接口,所述地址计算模块根据通过所述接口接收的所配置的参数来在一层读循环或多层读循环嵌套中计算所述存储器中的地址,以从所述计算的地址中读取元素以发送到所述计算单元;
计算单元,被配置为用所述接收到的元素来进行所述算子的张量运算。
项目2.根据项目1所述的处理器芯片,其中,
所述通过软件配置的参数包括:表示在一层读循环中对所述读入张量数据读取几个元素的值、表示要在所述一层读循环内的各步之间的步长的值,
或者,所述通过软件配置的参数包括:表示在各层读循环中进行几步读取的值、表示要在各层读循环内各步之间的步长的值,其中,各层读循环以从外层到内层的嵌套方式进行。
项目3.根据项目1所述的处理器芯片,其中,所述通过软件配置的参数包括:表示一层读循环读取的第一元素的地址与所述输入张量在存储器中的初始地址之间间隔几个地址的值。
项目4.根据项目1所述的处理器芯片,其中,所述通过软件配置的参数包括针对所述参数的条件,且其中所述参数在满足条件和不满足条件的情况下取不同的值。
项目5.根据项目4所述的处理器芯片,其中,所述参数是表示在特定的一层读循环中进行几步读取的值,所述条件是在比所述特定的一层更外层的另一层读循环中进行到第几步。
项目6.根据项目2-5中任一所述的处理器芯片,其中,所述地址计算模块根据一层或各个层读循环当前进行到第几步以及一层或各个层读循环各自的步长,来计算当前要读取的地址。
项目7.根据项目1所述的处理器芯片,其中,所述通过软件配置的参数指示根据算子的张量运算的、从所述存储器中的地址中读取元素的方式来 替代所述算子的张量运算。
项目8.根据项目7所述的处理器芯片,其中,所述张量运算是不改变输入张量中的数值的张量操作,其中所述地址计算模块根据通过所述接口接收的所配置的参数来在一层读循环或多层读循环嵌套中计算每次读取时要读取的所述存储器中的地址,使得所述读取替代所述张量操作。
项目9.根据项目8所述的处理器芯片,其中,所述张量运算包括如下中的至少一种:transpose算子的运算、reshape算子的运算、broadcast算子的运算、gather算子的运算、reverse算子的运算、concat算子的运算、cast算子的运算,其中,所述存储器被划分为多个通道,用于分别存储可并行访问的数据,且所述存储控制单元包括读取功能和写入功能分离的全交叉互联通路,以并行地访问所述存储器的多个通道中存储的数据。
项目10.一种用于灵活地访问人工智能处理器芯片中的数据的方法,包括:
由所述人工智能处理器芯片中的存储器存储来自处理器芯片外部的读入张量数据,所述读入张量数据包括多个元素,用于进行人工智能计算所包括的算子的张量运算;
根据所述算子的张量运算来控制从所述存储器中读取元素以发送到计算单元,包括:根据接收的通过软件配置的参数来在一层读循环或多层读循环嵌套中计算所述存储器中的地址,以从所述计算的地址中读取元素以发送到所述人工智能处理器芯片中的计算单元;
由所述计算单元用所述接收到的元素来进行所述算子的张量运算。
项目11.根据项目10所述的方法,其中,
所述通过软件配置的参数包括:表示在一层读循环中对所述读入张量数据读取几个元素的值、表示要在所述一层读循环内的各步之间的步长的值,
或者,所述通过软件配置的参数包括:表示在各层读循环中进行几步读取的值、表示要在各层读循环内各步之间的步长的值,其中,各层读循环以从外层到内层的嵌套方式进行。
项目12.根据项目10所述的方法,其中,所述通过软件配置的参数包括:表示一层读循环读取的第一元素的地址与所述输入张量在存储器中的初始地址之间间隔几个地址的值。
项目13.根据项目10所述的方法,其中,所述通过软件配置的参数包括针对所述参数的条件,且其中所述参数在满足条件和不满足条件的情况下取不同的值。
项目14.根据项目13所述的方法,其中,所述参数是表示在特定的一层读循环中进行几步读取的值,所述条件是在比所述特定的一层更外层的另一层读循环中进行到第几步。
项目15.根据项目11-14中任一所述的方法,还包括:根据一层或各个层读循环当前进行到第几步以及一层或各个层读循环各自的步长,来计算当前要读取的地址。
项目16.根据项目10所述的方法,其中,所述通过软件配置的参数指示根据算子的张量运算的、从所述存储器中的地址中读取元素的方式来替代所述算子的张量运算。
项目17.根据项目16所述的方法,其中,所述张量运算是不改变输入张量中的数值的张量操作,其中根据通过所述接口接收的所配置的参数来在一层读循环或多层读循环嵌套中计算每次读取时要读取的所述存储器中的地址,使得所述读取替代所述张量操作。
项目18.根据项目17所述的方法,其中,所述张量运算包括如下中的至少一种:transpose算子的运算、reshape算子的运算、broadcast算子的运算、gather算子的运算、reverse算子的运算、concat算子的运算、cast算子的运算,其中,所述存储器被划分为多个通道,用于分别存储可并行访问的数据,且所述方法还包括通过读取功能和写入功能分离的全交叉互联通路,并行地访问所述存储器的多个通道中存储的数据。
项目19.一种电子设备,包括:
存储器,用于存储指令;
处理器,用于读取所述存储器中的指令,并执行如项目10-18中任一项所述的方法。
项目20.一种非暂时存储介质,其上存储有指令,
其中,所述指令在被处理器读取时,使得所述处理器执行如项目10-18中任一项所述的方法。
当然,上述的具体实施例仅是例子而非限制,且本领域技术人员可以根 据本申请的构思从上述分开描述的各个实施例中合并和组合一些步骤和装置来实现本申请的效果,这种合并和组合而成的实施例也被包括在本申请中,在此不一一描述这种合并和组合。
注意,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本申请的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本申请为必须采用上述具体的细节来实现。
本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇,指“包括但不限于”,且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”,且可与其互换使用,除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”,且可与其互换使用。
本公开中的步骤流程图以及以上方法描述仅作为例示性的例子并且不意图要求或暗示必须按照给出的顺序进行各个实施例的步骤。如本领域技术人员将认识到的,可以按任意顺序进行以上实施例中的步骤的顺序。诸如“其后”、“然后”、“接下来”等等的词语不意图限制步骤的顺序;这些词语仅用于引导读者通读这些方法的描述。此外,例如使用冠词“一个”、“一”或者“该”对于单数的要素的任何引用不被解释为将该要素限制为单数。
另外,本文中的各个实施例中的步骤和装置并非仅限定于某个实施例中实行,事实上,可以根据本申请的概念来结合本文中的各个实施例中相关的部分步骤和部分装置以构思新的实施例,而这些新的实施例也包括在本申请的范围内。
以上描述的方法的各个操作可以通过能够进行相应的功能的任何适当的手段而进行。该手段可以包括各种硬件和/或软件组件和/或模块,包括但不限于硬件的电路、专用集成电路(ASIC)或处理器。
可以利用被设计用于进行在此描述的功能的通用处理器、数字信号处理器(DSP)、ASIC、场可编程门阵列信号(FPGA)或其他可编程逻辑器件(PLD)、离散门或晶体管逻辑、离散的硬件组件或者其任意组合而实现或进行描述的 各个例示的逻辑块、模块和电路。通用处理器可以是微处理器,但是作为替换,该处理器可以是任何商业上可获得的处理器、控制器、微控制器或状态机。处理器还可以实现为计算设备的组合,例如DSP和微处理器的组合,多个微处理器、与DSP核协作的微处理器或任何其他这样的配置。
结合本公开描述的方法或算法的步骤可以直接嵌入在硬件中、处理器执行的软件模块中或者这两种的组合中。软件模块可以存在于任何形式的有形存储介质中。可以使用的存储介质的一些例子包括随机存取存储器(RAM)、只读存储器(ROM)、快闪存储器、EPROM存储器、EEPROM存储器、寄存器、硬碟、可移动碟、CD-ROM等。存储介质可以耦接到处理器以便该处理器可以从该存储介质读取信息以及向该存储介质写信息。在替换方式中,存储介质可以与处理器是整体的。软件模块可以是单个指令或者许多指令,并且可以分布在几个不同的代码段上、不同的程序之间以及跨过多个存储介质。
在此公开的方法包括用于实现描述的方法的动作。方法和/或动作可以彼此互换而不脱离权利要求的范围。换句话说,除非指定了动作的具体顺序,否则可以修改具体动作的顺序和/或使用而不脱离权利要求的范围。
上述功能可以按硬件、软件、固件或其任意组合而实现。如果以软件实现,功能可以作为指令存储在切实的计算机可读介质上。存储介质可以是可以由计算机访问的任何可用的切实介质。通过例子而不是限制,这样的计算机可读介质可以包括RAM、ROM、EEPROM、CD-ROM或其他光碟存储、磁碟存储或其他磁存储器件或者可以用于携带或存储指令或数据结构形式的期望的程序代码并且可以由计算机访问的任何其他切实介质。如在此使用的,碟(disk)和盘(disc)包括紧凑盘(CD)、激光盘、光盘、数字通用盘(DVD)、软碟和蓝光盘,其中碟通常磁地再现数据,而盘利用激光光学地再现数据。
因此,计算机程序产品可以进行在此给出的操作。例如,这样的计算机程序产品可以是具有有形存储(和/或编码)在其上的指令的计算机可读的有形介质,该指令可由处理器执行以进行在此描述的操作。计算机程序产品可以包括包装的材料。
软件或指令也可以通过传输介质而传输。例如,可以使用诸如同轴电缆、 光纤光缆、双绞线、数字订户线(DSL)或诸如红外、无线电或微波的无线技术的传输介质从网站、服务器或者其他远程源传输软件。
此外,用于进行在此描述的方法和技术的模块和/或其他适当的手段可以在适当时由用户终端和/或基站下载和/或其他方式获得。例如,这样的设备可以耦接到服务器以促进用于进行在此描述的方法的手段的传送。或者,在此描述的各种方法可以经由存储部件(例如RAM、ROM、诸如CD或软碟等的物理存储介质)提供,以便用户终端和/或基站可以在耦接到该设备或者向该设备提供存储部件时获得各种方法。此外,可以利用用于将在此描述的方法和技术提供给设备的任何其他适当的技术。
其他例子和实现方式在本公开和所附权利要求的范围和精神内。例如,由于软件的本质,以上描述的功能可以使用由处理器、硬件、固件、硬连线或这些的任意的组合执行的软件实现。实现功能的特征也可以物理地位于各个位置,包括被分发以便功能的部分在不同的物理位置处实现。而且,如在此使用的,包括在权利要求中使用的,在以“至少一个”开始的项的列举中使用的“或”指示分离的列举,以便例如“A、B或C的至少一个”的列举意味着A或B或C,或AB或AC或BC,或ABC(即A和B和C)。此外,措辞“示例的”不意味着描述的例子是优选的或者比其他例子更好。
可以不脱离由所附权利要求定义的教导的技术而进行对在此描述的技术的各种改变、替换和更改。此外,本公开的权利要求的范围不限于以上描述的处理、机器、制造、事件的组成、手段、方法和动作的具体方面。可以利用与在此描述的相应方面进行基本相同的功能或者实现基本相同的结果的当前存在的或者稍后要开发的处理、机器、制造、事件的组成、手段、方法或动作。因而,所附权利要求包括在其范围内的这样的处理、机器、制造、事件的组成、手段、方法或动作。
提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或者使用本申请。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的,并且在此定义的一般原理可以应用于其他方面而不脱离本申请的范围。因此,本申请不意图被限制到在此示出的方面,而是按照与在此公开的原理和新颖的特征一致的最宽范围。
为了例示和描述的目的已经给出了以上描述。此外,此描述不意图将本 申请的实施例限制到在此公开的形式。尽管以上已经讨论了多个示例方面和实施例,但是本领域技术人员将认识到其某些变型、修改、改变、添加和子组合。

Claims (20)

  1. 一种用于灵活地访问数据的人工智能处理器芯片,包括:
    存储器,被配置为存储来自处理器芯片外部的读入张量数据,所述读入张量数据包括多个元素,用于进行人工智能计算所包括的算子的张量运算;
    存储控制单元,被配置为根据所述算子的张量运算来控制从所述存储器中读取元素以发送到计算单元,所述存储控制单元包括地址计算模块,所述地址计算模块具有接收通过软件配置的参数的接口,所述地址计算模块根据通过所述接口接收的所配置的参数来在一层读循环或多层读循环嵌套中计算所述存储器中的地址,以从所述计算的地址中读取元素以发送到所述计算单元;
    计算单元,被配置为用所述接收到的元素来进行所述算子的张量运算。
  2. 根据权利要求1所述的处理器芯片,其中,
    所述通过软件配置的参数包括:表示在一层读循环中对所述读入张量数据读取几个元素的值、表示要在所述一层读循环内的各步之间的步长的值,或者,所述通过软件配置的参数包括:表示在各层读循环中进行几步读取的值、表示要在各层读循环内各步之间的步长的值,其中,各层读循环以从外层到内层的嵌套方式进行。
  3. 根据权利要求1所述的处理器芯片,其中,所述通过软件配置的参数包括:表示一层读循环读取的第一元素的地址与所述输入张量在存储器中的初始地址之间间隔几个地址的值。
  4. 根据权利要求1所述的处理器芯片,其中,所述通过软件配置的参数包括针对所述参数的条件,且其中所述参数在满足条件和不满足条件的情况下取不同的值。
  5. 根据权利要求4所述的处理器芯片,其中,所述参数是表示在特定的一层读循环中进行几步读取的值,所述条件是在比所述特定的一层更外层的 另一层读循环中进行到第几步。
  6. 根据权利要求2-5中任一所述的处理器芯片,其中,所述地址计算模块根据一层或各个层读循环当前进行到第几步以及一层或各个层读循环各自的步长,来计算当前要读取的地址。
  7. 根据权利要求1所述的处理器芯片,其中,所述通过软件配置的参数指示根据算子的张量运算的、从所述存储器中的地址中读取元素的方式来替代所述算子的张量运算。
  8. 根据权利要求7所述的处理器芯片,其中,所述张量运算是不改变输入张量中的数值的张量操作,其中所述地址计算模块根据通过所述接口接收的所配置的参数来在一层读循环或多层读循环嵌套中计算每次读取时要读取的所述存储器中的地址,使得所述读取替代所述张量操作。
  9. 根据权利要求8所述的处理器芯片,其中,所述张量运算包括如下中的至少一种:transpose算子的运算、reshape算子的运算、broadcast算子的运算、gather算子的运算、reverse算子的运算、concat算子的运算、cast算子的运算,其中,所述存储器被划分为多个通道,用于分别存储可并行访问的数据,且所述存储控制单元包括读取功能和写入功能分离的全交叉互联通路,以并行地访问所述存储器的多个通道中存储的数据。
  10. 一种用于灵活地访问人工智能处理器芯片中的数据的方法,包括:
    由所述人工智能处理器芯片中的存储器存储来自处理器芯片外部的读入张量数据,所述读入张量数据包括多个元素,用于进行人工智能计算所包括的算子的张量运算;
    根据所述算子的张量运算来控制从所述存储器中读取元素以发送到计算单元,包括:根据接收的通过软件配置的参数来在一层读循环或多层读循环嵌套中计算所述存储器中的地址,以从所述计算的地址中读取元素以发送到所述人工智能处理器芯片中的计算单元;
    由所述计算单元用所述接收到的元素来进行所述算子的张量运算。
  11. 根据权利要求10所述的方法,其中,
    所述通过软件配置的参数包括:表示在一层读循环中对所述读入张量数据读取几个元素的值、表示要在所述一层读循环内的各步之间的步长的值,或者,所述通过软件配置的参数包括:表示在各层读循环中进行几步读取的值、表示要在各层读循环内各步之间的步长的值,其中,各层读循环以从外层到内层的嵌套方式进行。
  12. 根据权利要求10所述的方法,其中,所述通过软件配置的参数包括:表示一层读循环读取的第一元素的地址与所述输入张量在存储器中的初始地址之间间隔几个地址的值。
  13. 根据权利要求10所述的方法,其中,所述通过软件配置的参数包括针对所述参数的条件,且其中所述参数在满足条件和不满足条件的情况下取不同的值。
  14. 根据权利要求13所述的方法,其中,所述参数是表示在特定的一层读循环中进行几步读取的值,所述条件是在比所述特定的一层更外层的另一层读循环中进行到第几步。
  15. 根据权利要求11-14中任一所述的方法,还包括:根据一层或各个层读循环当前进行到第几步以及一层或各个层读循环各自的步长,来计算当前要读取的地址。
  16. 根据权利要求10所述的方法,其中,所述通过软件配置的参数指示根据算子的张量运算的、从所述存储器中的地址中读取元素的方式来替代所述算子的张量运算。
  17. 根据权利要求16所述的方法,其中,所述张量运算是不改变输入张 量中的数值的张量操作,其中根据通过所述接口接收的所配置的参数来在一层读循环或多层读循环嵌套中计算每次读取时要读取的所述存储器中的地址,使得所述读取替代所述张量操作。
  18. 根据权利要求17所述的方法,其中,所述张量运算包括如下中的至少一种:transpose算子的运算、reshape算子的运算、broadcast算子的运算、gather算子的运算、reverse算子的运算、concat算子的运算、cast算子的运算,其中,所述存储器被划分为多个通道,用于分别存储可并行访问的数据,且所述方法还包括通过读取功能和写入功能分离的全交叉互联通路,并行地访问所述存储器的多个通道中存储的数据。
  19. 一种电子设备,包括:
    存储器,用于存储指令;
    处理器,用于读取所述存储器中的指令,并执行如权利要求10-18中任一项所述的方法。
  20. 一种非暂时存储介质,其上存储有指令,
    其中,所述指令在被处理器读取时,使得所述处理器执行如权利要求10-18中任一项所述的方法。
PCT/CN2023/107010 2022-07-15 2023-07-12 人工智能芯片、灵活地访问数据的方法、设备和介质 WO2024012492A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210836577.0A CN117435547A (zh) 2022-07-15 2022-07-15 人工智能芯片、灵活地访问数据的方法、设备和介质
CN202210836577.0 2022-07-15

Publications (1)

Publication Number Publication Date
WO2024012492A1 true WO2024012492A1 (zh) 2024-01-18

Family

ID=89535585

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/107010 WO2024012492A1 (zh) 2022-07-15 2023-07-12 人工智能芯片、灵活地访问数据的方法、设备和介质

Country Status (2)

Country Link
CN (1) CN117435547A (zh)
WO (1) WO2024012492A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117785759A (zh) * 2024-02-28 2024-03-29 北京壁仞科技开发有限公司 数据存储方法、数据读取方法、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110462586A (zh) * 2017-05-23 2019-11-15 谷歌有限责任公司 使用加法器访问多维张量中的数据
CN111160545A (zh) * 2019-12-31 2020-05-15 北京三快在线科技有限公司 人工神经网络处理系统及其数据处理方法
CN111767508A (zh) * 2020-07-09 2020-10-13 地平线(上海)人工智能技术有限公司 计算机实现张量数据计算的方法、装置、介质和设备
US20220113968A1 (en) * 2019-08-14 2022-04-14 Jerry D. Harthcock Fully pipelined binary conversion hardware operator logic circuit
US20220164192A1 (en) * 2020-11-26 2022-05-26 Electronics And Telecommunications Research Institute Parallel processor, address generator of parallel processor, and electronic device including parallel processor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110462586A (zh) * 2017-05-23 2019-11-15 谷歌有限责任公司 使用加法器访问多维张量中的数据
US20220113968A1 (en) * 2019-08-14 2022-04-14 Jerry D. Harthcock Fully pipelined binary conversion hardware operator logic circuit
CN111160545A (zh) * 2019-12-31 2020-05-15 北京三快在线科技有限公司 人工神经网络处理系统及其数据处理方法
CN111767508A (zh) * 2020-07-09 2020-10-13 地平线(上海)人工智能技术有限公司 计算机实现张量数据计算的方法、装置、介质和设备
US20220164192A1 (en) * 2020-11-26 2022-05-26 Electronics And Telecommunications Research Institute Parallel processor, address generator of parallel processor, and electronic device including parallel processor

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117785759A (zh) * 2024-02-28 2024-03-29 北京壁仞科技开发有限公司 数据存储方法、数据读取方法、电子设备和存储介质
CN117785759B (zh) * 2024-02-28 2024-04-23 北京壁仞科技开发有限公司 数据存储方法、数据读取方法、电子设备和存储介质

Also Published As

Publication number Publication date
CN117435547A (zh) 2024-01-23

Similar Documents

Publication Publication Date Title
KR102443546B1 (ko) 행렬 곱셈기
WO2024012492A1 (zh) 人工智能芯片、灵活地访问数据的方法、设备和介质
US10884707B1 (en) Transpose operations using processing element array
CN109948774A (zh) 基于网络层捆绑运算的神经网络加速器及其实现方法
US11928580B2 (en) Interleaving memory requests to accelerate memory accesses
CN107766079A (zh) 处理器以及用于在处理器上执行指令的方法
JP7008983B2 (ja) テンソルデータにアクセスするための方法および装置
CN111630487B (zh) 用于神经网络处理的共享存储器的集中式-分布式混合组织
CN111583095B (zh) 图像数据存储方法、图像数据处理方法、系统及相关装置
WO2023071238A1 (zh) 计算图的编译、调度方法及相关产品
US11875248B2 (en) Implementation of a neural network in multicore hardware
WO2022142479A1 (zh) 一种硬件加速器、数据处理方法、系统级芯片及介质
WO2021142713A1 (zh) 神经网络处理的方法、装置与系统
CN107909537A (zh) 一种基于卷积神经网络的图像处理方法及移动终端
CN107506329A (zh) 一种自动支持循环迭代流水线的粗粒度可重构阵列及其配置方法
WO2021244045A1 (zh) 一种神经网络的数据处理方法及装置
Shang et al. LACS: A high-computational-efficiency accelerator for CNNs
WO2020093968A1 (zh) 卷积处理引擎及控制方法和相应的卷积神经网络加速器
CN107894957B (zh) 面向卷积神经网络的存储器数据访问与插零方法及装置
US11087067B2 (en) Systems and methods for implementing tile-level predication within a machine perception and dense algorithm integrated circuit
Bai et al. An OpenCL-based FPGA accelerator with the Winograd’s minimal filtering algorithm for convolution neuron networks
CN113469326A (zh) 在神经网络模型中执行剪枝优化的集成电路装置及板卡
CN115904681A (zh) 任务调度方法、装置及相关产品
WO2024012491A1 (zh) 优化神经网络模块算力的方法、芯片、电子设备和介质
JP2008102599A (ja) プロセッサ

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23838982

Country of ref document: EP

Kind code of ref document: A1