CN112596881B

CN112596881B - Storage component and artificial intelligence processor

Info

Publication number: CN112596881B
Application number: CN202011565305.9A
Authority: CN
Inventors: 裴京; 施路平; 王冠睿; 马骋
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2022-10-25
Anticipated expiration: 2040-12-25
Also published as: CN112596881A

Abstract

The present disclosure relates to a storage component and an artificial intelligence processor. The storage component is applied to a computational core of an artificial intelligence processor, the artificial intelligence processor comprising a plurality of computational cores, each computational core comprising a processing component and a storage component, the storage component comprising: first, second and third memory cells; the processing unit comprises an axon unit, a cell body unit and a routing unit. According to the storage component of the embodiment of the disclosure, the processing component and the storage component can be arranged in the computing core, so that the storage component directly receives read-write access of the processing component without reading and writing the storage component outside the core. The distributed storage architecture of the plurality of storage units can store different data respectively, so that the processing unit can access the plurality of storage units conveniently, and the distributed storage architecture is suitable for the processing unit of the many-core architecture. The size of the artificial intelligence processor can be reduced, the power consumption of the artificial intelligence processor is reduced, and the processing efficiency of the artificial intelligence processor is improved.

Description

Storage component and artificial intelligence processor

Technical Field

The present disclosure relates to the field of computers, and more particularly, to a storage component and an artificial intelligence processor.

Background

In the related art, the Memory generally includes a ROM (Read Only Memory), a RAM (Random Access Memory), a Cache Memory (Cache), and the like.

When the ROM is manufactured, information (data or programs) is stored and permanently stored. These information can only be read out, but generally can not be written in, and even if the device is powered off, these data will not be lost. The ROM is generally used to store basic programs and data of the computer, such as BIOS ROM. The physical form is typically a dual in-line (DIP) package.

RAM can support both reading data from and writing data to it. When the machine is powered down, the data stored therein is lost. The RAM is generally used as a memory of a computer, and the memory bank is a circuit board which integrates RAM integrated blocks together and is inserted into a memory slot in the computer to reduce the space occupied by the RAM integrated blocks. Generally, a memory of a memory bank includes 1G/bank, 2G/bank, 4G/bank, and the like.

The Cache is a general term of data caches such as a first-level Cache (L1 Cache), a second-level Cache (L2 Cache), and a third-level Cache (L3 Cache), is located between the CPU and the memory, and is a memory with a faster read-write speed than the memory. When the CPU writes or reads data into or from the memory, this data is also stored in the cache memory. When the CPU needs the data again, the CPU reads the data from the Cache instead of accessing the slower memory, and of course, if the needed data is not in the Cache, the CPU will read the data in the memory again.

In the related art, the operation speed of a microprocessor is higher than the Memory access speed by several orders of magnitude, so that the Memory prefetching is a key bottleneck problem, also called "Memory Wall" (Memory Wall). Statistics show that 50% of the computation cycles in a computer are waiting for data to be loaded into memory. The memory and processor of the computer are separate, so all data must be moved back and forth between the two. The operation speed of the CPU is increased faster, the access speed of the memory is increased slower, and the speed mismatch problem exists between the CPU and the memory. This problem not only limits system bandwidth, increases system power consumption, but also further increases the cost and size of the computer.

Disclosure of Invention

In view of the above, the present disclosure provides a storage component and an artificial intelligence processor.

According to an aspect of the present disclosure, there is provided a storage component, wherein the storage component is applied to a computation core of an artificial intelligence processor, the artificial intelligence processor including a plurality of computation cores, each computation core including a processing component and a storage component, the storage component including: a first storage unit, a second storage unit and a third storage unit; the processing unit comprises an axon unit, a cell body unit and a routing unit, wherein the first storage unit is used for storing processing data and weight data, receiving read-write access of the axon unit and read-write access of the cell body unit, enabling the axon unit to perform data processing on the read processing data and the weight data, writing an obtained first processing result into the first storage unit, and enabling the cell body unit to read the processing data and/or the first processing result; the second storage unit is used for storing operation parameters, receiving read-write access of the axon unit and the cell body unit and read-write access of the routing unit, enabling the cell body unit to perform data processing according to the read operation parameters and the read processing data, writing an obtained second processing result into the second storage unit, and enabling the routing unit to read the operation parameters; the third storage unit is configured to receive a write-only access of the cell unit and a read-only access of the routing unit, so that the cell unit writes the read processing data, the first processing result, and/or the second processing result into the third storage unit, and the routing unit reads and sends the processing data, the first processing result, and/or the second processing result to an external circuit according to the operation parameter.

In one possible implementation manner, the access priorities of the first storage unit, the second storage unit, and the third storage unit to the axon unit, the cell body unit, and the routing unit are set to: the axon unit has a higher access priority than the soma unit; the access priority of the cell unit is higher than that of the routing unit.

In a possible implementation manner, the first storage unit includes a processing data space and a weight data space, and the processing data space is used for storing processing data and the first processing result and receiving read-write access of an axon unit and the cell body unit; and the weight data space is used for storing weight data and receiving read-write access of the axon unit.

In a possible implementation manner, the second storage unit includes an operation parameter space and a first cache space, and the operation parameter space is used for storing operation parameters and receiving read-only access of the axon unit, the soma unit, and the routing unit; the first cache space is used for receiving read-write access of the cell body unit and storing the second processing result.

In a possible implementation manner, the first buffer space is further configured to receive communication data written by the routing unit, where the communication data includes data received by the routing unit from an external circuit.

In a possible implementation manner, the third storage unit includes a second cache space, and the second cache space is used for receiving the write-only access of the cell body unit and the read-only access of the routing unit.

In one possible implementation manner, the first storage unit is further configured to: determining the storage address bits according to the address selection bits of the vector type data: addressing according to the storage address bit to obtain a storage address of the vector type data; writing the vector type data to the memory address.

In one possible implementation manner, the first storage unit is further configured to: performing extension processing on the image type data according to the dimensionality of the image type data to obtain a plurality of vectors of the image type data; storing a plurality of vectors of the image-based data according to the dimensionality of the image-based data.

In one possible implementation manner, the first storage unit is further configured to: splitting the vector of the image type data under the condition that the vector length of the image type data is larger than the bit width of the first storage unit to obtain the split vector; and performing zero padding processing on the split vector according to the bit width of the first storage unit, and storing the vector after the zero padding processing.

According to another aspect of the present disclosure, there is provided an artificial intelligence processor comprising a plurality of computing cores, the computing cores comprising processing components and the above-mentioned storage components.

According to the storage component of the embodiment of the disclosure, the processing component and the storage component can be arranged in the computing core, so that the storage component directly receives read-write access of the processing component without reading and writing the storage component outside the core. The first storage unit comprises a processing data space for storing processing data and a weight data space for storing weight data, so that the weight data and the processing data can be read in parallel conveniently, matrix multiply-add processing can be performed on the processing data, and the reading efficiency and the processing efficiency are improved. The second storage unit is convenient for the cell body unit to read and write and process data, and is convenient for the routing unit to read the operation parameters for data communication, so that the reading efficiency and the processing efficiency can be improved. And the cell body unit can write the data to be sent into the third storage unit, and the routing unit can read the data and send the data, so that the data reading and writing efficiency can be improved. Furthermore, the access priority of each storage unit can be set according to the processing sequence of each processing unit so as to reduce access conflict and improve access efficiency. In addition, different storage modes can be used for storing different types of data, so that the data storage efficiency is improved, and the data can be conveniently read and written. In summary, the memory component optimizes the memory read-write speed, is suitable for the processing component of the many-core architecture, and can reduce the volume of the artificial intelligence processor, reduce the power consumption of the artificial intelligence processor, and improve the processing efficiency of the artificial intelligence processor.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 shows a schematic diagram of a computational core according to an embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of storing vector type data, according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of storing image type data according to an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of storing image type data according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a storage component according to an embodiment of the disclosure;

FIG. 6 shows a block diagram of an electronic device according to an embodiment of the present disclosure;

fig. 7 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a schematic diagram of a computational core according to an embodiment of the present disclosure. The storage component according to the embodiment of the present disclosure is applied to a computational core of an artificial intelligence processor including a plurality of computational cores.

As shown in fig. 1, each computing core includes a processing component and a storage component, the storage component including: a first storage unit 11, a second storage unit 12, and a third storage unit 13; the processing means includes an axon unit 21, a cell body unit 22, and a routing unit 23.

The first storage unit 11 is configured to store processing data and weight data, receive read-write access of an axon unit and read-write access of a cell body unit, enable the axon unit to perform data processing on the read processing data and the weight data, write an obtained first processing result into the first storage unit, and enable the cell body unit to read the processing data and/or the first processing result;

the second storage unit 12 is configured to store an operation parameter, receive read-write access of the axon unit, the cell body unit, and read-write access of the routing unit, so that the cell body unit performs data processing according to the read operation parameter and the processing data, write an obtained second processing result in the second storage unit, and cause the routing unit to read the operation parameter;

the third storage unit 13 is configured to receive a write-only access of the cell unit and a read-only access of the routing unit, so that the cell unit writes the read processing data, the first processing result, and/or the second processing result into the third storage unit, and causes the routing unit to read and send the processing data, the first processing result, and/or the second processing result to an external circuit according to the operation parameter.

According to the storage component of the embodiment of the disclosure, the processing component and the storage component can be arranged in the computing core, so that the storage component directly receives read-write access of the processing component without reading and writing the storage component outside the core. The distributed storage architecture of the plurality of storage units can store different data respectively, so that the processing component can access the plurality of storage units conveniently, the memory reading and writing speed is optimized, the distributed storage architecture is suitable for the processing component of the many-core architecture, the size of the artificial intelligent processor can be reduced, the power consumption of the artificial intelligent processor is reduced, and the processing efficiency of the artificial intelligent processor is improved.

In one possible implementation, the storage component is applied to a computational core of an artificial intelligence processor. The artificial intelligence processor can be a brain-like computing chip, namely, the processing efficiency is improved and the power consumption is reduced by taking the processing mode of the brain as reference and simulating the transmission and processing of information by neurons in the brain. The artificial intelligence processor can comprise a plurality of computing cores, and different tasks can be processed independently among the computing cores or the same task can be processed in parallel, so that the processing efficiency is improved. The computing cores can carry out inter-core information transmission through the routing units in the computing cores. The processing unit can simulate the processing mode of neurons of the brain on information, and is divided into processing units such as an axon unit and a soma unit, the processing units respectively perform read-write access, read-only access or write-only access on a plurality of storage units of the storage unit so as to perform data interaction with the storage unit in the core, and respectively undertake respective data processing tasks and/or data transmission tasks so as to obtain data processing results, or communicate with other computing cores. The present disclosure does not limit the application field of the memory component.

In one possible implementation manner, the first storage unit may include a multi-chip Static Random-Access Memory (SRAM) to improve the read/write speed of the processing unit. In an example, the first storage unit may include four SRAMs with a read-write width of 32B and a capacity of 32KB. The present disclosure does not limit the read and write width and capacity of the first memory cell.

In one possible implementation, the first storage unit may include four SRAMs, for example, four blocks of the first storage unit may be respectively used to store different data. In an example, four SRAMs of the first storage unit may be used to store the process data and the weight data of the neural network, respectively, e.g., two SRAMs may be used to store the process data and another two SRAMs may be used to store the weight data. In an example, four SRAMs may also all be used to store processing data and weight data of the neural network, and the present disclosure does not limit the type and use of the memory included in the first storage unit.

In an example, the axon unit may read the weight data and the processing data stored in the first storage unit for data processing, for example, the artificial intelligence processor may perform a processing task of a neural network, the axon unit may read the weight data and the processing data stored in the first storage unit for processing, for example, the axon unit may perform type conversion on the read weight data and the processing data, for example, may convert the read data into matrix type data or vector type data, and transmit the matrix type data or the vector type data to the dendrite unit, which may perform matrix or vector multiply-add operation, for processing, to obtain the first processing result. The axon unit may read a first processing result obtained by the dendrite unit and write to the first storage unit.

In an example, the first storage unit comprises four SRAMs, two SRAMs can be used as a processing data space for storing processing data and a first processing result, and the other two SRAMs can be used as a weight data space for storing weights of the neural network. In performing the processing of the neural network, weighted sum processing, that is, multiply-add processing, may be performed on the processing data by the weight data.

In an example, the axon unit may read the process data in the process data space and the weight data in the weight data space, and may perform data conversion of the process data and the weight data, for example, into matrix type data or vector type data, and perform matrix operation by the dendrite unit to realize efficient weighted sum processing, that is, multiply-add processing, on the process data.

In an example, the read and write bit width of the four blocks of SRAM may be 32B, with a capacity of 32KB. The read-write bit width and the capacity of the first storage unit are not limited, wherein two SRAMs can form a group of memory units, and the first storage unit can comprise two groups of memory units. For example, two groups of memory cells are used for storing the processing data and the weight data, respectively. When the processing data and the weight data are read, the two groups of memory cells can be accessed in parallel so as to read the processing data and the weight data simultaneously and improve the reading efficiency. In an example, when the axon unit reads and writes the processing data and the weight data, the axon unit can read and write the two SRAMs according to the bit width of 32B. The read-write bit width of the conflict unit when accessing the first storage unit is not limited.

In an example, the dendrite unit may obtain a first processing result, and the axon unit may read the first processing result and write to the processing data space of the first storage unit. The first processing result may also be obtained by the axon unit, for example, the data conversion result may be used as the first processing result and written into the processing data space of the first storage unit, that is, the axon unit may have read-write access to the processing data space of the first storage unit. The present disclosure does not limit the execution unit that obtains the first processing result.

In an example, the axon unit may have read and write access to the weight data space of the first storage unit, e.g., when performing a multiply-add operation, the weight data in the weight data space may be read to perform a weighted sum of the processed data. When the neural network is trained, the weight of the neural network can be changed along with the training process, and the axon unit can write the weight generated in the training process into a weight data space. The read-write mode of the weight data is not limited in the present disclosure.

In an example, the axon unit may include two sets of data buses, which may be used for read-only access and read-write access, respectively, to the first storage unit. For example, if the process data or the weight data does not need to be modified, the first memory cell may be read-only accessed through the data bus for read-only access. And if the weight data needs to be modified or the first processing result needs to be written into the first storage unit, performing read-write access on the first storage unit by using the data bus for the read-write access.

In an example, the processing data may be data to be processed, such as image data, audio data, and the like, previously stored in the first storage unit, or may be an intermediate result of processing, for example, a first processing result of a certain processing step may be used as processing data of another processing step. The present disclosure is not limited as to the type of data processed.

By the mode, the first storage unit comprises the processing data space for storing the processing data and the weight data space for storing the weight data, so that the weight data and the processing data can be read in parallel conveniently, matrix multiplication and addition processing can be performed on the processing data, and the reading efficiency and the processing efficiency are improved.

In one possible implementation, the soma unit can be used for data transfer as well as for non-linear operations. For example, it may be used to transfer the processing result of the axon unit to the third storage unit so that the routing unit may read and send the processing result. For another example, the cell body unit may perform tensor comparison and nonlinear operations such as LUT (look up table) activation functions and LIF (leaky integration and fire) neurons on the processing data, and for example, the cell body unit may read the processing data in the first storage unit and the nonlinear operation parameters stored in the second storage unit, and perform the nonlinear operations on the processing data.

In an example, the second storage unit may include a block of SRAM, which may be used to store the operation parameters and the cache data. For example, the read-write bit width of the SRAM is 16B, and the capacity is 16KB, and the present disclosure does not limit the type, bit width, and capacity of the memory included in the second storage unit. In an example, the operation parameter may include a routing table, a lookup table, a non-linear operation parameter, and the like, and the disclosure does not limit the category of the operation parameter. The second storage unit can receive the cell body unit reading operation parameters to perform nonlinear operation, or can receive the routing unit reading routing table to perform data communication.

In an example, the cell body unit reads the processing data in the first storage unit, for example, may read the processing data according to a read-write bit width of 32B, and may read the nonlinear operation parameter (for example, a parameter of the activation function) in the operation parameter space of the second storage unit, for example, may read the nonlinear operation parameter according to a read-write bit width of 16B, and performs activation processing on the processing data, obtains a second processing result of the activation processing, and writes the second processing result into the first buffer space of the second storage unit. The cell unit may further read data in the first buffer space or the first storage unit and write the data in the third storage unit, and the routing unit may read the data in the third storage unit and a routing table in the operation parameter space, for example, may read the routing table according to a read-write bit width of 16B, and may send the data to another computational core according to the routing table.

In an example, the routing unit may further receive data sent by another computing core, and write the data into the first cache space of the second storage unit, for example, the data may be written into the first cache space according to a read-write bit width of 1B, 2B, or 8B, and the cell body unit may read the data, process the data, or transmit the data to another storage unit (e.g., the first storage unit).

In this way, the cell body unit can read and write and process data conveniently, the routing unit can read the operation parameters conveniently to carry out data communication conveniently, and the reading efficiency and the processing efficiency can be improved.

In a possible implementation manner, the third storage unit includes a second cache space, and the second cache space is used for receiving the write-only access of the cell unit and the read-only access of the routing unit.

In an example, the third storage unit may include a second cache space for storing data to be transmitted, for example, the second cache space may be a register provided in the routing unit, for example, a register with a capacity of 16B and a read-write bit width of 16B. The cell body unit may write the data to be sent into the second cache space, for example, may write the processed data, the first processing result, and/or the second processing result into the second cache space, and the routing unit may read the written data and send the data according to the routing table.

In an example, a third storage unit may be disposed in the routing unit and configured to cache data to be sent, and the cell body unit may write the data to be sent to the third storage unit without reading the data from the third storage unit, so that the access to the third storage unit by the cell body unit may be a write-only access. The routing unit can read the data to be sent and send the data according to the routing table, and when receiving the data sent by other computing cores, the routing unit can directly write the data into the first cache space of the second storage unit instead of the second cache space of the third storage unit, so that the routing unit does not need to write the third storage unit, and the routing unit can perform read-only access on the third storage unit.

In this way, the cell body unit can write the data to be sent into the third storage unit, and the routing unit can read the data and send the data, so that the data reading and writing efficiency can be improved.

In one possible implementation, the priority of access may be set according to the processing order of the data by the processing units. In an example, the access priority of the first storage unit to each processing unit may be set to be higher for axon units and lower for cell body units. For example, when the two processing units access the first memory unit simultaneously, the first memory unit may generate an access arbitration signal according to priority, for example, the arbitration signal may be set to allow the axon unit to access first, and after the axon unit has accessed, the arbitration signal may be set to allow the cell body unit to access.

In an example, the axon unit may first access the first storage unit to process the processing data in the first storage unit, obtain a first processing result, and write to the first storage unit. After the axon unit is accessed, the cell body unit can access the first storage unit so as to read the first processing result and write the first processing result into the third storage unit.

In an example, the access priority of the second storage unit to each processing unit may be set such that the access priority of the cell body unit is higher and the access priority of the routing unit is lower. For example, when the two processing units access the second storage unit simultaneously, the second storage unit may generate an access arbitration signal according to the priority, for example, the arbitration signal may be set to allow the cell unit to access first, and after the cell unit has completed accessing, the arbitration signal may be set to allow the routing unit to access.

In an example, the cell body unit may read the operation parameter in the second storage unit and the processing data in the first storage unit, perform a non-linear operation on the processing data according to the operation parameter to obtain a second processing result, and then write the second processing result into the first cache space of the second storage unit. Furthermore, the cell body unit can read the second processing result and write the second processing result into the third storage unit, and after the cell body unit finishes reading the second processing result, the routing unit can read the routing table in the second storage unit and send the second processing result according to the routing table.

In an example, the access priority of the third storage unit to each processing unit may be set such that the access priority of the cell unit is higher and the access priority of the routing unit is lower. For example, when the two processing units access the third storage unit simultaneously, the third storage unit may generate an access arbitration signal according to the priority, for example, the arbitration signal may be set to allow the cell unit to access first, and after the cell unit has completed accessing, the arbitration signal may be set to allow the routing unit to access.

In an example, the cell body unit may read a second processing result in the second storage unit, or processing data in the first storage unit, or a first processing result, to wait for data transmission, and write the data into the third storage unit, and after the cell body unit finishes writing, the routing unit may read data to be transmitted in the third storage unit and a routing table in the second storage unit, and transmit the second processing result according to the routing table.

In an example, if the processed data is transient data, such as intermediate data, the data may no longer be valid after being accessed and processed, and may no longer be accessed in subsequent processing, and the data may be released without affecting subsequent use, thereby saving storage space. For example, the second cache space may be used to store data received from other computational cores, and after the data is read by the cell body unit and transferred to the first memory unit for further computation, the data received in the second cache space may be released to save memory space.

In this way, the access priority of each storage unit can be set according to the processing sequence of each processing unit, so that access conflict is reduced, and the access efficiency is improved.

In a possible implementation manner, the data such as the processing data, the weight data, the first processing result, the second processing result, and the like may be data of a vector type, a matrix type, an image type, and the like. Before processing the processing data, the processing data may be written into the first memory unit so that the processing data is read by the processing units such as the axon unit, the cell body unit, and the like. Alternatively, after the axon unit obtains the first processing result, the first processing result may be written to the first storage unit. Alternatively, after the cell body unit obtains the second processing result, the second processing result may be written into the second memory unit. During the writing process, different types of data may be stored differently.

In one possible implementation manner, the first storage unit is further configured to: determining a storage address bit according to an address selection bit of the vector type data: addressing according to the storage address bit to obtain a storage address of the vector type data; writing the vector type data to the memory address.

In one possible implementation, the data such as the processing data, the weight data, the first processing result, etc. may be benign data, the vector type data may be a vector composed of binary data, a vector composed of integer data, or a membrane potential bias vector, etc., and the disclosure does not limit the type of the vector type data.

In an example, when storing the vector type data to the first memory cell, the memory address of the vector type data is determined, and the high and low bits are selected and then written into the first memory cell. For example, the first memory cell may be written in a memory bit width of 32B. In an example, in the process of executing the operation of the convolutional neural network and the multilayer perceptron, when data generated by the operation is stored, the data can be addressed firstly, and the data is written into the first storage unit after the storage address is determined.

For example, the data may be addressed according to a bit width of 16B, the data with a bit width of 32B may be divided into data with 16B high and 16B low, among the data with 16B high and 16B low, the first 13 bits are memory addresses and may be used for addressing the memory addresses, the last bit is a high-low significant bit and may be used for selecting the high-low bit, for example, the high-low significant bit is 1 and may indicate that the 16B data is 16B high data, the high 16B data may be addressed and stored, the high-low significant bit is 0 and may indicate that the 16B data is 16B low data, and the low 16B data may be addressed and stored.

In an example, when the computing core executes other tasks, such as data conversion and other tasks, the direct addressing can be realized, and in the process, the high-low effective bits contained in the data have no effect. For example, addressing may be by bit width in 32B.

Fig. 2 illustrates a schematic diagram of storing vector-type data, according to an embodiment of the present disclosure. As shown in fig. 2, the addressing may be in 16B bits wide and the vector type data may be stored in 16B bits wide. In an example, if the vector type data is less than 16B in length, zeros may be padded at the end to fill in the length of 16B and stored.

In one possible implementation, if the processed data, the first processing result, and the like are image type data, the image type data may be expanded to obtain a plurality of vectors of the image type data, and the plurality of vectors may be stored.

In one possible implementation manner, the first storage unit is further configured to: performing extension processing on the image type data according to the dimensionality of the image type data to obtain a plurality of vectors of the image type data; storing a plurality of vectors of the image type data according to the dimensionality of the image type data.

In a possible implementation manner, in the process of performing the operation of the convolutional neural network and the multilayer perceptron, if the depth of the neural network is small, the number of channels is small, and the extension can be performed according to the color dimensionality of the image, for example, the image can be extended according to three dimensionalities of R (red), G (green), and B (blue), that is, the image type data is extended into three groups of data, namely, a data group consisting of R values of pixels of the image, a data group consisting of G values of pixels of the image, and a data group consisting of B values of pixels of the image.

Further, three sets of data may be stored separately, for example, the three sets of data may be spread in the width direction into a plurality of vectors separately, i.e., each set of data is spread into a plurality of vectors. For example, if the image-type data is an image with a resolution of 1024 × 768, three sets of data obtained by extending the image in three dimensions R, G, B are also image-type data of 1024 × 768, and further, the three sets of image-type data can be extended in the width direction, for example, 768 vectors can be extended respectively, and each vector includes 1024 elements. The image-type data may also be extended in the height direction, and the present disclosure does not limit the extension direction.

In one possible implementation, after obtaining the plurality of vectors, respectively, the plurality of vectors may be stored, respectively. Multiple vectors can be stored separately in the three dimensions R, G, B.

Fig. 3 illustrates a schematic diagram of storing image type data according to an embodiment of the present disclosure. As shown in fig. 3, the data sets of the R dimension may be spread into a plurality of vectors and stored to the first storage units, respectively. Similarly, the data in the G dimension and the B dimension may be spread into a plurality of vectors and stored in the first storage unit.

In an example, the length of the vector obtained by extending may be larger than the storage bit width of the first storage unit, for example, the length of the vector is 1024B, and the storage bit width of the first storage unit is 16B. The vectors may be split, each vector yielding a plurality of 16B-long subvectors. For another example, if the length of the last sub-vector in the sub-vectors obtained by splitting is less than 16B, it can be complemented by zero padding to 16B and stored.

In a possible implementation manner, in the process of performing the operation of the convolutional neural network and the multilayer perceptron, if the depth of the neural network is larger and the number of channels is larger, the extension may be performed according to the dimension of the size of the image, for example, the extension may be performed on the image according to two dimensions of the width and the height, so as to obtain a plurality of vectors. For example, the image type data is an image with a resolution of 1024 × 768, 1024 × 768 vectors can be obtained, and elements of each vector may include, for example, an R value, a B value, a G value, a gray value, a brightness value, some labeling information, and the like, and the disclosure does not limit the elements of the vectors.

Further, a plurality of vectors may be stored, respectively. For example, the image-type data is an image with a resolution of 1024 × 768, 1024 × 768 vectors can be obtained, and 1024 × 768 vectors can be stored respectively, and in the storage process, a plurality of vectors can be stored one by one in the width direction of the image-type data and in the height direction of the image-type data, for example, 1024 vectors in the 1 st row can be stored first, 1024 vectors in the 2 nd row … can be stored, and finally 1024 vectors in the 768 th row can be stored. The vectors may be stored one by one in the height direction of the image type data and in the width direction of the image type data, but the present disclosure is not limited to the direction.

Fig. 4 illustrates a schematic diagram of storing image type data according to an embodiment of the present disclosure. As shown in fig. 4, the image type data may be spread to obtain a plurality of vectors, and the plurality of vectors may be stored one by one first in a width direction of the image type data and then in a height direction of the image type data.

In an example, the extended length of the obtained vector may be larger than the storage bit width of the first storage unit, for example, the length of the vector is 24B, and the storage bit width of the first storage unit is 16B. The vectors may be split, each vector yielding a plurality of 16B-long subvectors. For another example, if the last sub-vector of the sub-vectors obtained by splitting is less than 16B (for example, the length of the vector is 24B, and the lengths of the split sub-vectors are 16B and 8B, respectively), the sub-vectors with the length less than 16B may be complemented by zero padding processing to 16B and stored. As shown in fig. 4, the vector x _ grp0 is a sub-vector with a split length equal to 16B, and the vector x _ grp1 is a sub-vector with a split length smaller than 16B, and zero padding can be performed on the vector x _ grp1 to complement the length to 16B.

In one possible implementation, the plurality of storage units of the storage component may perform memory management in a pipelined manner. In an example, a plurality of processing units of the processing component may perform pipelined processing, for example, the processing result of a group of processing data by the axon unit may be read by the cell body unit and further processed, and the next group of processing data may be processed by the axon unit while the cell body unit is processing. In the memory management, the memory unit may also perform memory management in the form of a pipeline. For example, the processing data may be divided into a plurality of groups, the axon unit may read one group of processing data for processing, and write the processing result of the group of processing data into the first storage unit, the cell body unit may read the processing result of the group of processing data for further processing without waiting for all processing data to be processed, and after all processing results are obtained, the cell body unit may read all processing results, and the first storage unit only needs to open a storage space sufficient for caching the processing result of the group of processing data, and does not need to open a storage space for caching the processing result of all processing data, thereby improving the memory management efficiency.

In an example, the processing data may be an image, the image may include a plurality of rows of pixels, the axon unit may read one or more rows of pixels at a time, and after processing, may write the processing result of the one or more rows of pixels to the first storage unit, the cell body unit may read the processing result of the one or more rows of pixels for further processing, and after the cell body unit finishes reading, the first storage unit may release a storage space for storing the processing result. At the same time, the axon unit can read the next row or several rows of pixels, the processed result can be written into the first storage unit again, the cell body unit can read the processed result of the next row or several rows of pixels, the first storage unit releases the storage space … … for storing the processed result, the processing efficiency of each processing unit can be improved through the pipeline processing and memory management, the cell body unit does not need to wait for the processed result of the whole image, and the axon unit can start further processing when the axon unit obtains the processed result of one row or several rows of pixels each time. Meanwhile, the utilization rate of the storage space of the storage component can be improved, only the storage space of the processing result of one or more rows of pixels is required to be opened up, the storage space for storing the processing result of the whole image is not required to be opened up, and the management efficiency of the memory is improved.

According to the storage component of the embodiment of the disclosure, the processing component and the storage component can be arranged in the computing core, so that the storage component directly receives read-write access of the processing component without reading and writing the storage component outside the core. The first storage unit comprises a processing data space for storing processing data and a weight data space for storing weight data, so that the weight data and the processing data can be read in parallel conveniently, matrix multiply-add processing can be performed on the processing data, and the reading efficiency and the processing efficiency can be improved. The second storage unit is convenient for the cell body unit to read and write and process data, and is convenient for the routing unit to read the operation parameters for data communication, so that the reading efficiency and the processing efficiency can be improved. And the cell body unit can write the data to be sent into the third storage unit, and the routing unit can read the data and send the data, so that the data reading and writing efficiency can be improved. Furthermore, the access priority of each storage unit can be set according to the processing sequence of each processing unit so as to reduce access conflict and improve access efficiency. In addition, different storage modes can be used for storing different types of data, so that the data storage efficiency is improved, and the data can be conveniently read and written. In summary, the memory component optimizes the memory read-write speed, is suitable for the processing component of the many-core architecture, and can reduce the volume of the artificial intelligence processor, reduce the power consumption of the artificial intelligence processor, and improve the processing efficiency of the artificial intelligence processor.

Fig. 5 shows a schematic diagram of a memory component according to an embodiment of the disclosure. The first storage unit may include two memory banks Mem0 and Mem1, where Mem0 and Mem1 may respectively include two SRAMs with a bit width of 32B and a capacity of 32KB, which may be used to store processing data and weight data.

In one possible implementation, the second storage unit may include a Mem2 memory, which may be an SRAM with a read-write bit width of 16B and a capacity of 16KB, and may be used to store the operation parameters and the buffer data.

In one possible implementation, the third storage unit may include a Mem3 memory, which may be a register with a capacity of 16B and a read-write bit width of 16B. May be used to store data to be transmitted.

In one possible implementation, the axonal unit may transmit the read or write instruction to the first storage unit through the data selector MUX, for example, the axonal unit may access the processing data and the weight data, perform weighted summation processing, i.e., multiply-add operation processing, through the dendrite unit to obtain a first processing result, and write the first processing result to the first storage unit through the data selector MUX.

In one possible implementation manner, the cell body unit may read the operation parameter in the second storage unit and the processing data in the first storage unit to perform the non-linear operation processing on the processing data, and write the obtained second processing result data selector MUX into the second storage unit.

In one possible implementation, the cell body unit may read the processing data and/or the first processing result in the first storage unit, or the second processing result in the second storage unit, and write the processing data and/or the first processing result in the second storage unit into the third storage unit through the data selector MUX. The routing unit may read the operation parameter (routing table) in the second storage unit and the data in the third storage unit, and transmit the data based on the operation parameter.

In a possible implementation manner, the present disclosure further provides an artificial intelligence processor, where the artificial intelligence processor includes a plurality of computing cores, and the computing cores include a processing component and the above storage component.

Fig. 6 is a block diagram illustrating a combined processing device 1200 according to an embodiment of the present disclosure. As shown in fig. 7, the combined processing device 1200 includes a computing processing device 1202 (e.g., an artificial intelligence processor including multiple computing cores as described above), an interface device 1204, other processing devices 1206, and a storage device 1208. Depending on the application scenario, one or more computing devices 1210 (e.g., computing cores) may be included in the computing processing device.

In one possible implementation, the computing processing device of the present disclosure may be configured to perform operations specified by a user. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or as part of a hardware structure of an artificial intelligence processor core, computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.

In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively perform user-specified operations. Other Processing devices of the present disclosure may include one or more types of general and/or special purpose processors such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an artificial intelligence processor, and the like, depending on the implementation. These processors may include, but are not limited to, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined based on actual needs. As previously mentioned, the computational processing apparatus of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only. However, when considered together, a computing processing device and other processing devices may be considered to form a heterogeneous multi-core structure.

In one or more embodiments, the other processing devices can interface with external data and controls as a computational processing device of the present disclosure (which can be embodied as artificial intelligence, e.g., a computing device associated with neural network operations), performing basic controls including, but not limited to, data handling, turning on and/or off of the computing device, and the like. In further embodiments, other processing devices may also cooperate with the computing processing device to collectively perform computational tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing processing device may obtain input data from other processing devices via the interface device, and write the input data into a storage device (or memory) on the computing processing device. Further, the computing processing device may obtain control instructions from other processing devices via the interface device, and write the control instructions into a control cache on the computing processing device slice. Alternatively or optionally, the interface device may also read data from the memory device of the computing processing device and transmit the data to the other processing device.

Additionally or alternatively, the combined processing device of the present disclosure may further include a storage device. As shown in the figure, the storage means is connected to the computing processing means and the further processing means, respectively. In one or more embodiments, the storage device may be used to store data for the computing processing device and/or the other processing devices. For example, the data may be data that is not fully retained within internal or on-chip storage of a computing processing device or other processing device.

According to different application scenarios, the artificial intelligence chip disclosed by the disclosure can be used for a server, a cloud server, a server cluster, a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an automatic driving terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

Fig. 7 illustrates a block diagram of an electronic device 1900 in accordance with an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to FIG. 7, electronic device 1900 includes a processing component 1922 (e.g., an artificial intelligence processor including multiple computing cores), which further includes one or more computing cores, and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, some or all of the units can be selected to achieve the purpose of the solution described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The electronic device or processor of the present disclosure may also be applied to the fields of the internet, internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or the processor disclosed by the disclosure can also be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, a computationally powerful electronic device or processor according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or processor may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A memory component for application to a computational core of an artificial intelligence processor, the artificial intelligence processor comprising a plurality of computational cores, each computational core comprising a processing component and a memory component, the memory component comprising: a first storage unit, a second storage unit and a third storage unit; the processing unit comprises an axon unit, a cell body unit and a routing unit,

the first storage unit is used for storing processing data and weight data, receiving read-write access of an axon unit and read-write access of a cell body unit, enabling the axon unit to perform data processing on the read processing data and the weight data, writing an obtained first processing result into the first storage unit, and enabling the cell body unit to read the processing data and/or the first processing result;

the second storage unit is used for storing operation parameters, receiving read-write access of the axon unit and the cell body unit and read-write access of the routing unit, enabling the cell body unit to perform data processing according to the read operation parameters and the read processing data, writing an obtained second processing result into the second storage unit, and enabling the routing unit to read the operation parameters;

the third storage unit is configured to receive a write-only access of the cell unit and a read-only access of the routing unit, so that the cell unit writes the read processing data, the first processing result, and/or the second processing result into the third storage unit, and the routing unit reads and sends the processing data, the first processing result, and/or the second processing result to an external circuit according to the operation parameter.

2. The component of claim 1, wherein access priorities of the first, second, and third storage units for the axon unit, the cell body unit, and the routing unit are set to:

the axon unit has a higher access priority than the soma unit;

the access priority of the cell unit is higher than that of the routing unit.

3. The component of claim 1, wherein the first storage unit comprises a process data space and a weight data space,

the processing data space is used for storing processing data and the first processing result and receiving read-write access of the axon unit and the cell body unit;

and the weight data space is used for storing weight data and receiving read-write access of the axon unit.

4. The component of claim 1, wherein the second storage unit comprises an operation parameter space and a first buffer space,

the operation parameter space is used for storing operation parameters and receiving read-only access of the axon unit, the cell body unit and the routing unit;

the first cache space is used for receiving read-write access of the cell body unit and storing the second processing result.

5. The component of claim 4, wherein the first buffer space is further configured to receive communication data written by the routing unit, the communication data comprising data from an external circuit received by the routing unit.

6. The component of claim 1, wherein the third storage unit comprises a second cache space for receiving write-only accesses by the cell unit and read-only accesses by the routing unit.

7. The component of claim 1, wherein the first storage unit is further to:

determining a storage address bit according to an address selection bit of the vector type data:

addressing according to the storage address bit to obtain a storage address of the vector type data;

writing the vector type data to the memory address.

8. The component of claim 1, wherein the first storage unit is further to:

performing extension processing on the image type data according to the dimensionality of the image type data to obtain a plurality of vectors of the image type data;

storing a plurality of vectors of the image-based data according to the dimensionality of the image-based data.

9. The component of claim 8, wherein the first storage unit is further configured to:

splitting the vector of the image type data under the condition that the vector length of the image type data is larger than the bit width of the first storage unit to obtain the split vector;

and performing zero padding processing on the split vector according to the bit width of the first storage unit, and storing the vector after the zero padding processing.

10. An artificial intelligence processor comprising a plurality of computing cores, the computing cores comprising processing components and memory components according to any of claims 1 to 9.