CN112799599A - Data storage method, computing core, chip and electronic equipment - Google Patents

Data storage method, computing core, chip and electronic equipment Download PDF

Info

Publication number
CN112799599A
CN112799599A CN202110172560.5A CN202110172560A CN112799599A CN 112799599 A CN112799599 A CN 112799599A CN 202110172560 A CN202110172560 A CN 202110172560A CN 112799599 A CN112799599 A CN 112799599A
Authority
CN
China
Prior art keywords
data
processing
convolution
layer
storage unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110172560.5A
Other languages
Chinese (zh)
Other versions
CN112799599B (en
Inventor
徐海峥
裴京
王松
马骋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110172560.5A priority Critical patent/CN112799599B/en
Publication of CN112799599A publication Critical patent/CN112799599A/en
Application granted granted Critical
Publication of CN112799599B publication Critical patent/CN112799599B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The present disclosure relates to a data storage method, a computing core, a chip and an electronic device, the method is applied to a computing core of a processor, each computing core comprises a processing unit and a storage unit, and the storage unit comprises more than two storage units. The processing component writes the weight data and the processing data of each convolution layer into each storage unit in sequence for storage according to the convolution layer processing sequence in the multilayer convolution operation process; and each storage unit receives and stores the weight data and the processing data. And the data of adjacent layers can be calculated and dynamically covered at the same time, so that the residual continuous dynamic available space is increased, a larger data space is vacated for subsequent calculation, the space-time operation efficiency of a calculation core is improved, and the performance of a chip is further improved.

Description

Data storage method, computing core, chip and electronic equipment
Technical Field
The present disclosure relates to the field of neuromorphic engineering, and in particular, to a data storage method, a computational core, a chip, and an electronic device.
Background
Convolution operations are common operations in neural networks. And the many-core nerve morphology chip stores and transmits data needing convolution operation in the mapping process. These data can be divided into static data, which is not erasable, and dynamic data, which is constantly erasable. The memory space occupied by the static data in the chip is relatively fixed and cannot be covered, and the memory space occupied by the dynamic data in the chip can be changed and can be covered according to the time sequence requirement. The static data may include weights of convolution kernels of the neural network and the dynamic data may include process data.
The data in the overlapping area is easy to be covered and destroyed in the data storage and transmission process, and even if the data is moved firstly, staggered from the overlapping area and then sent, the calculation clock is increased, and the calculation efficiency of the chip is reduced. The storage and access of the chip are long in time consumption, large in occupied space capacity and low in dynamic recycling efficiency.
Disclosure of Invention
In view of the above, the present disclosure provides a data storage method, a computing core, a chip and an electronic device.
According to an aspect of the present disclosure, a data storage method is provided, which is applied to a computing core of a processor, where the processor includes a plurality of computing cores, each of which includes a processing unit and a storage unit, where the storage unit includes two or more storage units.
The method comprises the following steps: the processing component writes the weight data and the processing data of each convolution layer into each storage unit in sequence for storage according to the convolution layer processing sequence in the multilayer convolution operation process; each storage unit receives and stores the weight data and the processing data;
wherein, in the memory cell, the processing data and the weight data of the same convolution layer are stored in different memory cells; in the same storage unit, the weight data space storing the weight data and the processing data space storing the processing data are sequentially arranged according to a first address direction;
wherein, the weight data space for storing the weight data of each convolution layer is sequentially arranged along the first address direction according to the processing sequence of the convolution layers; the processing data spaces for storing the processing data of the convolutional layers are sequentially arranged along the first address direction according to the processing sequence of the convolutional layers;
and in a weight data space for storing weight data of the same convolutional layer and a processing data space for storing processing data of the same convolutional layer, the weight data and the processing data are respectively and sequentially arranged according to a second address direction, and the first address direction is opposite to the second address direction.
In one possible implementation, the first address direction is a high address to low address direction and the second address direction is a low address to high address direction.
In one possible implementation, the method further includes:
the processing component sends the operation result data of each convolution layer in the multilayer convolution operation process to each storage unit for storage;
and each storage unit receives and stores the operation result data, wherein the operation result data of any convolution layer and the processing data of the convolution layer are stored in the same storage unit, and share a storage space with the processing data of the next convolution layer.
In one possible implementation, the method further includes:
the processing component writes first data received from outside the computing core into a storage unit and reads second data from the storage unit to be sent outside the computing core;
in multiple write operations, the initial address of each write operation is arranged according to the first address direction, and first data is written in each write operation according to the second address direction;
in the multiple read operations, the start address of each read operation is arranged according to the first address direction, and the second data is read according to the second address direction in each read operation.
In a possible implementation manner, in the storage unit, the storage sequence of the processing data and the weight data for each convolution layer conforms to the convolution operation process of firstly the depth direction, then the horizontal direction, and then the vertical direction.
In one possible implementation, each layer of convolution operation process uses a plurality of convolution kernel groups, and each convolution kernel group comprises a plurality of convolution kernels;
in the storage unit, the weight data of each convolution kernel group in each layer of convolution operation are stored in sequence, and the storage sequence of the weight data of each convolution kernel group is stored according to the serial number sequence of convolution kernels, the depth direction of the convolution kernels, the transverse direction and the longitudinal direction.
According to another aspect of the present disclosure, there is provided a computing core comprising a processing component and a storage component;
the processing component writes the weight data and the processing data of each convolution layer into each storage unit in sequence for storage according to the convolution layer processing sequence in the multilayer convolution operation process;
the storage component comprises more than two storage units, and each storage unit receives and stores the weight data and the processing data;
wherein, in the memory cell, the processing data and the weight data of the same convolution layer are stored in different memory cells; in the same storage unit, the weight data space storing the weight data and the processing data space storing the processing data are sequentially arranged according to a first address direction;
wherein, the weight data space for storing the weight data of each convolution layer is sequentially arranged along the first address direction according to the processing sequence of the convolution layers; the processing data spaces for storing the processing data of the convolutional layers are sequentially arranged along the first address direction according to the processing sequence of the convolutional layers;
and in a weight data space for storing weight data of the same convolutional layer and a processing data space for storing processing data of the same convolutional layer, the weight data and the processing data are respectively and sequentially arranged according to a second address direction, and the first address direction is opposite to the second address direction.
For the above-mentioned computation core, in one possible implementation, the first address direction is a high address to low address direction, and the second address direction is a low address to high address direction.
For the above computation core, in a possible implementation manner, the processing unit is further configured to send operation result data of each convolution layer in the multilayer convolution operation process to each storage unit for storage;
and each storage unit receives and stores the operation result data, wherein the operation result data of any convolution layer and the processing data of the convolution layer are stored in the same storage unit, and share a storage space with the processing data of the next convolution layer.
For the above-mentioned computing core, in a possible implementation manner, the processing unit is further configured to write first data received from outside the computing core into a storage unit, and read second data from the storage unit to send to outside the computing core;
in multiple write operations, the initial address of each write operation is arranged according to the first address direction, and first data is written in each write operation according to the second address direction;
in the multiple read operations, the start address of each read operation is arranged according to the first address direction, and the second data is read according to the second address direction in each read operation.
For the above calculation kernel, in a possible implementation manner, in the storage unit, the storage sequence of the processing data and the weight data for each convolution layer conforms to a convolution operation process of firstly a depth direction, then a horizontal direction, and then a vertical direction.
For the above computation kernel, in a possible implementation, each layer of convolution operation process uses a plurality of convolution kernel groups, and each convolution kernel group includes a plurality of convolution kernels;
in the storage unit, the weight data of each convolution kernel group in each layer of convolution operation are stored in sequence, and the storage sequence of the weight data of each convolution kernel group is stored according to the serial number sequence of convolution kernels, the depth direction of the convolution kernels, the transverse direction and the longitudinal direction.
According to another aspect of the present disclosure, an artificial intelligence chip is provided, the chip including a plurality of computing cores.
According to another aspect of the present disclosure, an electronic device is provided that includes one or more artificial intelligence chips.
According to the embodiment of the disclosure, by storing the weight data and the processing data in the weight data space and the processing data space of each storage unit according to the processing sequence of the convolutional layer and the first address direction, the dynamic coverage can be performed on the adjacent convolutional layer data while calculating, the residual continuous dynamic available space is increased, and a larger data space is made for the subsequent calculation. The weight data and the processing data of the convolution layer in the same layer are stored and sequentially arranged according to the second address direction, the depth can be highly matched, and then the convolution operation process is performed horizontally and longitudinally, so that the fitting degree of the storage process and the convolution operation process is improved. Under the condition that the data address to be transmitted and the data address to be received of the storage unit are overlapped, the initial address of each read-write operation in multiple read-write operations is arranged according to the first address direction, so that the problem that the data in the overlapped area of the storage unit is washed out and covered because the processing unit executes the multiple read-write operations at the same time can be solved, the space-time operation efficiency of a computation core is improved, and the performance of a chip is further improved.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
FIG. 1 shows a schematic diagram of a compute core, according to an embodiment of the present disclosure.
Fig. 2a is a schematic diagram illustrating data storage in the related art.
FIG. 2b shows a data storage schematic according to an embodiment of the present disclosure.
Fig. 3 shows a flow diagram according to an embodiment of the present disclosure.
Fig. 4a is a schematic diagram illustrating data transmission of a memory cell in the related art.
Fig. 4b shows a data transmission schematic according to an embodiment of the present disclosure.
Fig. 5 shows a flow diagram according to an embodiment of the present disclosure.
FIG. 6 shows a schematic diagram of processing a data storage sequence according to an embodiment of the present disclosure.
Fig. 7 illustrates a schematic diagram of a weight data storage order according to an embodiment of the present disclosure.
Fig. 8 is a schematic diagram illustrating a storage order principle of a convolution operation process according to an embodiment of the present disclosure.
Fig. 9 illustrates a block diagram of an electronic device according to an embodiment of the present disclosure.
FIG. 10 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to and includes any and all possible processes for one or more of the associated listed items.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
It should also be understood that the tensor (tensor), which is the container in which the data is stored, can be thought of as a multidimensional array. Image data, as well as other perceptual data (e.g., audio, video, etc.), can be represented as multidimensional tensors and can be stored in memory in binary form. For facilitating understanding of the technical solution of the present disclosure, the processing data may be exemplified by image data hereinafter. The image data used in the description of the present disclosure herein is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. The present disclosure is applicable to process data including video, audio, images, etc. that may be stored in binary form in memory.
FIG. 1 shows a schematic diagram of a compute core, according to an embodiment of the present disclosure. The data storage method according to the embodiment of the disclosure is applied to a computing core of a processor, and the processor comprises a plurality of computing cores.
As shown in fig. 1, each compute core includes a processing component and a storage component. The processing means may comprise a dendrite unit, an axon unit, a soma unit, a routing unit. The storage unit may include more than two storage units, the storage units may be configured to store the processing data and the weight data, a space in the storage units where the processing data is stored may be referred to as a processing data space, and a space in which the weight data is stored may be referred to as a weight data space.
In a possible implementation manner, the processor may be a brain-like computing chip, that is, with reference to a processing mode of the brain, by simulating transmission and processing of information by neurons in the brain, processing efficiency is improved and power consumption is reduced. The processor may include multiple computing cores that may independently handle different tasks, such as: a convolution operation task, a pooling task or a full connection task and the like; the same task may also be processed in parallel, i.e., each compute core may process different portions of the same task assigned, for example: partial layers of convolution operation tasks in the multilayer convolution neural network operation task and the like. It should be noted that the present disclosure does not limit the number of computing cores in a chip and the tasks executed by the computing cores.
Within the computing core, processing components and storage components may be provided. The processing means may comprise a dendrite unit, an axon unit, a soma unit and a routing unit. The processing component can simulate the processing mode of neurons of the brain on information, wherein the dendritic units are used for receiving signals, the axonal units are used for sending spike signals, the soma units are used for integrated transformation of the signals, and the routing units are used for information transmission with other computing cores. The processing unit in the computing core can perform read-write access on a plurality of storage units of the storage unit to perform data interaction with the storage unit in the computing core, and can respectively undertake respective data processing tasks and/or data transmission tasks to obtain data processing results, or perform communication with other computing cores. The present disclosure does not limit the field of application of the processing component.
In one possible implementation, the storage unit may include more than two storage units, and the storage unit may be a Static Random-Access Memory (SRAM). For example, the memory cells may include SRAM having a read/write width of 32B and a capacity of 64 KB. The present disclosure does not limit the read and write width and capacity of the memory cell.
In one possible implementation, the storage unit includes a process data space and a weight data space. The process data space may be used to store dynamic data, i.e., data that may change during operation, data that needs to be input or output during operation, and data that is to be changed during an associated operation. The weight data space may be used to store static data, i.e., data used as control or reference during program execution. Static data may not change as the program runs, so it is said that static data may not change for a long period of time. For example, for a convolutional neural network structure, the static data may be weight data of a convolutional operation, and the dynamic data may be processing data, for example, including convolutional layer input data (for example, convolutional layer input data processed by shortcut), which may be erased and overwritten as the processing data changes, for example, in a hierarchical iterative computation process of the convolutional neural network structure, the processing data is continuously modified according to a time sequence requirement, and old processing data is overwritten with new processing data.
According to the computing core disclosed by the embodiment of the disclosure, the processing component and the storage component can be arranged in the computing core, so that the storage component directly receives read-write access of the processing component, the storage component outside the computing core does not need to be read and written, the memory read-write speed is optimized, and the method is suitable for the processing component of a many-core architecture.
In a possible implementation manner, the method may be used to implement storage and transmission of data in a multi-layer convolution operation process, where the data in the multi-layer convolution operation process includes processing data and weight data in each layer of convolution operation process.
For example, when a neural network algorithm is applied to perform target recognition on an image, a multi-layer convolution operation is performed on input image processing data. The basic data structure of the neural network is a layer, and various neural networks including a deep neural network DNN, a convolutional neural network CNN, a cyclic neural network RNN, a deep residual error network ResNet, and the like are all a neural network formed by organically combining a plurality of layers. A layer can be understood as a data processing module, different layers are required for different data processing types, and different layers have different layer attribute states, that is, weights of the layers. One or more inputs can be converted into one or more outputs under the action of the weights of different layers.
If a neural network is a relatively small network, a computational core can satisfy the resources required for processing each layer of the neural network, and the processing data and the weight data of the neural network can be stored in the computational core.
If the neural network is a large neural network, the neural network has a plurality of layers, each layer needs to calculate a large amount of data, and one calculation core cannot meet the requirement of processing resources required by the neural network. For example, assuming that the neural network includes 7 convolutional layers L1-L7, the processor may assign the processing data and weight data of convolutional layers L1 and L2 to compute core a, assign the processing data and weight data of convolutional layers L3 and L4 to compute core B, and assign the processing data and weight data of convolutional layers L5-L7 to compute core C. It should be noted that the present disclosure does not limit the type of neural network in which the computing cores operate, and how the neural network is split and allocated to different computing cores.
In the related art, when storing data in the multi-layer convolution operation process in a storage unit, a processor stores processing data (dynamic data) and weight data (static data) in the multi-layer convolution operation process separately on different storage units. That is, the processor stores the weight data in one memory location and the process data in another memory location.
Fig. 2a shows a schematic diagram of data storage in the related art, as shown in fig. 2a, there are two storage units, namely: the memory unit MEM0 and the memory unit MEM1 are respectively provided, the memory space capacity of each memory unit is 64KB, the read-write bit width is 32B, weight data W in the multi-layer convolution operation process is stored in the memory unit MEM0, and processing data X is stored in the memory unit MEM 1. The L5 th layer, the L6 th layer and the L7 th layer can represent the corresponding convolution layer number of the multilayer convolution neural network in the operation process. In performing the convolution operations of the respective layers, the convolution operations are performed in the order of L5 layers, L6 layers, and L7 layers … ….
The weight data of the L5 th layer, the L6 layer and the L7 layer are stored in the memory unit MEM0, the weight data W of the L5 layer with the size of 8KB, the weight data W of the L6 layer with the size of 18KB and the weight data W of the L7 layer with the size of 8KB are sequentially stored according to the second address direction, and the remaining continuous dynamic available space is 30 KB. Herein, the second address direction may be a low address to high address direction, and the continuous dynamic available space is a blank space in which no data is stored.
The processed data of the L5 th layer, the L6 layer, and the L7 layer is stored in another memory unit MEM1, and the processed data X of the L5 layer with the size of 28KB, the processed data X of the L6 layer with the size of 15KB, the processed data X of the L7 layer with the size of 15KB, and the remaining continuous dynamic usable space are sequentially stored in the second address direction, and are 6 KB.
When the convolution operation of the whole node unit of the L5 layer, the L6 layer and the L7 layer is completed, the convolution operation needs to be sent to other computing cores, and meanwhile, the data sent by the computing core corresponding to the previous node unit is received, because the positive sequence storage sequence of the second address direction is adopted, the computing cores cannot compute the L6 layer and output the computation result of the L6 layer to the L7 layer, and meanwhile dynamically cover the data of the space occupied by the L6 layer.
Therefore, the processing data and the weight data are arranged in the respective memory units in the positive order in the memory units according to the convolutional layer processing order and the second address direction, which results in low repeated utilization rate of the dynamic data, small residual continuous space, incapability of dynamically covering the same layer of data while calculating, influence on the storage space of subsequent calculation, and increase of calculation time along with continuous data transportation in the storage process.
In view of the above-mentioned problem of data storage in the related art as shown in fig. 2a, fig. 3 shows a flowchart of a data storage method according to an embodiment of the present disclosure. The method as shown in fig. 3 may comprise the steps of:
in step S1, the processing unit writes the weight data and the processing data of each convolutional layer into each storage unit in sequence according to the convolutional layer processing sequence in the multi-layer convolutional calculation process, and stores the weight data and the processing data.
In step S2, each storage unit receives and stores the weight data and the processing data.
In one possible implementation manner, in the same storage unit, the weight data space in which the weight data is stored and the processing data space in which the processing data is stored are sequentially arranged in the first address direction. The first address direction may be a high address to low address direction herein, the first address direction being opposite the second address direction.
For example, fig. 2b shows a schematic diagram of data storage according to an embodiment of the present disclosure, and as shown in fig. 2b, it is assumed that the storage unit includes two storage units: memory cell MEM0, memory cell MEM 1. SRAM with a read-write width of 32B and a capacity of 64KB can be used for each memory cell. The L5 th layer, the L6 th layer and the L7 th layer can represent the corresponding convolution layer number of the multilayer convolution neural network in the operation process.
As shown in fig. 2b, in the memory unit MEM0, a weight data space storing 8KB of static data W, a processing data space storing 30KB (15KB +15KB) of dynamic data X, and the remaining 40KB of continuous dynamic available space are sequentially arranged according to the first address direction.
As shown in fig. 2b, in the memory unit MEM1, a weight data space storing static data W of 26KB (18KB +8KB), a processing data space storing dynamic data X of 28KB, and the remaining 10KB of continuous dynamic available space are sequentially arranged in the first address direction.
Each storage unit comprises a processing data space capable of storing dynamic data and a weight data space capable of storing static data, and the weight data space storing the weight data and the processing data space storing the processing data are sequentially arranged according to the first address direction, so that the residual available continuous dynamic available space of the storage unit can be improved.
In one possible implementation, in the storage unit, the processing data and the weight data of the same convolutional layer are stored in different storage units; the processing data and the weight data of different convolution layers can be stored in the same storage unit.
For example, as shown in fig. 2b, the processing unit sends the weight data W of the L5 layer into the weight data space of the memory cell MEM0, and sends the processing data X of the L5 layer into the processing data space of the memory cell MEM 1; sending the weight data W of the L6 layer into the weight data space of the memory cell MEM1, and sending the processing data X of the L6 layer into the processing data space of the memory cell MEM 0; the weight data W of the L7 layer is sent to the weight data space of the memory cell MEM1, and the processed data X of the L7 layer is sent to the processed data space of the memory cell MEM 0.
Because the processing data and the weight data of the same convolution layer are stored in different storage units, the processing unit can access a plurality of storage units in parallel to read the processing data and the weight data, and the computing efficiency of the computing core is improved.
In one possible implementation, the weight data spaces storing the weight data of each convolutional layer are sequentially arranged along the first address direction according to the convolutional layer processing sequence; the processing data spaces storing the processing data of the respective convolutional layers are sequentially arranged in the first address direction in accordance with the convolutional layer processing order.
For example, as shown in fig. 2b, in the memory unit MEM0, the weight data W of the L5 layer and the processing data X of the L6 layer and the L7 layer may sequentially store weight data W of the L5 layer of 8KB size, processing data X of the L6 layer of 15KB size, processing data X of the L7 layer of 15KB size, and the remaining continuous dynamically available space 40KB along the first address direction.
The weight data W of the L6 layer, the L7 layer, and the processing data X of the L5 layer in the memory unit MEM1 may sequentially store weight data W of the L6 layer of 18KB size, weight data W of the L7 layer of 8KB size, processing data X of the L5 layer of 28KB size, and the remaining continuous dynamic available space 10KB in the first address direction.
Comparing the remaining contiguous dynamically available space of 36KB (MEM0:30KB + MEM1:6KB) for memory cell MEM0 and memory cell MEM1 in the related art of FIG. 2a with the remaining contiguous dynamically available space of 50KB (MEM0:40KB + MEM1:10KB) for memory cell MEM0 and memory cell MEM1 in FIG. 2b, the method used in the embodiments of the present disclosure can increase the contiguous dynamically available space of memory cells.
In a possible implementation manner, the processing unit sends operation result data of each convolution layer in the multilayer convolution operation process to each storage unit for storage, and each storage unit receives and stores the operation result data. The operation result data of any convolution layer and the processing data of the convolution layer are stored in the same storage unit, and share a storage space with the processing data of the next convolution layer.
For example, in comparison with the related art shown in fig. 2a, the L6 layer processed data X cannot be dynamically overwritten while being calculated, and the L6 layer processed data X stored in the space occupied by the L6 layer cannot be dynamically overwritten while the L6 layer operation result data is output as the L7 layer processed data X. As shown in fig. 2b, the processing unit may dynamically overwrite the L6 layer of processing data X in the memory cell MEM0 while performing computation, and the convolution operation result data of the L6 layer may be used as the processing data X of the L7 layer, and the convolution operation result data of the L6 layer may directly overwrite the processing data X of the L6 layer that has completed processing, and be used as the processing data X of the L7 layer, that is, the convolution operation result data of the L6 layer may share a memory space with the processing data X of the adjacent L7 layer, and repeatedly use a region of the memory cell, thereby making up a sufficient space for subsequent computation.
Therefore, by setting a weight data space and a processing data space for the storage unit, under the condition that processing data and weight data of the same layer are sent to different storage units for storage, the method can store the weight data and the processing data in the weight data space and the processing data space respectively according to the first address direction according to the processing sequence of the convolutional layer, and can dynamically cover the adjacent layer data while calculating, thereby increasing the residual continuous dynamic available space, vacating a larger data space for subsequent calculation, improving the space-time operation efficiency of a calculation core, and further improving the performance of a chip.
In the related art, when the computation cores perform multilayer convolution operation and storage on the processed data, when the convolution data of one layer is completely computed and enters the next layer of computation, the data of the computation cores which are completely computed at this time needs to be sent to other computation cores for storage, and meanwhile, the data of the previous layer which is completely convolved are received. Often the data received and transmitted are in the same piece of storage space, possibly with an overlap of spaces.
Fig. 4a is a schematic diagram illustrating data transmission of a memory cell in the related art. As shown in FIG. 4a, the capacity of data to be received by the memory unit MEM1 is 28KB, the capacity of data to be transmitted is 28KB, and the capacity of overlapping space between the address of data to be transmitted and the address of data to be received is 14KB, i.e. the data space of the memory unit MEM1 with the address of 0x 5400-0 x 6200.
If a positive sequence transmission mode is adopted from the data space to be transmitted and the data space to be received in the storage unit, namely, in multiple read-write operations of the processing unit to the storage unit, the initial address of each read-write operation is arranged according to the second address direction. The first data (corresponding to the data space with the addresses of 0x 5400-0 x6000 of the memory unit MEM 1) with the data capacity of 12KB received by the memory unit MEM1 for the first time can flush and cover the second data (corresponding to the data space with the addresses of 0x 5200-0 x5E00 of the memory unit MEM 1) to be sent for the second time, and the flushed part is corresponding to the addresses of 0x 5400-0 x5E00 of the memory unit MEM 1. Resulting in the subsequent second data to be transmitted being erroneous data.
The first data may be data to be received by the memory cell occurring in the write operation performed on the memory cell; the second data may be data to be transmitted for the memory cell that occurs in performing a read operation on the memory cell.
If the data of the data area needing to be sent is moved upwards by a space of 14KB, the data area and the data area do not overlap. And then the data are sequentially sent according to the direction of the second address, and the data moving process is increased, so that the transmission delay and the clock calculation are increased, the operation load of the chip is increased, and the operation efficiency of the chip is reduced.
In one possible implementation, the processing component may write first data received from outside the computing core to a storage unit and read second data from the storage unit to send outside the computing core.
In the multiple write operations, the start address of each write operation is arranged according to the first address direction, and first data is written in each write operation according to the second address direction.
In the multiple read operations, the start address of each read operation is arranged according to the first address direction, and the second data is read according to the second address direction in each read operation.
For example, fig. 4b shows a data transmission diagram according to an embodiment of the present disclosure. As shown in fig. 4b, the memory unit MEM1 has the start address of 0x5600 for the second data of 12KB size to be transmitted for the first time, and has the start address of 0x6400 for the first data of 12KB size to be received for the first time. The start address of the second data of 12KB capacity to be transmitted for the second time is 0x4a00, while the start address of the first data of 12KB capacity to be received for the second time is 0x 5800. The start address of the second data of 4KB capacity to be transmitted for the third time is 0x4600, while the start address of the first data of 4KB capacity to be received for the third time is 0x 5400.
As shown in fig. 4b, the starting addresses of the three read operations in memory cell MEM1 are: 0x5600, 0x4a00, 0x4600, the start addresses of the three write operations are: 0x6400, 0x5800, 0x5400, that is, the starting address of each read-write operation is arranged according to the first address direction. And in each read-write operation, the first data or the second data can be read and written according to the direction of the second address and the initial address. It should be understood that the processing unit may perform multiple read/write operations on the storage unit, and the present disclosure does not limit the data capacity size and the specific number of read/write operations of each read/write operation of the processing unit.
In the storage unit, the data to be transmitted can be reused at the next moment in the spatial region. For example, the memory unit MEM1 needs to occupy the data space of addresses 0x 5800-0 x6400 when receiving the first data with the capacity of 12KB for the second time, since the memory unit MEM1 has been read by the processing unit for the last time, and the second data is taken away, and the data space occupying addresses 0x 5600-0 x6200 is released, the flush and the coverage of the overlapping space of addresses 0x 5800-0 x6200 are avoided. Similarly, the memory unit MEM1 needs to occupy the data space of addresses 0x5400 ~ 0x5800 when receiving the first data with 12KB capacity for the third time, since the memory unit MEM1 has been taken away the second data and released the data space occupying addresses 0x5600 ~ 0x6200 for the first time, and the data space occupying addresses 0x4A00 ~ 0x5600 released by taking away the second data for the second time, the first and second memory units MEM1 release the data space of addresses 0x4A00 ~ 0x6200 together, avoiding the washout and overwriting of the data space of addresses 0x5400 ~ 0x5800 needed to receive the first data for the third time.
Therefore, in the data transmission process shown in fig. 4b, under the condition that the to-be-transmitted data address and the to-be-received data address of the storage unit overlap, there is no flushing and covering of data in the overlapping area of the storage unit caused by multiple times of read-write operations executed by the processing unit at the same time, and there is no additional time brought by moving and moving the data to avoid the data flushing, so that the routing transmission delay is eliminated, and further the time-space operation efficiency of the chip is improved, and the chip performance is improved.
In one possible implementation manner, in a weight data space storing weight data of the same convolutional layer and a processing data space storing processing data of the same convolutional layer, the weight data and the processing data are respectively and sequentially arranged according to a second address direction, and the first address direction is opposite to the second address direction.
Some examples of ordering the weight data and the processing data are given below, and for each convolutional layer, the ordered weight data and processing data of the layer may be arranged in order of the second address direction in the weight data space of the layer according to the ordering (e.g., the order of sequence numbers 0, 1,2, … …), and in order of the second address direction in the processing data space of the layer.
Fig. 5 shows a flow diagram according to an embodiment of the present disclosure. As shown in fig. 5, the method for storing data in each layer of convolution operation process in the multilayer convolution operation process may include the following steps:
at step S31, the processing section determines the processing data and weight data storage order within each convolution layer.
In one possible implementation manner, for the processing data in each layer of convolution operation process in the multilayer convolution operation process, the processing unit determines that the storage sequence of the processing data in each convolution layer is the sequence of the depth direction, the transverse direction and the longitudinal direction.
FIG. 6 shows a schematic diagram of processing a data storage sequence according to an embodiment of the present disclosure. Assuming that the bit width of the memory cell of the computation core is 32B, the processing data within the layer may be a series of images of splitting the input whole image data (512 pixels) into successive 32 frames of size 4 × 4 pixels. As shown in fig. 6, the left rectangular parallelepiped in fig. 6 may represent the processed data within the layer, and there are 32 layers in the depth direction (z-axis direction), and each layer corresponds to a frame of 4 × 4 pixel images.
The processing unit may determine the storage order of the processing data in each convolution layer, that is, the order in which the respective cubes in the left-side rectangular solid of fig. 6 are sent to the storage unit. The processing component can mark each small cube in the cuboid on the left side of the figure 6 according to the depth direction (z-axis direction) and the transverse direction (x-axis direction) and the longitudinal direction (y-axis direction) of the cuboid on the left side of the figure 6, and the corresponding relation between the marked serial number and the coordinate (x, y, z) is as follows:
Figure BDA0002939255730000151
where M is the depth of the processed data, i.e., the z-direction dimension.
Therefore, the sequence number corresponding to the first frame image is:
Figure BDA0002939255730000161
the sequence number corresponding to the 2 nd frame image is:
Figure BDA0002939255730000162
by analogy, the sequence number corresponding to the 32 th frame image is:
Figure BDA0002939255730000163
the order of the numbers of the respective small cubes in the left-side rectangular parallelepiped in fig. 6 represents the processing data storage order determined by the processing section.
In a possible implementation manner, each layer of convolution operation process uses a plurality of convolution kernel groups, each convolution kernel group includes a plurality of convolution kernels, in the storage unit, the weight data of each convolution kernel group in each layer of convolution operation is stored in sequence, and the storage sequence of the weight data of each convolution kernel group is stored according to the numbering sequence of the convolution kernels, the depth direction of the convolution kernels, the transverse direction and the longitudinal direction.
For example, for the weight data in each layer of convolution operation process in the multilayer convolution operation process, the processing component may number and group the convolution kernels, and then determine a group of storage sequences according to the sequence of the numbers of the convolution kernels in each group, the depth direction of the convolution kernels, the transverse direction of the convolution kernels, and the longitudinal direction of the convolution kernels.
Wherein the weight data comprises a plurality of sets of convolution kernels, each set comprising a plurality of convolution kernels. The processing unit may group the convolution kernels in the weight data according to the depth of the processing data, and the number of the convolution kernels in each group may be the same as the depth of the processing data.
Fig. 7 illustrates a schematic diagram of a weight data storage order according to an embodiment of the present disclosure. As shown in FIG. 7, each cuboid on the left side of the figure represents a convolution kernel, and 64 convolution kernels in the figure can be numbered W first0,W1,…,W63. Corresponding to the processing data depth (M ═ 32) shown in fig. 6, the convolution kernels may be grouped into 32, one group for each 32 convolution kernels, and divided into 2 groups (N ═ 2), W0,W1,…,W31Is a first group, W32,W33,…,W63The number of groups N grouped for the second group may represent the depth of the weight data.
The processing component may determine a storage order of the weight data in each convolution layer, that is, determine a sequence in which each cube in each cuboid on the left side of fig. 7 is sent to the storage component.
The processing section may divide the cuboids on the left side of fig. 7 into N groups (N ═ 2), and determine a storage order in which the weight data are stored in a group-by-group order.
For the storage sequence of each group of convolution kernels, the labels of the small cubes in each cuboid on the left side of fig. 7 can be given according to the numbering sequence of the convolution kernels, and then according to the depth direction (z-axis direction) of the convolution kernels, the transverse direction (x-axis direction) of the convolution kernels and the longitudinal direction (y-axis direction) of the convolution kernels. And the sequence direction of the convolution kernel numbers corresponds to the depth direction of the processed data.
The corresponding relation between the marked serial numbers and the coordinates (x, y, z) is as follows:
Figure BDA0002939255730000171
Figure BDA0002939255730000172
as can be seen from the correspondence between the serial numbers and the coordinates (x, y, z), the convolution kernel W is applied to the weight data of the group N-10The corresponding sequence numbers are: [0326496128160192224256288320352384416448480]Convolution kernel W1The corresponding sequence numbers are: [1336597129161193225257289321353385417449481]By analogy, the convolution kernel W31The sequence number of (A) is: [316395127159191223255287319351383415447479511]。
Similarly, for a set of N-2 weight data, the convolution kernel W32The corresponding serial number is [ 512544576608640672704736768800832864896928960992 ]]Convolution kernel W33The corresponding sequence numbers are: [513545577609641673705737769801833865897929961993]By analogy, the convolution kernel W63The corresponding sequence numbers are: [5435756076396717037357677998318638959279599911023]。
The order of the numbers of the respective small cubes in the respective cuboids on the left side of fig. 7 may represent the storage order of the weight value data determined by the processing section.
Step S32, the processing unit sends the processing data and the weight data to a storage unit according to the processing data and the weight data storage order.
In one possible implementation manner, the processing unit sends the processing data to the processing data space of the storage unit in the storage unit according to the storage sequence of the processing data and according to the second address direction.
For example, as shown in fig. 6, the serial numbers marked on the cubes in the figure may correspond to storage addresses of the storage units, and the processing unit may send the pixel values corresponding to the serial numbers marked on the cubes into the storage units by accessing the addresses corresponding to the storage units. The processing component can send the pixel value corresponding to the cube with the serial number 0 into the space of the address 0x0000 in the storage unit by accessing the address 0x0000 in the storage unit corresponding to the cube with the serial number 0; the processing component can send the pixel value corresponding to the cube with the sequence number 1 into the space of the address 0x0001 in the storage unit by accessing the address 0x0001 in the storage unit corresponding to the cube with the sequence number 1; by analogy, the processing unit may send the pixel value corresponding to the sequence number 511 cube into the address 0x01FF space in the storage unit by accessing the address 0x01FF in the storage unit corresponding to the sequence number 511 cube.
As shown in fig. 6, the processing unit sends the processing data to the storage unit in the depth direction, the transverse direction, and the longitudinal direction according to the sequence of the marked serial numbers.
In a possible implementation manner, the processing unit sends the weight data to a weight data space of a storage unit in the storage unit according to a second address direction according to a storage sequence of the weight data.
For example, as shown in fig. 7, the serial numbers marked on the cubes in the drawing may correspond to storage addresses of the storage units, and the processing component may send the weight data corresponding to the serial numbers marked on the cubes into the storage units by accessing the addresses corresponding to the storage units. The processing component can send the weight value corresponding to the sequence number 0 cube into the space of the address 0x0000 in the storage unit by accessing the address 0x0000 in the storage unit corresponding to the sequence number 0 cube; the processing component can send the weight value corresponding to the cube with the sequence number 1 into the space of the address 0x0001 in the storage unit by accessing the address 0x0001 in the storage unit corresponding to the cube with the sequence number 1; by analogy, the processing unit may send the weight value corresponding to the sequence number 1023 cube into the address 0x03FF space in the storage unit by accessing the address 0x03FF in the storage unit corresponding to the sequence number 1023 cube.
As shown in fig. 7, the processing component sends the weight data to the storage unit in a group sequence according to the sequence number, the depth direction of the convolution kernel, the transverse direction of the convolution kernel, and the longitudinal direction of the convolution kernel.
In step S33, the storage unit receives and stores the processing data and the weight data.
The weight data can be stored first, and then the processing data can be stored, the weight data can be stored in the weight data space of one storage unit, and the processing data can be stored in the processing data space of another storage unit.
In a possible implementation manner, in the storage unit, the storage sequence of the processing data and the weight data for each convolution layer conforms to the convolution operation process of firstly the depth direction, then the horizontal direction, and then the vertical direction.
Fig. 8 is a schematic diagram illustrating a storage order principle of a convolution operation process according to an embodiment of the present disclosure. As shown in fig. 8, assuming that the processed data is image data, it can be represented by a three-dimensional tensor X [ i, j, m ], which is as follows:
X[i,j,m],i=1,2,…,I,j=1,2,…,J,m=1,2,…,M
i denotes that the image data X [ I, J, M ] has I pixels in the longitudinal dimension, J denotes that the image data X [ I, J, M ] has J pixels in the lateral dimension, M denotes that the image data X [ I, J, M ] has M pixels in the depth dimension, and the size of the processed data is I × J × M.
The image data with the depth of M may be M frames of continuous sequence images converted from a video, may be M sub-images with the same size split from a whole image, or may be image data with the channel depth of M, which is not limited in this disclosure.
For example, if M pieces of processing data are subjected to sliding window convolution operation with N groups of 3 × 3 fixed window sizes, if N groups of 3 × 3 fixed window sizes are respectively performed on one picture, each time one piece of picture data is processed, the storage unit needs to be accessed once to read a convolution kernel, and the storage unit needs to be read M times repeatedly, which results in frequent access to the storage unit. Therefore, three-dimensional sliding window convolution operation can be performed on the processing data according to the depth direction, and the convolution kernels (namely weight data) which are read once by accessing the storage unit can perform sliding window convolution operation on M input images at the same time, so that parallel operation of the input images is realized.
As shown in FIG. 8, assume that there are M pictures, C k, in the depth directionxky,m]The sliding window can slide in the image data according to the depth direction, the transverse direction and the longitudinal direction, and the sliding mode of the sliding window is not limited in the application.
And the image data C [ k ] taken by the sliding windowxky,m]The corresponding weight value is K [ K ]xky,m,n]In FIG. 8, weight data K [ K ]xky,m,n]There are N groups of M convolution kernels each corresponding to processed data of depth M. Image data C [ k ] taken by sliding windowxky,m]Can be sequentially matched with each group of convolution kernels K [ K ] according to the group number N of the convolution weightsxky,m,n]And (N is 1,2, …, N) respectively performing multiplication and addition operation.
Because the convolution operation is carried out on the image data X [ i, j, m ], namely the multiplication and addition operation is carried out on the image data X [ i, j, m ], the sequence of the convolution operation process is adjusted according to the multiplication and addition conversion law, and the result of the convolution operation is not changed. Therefore, in order to reduce the number of accesses to the memory unit, the order of the convolution operation process may be adjusted to perform the operation first along the depth M direction, and then along the lateral J direction and the longitudinal I direction.
The corresponding formula of the operation process is as follows:
Figure BDA0002939255730000201
wherein, C [ k ]xky,m]Representing image data taken by sliding windows, i.e. correspondencesData of input image of convolution kernel size, K Kxky,m,n]And representing convolution weight values, kx & ky represents the size of one convolution kernel, the convolution weight values are divided into N groups according to the depth direction M, each group of M convolution kernels is obtained, the convolution process comprises the steps of firstly carrying out multiply-add operation on the convolution kernels along the depth direction M, and then respectively completing sliding window convolution operation of the rest data along the transverse J and the longitudinal I according to the size kx & ky of the convolution kernels.
The processing component is beneficial to realizing the parallel processing of data and improving the calculation efficiency of the chip according to the convolution operation process of firstly depth, transversely and then longitudinally. Aiming at the processing data and the weight data in the convolution operation process, the processing part can store the processing data in a storage sequence of a depth direction, a transverse direction and a longitudinal direction, and the weight data in a storage sequence of a convolution kernel number sequence, a convolution kernel depth direction, a convolution kernel transverse direction and a convolution kernel longitudinal direction in a group, so that the processing data and the weight data which are stored firstly can be read preferentially to participate in operation, the depth is matched with the height, the transverse direction and the longitudinal direction are matched with each other, the fitting degree of the storage process and the convolution operation process is improved, and the calculation efficiency is improved.
The present disclosure also provides a computing core. FIG. 1 illustrates an example of a computing core that includes a processing component and a storage component.
In a possible implementation manner, the processing unit writes the weight data and the processing data of each convolutional layer into each storage unit in sequence for storage according to the convolutional layer processing sequence in the multilayer convolutional operation process.
The storage component comprises more than two storage units, and each storage unit receives and stores the weight data and the processing data.
Wherein, in the memory cell, the processing data and the weight data of the same convolution layer are stored in different memory cells; in the same storage unit, the weight data space in which the weight data is stored and the processing data space in which the processing data is stored are sequentially arranged in the first address direction.
Wherein, the weight data space for storing the weight data of each convolution layer is sequentially arranged along the first address direction according to the processing sequence of the convolution layers; the processing data spaces storing the processing data of the respective convolutional layers are sequentially arranged in the first address direction in accordance with the convolutional layer processing order.
And in a weight data space for storing weight data of the same convolutional layer and a processing data space for storing processing data of the same convolutional layer, the weight data and the processing data are respectively and sequentially arranged according to a second address direction, and the first address direction is opposite to the second address direction.
In one possible implementation, the first address direction is a high address to low address direction and the second address direction is a low address to high address direction.
In a possible implementation manner, the processing unit is further configured to send operation result data of each convolution layer in the multilayer convolution operation process to each storage unit for storage;
and each storage unit receives and stores the operation result data, wherein the operation result data of any convolution layer and the processing data of the convolution layer are stored in the same storage unit, and share a storage space with the processing data of the next convolution layer.
In a possible implementation, the processing unit is further configured to write first data received from outside the computing core to a storage unit, and read second data from the storage unit to be sent outside the computing core;
in multiple write operations, the initial address of each write operation is arranged according to the first address direction, and first data is written in each write operation according to the second address direction;
in the multiple read operations, the start address of each read operation is arranged according to the first address direction, and the second data is read according to the second address direction in each read operation.
In a possible implementation manner, in the storage unit, the storage sequence of the processing data and the weight data for each convolution layer conforms to the convolution operation process of firstly the depth direction, then the horizontal direction, and then the vertical direction.
In one possible implementation, each layer of convolution operation process uses a plurality of convolution kernel groups, and each convolution kernel group comprises a plurality of convolution kernels;
in the storage unit, the weight data of each convolution kernel group in each layer of convolution operation are stored in sequence, and the storage sequence of the weight data of each convolution kernel group is stored according to the serial number sequence of convolution kernels, the depth direction of the convolution kernels, the transverse direction and the longitudinal direction.
For the above embodiments related to the computation core, refer to the above description related to the data storage method, and are not described again.
In a possible implementation manner, an embodiment of the present disclosure further provides an artificial intelligence chip, where the chip includes at least one computing core as described above. The chip may include a plurality of processors, which may include a plurality of computing cores, and the present disclosure does not limit the number of computing cores within the chip.
In a possible implementation manner, an embodiment of the present disclosure provides an electronic device including one or more artificial intelligence chips described above.
Fig. 9 is a block diagram illustrating a combined processing device 1200 according to an embodiment of the present disclosure. As shown in fig. 9, the combined processing device 1200 includes a computing processing device 1202 (e.g., an artificial intelligence processor including multiple computing cores as described above), an interface device 1204, other processing devices 1206, and a storage device 1208. Depending on the application scenario, one or more computing devices 1210 (e.g., computing cores) may be included in the computing processing device.
In one possible implementation, the computing processing device of the present disclosure may be configured to perform operations specified by a user. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or as part of a hardware structure of an artificial intelligence processor core, computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.
In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively perform user-specified operations. Other Processing devices of the present disclosure may include one or more types of general and/or special purpose processors such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an artificial intelligence processor, and the like, depending on the implementation. These processors may include, but are not limited to, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined based on actual needs. As previously mentioned, the computational processing apparatus of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only. However, when considered together, a computing processing device and other processing devices may be considered to form a heterogeneous multi-core structure.
In one or more embodiments, the other processing devices may interface with external data and controls as a computational processing device of the present disclosure (which may be embodied as an artificial intelligence, e.g., a computing device associated with neural network operations), performing basic controls including, but not limited to, data handling, turning on and/or off of the computing device, and the like. In further embodiments, other processing devices may also cooperate with the computing processing device to collectively perform computational tasks.
In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing processing device may obtain input data from other processing devices via the interface device, and write the input data into a storage device (or memory) on the computing processing device. Further, the computing processing device may obtain the control instruction from the other processing device via the interface device, and write the control instruction into the control cache on the computing processing device slice. Alternatively or optionally, the interface device may also read data from the memory device of the computing processing device and transmit the data to the other processing device.
Additionally or alternatively, the combined processing device of the present disclosure may further include a storage device. As shown in the figure, the storage means is connected to the computing processing means and the further processing means, respectively. In one or more embodiments, the storage device may be used to hold data for the computing processing device and/or the other processing devices. For example, the data may be data that is not fully retained within internal or on-chip storage of a computing processing device or other processing device.
According to different application scenarios, the artificial intelligence chip disclosed by the disclosure can be used for a server, a cloud server, a server cluster, a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an automatic driving terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
Fig. 10 shows a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to FIG. 10, an electronic device 1900 includes a processing component 1922 (e.g., an artificial intelligence processor including multiple computing cores), which further includes one or more computing cores, and memory resources, represented by memory 1932, for storing instructions, e.g., applications, that are executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The electronic device or processor of the present disclosure may also be applied to the fields of the internet, internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or the processor disclosed by the disclosure can also be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, a computationally powerful electronic device or processor according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or processor may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (14)

1. The data storage method is applied to a computing core of a processor, wherein the processor comprises a plurality of computing cores, each computing core comprises a processing unit and a storage unit, and the storage unit comprises more than two storage units;
the method comprises the following steps:
the processing component writes the weight data and the processing data of each convolution layer into each storage unit in sequence for storage according to the convolution layer processing sequence in the multilayer convolution operation process;
each storage unit receives and stores the weight data and the processing data;
wherein, in the memory cell, the processing data and the weight data of the same convolution layer are stored in different memory cells; in the same storage unit, the weight data space storing the weight data and the processing data space storing the processing data are sequentially arranged according to a first address direction;
wherein, the weight data space for storing the weight data of each convolution layer is sequentially arranged along the first address direction according to the processing sequence of the convolution layers; the processing data spaces for storing the processing data of the convolutional layers are sequentially arranged along the first address direction according to the processing sequence of the convolutional layers;
and in a weight data space for storing weight data of the same convolutional layer and a processing data space for storing processing data of the same convolutional layer, the weight data and the processing data are respectively and sequentially arranged according to a second address direction, and the first address direction is opposite to the second address direction.
2. The method of claim 1, wherein the first address direction is a high address to low address direction and the second address direction is a low address to high address direction.
3. The method of claim 1, further comprising:
the processing component sends the operation result data of each convolution layer in the multilayer convolution operation process to each storage unit for storage;
and each storage unit receives and stores the operation result data, wherein the operation result data of any convolution layer and the processing data of the convolution layer are stored in the same storage unit, and share a storage space with the processing data of the next convolution layer.
4. The method of claim 1, further comprising:
the processing component writes first data received from outside the computing core into a storage unit and reads second data from the storage unit to be sent outside the computing core;
in multiple write operations, the initial address of each write operation is arranged according to the first address direction, and first data is written in each write operation according to the second address direction;
in the multiple read operations, the start address of each read operation is arranged according to the first address direction, and the second data is read according to the second address direction in each read operation.
5. The method of claim 1, wherein the storage unit stores the processing data and the weight data for each convolutional layer in a sequence conforming to a convolutional operation process of depth direction first, horizontal direction, and vertical direction.
6. The method of claim 1 or 5, wherein each layer of convolution operation process uses a plurality of convolution kernel groups, each convolution kernel group comprises a plurality of convolution kernels;
in the storage unit, the weight data of each convolution kernel group in each layer of convolution operation are stored in sequence, and the storage sequence of the weight data of each convolution kernel group is stored according to the serial number sequence of convolution kernels, the depth direction of the convolution kernels, the transverse direction and the longitudinal direction.
7. A computing core, comprising a processing component and a storage component;
the processing component writes the weight data and the processing data of each convolution layer into each storage unit in sequence for storage according to the convolution layer processing sequence in the multilayer convolution operation process;
the storage component comprises more than two storage units, and each storage unit receives and stores the weight data and the processing data;
wherein, in the memory cell, the processing data and the weight data of the same convolution layer are stored in different memory cells; in the same storage unit, the weight data space storing the weight data and the processing data space storing the processing data are sequentially arranged according to a first address direction;
wherein, the weight data space for storing the weight data of each convolution layer is sequentially arranged along the first address direction according to the processing sequence of the convolution layers; the processing data spaces for storing the processing data of the convolutional layers are sequentially arranged along the first address direction according to the processing sequence of the convolutional layers;
and in a weight data space for storing weight data of the same convolutional layer and a processing data space for storing processing data of the same convolutional layer, the weight data and the processing data are respectively and sequentially arranged according to a second address direction, and the first address direction is opposite to the second address direction.
8. The compute core of claim 7 wherein the first address direction is a high address to low address direction and the second address direction is a low address to high address direction.
9. The computational core of claim 7, wherein the processing unit is further configured to send operation result data of each convolution layer in the multi-layer convolution operation process to each storage unit for storage;
and each storage unit receives and stores the operation result data, wherein the operation result data of any convolution layer and the processing data of the convolution layer are stored in the same storage unit, and share a storage space with the processing data of the next convolution layer.
10. The computing core of claim 7, wherein the processing component is further configured to write first data received from outside the computing core to a storage unit and read second data from the storage unit for transmission outside the computing core;
in multiple write operations, the initial address of each write operation is arranged according to the first address direction, and first data is written in each write operation according to the second address direction;
in the multiple read operations, the start address of each read operation is arranged according to the first address direction, and the second data is read according to the second address direction in each read operation.
11. The computational core of claim 7, wherein the storage order of the processing data and the weight data for each convolutional layer in the storage unit conforms to a convolutional operation process of depth direction first, horizontal direction, and vertical direction.
12. The computational core according to claim 7 or 11, wherein each layer of convolution operation uses a plurality of convolution kernel groups, each convolution kernel group comprising a plurality of convolution kernels;
in the storage unit, the weight data of each convolution kernel group in each layer of convolution operation are stored in sequence, and the storage sequence of the weight data of each convolution kernel group is stored according to the serial number sequence of convolution kernels, the depth direction of the convolution kernels, the transverse direction and the longitudinal direction.
13. An artificial intelligence chip, wherein the chip comprises a plurality of computing cores according to any one of claims 7 to 12.
14. An electronic device comprising one or more artificial intelligence chips according to claim 13.
CN202110172560.5A 2021-02-08 2021-02-08 Data storage method, computing core, chip and electronic equipment Active CN112799599B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110172560.5A CN112799599B (en) 2021-02-08 2021-02-08 Data storage method, computing core, chip and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110172560.5A CN112799599B (en) 2021-02-08 2021-02-08 Data storage method, computing core, chip and electronic equipment

Publications (2)

Publication Number Publication Date
CN112799599A true CN112799599A (en) 2021-05-14
CN112799599B CN112799599B (en) 2022-07-15

Family

ID=75814832

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110172560.5A Active CN112799599B (en) 2021-02-08 2021-02-08 Data storage method, computing core, chip and electronic equipment

Country Status (1)

Country Link
CN (1) CN112799599B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114942731A (en) * 2022-07-25 2022-08-26 北京星天科技有限公司 Data storage method and device
CN114968602A (en) * 2022-08-01 2022-08-30 成都图影视讯科技有限公司 Architecture, method and apparatus for a dynamically resource-allocated neural network chip
TWI799169B (en) * 2021-05-19 2023-04-11 神盾股份有限公司 Data processing method and circuit based on convolution computation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105940381A (en) * 2013-12-26 2016-09-14 英特尔公司 Data reorder during memory access
CN107329734A (en) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing convolutional neural networks forward operation
CN108229648A (en) * 2017-08-31 2018-06-29 深圳市商汤科技有限公司 Convolutional calculation method and apparatus, electronic equipment, computer storage media
CN109992198A (en) * 2017-12-29 2019-07-09 深圳云天励飞技术有限公司 The data transmission method and Related product of neural network
CN110309912A (en) * 2018-03-27 2019-10-08 北京深鉴智能科技有限公司 Data access method, hardware accelerator, calculates equipment, storage medium at device
US20200110536A1 (en) * 2018-10-09 2020-04-09 Western Digital Technologies, Inc. Optimizing data storage device operation by grouping logical block addresses and/or physical block addresses using hints
CN111860812A (en) * 2016-04-29 2020-10-30 中科寒武纪科技股份有限公司 Apparatus and method for performing convolutional neural network training

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105940381A (en) * 2013-12-26 2016-09-14 英特尔公司 Data reorder during memory access
CN107329734A (en) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing convolutional neural networks forward operation
CN111860812A (en) * 2016-04-29 2020-10-30 中科寒武纪科技股份有限公司 Apparatus and method for performing convolutional neural network training
CN108229648A (en) * 2017-08-31 2018-06-29 深圳市商汤科技有限公司 Convolutional calculation method and apparatus, electronic equipment, computer storage media
CN109992198A (en) * 2017-12-29 2019-07-09 深圳云天励飞技术有限公司 The data transmission method and Related product of neural network
CN110309912A (en) * 2018-03-27 2019-10-08 北京深鉴智能科技有限公司 Data access method, hardware accelerator, calculates equipment, storage medium at device
US20200110536A1 (en) * 2018-10-09 2020-04-09 Western Digital Technologies, Inc. Optimizing data storage device operation by grouping logical block addresses and/or physical block addresses using hints

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐海峥,裴京等: ""RAID系统中的一种伪并行I%2fO调度策略"", 《计算机工程与应用》 *
徐海峥,裴京等: ""RAID系统中的一种伪并行I%2fO调度策略"", 《计算机工程与应用》, vol. 45, no. 1, 1 April 2009 (2009-04-01) *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI799169B (en) * 2021-05-19 2023-04-11 神盾股份有限公司 Data processing method and circuit based on convolution computation
CN114942731A (en) * 2022-07-25 2022-08-26 北京星天科技有限公司 Data storage method and device
CN114942731B (en) * 2022-07-25 2022-10-25 北京星天科技有限公司 Data storage method and device
CN114968602A (en) * 2022-08-01 2022-08-30 成都图影视讯科技有限公司 Architecture, method and apparatus for a dynamically resource-allocated neural network chip
CN114968602B (en) * 2022-08-01 2022-10-21 成都图影视讯科技有限公司 Architecture, method and apparatus for a dynamically resource-allocated neural network chip

Also Published As

Publication number Publication date
CN112799599B (en) 2022-07-15

Similar Documents

Publication Publication Date Title
CN109117948B (en) Method for converting picture style and related product
CN109062611B (en) Neural network processing device and method for executing vector scaling instruction
CN112799599B (en) Data storage method, computing core, chip and electronic equipment
US11775430B1 (en) Memory access for multiple circuit components
WO2020073211A1 (en) Operation accelerator, processing method, and related device
WO2023045445A1 (en) Data processing device, data processing method, and related product
WO2023045446A1 (en) Computing apparatus, data processing method, and related product
CN112163601A (en) Image classification method, system, computer device and storage medium
WO2023123919A1 (en) Data processing circuit, data processing method, and related product
CN112686379A (en) Integrated circuit device, electronic equipment, board card and calculation method
CN107305486B (en) Neural network maxout layer computing device
CN112799598B (en) Data processing method, processor and electronic equipment
CN112084023A (en) Data parallel processing method, electronic equipment and computer readable storage medium
CN115129460A (en) Method and device for acquiring operator hardware time, computer equipment and storage medium
CN112801276B (en) Data processing method, processor and electronic equipment
CN112817898A (en) Data transmission method, processor, chip and electronic equipment
CN114281561A (en) Processing unit, synchronization method for a processing unit and corresponding product
CN113469333B (en) Artificial intelligence processor, method and related products for executing neural network model
CN112232498B (en) Data processing device, integrated circuit chip, electronic equipment, board card and method
WO2023045638A1 (en) Computing device, method for implementing convolution operation by using computing device, and related product
WO2023087698A1 (en) Computing apparatus and method for executing convolution operation, and related products
CN117235424A (en) Computing device, computing method and related product
CN113705785A (en) Network training method for avoiding forward computing data overlapping of many-core architecture chip
CN114692841A (en) Data processing device, data processing method and related product
CN112801278A (en) Data processing method, processor, chip and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant