CN111160545A - Artificial neural network processing system and data processing method thereof - Google Patents

Artificial neural network processing system and data processing method thereof Download PDF

Info

Publication number
CN111160545A
CN111160545A CN201911411798.8A CN201911411798A CN111160545A CN 111160545 A CN111160545 A CN 111160545A CN 201911411798 A CN201911411798 A CN 201911411798A CN 111160545 A CN111160545 A CN 111160545A
Authority
CN
China
Prior art keywords
layer
chip
cache
data
area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911411798.8A
Other languages
Chinese (zh)
Inventor
陕天龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN201911411798.8A priority Critical patent/CN111160545A/en
Publication of CN111160545A publication Critical patent/CN111160545A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The application discloses an artificial neural network processing system and a data processing method thereof. The system comprises: the device comprises a control module, an operation module and a storage module; the control module comprises a cache control unit, and the cache control unit is configured to execute the control of the artificial neural network operation cache; the operation module comprises one or more main operation units, and the main operation units are configured to execute operation on each network layer in the artificial neural network structure; the storage module comprises a fusion storage unit, and the fusion storage unit is configured to load operation data of each network layer of the artificial neural network and cache operation results; the integrated storage unit comprises an on-chip memory and an off-chip memory, and the cache address of the on-chip memory and the cache address of the off-chip memory are coded uniformly to form a uniform address value. The scheme can ensure the parallelism, compatibility and expandability of the artificial neural network operation, and reduces the power consumption level of the system while ensuring high performance.

Description

Artificial neural network processing system and data processing method thereof
Technical Field
The application relates to the technical field of artificial intelligence, in particular to an artificial neural network processing system, a data processing method thereof, electronic equipment and a readable storage medium.
Background
In recent years, intelligent solutions based on artificial neural networks are more and more abundant, and the great company builds artificial intelligence ecology around a GPU; a sailing company builds a processing device and a development suite of an AI architecture based on an FPGA; the TPU introduced by google in 2016 may achieve the full reasoning work of deep learning convolutional neural networks. In the edge computing equipment of each big company (the edge computing equipment is an open platform integrating network, computing, storage and application core capabilities at one side close to an object or a data source and can provide nearest-end service nearby), a storage module is crucial in the whole architecture, a GPU uses an external memory GDDR (graphics Double DataRate) to provide high-speed bandwidth for training and reasoning of an artificial neural network, and a TPU uses full on-chip storage resources to provide high-parallel data cache for AI algorithm reasoning.
In the existing storage module scheme of the artificial neural network, the GPU uses an external memory GDDR as a cache in the training and reasoning processes, and the high-speed clock frequency brings considerable operation speed, but also brings huge power consumption, and is difficult to be used in devices such as miniaturized AI devices and cloud servers. The TPU uses the whole on-chip storage resources to perform cache in the artificial neural network reasoning process, so that although the overall power consumption is reduced, overflow can occur due to the limited on-chip storage resources, and the compatibility of the storage structure to a new functional layer is insufficient.
Content of application
In view of the above, the present application is proposed to provide an artificial neural network processing system and a data processing method thereof that overcome or at least partially solve the above problems.
In accordance with one aspect of the present application, there is provided an artificial neural network processing system, comprising: the device comprises a control module, an operation module and a storage module;
the control module comprises a cache control unit configured to perform control of an artificial neural network operation cache;
the operation module comprises one or more main operation units, and the main operation units are configured to execute operation on each network layer in the artificial neural network structure;
the storage module comprises a fusion storage unit, and the fusion storage unit is configured to load operation data of each network layer of the artificial neural network and cache operation results; the fusion storage unit comprises an on-chip memory and an off-chip memory, and the cache address of the on-chip memory and the cache address of the off-chip memory are coded uniformly to form a uniform address value.
Optionally, the on-chip memory, the cache control unit, and the main operation unit are configured on the same chip.
Optionally, the storage area of the on-chip memory includes an on-chip cache first area and an on-chip cache second area, the storage area of the off-chip memory includes an off-chip cache first area and an off-chip cache second area, and the on-chip cache first area, the off-chip cache first area, the on-chip cache second area and the off-chip cache second area are uniformly encoded in sequence to form a uniform address value.
Optionally, the on-chip memory includes two symmetrically arranged static RAM cache matrices, each of which is composed of a plurality of static RAMs and a controller.
Optionally, the control module further comprises a main controller configured to perform at least one of the following operations: receiving and sending an operation instruction of the artificial neural network, scheduling an operation flow of the artificial neural network, and making a function decision of the artificial neural network; and/or the presence of a gas in the gas,
the operation module further comprises an auxiliary operation unit, and the auxiliary operation unit is configured to preprocess operation data of the artificial neural network and/or collect operation results; and/or the presence of a gas in the gas,
the storage module further comprises an auxiliary storage unit, and the auxiliary storage unit is configured to calculate and load weight parameters of each network layer of the artificial neural network;
wherein the main controller and the auxiliary arithmetic unit are realized by a CPU; the auxiliary storage unit is realized through a CPU memory.
Optionally, the system further includes a control path module configured on the chip, where the control path module provides a cache for communication between the main controller and the arithmetic unit and the cache control unit, and includes a cache instruction RAM, a weight instruction RAM, and a weight data RAM.
According to another aspect of the present application, there is provided a data processing method applied to the artificial neural network processing system as described in any one of the above, the method including:
dividing the unified address value into regions, wherein the regions at least comprise a first cache region and a second cache region;
and returning the operation result data of each network layer of the neural network from the main operation unit, alternately writing the operation result data into the first cache region or the second cache region, reading the operation result data of the network layer from the first cache region or the second cache region, and writing the operation result data into the main operation unit to perform the operation of the next network layer of the network layer.
Optionally, the first cache region is composed of a first on-chip cache region and a first off-chip cache region of continuous uniform address values, the second cache region is composed of a second on-chip cache region and a second off-chip cache region of continuous uniform address values, and data is preferentially written in or read from the first on-chip cache region.
Optionally, the reading or writing in the first cache region or the second cache region includes:
and judging whether the target network layer is an original data layer, a result data layer, an overflow layer or a backup layer, and determining a read or write target cache region according to the operational characteristics of each network layer and the relationship between the unified address value of the data and the maximum address value of the on-chip cache region.
Optionally, the determining whether the target network layer is an original data layer, a result data layer, an overflow layer or a backup layer, and determining the read or write target cache area according to the operational characteristics of each network layer and the relationship between the data uniform address value and the maximum address value of the on-chip cache area includes:
if the target network layer is not a cross-layer backup layer, not an original data layer, not a result data layer and not an overflow layer; or, the target network layer is an overflow layer and is not a backup layer, and after the reading or writing is finished, the unified address value is less than or equal to the maximum address value of the on-chip cache region, only the data in the on-chip cache region needs to be read or written;
if the target network layer is a cross-layer number backup layer, or a non-original data layer, or a non-result data layer, and is not an overflow layer; or, the target network layer is an overflow layer and is not a backup layer, and the unified address value before the current reading or writing is started is larger than the maximum address value of the on-chip cache region, only the data in the off-chip cache region needs to be read or written;
if the target reading network layer is an overflow layer and is not a backup layer, the unified address value before the current reading is started is smaller than the maximum address value of the on-chip cache region, and the unified address value after the current reading or writing is finished is larger than the maximum address value of the on-chip cache region, the data in the on-chip cache region needs to be read or written first, and then the data in the off-chip cache region needs to be read.
Optionally, the determining whether the target network layer is an original data layer, a result data layer, an overflow layer or a backup layer, and determining that the target is written into the cache area according to the calculation characteristics of each network layer and the relationship between the uniform address value of the data and the maximum address value of the on-chip cache area further includes:
and if the target writing layer is a backup layer, writing the target writing layer into the on-chip cache region and the off-chip cache region at the same time.
Optionally, the alternately writing the operation result data of each network layer of the neural network into the first cache region or the second cache region after returning from the main operation unit, and reading the operation result data of the network layer from the first cache region or the second cache region and writing the operation result data into the main operation unit to perform the operation of the next network layer of the network layer includes:
determining the base address of each channel and the cache partition of each channel based on the layer number and the channel number of the network layer where the current request is located, and then reading or writing data of each channel in a burst mode based on the base address of each channel.
Optionally, the alternately writing the operation result data of each network layer of the neural network into the first cache region or the second cache region after returning from the main operation unit includes:
after receiving an external instruction, analyzing a write-in instruction and informing a write-in data FIFO group to prepare to receive data to be cached;
respectively adjusting the quantity proportion of write-in FIFO and the quantity proportion of read-out FIFO according to the data bit width proportion of the on-chip cache region and the off-chip cache region;
the number of read operations and the number of write operations are adjusted according to the relationship between the number of write FIFOs and the number of read FIFOs, so that the write data FIFO group temporarily stores data input in parallel by multiple channels and reads the data in a burst mode, and the total number of read operations and write operations is reduced.
Optionally, the alternately writing the operation result data returned from the operation unit by each network layer in the neural network into the first cache region or the second cache region includes:
if the target write-in layer is an overflow layer, firstly continuously writing the target write-in layer into the on-chip cache region, then writing the target write-in layer into the off-chip cache region, then continuously writing the target write-in layer into the off-chip cache region, and setting a plurality of write-in FIFO (first in first out) to work in parallel in the direct memory access when writing the off-chip cache region.
Optionally, the area further includes an original data cache area, a plurality of backup layer areas, and an operation result area, where the original data cache area, the first off-chip cache area, the second off-chip cache area, the plurality of backup layer areas, and the operation result area are disposed on an off-chip memory, and the original data cache area, the first on-chip cache area, the first off-chip cache area, the second on-chip cache area, the second off-chip cache area, the plurality of backup layer areas, and the operation result area are uniformly encoded in sequence to form a uniform address value.
In accordance with yet another aspect of the present application, there is provided an electronic device including: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform a method as any one of the above.
According to a further aspect of the application, there is provided a computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement a method as in any above.
Therefore, the integrated storage module of the artificial neural network disclosed by the application carries out real-time decision and channel selection on the cache of network level data in a mode of combining the on-chip cache matrix with the off-chip cache, and realizes a high-speed storage architecture with high parallelism, strong compatibility and low power consumption. Specifically, the method comprises the following steps:
first, the technical scheme disclosed in the application can ensure high operational parallelism and high performance of the artificial neural network. The system on-chip storage matrix unit can provide synchronous cache processing for high parallel data, and preferentially uses on-chip cache for network scale which does not exceed on-chip storage resources.
Second, the application can ensure the compatibility and expandability of the artificial neural network algorithm. The external memory is functionally fused with the on-chip memory, and the external memory is used as an alternative cache and a function expansion cache.
Thirdly, the technical scheme disclosed by the application can ensure the low power consumption level of the artificial neural network hardware system. According to the method, only the external memory is used as an alternative and function expansion cache of the internal cache, and the power consumption generated in the network operation process is mainly in the chip; secondly, the system disclosed by the application establishes a data path for the external memory independently, and is compatible with the data path of the internal cache, so that reading and writing are not required to be carried out through the on-chip cache, and the on-chip data processing and delay time are greatly reduced.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 shows a schematic block diagram of an artificial neural network processing system according to an embodiment of the present application;
FIG. 2 shows a schematic flow diagram of a data processing method according to an embodiment of the present application;
FIG. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 4 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating an overall flow of data processing of an artificial neural network processing system according to an embodiment of the present application;
FIG. 6 illustrates an internal cache diagram of a control-path module according to one embodiment of the present application;
FIG. 7 illustrates a functional diagram of a cache instruction protocol according to one embodiment of the present application;
FIG. 8 is a diagram illustrating an overflow layer read/write command to memory correspondence according to an embodiment of the present application;
FIG. 9 illustrates base address partitioning in an original data cache according to one embodiment of the present application;
FIG. 10 is a schematic diagram illustrating a comparison of operations when the number ratio of write FIFOs and read FIFOs is different according to one embodiment of the present application;
FIG. 11 is a schematic diagram of an SRAM cache matrix structure according to an embodiment of the present application;
FIG. 12 shows a schematic diagram of a comparison of buffering times of a multi FIFO _ DMA and a single FIFO _ DMA according to an embodiment of the present application;
FIG. 13 is a diagram illustrating a unified address distribution and read/write flow according to an embodiment of the present application;
FIG. 14 illustrates write status determination and memory address partitioning for a backup layer according to one embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 shows a schematic block diagram of an artificial neural network processing system according to an embodiment of the present application; the artificial neural network processing system disclosed in the embodiment is a complete framework applying the fusion storage module and the data processing method, and the system can realize the overall flow of reading or writing data in the artificial neural network into the storage module as shown in fig. 5, including the preset algorithm flow and the loaded data; reading and operating data of a first network layer; writing the operation result into a storage system; calculating layer by layer, and storing the final calculation result in an off-chip memory; and finishing the operation of the current round, waiting for starting the instruction in the next operation, and the like.
The system comprises: the device comprises a control module, an operation module and a storage system, wherein the modules can be at least partially configured on a chip of a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC).
The control module comprises a cache control unit configured to perform control of the artificial neural network operation cache. The cache control unit is one of the core functional units that executes the cache instruction protocol function shown in fig. 7 and executes the data access control flow shown in fig. 13, and the cache instruction protocol function shown in fig. 7 is implemented by programming in the cache control unit, so that the scheduling of the image data in the on-chip and off-chip memories and the operation in the operation module in the neural network are implemented.
According to fig. 7, the cache control unit determines that the current instruction is a read direction instruction according to the operation, activates a read command parsing module of the system read cache control area, and the read command parsing module locks layer, channel _ number, and the like to describe information such as a layer number, a channel number, and the like of the current network layer when the sync signal is 1, and locks concat _ enable, and the like to describe information such as a data amount, a data type, and the like requested by the current instruction.
The read command analysis module judges that the layer number requested in the current step is 0 and original data needs to be read, and then the read command analysis module informs the original image reading excitation module to automatically generate a read command so as to automatically read the original data in the cache; and simultaneously, the command selector of the reading cache control area selects a reading command from the original image reading excitation module according to the analysis result and outputs the reading command to the cache control area.
The operation module comprises one or more main operation units, and the main operation units are configured to perform operation execution on different layer structures in the artificial neural network structure.
The main operation unit in this embodiment is configured on a chip, and implements operation execution on each network layer through a nand gate and other logic arrays, which may specifically include convolution, pooling, tensor splicing, activation, upsampling, downsampling, and the like.
The storage module comprises a fusion storage unit consisting of an on-chip memory and an off-chip memory, and is configured to load operation data of each network layer of the artificial neural network and cache operation results; and the cache addresses of the on-chip memory and the off-chip memory in the fusion memory unit are mixed and uniformly coded to form a uniform address value.
The fusion storage unit disclosed by the embodiment is mainly used for storing the service data in the artificial neural network, mainly image data, and the on-chip memory in the fusion storage unit can be directly controlled by the cache control unit to read or write data, so that the processing speed is high; the off-chip memory is connected with the cache control unit through the INTERCONNECT module and the MIG interface controller, so that a data path is independently established for the off-chip memory and is compatible with the data path of the internal cache, reading and writing are not required to be carried out through the on-chip cache, and the on-chip data processing and delay time are greatly reduced. It should be noted that the INTERCONNECT is an opto-electrical and Photonic Integrated Circuit (PIC) design software package, which can be used for designing, simulating, and analyzing integrated optical circuits including mach-zehnder modulators, coupled ring resonators, arrayed waveguide gratings, etc., silicon-based photonic devices, and optical INTERCONNECT systems, and supports vectorization mapping and graphical display of photonic integrated circuit devices, and is therefore often used in image processing computing devices. MIG is an interface control frame, which mainly has a user-oriented port and a DDR-oriented port, and the user can complete the access to DDR SDRAM through the signal of the user port by using MIG, thus achieving the purpose of simplifying operation.
The on-chip memory may be comprised of a static RAM array and the off-chip memory may be implemented with dynamic DDR. In order to implement access scheduling of image data, in this embodiment, the storage addresses of the storage resources are uniformly encoded, and a uniform address value is given, so as to better perform loading of operation data of each network layer of the artificial neural network and cache of operation results.
To sum up, the artificial neural network processing system disclosed in this embodiment provides a scheme for accelerating processing of an artificial neural network, in which the fusion storage module includes an on-chip memory and an off-chip memory, so that dual fusion of the memory in storage addresses and operation functions is realized, parallelism, compatibility, and expandability of the system are improved, and the power consumption level of the system is reduced while high performance is ensured.
In one embodiment, the on-chip memory, the cache control unit, and the main arithmetic unit are configured on the same chip.
According to the functional design requirement, the on-chip memory, the cache control unit and the main operation unit are all arranged on the chip of the field programmable gate array or the application specific integrated circuit, so that the calculation and the scheduling of the neural network data are conveniently realized.
In one embodiment, the storage area of the on-chip memory includes an on-chip cache first area and an on-chip cache second area, the storage area of the off-chip memory includes an off-chip cache first area and an off-chip cache second area, and the on-chip cache first area, the off-chip cache first area, the on-chip cache second area and the off-chip cache second area are uniformly encoded in sequence to form a uniform address value.
The above arrangement of this embodiment realizes physical fusion of the fused storage units, and forms the fused storage area into two buffer areas, namely, a first buffer area and a second buffer area, so as to provide conditions for performing alternate access in the neural network calculation.
In one embodiment, the on-chip memory comprises two symmetrically arranged static RAM cache matrixes which are respectively composed of a plurality of static RAMs and a controller.
Referring to fig. 11, in order to implement the fused storage and facilitate data processing and scheduling in the present application, the on-chip memory disclosed in this embodiment includes two rows of static RAM cache matrices symmetrically arranged, each static RAM cache matrix is composed of C × D static RAMs and a controller, and each instruction can cache write data in each instruction (C rows × D channels) in parallel. Viewed externally, the on-chip cache area is a [2, C, D ] three-dimensional SRAM matrix.
In one embodiment, the system further includes a control path module configured on the chip, and the control path module provides a buffer for communication between the main controller and the arithmetic unit and the buffer control unit, and includes a buffer instruction RAM, a weight instruction RAM, and a weight data RAM.
Referring to fig. 6, in order to reduce the burden of the main processor and increase the response speed of the system, a control path module including a plurality of RAMs is further configured on the chip, so that the occupation of the main controller is reduced when the operation is performed in each network layer, and the system can operate even in the absence of the main controller.
In general loading, after the system is started, the main controller CPU determines an algorithm flow according to an algorithm model, and further determines a weight parameter loading flow and a cache instruction loading flow. The CPU presets and writes the instruction data of each flow into an instruction RAM cache in the chip control path module through a system bus in sequence, the instruction data in the RAM cache is not modified under the condition that an algorithm model is not changed, and the CPU determines operation starting, result acquisition and operation stopping according to the flow state of the control path.
Meanwhile, the CPU loads the weight parameter file into a CPU memory, and prepares to write the weight parameters required by the first-layer operation of the algorithm into a weight parameter RAM cache of the control access module through a system bus; and starting to collect image data to be operated, and preparing to write the preprocessed image data into an original data area of an off-chip memory of the storage module through a system bus. Of course, the weight parameters and the image data may be loaded and transmitted in other ways, the loading target of the weight parameters is the weight parameter RAM cache, and the loading target of the image data to be operated is the off-chip memory.
In one embodiment, the control module further comprises a master controller configured to perform at least one of the following: receiving and sending an operation instruction of the artificial neural network, scheduling an operation flow of the artificial neural network, and making a function decision of the artificial neural network; and/or the operation module further comprises an auxiliary operation unit, wherein the auxiliary operation unit is configured to preprocess the operation data of the artificial neural network and/or collect the operation result; and/or the storage module further comprises an auxiliary storage unit, wherein the auxiliary storage unit is configured to calculate and load weight parameters of each network layer of the artificial neural network; wherein the main controller and the auxiliary arithmetic unit are realized by a CPU; the auxiliary storage unit is realized through a CPU memory.
The artificial neural network processing system can adopt a CPU to carry out overall control, and can also adopt other modes to carry out control in a chip, wherein DDR series memory can be adopted as the CPU memory.
FIG. 2 shows a schematic flow diagram of a data processing method according to an embodiment of the present application; the method is applied to the artificial neural network processing system, and comprises the following steps:
step S110, dividing the uniform address value into areas, wherein the areas at least comprise a first cache area and a second cache area.
And step S120, after the operation result data of each network layer of the neural network is returned from the main operation unit, alternately writing the operation result data into the first cache region or the second cache region, reading the operation result data of the network layer from the first cache region or the second cache region, and writing the operation result data into the main operation unit to perform the operation of the next network layer of the network layer.
Referring to fig. 14, this embodiment describes a specific process of storing or outputting service data such as images in a storage module according to the operation requirement of each network layer in the present application, and fig. 13 is an illustration according to the read-write process between each network layer. The flow relates to the circulation of service data such as images and the like among the main arithmetic unit, the first buffer area U1 and the second buffer area U0, and can be visually called as ping-pong buffer. To describe the present application more specifically, the Yolov3-tiny neural network hierarchy is taken as an example for the following description.
The cache control unit informs the corresponding cache region of the read request and the address. When the cache region is an on-chip cache, the on-chip cache controller reads a cache matrix in a U0/U1 region corresponding to the current layer, selects a U0 region when the layer number is odd, and selects a U1 region when the layer number is even; when the cache region is an off-chip cache, the DMA reads the real address through the controller of the off-chip memory. Since the data addresses corresponding to the read requests are distributed in the corresponding sectors of the channels, the DMA accesses the target data of each channel in a burst mode, and then outputs the read results on the fixed 4/8/16 data paths according to the low-to-high arrangement of the channels.
Referring to fig. 13, when the original data reading and the writing of the first layer operation result data are completed, the operation result data of the whole layer of the first layer L1 are already buffered in the ping-pong buffer master 1 area (on-chip buffer U1) and the ping-pong buffer slave 1 area (off-chip buffer U1) as shown in fig. 13, the operation result data of the first layer L1 is subsequently read, so that the input data of the second layer L2 operation is input data, and the operation result of the second layer L2 is written in the ping-pong buffer master 2 area (on-chip buffer U0) and the ping-pong buffer slave 2 area (off-chip buffer U0) in the buffer system, and if the whole layer writing data of the second layer L2 does not overflow, only the ping-pong buffer master 2 area (on-chip buffer U0) is needed.
After the whole layer of operation results of the second layer L2 are written into the buffer, the operation result data of the second layer is read out to serve as the operation input data of the third layer, and the operation results of the third layer are written into a ping-pong buffer main 1 area (on-chip buffer U1) and a ping-pong buffer slave 1 area (off-chip buffer U1) of the buffer system, and if the whole layer of write data of the third layer does not overflow, only the operation result data of the second layer is written into the ping-pong buffer main 1 area (on-chip buffer U1).
In conclusion, by the data processing method, the caching efficiency can be improved, the data processing and delay time of the storage module can be greatly reduced, and the power consumption level of the system can be reduced.
In one embodiment, the first cache area consists of an on-chip first cache area and an off-chip first cache area of continuous uniform address values, the second cache area consists of an on-chip second cache area and an off-chip second cache area of continuous uniform address values, and data is preferentially selected to be written into or read from the on-chip cache area.
The region distribution in the unified address is shown on the left side with reference to fig. 13, where the on-chip memory is divided into two parts to be inserted into the off-chip memory, forming a plurality of buffer regions as shown in fig. 13. Because the on-chip cache has the advantages of high speed, low power consumption and the like, under the condition that the unified address value of the data is less than or equal to the maximum address value of the on-chip cache region, the data is preferentially selected to be written in or read from the on-chip cache region.
In one embodiment, the reading or writing of the first cache region or the second cache region comprises: and judging whether the network layer is an original data layer, a result data layer, an overflow layer or a backup layer, and determining a read or write target cache region according to the operational characteristics of each network layer and the relationship between the unified address value of the data and the maximum address value of the on-chip cache region.
Referring to fig. 8, in this embodiment, names of the neural network layers are named respectively according to the operation requirements and operation characteristics of the network layers, and corresponding caching or reading modes are selected respectively. For example, the original data layer is the L0 layer, data needs to be read from the original data buffer, the result data layer is the last layer of the network, the result is written into the operation result buffer after the operation of the layer is finished, and the backup layer refers to the situation that the result data backed up after the operation of the previous layer needs to be read when tensor splicing CONCAT is needed in the operation, which often occurs in network operations such as ResNet. The overflow layer is a condition that the network layer needs to be combined with an off-chip memory for storage when the on-chip cache area is insufficient due to excessive neurons, excessive weight parameters and the like during operation. And when the calculated data amount exceeds the storage amount of a single-layer cache distributed by a storage system, the module judges that the current layer is an overflow layer and the data to be cached, which overflows the on-chip cache, needs to be cached in the off-chip cache. And the write command analysis module sends the overflow layer mark to the cache control area.
Still taking the Yolov3-tiny neural network hierarchy as an example, when the computation reaches the third level L3, the algorithm requires to read the cache data of the second level, at this time, there are cache data of the second level in both the ping-pong cache main 2 area (on-chip cache U0) and the backup level 1 area of the off-chip cache, and since the level where the requested data is located does not cross the level (from the third level to the second level) with the data cached in the last level, the data is selected to be read from the ping-pong cache main 2 area; when the operation reaches the fourth layer L4 or later, the algorithm requires to read the cache data of the second layer, only the backup layer 1 area of the off-chip cache has the cache data of the second layer, and the cross-layer (from the fourth layer to the second layer) occurs between the level where the requested data is located and the data cached at the previous layer, so the data is selected to be read from the backup layer 1 area of the off-chip cache. The reason for this selection is that the on-chip cache is in the form of an SRAM matrix, the parallelism of the accessed and stored data is higher, and intermediate caches such as DMA and the like are not needed, so that the on-chip cache speed is higher than that of the off-chip cache, and the on-chip cache is preferentially selected under the condition that the on-chip cache can be selected, which is beneficial to improving the overall efficiency of the storage system.
In one embodiment, the determining whether the network layer is an original data layer, a result data layer, an overflow layer, or a backup layer, and determining the read or write target cache area according to the operational characteristics of each network layer and the relationship between the unified address value of the data and the maximum address value of the on-chip cache area includes:
if the target network layer is not a cross-layer backup layer, not an original data layer, not a result data layer and not an overflow layer; or, the target network layer is an overflow layer and is not a backup layer, and after the reading or writing is finished, the unified address value is less than or equal to the maximum address value of the on-chip cache region, only the data in the on-chip cache region needs to be read or written.
If the target network layer is a cross-layer number backup layer, or a non-original data layer, or a non-result data layer, and is not an overflow layer; or, the target network layer is an overflow layer and is not a backup layer, and the data in the off-chip cache area only needs to be read or written if the unified address value before the reading or writing is started is larger than the maximum address value of the on-chip cache area.
If the target reading layer is an overflow layer and is not a backup layer, the unified address value before the start of the reading is smaller than the maximum address value of the on-chip cache region, and the unified address value after the end of the reading or writing is larger than the maximum address value of the on-chip cache region, the data in the on-chip cache region needs to be read or written first, and then the data in the off-chip cache region needs to be read.
In design and execution, after receiving a read command, the cache control unit first determines a read command or a memory to be read during processing of a layer where the read command is located, and RD _ SW1 determines the following three types:
first type RD _ SW1_ S1: only the on-chip cache needs to be read. Judging whether the target reading layer is a backup layer of a non-cross-layer number, a non-original data layer, a non-result data layer and a non-overflow layer;
second class RD _ SW1_ S2: only the off-chip cache needs to be read. The judgment condition is that the target reading layer is a cross-layer number backup layer, or a non-original data layer, or a non-result data layer, and is not an overflow layer. The current step is of the second type;
third class RD _ SW1_ S3: the on-chip cache needs to be read first, and then the off-chip cache needs to be read. And judging that the target reading layer is an overflow layer and is not a backup layer.
Then, the memory that the data requested by the single read instruction needs to be read before the next single read instruction arrives is determined, and RD _ SW2 is determined as the following three types:
the first type of RD _ SW2_ S1, only needs to read the on-chip cache. The judgment condition is that RD _ SW1 is RD _ SW1_ S1, or RD _ SW1 is RD _ SW1_ S3, and the unified address value of the read cache after the reading is finished is less than or equal to the maximum address value of the on-chip cache;
the second type of RD _ SW2_ S2, only needs to read the off-chip cache. The judgment condition is RD _ SW 1-RD _ SW1_ S2 or RD _ SW 1-RD _ SW1_ S3, and the unified address value of the read buffer before the reading of this time is started is larger than the maximum address value of the on-chip buffer. The current step is of the second type;
the third type of RD _ SW2_ S3 requires reading the on-chip cache first and then the off-chip cache. The determination condition is RD _ SW1 being RD _ SW1_ S3, and the unified address value of the read buffer before the start of the current read is smaller than the maximum address value of the on-chip buffer, and the unified address value of the read buffer after the end of the current read is larger than the maximum address value of the on-chip buffer.
The write command can be divided into three cases, namely WR _ SW1_ S1, WR _ SW1_ S2, WR _ SW1_ S3, and the description thereof is omitted here.
In one embodiment, the determining whether the network layer is an original data layer, a result data layer, an overflow layer, or a backup layer further includes, according to the calculation characteristics of each network layer and the relationship between the data uniform address value and the maximum address value of the on-chip cache area, determining that the target write cache area is written into the on-chip cache area:
and if the target writing layer is a backup layer, writing the target writing layer into the on-chip cache region and the off-chip cache region at the same time.
Referring to fig. 14, when the algorithm requires to back up the operation result of the second layer L2, each write command is determined as a fourth class WR _ SW2_ S4, and needs to be written into the on-chip cache and the off-chip cache simultaneously, where "simultaneously" is interpreted to mean that during a period from the start of the current command to the start of the next write command, data written into the cache system is written into the on-chip cache and the off-chip cache simultaneously, so as to back up the current layer data, which is applied to, for example, a cross-layer cache for sampling on the CONCAT. The determination condition is WR _ SW1 — WR _ SW1_ S4. Because the data paths used for writing the off-chip cache and the on-chip cache are independent, the data does not need to be written into the on-chip cache first and then read out from the on-chip cache and then written into the off-chip cache, each row of data is written into the off-chip cache and the on-chip cache simultaneously, and the interval time of two times of writing instructions is a larger value of the time required for independently writing the off-chip cache and the on-chip cache.
When the third layer L3 is operated, the algorithm requires to read the cache data of the second layer, at this time, the cache data of the second layer exists in both the ping-pong cache main 2 area (on-chip cache U0) and the backup layer 1 area of the off-chip cache, and since the layer (from the third layer to the second layer) is not crossed between the layer where the requested data is located and the data cached in the previous layer, the data is selected to be read from the ping-pong cache main 2 area; when the operation reaches the fourth layer L4 or later, the algorithm requires to read the cache data of the second layer, only the backup layer 1 area of the off-chip cache has the cache data of the second layer, and the cross-layer (from the fourth layer to the second layer) occurs between the level where the requested data is located and the data cached at the previous layer, so the data is selected to be read from the backup layer 1 area of the off-chip cache. The reason for this selection is that the on-chip cache is in the form of an SRAM matrix, the parallelism of the accessed and stored data is higher, and intermediate buffers such as DMA and the like are not needed, so that the on-chip cache speed is higher than that of the off-chip cache, and the on-chip cache is preferentially selected under the condition that the on-chip cache can be selected, which is beneficial to improving the overall efficiency of the storage system.
In one embodiment, the alternately writing the operation result data of each network layer of the neural network into the first cache region or the second cache region after returning from the main operation unit, and reading the operation result data of the network layer from the first cache region or the second cache region and writing the operation result data into the main operation unit to perform the operation of the next network layer of the network layer includes:
determining the base address of each channel and the cache partition of each channel based on the layer number and the channel number of the network layer where the current request is located, and then reading or writing data of each channel in a burst mode based on the base address of each channel.
Referring to fig. 9, the buffer control unit performs address encoding on a single write request according to the determination results of WR _ SW1 and WR _ SW2, and the calculated address result is a BASE address WR _ SUB _ BASE of the current burst data, which is a unique BASE address of each channel in the unified address control. Meanwhile, the address coding module performs cache partition of the channel and channel base address calculation according to information of a layer where the current request is located, such as a layer number, the number of channels and the like.
When the total number of channels of the first layer data is 4, each channel occupies one storage area and the BASE addresses are WR _ BASE1, WR _ BASE2, WR _ BASE3 and WR _ BASE4, respectively, the target BASE addresses of the current write request are (WR _ BASE1+ WR _ SUB _ BASE), (WR _ BASE2+ WR _ SUB _ BASE), (WR _ BASE3+ WR _ SUB _ BASE), (WR _ BASE4+ WR _ SUB _ BASE), respectively. Since the BASE address of the whole buffer area where the first layer DATA is located is U0_ DATA _ BASE, the unified BASE address of the read request is:
WR_BASE1+WR_SUB_BASE+U0_DATA_BASE,
WR_BASE2+WR_SUB_BASE+U0_DATA_BASE,
WR_BASE3+WR_SUB_BASE+U0_DATA_BASE,
WR_BASE4+WR_SUB_BASE+U0_DATA_BASE。
since the cache of the memory system is divided into an on-chip cache and an off-chip cache, and the address 0 of the on-chip cache is RAM _ MEM _ BASE relative to the uniform address, the real BASE addresses of the read request are respectively:
WR_BASE1+WR_SUB_BASE+U0_DATA_BASE-RAM_MEM_BASE,
WR_BASE2+WR_SUB_BASE+U0_DATA_BASE-RAM_MEM_BASE,
WR_BASE3+WR_SUB_BASE+U0_DATA_BASE-RAM_MEM_BASE,
WR_BASE4+WR_SUB_BASE+U0_DATA_BASE-RAM_MEM_BASE。
in the Yolov3-tiny network, the total number of first layer channels is 16, and the division is also performed by the above address allocation method.
The reading operation is applicable to the same situation, and is not described herein again.
In one embodiment, the alternately writing the operation result data of each network layer of the neural network into the first cache region or the second cache region after being returned from the main operation unit includes:
after receiving an external instruction, analyzing a write-in instruction and informing a write-in data FIFO group to prepare to receive data to be cached;
respectively adjusting the quantity proportion of write-in FIFO and the quantity proportion of read-out FIFO according to the data bit width proportion of the on-chip cache region and the off-chip cache region;
the number of read operations and the number of write operations are adjusted according to the relationship between the number of write FIFOs and the number of read FIFOs, so that the write data FIFO group temporarily stores data input in parallel by multiple channels and reads the data in a burst mode, and the total number of read operations and write operations is reduced.
Referring to fig. 10, the main function of the write data FIFO group is to temporarily store data inputted in parallel by multiple channels, and since the amount of data inputted externally is uncertain and the response time of the ready signal issued outwards from the write data FIFO group is uncertain, the present application uses a burst read mode in the read logic of the write data FIFO group, that is, when the multiple channels of data in the write data FIFO group are accumulated to the length of the whole line of the current layer, the write data FIFO group is allowed to read out the temporarily stored data in parallel, and the continuous writing of data outside the storage system is not affected.
The write data FIFO group can respectively adjust the number of the participating write FIFOs and the number proportion of the read FIFOs according to the bit width proportion of the on-chip cache and the off-chip cache.
Taking the convolution layer and the pooling layer as an example, when two layers of operation data use two types of input interfaces, and the data bit width is different when the two layers of operation data are input into the cache system, the write FIFO may allocate 8 FIFOs for convolution layer result data with a data bit width of 8 × Abit, and allocate 2 FIFOs for pooling layer result data with a data bit width of 2 × a bit.
The number ratio of the read FIFOs may take an on-chip cache and an off-chip cache as an example, when the write instruction target cache is the on-chip cache, the read FIFO may allocate 8 FIFOs for the on-chip cache having a write data bit width of 8 × B bit, and when the write instruction target cache is the off-chip cache, the read FIFO may allocate 2 FIFOs for the on-chip cache having a write data bit width of 2 × B bit.
When the number of writing FIFO and reading FIFO is equal, the current operation of the layer fixedly uses the number of FIFO as the FIFO group of the writing data; when the number of the write-in FIFOs is larger than that of the read-out FIFOs, waiting for a plurality of times of read operations after one time of write until the data volume of the write operation is read, and starting the next time of write operation; when the number of write-in FIFOs is smaller than the number of read-out FIFOs, waiting for multiple write operations after one read until the last read data volume is written, and starting the next write operation; when the write-cache is finished, the read-write FIFO proportion is determined, and the FIFO position which is not written or read is defaulted to be supplemented with 0.
In one embodiment, the alternately writing the operation result data returned from the main operation unit by each network layer in the neural network into the first cache region or the second cache region comprises: if the target write-in layer is an overflow layer, firstly continuously writing the target write-in layer into the on-chip cache region, then writing the target write-in layer into the off-chip cache region, then continuously writing the target write-in layer into the off-chip cache region, and setting a plurality of write-in FIFO (first in first out) to work in parallel in the direct memory access when writing the off-chip cache region.
In this embodiment, if it is determined that the target write layer is an overflow layer, the flow of the entire layer of write cache is WR _ SW1_ S3, and it needs to be written into the on-chip cache region first and then written into the off-chip cache region; the write cache state of the single instruction is WR _ SW2_ S1, WR _ SW2_ S3 and WR _ SW2_ S2 which sequentially appear, wherein the write cache appears continuously on the chip first, the write cache appears continuously on the chip first and then the write cache appears continuously off the chip in the middle.
Referring to fig. 12, when the target cache region is an off-chip cache region, the cache controller reads out data to be cached from the write cache FIFO group, and sends the data to Direct Memory Access (DMA) using the current layer width as the burst length and the real address of the current instruction as the base address. In order to improve the writing efficiency, the cache system builds 4 parallel FIFOs in the DMA for caching, when the first line of data is written into the first FIFO in the DMA but the external memory is not written into, the second line of data is written into the second FIFO in the DMA, and so on. Thus, if the DMA writes the first line of data into the external memory after a long time, and the DMA buffers the subsequent three lines in the middle time, when the acquisition time interval between the fifth line and the fourth line is longer than the acquisition time interval between the fourth line and the third line due to the different number ratio of the read-write FIFOs, the buffering method in the DMA can avoid lengthening the acquisition time between the fourth line and the third line (T2> T1), for example, and further avoid the decrease of the buffering speed.
And when the target cache region is an on-chip cache region and an off-chip cache region, selecting the on-chip cache or the off-chip cache according to the uniform address. When the target address of the cache of the write area is larger than the maximum address RAM _ MAX _ ADDR of the single-chip SRAM in the cache matrix, shielding the cache operation of the write area; meanwhile, the cache control unit informs the off-chip cache controller of the write instruction, starts to write the DMA when the write address is larger than the RAM _ MAX _ ADDR, and takes the target address minus the RAM _ MAX _ ADDR as a real address value written into the external memory. The burst length of the write DMA at this time is the current layer width minus the width of the data that has been written to the on-chip buffer.
In one embodiment, the area further includes an original data cache area, a plurality of backup layer areas, and an operation result area, and the original data cache area, the off-chip cache first area, the off-chip cache second area, the plurality of backup layer areas, and the operation result area are all disposed on the off-chip memory, and the original data cache area, the on-chip cache first area, the off-chip cache first area, the on-chip cache second area, the off-chip cache second area, the plurality of backup layer areas, and the operation result area are uniformly encoded in sequence to form a uniform address value.
Referring to the left part of fig. 13, showing a specific case of the off-chip memory and the on-chip memory partition, it can be seen from fig. 13 that the off-chip memory includes more areas than the on-chip memory, and two areas of the on-chip memory are inserted between the areas outside the chip.
It should be noted that, when the algorithm is finished to the last layer, the storage module writes the operation result into the operation result area of the off-chip cache, and notifies the control path module when the cache is finished, and the control path module notifies the CPU that the operation is finished in this round by interrupting or polling the state machine after confirming that each instruction loads the state of the state machine. After receiving the message, the CPU accesses an operation result area of an external memory of the cache system through a system bus to acquire data, and after the data acquisition is finished, the CPU informs the control path to set the instruction loading state machine again, but does not clear the data in each instruction RAM, prepares to operate on new original data, and returns to the step 2; when the network structure is changed and the instruction data needs to be reloaded, returning to the step 1; when the operation is terminated, the operation is returned to the step 1 and terminated.
In summary, the technical solution disclosed in the present application includes a control module, an operation module, and a storage module; the control module comprises a cache control unit configured to perform control of an artificial neural network operation cache; the operation module comprises one or more main operation units, and the main operation units are configured to perform operation execution on different layer structures in the artificial neural network structure; the storage module comprises a fusion storage unit consisting of an on-chip memory and an off-chip memory, the cache addresses of the fusion storage unit are uniformly coded to form a uniform address value, and the fusion storage unit is configured to load operation data of each network layer of the artificial neural network and cache operation results; the on-chip memory, the cache control unit and the main arithmetic unit are configured on the same chip. The scheme can ensure the parallelism, compatibility and expandability of the artificial neural network operation, and reduces the power consumption level of the system while ensuring high performance.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various application aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, application is directed to less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in an artificial neural network processing system according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
For example, fig. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 300 comprises a processor 310 and a memory 320 arranged to store computer executable instructions (computer readable program code). The memory 320 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 320 has a storage space 330 storing computer readable program code 331 for performing any of the method steps described above. For example, the storage space 330 for storing the computer readable program code may comprise respective computer readable program codes 331 for respectively implementing various steps in the above method. The computer readable program code 331 may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. Such a computer program product is typically a computer readable storage medium such as described in fig. 4. FIG. 4 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application. The computer readable storage medium 400 has stored thereon a computer readable program code 331 for performing the steps of the method according to the application, readable by a processor 310 of an electronic device 300, which computer readable program code 331, when executed by the electronic device 300, causes the electronic device 300 to perform the steps of the method described above, in particular the computer readable program code 331 stored on the computer readable storage medium may perform the method shown in any of the embodiments described above. The computer readable program code 331 may be compressed in a suitable form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (17)

1. An artificial neural network processing system, comprising: the device comprises a control module, an operation module and a storage module;
the control module comprises a cache control unit configured to perform control of an artificial neural network operation cache;
the operation module comprises one or more main operation units, and the main operation units are configured to execute operation on each network layer in the artificial neural network structure;
the storage module comprises a fusion storage unit, and the fusion storage unit is configured to load operation data of each network layer of the artificial neural network and cache operation results; the fusion storage unit comprises an on-chip memory and an off-chip memory, and the cache address of the on-chip memory and the cache address of the off-chip memory are coded uniformly to form a uniform address value.
2. The artificial neural network processing system of claim 1, wherein the on-chip memory, the cache control unit, and the main arithmetic unit are configured on a same chip.
3. The artificial neural network processing system of claim 1, wherein the storage area of the on-chip memory comprises an on-chip cache one area and an on-chip cache two area, the storage area of the off-chip memory comprises an off-chip cache one area and an off-chip cache two area, and the on-chip cache one area, the off-chip cache one area, the on-chip cache two area and the off-chip cache two area are uniformly encoded in sequence to form a uniform address value.
4. The system of claim 1, wherein the on-chip memory comprises two symmetrically arranged static RAM cache matrices, each of which is composed of a plurality of static RAMs and a controller.
5. The system of claim 2, wherein the control module further comprises a master controller configured to perform at least one of: receiving and sending an operation instruction of the artificial neural network, scheduling an operation flow of the artificial neural network, and making a function decision of the artificial neural network; and/or the presence of a gas in the gas,
the operation module further comprises an auxiliary operation unit, and the auxiliary operation unit is configured to preprocess operation data of the artificial neural network and/or collect operation results; and/or the presence of a gas in the gas,
the storage module further comprises an auxiliary storage unit, and the auxiliary storage unit is configured to calculate and load weight parameters of each network layer of the artificial neural network;
the main controller and the auxiliary operation unit are realized by a CPU; the auxiliary storage unit is realized through a CPU memory.
6. The system of claim 5, further comprising a control path module disposed on the chip, the control path module to provide a cache for communication of the master controller with the arithmetic unit and the cache control unit, including a cache instruction RAM, a weight instruction RAM, and a weight data RAM.
7. A data processing method applied to the neural network processing system according to any one of claims 1 to 6, the method comprising:
dividing the unified address value into regions, wherein the regions at least comprise a first cache region and a second cache region;
and returning the operation result data of each network layer of the neural network from the main operation unit, alternately writing the operation result data into the first cache region or the second cache region, reading the operation result data of the network layer from the first cache region or the second cache region, and writing the operation result data into the main operation unit to perform the operation of the next network layer of the network layer.
8. The method of claim 7, wherein the first cache region consists of a first on-chip cache region and a first off-chip cache region for consecutive uniform address values, the second cache region consists of a second on-chip cache region and a second off-chip cache region for consecutive uniform address values, and the writing or reading of data from the on-chip cache region is preferentially selected.
9. The method of claim 8, wherein the cache one or cache two read or write comprises:
and judging whether the target network layer is an original data layer, a result data layer, an overflow layer or a backup layer, and determining a read or write target cache region according to the operational characteristics of each network layer and the relationship between the unified address value of the data and the maximum address value of the on-chip cache region.
10. The method as claimed in claim 9, wherein said determining whether the target network layer is an original data layer, a result data layer, an overflow layer or a backup layer, and determining the target cache area for reading or writing according to the operational characteristics of each network layer and the relationship between the unified address value of the data and the maximum address value of the on-chip cache area comprises:
if the target network layer is not a cross-layer backup layer, not an original data layer, not a result data layer and not an overflow layer; or, the target network layer is an overflow layer and is not a backup layer, and after the reading or writing is finished, the unified address value is less than or equal to the maximum address value of the on-chip cache region, only the data in the on-chip cache region needs to be read or written;
if the target network layer is a cross-layer number backup layer, or a non-original data layer, or a non-result data layer, and is not an overflow layer; or, the target network layer is an overflow layer and is not a backup layer, and the unified address value before the current reading or writing is started is larger than the maximum address value of the on-chip cache region, only the data in the off-chip cache region needs to be read or written;
if the target reading network layer is an overflow layer and is not a backup layer, the unified address value before the current reading is started is smaller than the maximum address value of the on-chip cache region, and the unified address value after the current reading or writing is finished is larger than the maximum address value of the on-chip cache region, the data in the on-chip cache region needs to be read or written first, and then the data in the off-chip cache region needs to be read.
11. The method as claimed in claim 9, wherein said determining whether the target network layer is an original data layer, a result data layer, an overflow layer or a backup layer, and determining the target write buffer according to the calculation characteristics of each network layer and the relationship between the unified address value of the data and the maximum address value of the on-chip buffer further comprises:
and if the target writing layer is a backup layer, writing the target writing layer into the on-chip cache region and the off-chip cache region at the same time.
12. The method of claim 7, wherein the alternately writing the operation result data of each network layer of the neural network into the first cache region or the second cache region after returning from the main operation unit, and reading the operation result data of the network layer from the first cache region or the second cache region and writing the operation result data into the main operation unit to perform the operation of the next network layer of the network layer comprises:
determining the base address of each channel and the cache partition of each channel based on the layer number and the channel number of the network layer where the current request is located, and then reading or writing data of each channel in a burst mode based on the base address of each channel.
13. The method of claim 7, wherein alternately writing the operation result data of each network layer of the neural network into the first cache region or the second cache region after returning from the main operation unit comprises:
after receiving an external instruction, analyzing a write-in instruction and informing a write-in data FIFO group to prepare to receive data to be cached;
respectively adjusting the quantity proportion of write-in FIFO and the quantity proportion of read-out FIFO according to the data bit width proportion of the on-chip cache region and the off-chip cache region;
the number of read operations and the number of write operations are adjusted according to the relationship between the number of write FIFOs and the number of read FIFOs, so that the write data FIFO group temporarily stores data input in parallel by multiple channels and reads the data in a burst mode, and the total number of read operations and write operations is reduced.
14. The method of claim 13, wherein alternately writing operation result data returned by each network layer in the neural network from the operation unit to the first cache region or the second cache region comprises:
if the target write-in layer is an overflow layer, firstly continuously writing the target write-in layer into the on-chip cache region, then writing the target write-in layer into the off-chip cache region, then continuously writing the target write-in layer into the off-chip cache region, and setting a plurality of write-in FIFO (first in first out) to work in parallel in the direct memory access when writing the off-chip cache region.
15. The method of claim 8, wherein the area further comprises an original data buffer area, a plurality of backup layer areas and an operation result area, and the original data buffer area, the off-chip buffer one area, the off-chip buffer two area, the plurality of backup layer areas and the operation result area are all disposed on an off-chip memory, and the original data buffer area, the on-chip buffer one area, the off-chip buffer one area, the on-chip buffer two area, the off-chip buffer two area, the plurality of backup layer areas and the operation result area are uniformly encoded in sequence to form a uniform address value.
16. An electronic device, wherein the electronic device comprises: a processor; and a memory arranged to store computer-executable instructions that, when executed, cause the processor to perform the method of any one of claims 7-15.
17. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 7-15.
CN201911411798.8A 2019-12-31 2019-12-31 Artificial neural network processing system and data processing method thereof Pending CN111160545A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911411798.8A CN111160545A (en) 2019-12-31 2019-12-31 Artificial neural network processing system and data processing method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911411798.8A CN111160545A (en) 2019-12-31 2019-12-31 Artificial neural network processing system and data processing method thereof

Publications (1)

Publication Number Publication Date
CN111160545A true CN111160545A (en) 2020-05-15

Family

ID=70560315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911411798.8A Pending CN111160545A (en) 2019-12-31 2019-12-31 Artificial neural network processing system and data processing method thereof

Country Status (1)

Country Link
CN (1) CN111160545A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651380A (en) * 2020-06-17 2020-09-11 中国电子科技集团公司第十四研究所 Parameter loading method based on descriptor table
CN111752879A (en) * 2020-06-22 2020-10-09 深圳鲲云信息科技有限公司 Acceleration system, method and storage medium based on convolutional neural network
CN111783967A (en) * 2020-05-27 2020-10-16 上海赛昉科技有限公司 Data double-layer caching method suitable for special neural network accelerator
CN112308226A (en) * 2020-08-03 2021-02-02 北京沃东天骏信息技术有限公司 Quantization of neural network models, method and apparatus for outputting information
CN112381220A (en) * 2020-12-08 2021-02-19 厦门壹普智慧科技有限公司 Neural network tensor processor
CN112488305A (en) * 2020-12-22 2021-03-12 西北工业大学 Neural network storage organization structure and configurable management method thereof
CN112580485A (en) * 2020-12-14 2021-03-30 珠海零边界集成电路有限公司 Image reading and writing method and device, electronic equipment and storage medium
CN112860320A (en) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 Method, system, device and medium for data processing based on RISC-V instruction set
CN113033785A (en) * 2021-02-26 2021-06-25 上海阵量智能科技有限公司 Chip, neural network training system, memory management method, device and equipment
CN113469333A (en) * 2021-06-28 2021-10-01 上海寒武纪信息科技有限公司 Artificial intelligence processor, method and related product for executing neural network model
CN113946538A (en) * 2021-09-23 2022-01-18 南京大学 Convolutional layer fusion storage device and method based on line cache mechanism
WO2023124428A1 (en) * 2021-12-30 2023-07-06 上海商汤智能科技有限公司 Chip, accelerator card, electronic device and data processing method
WO2024012492A1 (en) * 2022-07-15 2024-01-18 北京有竹居网络技术有限公司 Artificial intelligence chip, method for flexibly accessing data, device, and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018010513A (en) * 2016-07-14 2018-01-18 日本電気株式会社 Information processing system, automatic application device, application terminal, method and program
CN108197705A (en) * 2017-12-29 2018-06-22 国民技术股份有限公司 Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium
CN108334474A (en) * 2018-03-05 2018-07-27 山东领能电子科技有限公司 A kind of deep learning processor architecture and method based on data parallel
CN108710941A (en) * 2018-04-11 2018-10-26 杭州菲数科技有限公司 The hard acceleration method and device of neural network model for electronic equipment
CN109740754A (en) * 2018-12-29 2019-05-10 北京中科寒武纪科技有限公司 Neural computing device, neural computing method and Related product
CN110298443A (en) * 2016-09-29 2019-10-01 北京中科寒武纪科技有限公司 Neural network computing device and method
KR20190129240A (en) * 2018-05-10 2019-11-20 서울대학교산학협력단 Neural network processor based on row operation and data processing method using thereof
CN209708122U (en) * 2019-06-03 2019-11-29 南京宁麒智能计算芯片研究院有限公司 A kind of computing unit, array, module, hardware system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018010513A (en) * 2016-07-14 2018-01-18 日本電気株式会社 Information processing system, automatic application device, application terminal, method and program
CN110298443A (en) * 2016-09-29 2019-10-01 北京中科寒武纪科技有限公司 Neural network computing device and method
CN108197705A (en) * 2017-12-29 2018-06-22 国民技术股份有限公司 Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium
CN108334474A (en) * 2018-03-05 2018-07-27 山东领能电子科技有限公司 A kind of deep learning processor architecture and method based on data parallel
CN108710941A (en) * 2018-04-11 2018-10-26 杭州菲数科技有限公司 The hard acceleration method and device of neural network model for electronic equipment
KR20190129240A (en) * 2018-05-10 2019-11-20 서울대학교산학협력단 Neural network processor based on row operation and data processing method using thereof
CN109740754A (en) * 2018-12-29 2019-05-10 北京中科寒武纪科技有限公司 Neural computing device, neural computing method and Related product
CN209708122U (en) * 2019-06-03 2019-11-29 南京宁麒智能计算芯片研究院有限公司 A kind of computing unit, array, module, hardware system

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783967A (en) * 2020-05-27 2020-10-16 上海赛昉科技有限公司 Data double-layer caching method suitable for special neural network accelerator
CN111783967B (en) * 2020-05-27 2023-08-01 上海赛昉科技有限公司 Data double-layer caching method suitable for special neural network accelerator
CN111651380A (en) * 2020-06-17 2020-09-11 中国电子科技集团公司第十四研究所 Parameter loading method based on descriptor table
CN111651380B (en) * 2020-06-17 2023-08-18 中国电子科技集团公司第十四研究所 Parameter loading method based on descriptor table
WO2021259098A1 (en) * 2020-06-22 2021-12-30 深圳鲲云信息科技有限公司 Acceleration system and method based on convolutional neural network, and storage medium
CN111752879B (en) * 2020-06-22 2022-02-22 深圳鲲云信息科技有限公司 Acceleration system, method and storage medium based on convolutional neural network
CN111752879A (en) * 2020-06-22 2020-10-09 深圳鲲云信息科技有限公司 Acceleration system, method and storage medium based on convolutional neural network
CN112308226B (en) * 2020-08-03 2024-05-24 北京沃东天骏信息技术有限公司 Quantization of neural network model, method and apparatus for outputting information
CN112308226A (en) * 2020-08-03 2021-02-02 北京沃东天骏信息技术有限公司 Quantization of neural network models, method and apparatus for outputting information
CN112381220A (en) * 2020-12-08 2021-02-19 厦门壹普智慧科技有限公司 Neural network tensor processor
CN112381220B (en) * 2020-12-08 2024-05-24 厦门壹普智慧科技有限公司 Neural network tensor processor
CN112580485A (en) * 2020-12-14 2021-03-30 珠海零边界集成电路有限公司 Image reading and writing method and device, electronic equipment and storage medium
CN112488305A (en) * 2020-12-22 2021-03-12 西北工业大学 Neural network storage organization structure and configurable management method thereof
CN112488305B (en) * 2020-12-22 2023-04-18 西北工业大学 Neural network storage device and configurable management method thereof
CN112860320A (en) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 Method, system, device and medium for data processing based on RISC-V instruction set
CN113033785B (en) * 2021-02-26 2024-01-09 上海阵量智能科技有限公司 Chip, neural network training system, memory management method, device and equipment
CN113033785A (en) * 2021-02-26 2021-06-25 上海阵量智能科技有限公司 Chip, neural network training system, memory management method, device and equipment
CN113469333A (en) * 2021-06-28 2021-10-01 上海寒武纪信息科技有限公司 Artificial intelligence processor, method and related product for executing neural network model
CN113469333B (en) * 2021-06-28 2023-11-10 上海寒武纪信息科技有限公司 Artificial intelligence processor, method and related products for executing neural network model
CN113946538B (en) * 2021-09-23 2024-04-12 南京大学 Convolutional layer fusion storage device and method based on line caching mechanism
CN113946538A (en) * 2021-09-23 2022-01-18 南京大学 Convolutional layer fusion storage device and method based on line cache mechanism
WO2023124428A1 (en) * 2021-12-30 2023-07-06 上海商汤智能科技有限公司 Chip, accelerator card, electronic device and data processing method
WO2024012492A1 (en) * 2022-07-15 2024-01-18 北京有竹居网络技术有限公司 Artificial intelligence chip, method for flexibly accessing data, device, and medium

Similar Documents

Publication Publication Date Title
CN111160545A (en) Artificial neural network processing system and data processing method thereof
CN106228238B (en) Accelerate the method and system of deep learning algorithm on field programmable gate array platform
CN107250996B (en) Method and apparatus for compaction of memory hierarchies
US11748599B2 (en) Super-tiling in neural network processing to enable analytics at lower memory speed
WO2018077295A1 (en) Data processing method and apparatus for convolutional neural network
JPH04219859A (en) Harware distributor which distributes series-instruction-stream data to parallel processors
US8595437B1 (en) Compression status bit cache with deterministic isochronous latency
JP2011522325A (en) Local and global data sharing
US20190370630A1 (en) Neural network system, application processor having the same, and method of operating the neural network system
CN108074211B (en) Image processing device and method
CN111028360B (en) Data reading and writing method and system in 3D image processing, storage medium and terminal
CN111324294B (en) Method and device for accessing tensor data
US20220100660A1 (en) Reconfigurable cache architecture and methods for cache coherency
US20230334748A1 (en) Control stream stitching for multicore 3-d graphics rendering
CN113065643A (en) Apparatus and method for performing multi-task convolutional neural network prediction
CN110414672B (en) Convolution operation method, device and system
CN105874431A (en) Computing system with reduced data exchange overhead and related data exchange method thereof
CN114328360A (en) Data transmission method, device, electronic equipment and medium
US10990445B2 (en) Hardware resource allocation system for allocating resources to threads
CN117215491A (en) Rapid data access method, rapid data access device and optical module
CN111756802A (en) Method and system for scheduling data stream tasks on NUMA platform
CN116894755A (en) Multi-core state caching in graphics processing
CN112368686A (en) Heterogeneous computing system and memory management method
US11816025B2 (en) Hardware acceleration
CN115904681A (en) Task scheduling method and device and related products

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200515