WO2019196222A1 - 一种应用于卷积神经网络的处理器 - Google Patents

一种应用于卷积神经网络的处理器 Download PDF

Info

Publication number
WO2019196222A1
WO2019196222A1 PCT/CN2018/095364 CN2018095364W WO2019196222A1 WO 2019196222 A1 WO2019196222 A1 WO 2019196222A1 CN 2018095364 W CN2018095364 W CN 2018095364W WO 2019196222 A1 WO2019196222 A1 WO 2019196222A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
feature
index
convolution
store
Prior art date
Application number
PCT/CN2018/095364
Other languages
English (en)
French (fr)
Inventor
刘勇攀
袁哲
岳金山
杨华中
李学清
王智博
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2019196222A1 publication Critical patent/WO2019196222A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Definitions

  • the invention belongs to the technical field of arithmetic processing, and more particularly to a processor applied to a convolutional neural network.
  • the Convolutional Neural Network is a feedforward neural network whose artificial neurons can respond to surrounding units in a part of the coverage and is suitable for processing large images.
  • Convolutional neural networks are widely used in image recognition, speech recognition and other fields, but the amount of calculation is very large.
  • the activation function ReLU Rectified Linear Unit
  • weight Data sparse weight data
  • the sparsity of the utilization and weight data can greatly improve the computational efficiency of the convolutional neural network.
  • some methods do not perform a multiplication operation when the input data is 0, thereby reducing the operation.
  • these methods are all focused on dealing with the sparse neural network itself, assuming that the neural network is sparse.
  • the output of each layer in the convolutional neural network may be sparse and may be non-sparse.
  • Current processors can only handle sparse convolutional neural networks, and cannot handle sparse and non-sparse convolutional neural networks simultaneously.
  • the present invention provides a processor for use in a convolutional neural network.
  • a processor for a convolutional neural network comprising: a plurality of processing units, each of the processing units including a data processor and an index processor;
  • the data processor is configured to convolve each feature map outputted by each layer in the convolutional neural network and each convolution kernel in the convolutional neural network to obtain a convolution result;
  • the index processor is configured to obtain an index of the convolution result according to an index of each of the feature maps and an index of each of the convolution kernels;
  • an index of each of the feature maps is an index of a non-zero feature element in each of the feature maps
  • an index of each of the convolution kernels is an index of non-zero weight elements in each of the convolution kernels.
  • the processing unit further includes a feature data masking signal, a weight data masking signal, a processing unit masking signal, and an AND gate;
  • the feature data masking signal when each of the feature maps is in an intermediate state, the feature data masking signal is turned on;
  • the feature data masking signal is 0;
  • the weight data masking signal When each of the convolution kernels is in an intermediate state, the weight data masking signal is turned on;
  • the weight data masking signal is 0;
  • the feature data masking signal and the weight data masking signal are shielded by the AND gate output processing unit;
  • the processing unit masking signal is 0, and the data processor and the index processor are not running.
  • each of the processing units further includes an accumulator
  • the accumulator is configured to store a convolution result output by each of the processing units and an index of the convolution result
  • a plurality of said processing units share said accumulator.
  • a collision detector is further included;
  • the conflict detector is configured to, when the convolution result corresponding to the index of the convolution result is routed to the same accumulator according to an index of a convolution result output by any two adjacent processing units, This time, the convolution result corresponding to the index of the convolution result is not stored in the accumulator.
  • a storage unit is further included;
  • the storage unit includes a first storage module, a second storage module, and a third storage module;
  • the first storage module is configured to store each of the feature maps
  • the second storage module is configured to store each of the convolution kernels
  • the third storage module is configured to store a location of each of the feature maps, where the location includes a starting address stored in each of the feature maps and a number of non-zero elements in each of the feature maps.
  • the first storage module and the second storage module respectively comprise a plurality of memories
  • the memory in the first storage module is reconstructed by an external control signal and a multiplexer for storing the feature maps in different states;
  • the memory in the second memory module is reconstructed by an external control signal and a multiplexer for storing the convolution kernels in different states.
  • the memory in the first storage module includes a first memory, a second memory, and a third memory;
  • the first memory is configured to store non-zero feature elements in each of the feature maps when each of the feature maps is in a sparse state, and the second memory is configured to store an index of non-zero feature elements in the feature map
  • the third memory is turned off;
  • the first memory and the second memory are used to store all feature elements in each of the feature maps, and the third memory is opened for storing the feature maps
  • the identifier of each feature element is used to mark the feature element 0;
  • the first memory and the second memory are used to store all feature elements in the feature map, and the third memory is turned off.
  • the memory in the second storage module includes a fourth memory, a fifth memory, and a sixth memory;
  • the fourth memory is configured to store non-zero weight elements in each of the convolution kernels when each of the convolution kernels is in a sparse state
  • the fifth memory is configured to store non-zero weights in the convolution kernel An index of the element, the sixth memory is turned off;
  • the fourth memory and the fifth memory are used to store weighting elements in each of the convolution kernels, and the sixth memory is opened for storing the An identifier of each weight element in the convolution kernel, the identifier being used to mark the weight element 0;
  • the fourth memory and the fifth memory are used to store weighting elements in the convolution kernel, and the sixth memory is turned off.
  • each of the processing units is arranged in an array form
  • each of the processing units corresponds to the same feature element
  • the processing units of each row correspond to the same weight element.
  • the memory in the first storage module includes a first memory and a second memory
  • the memory in the second storage module includes a fourth memory and a fifth memory
  • the first memory is configured to store non-zero feature elements in each of the feature maps when each of the feature maps is in a sparse state, and the second memory is configured to store an index of non-zero feature elements in the feature map ;
  • the first memory and the second memory are used to store all feature elements in the feature map when each of the feature maps is in a dense state;
  • the fourth memory is configured to store non-zero weight elements in each of the convolution kernels when each of the convolution kernels is in a sparse state
  • the fifth memory is configured to store non-zero weights in the convolution kernel The index of the element
  • the fourth memory and the fifth memory are used to store weighting elements in the convolution kernel when each of the convolution kernels is in a dense state.
  • the invention provides a processor applied to a convolutional neural network, which performs convolution calculation on each feature map outputted by each layer in a convolutional neural network and each convolution kernel in a convolutional neural network by a data processor
  • the index of the feature map and the index of the convolution kernel are calculated by the index processor to obtain an index of the convolution result and the convolution result.
  • the feature map or the convolution kernel is sparse, only the index of the non-zero element is set. Therefore, the storage space and the amount of calculation are reduced, and the processing is applied to the convolutional neural network of the sparse state and the non-sparse state.
  • FIG. 1 is a schematic diagram of an overall structure of a processor applied to a convolutional neural network according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of an overall structure of a storage unit in a processor applied to a convolutional neural network according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a processing unit arrangement structure in a processor applied to a convolutional neural network according to an embodiment of the present invention
  • FIG. 4 is a schematic diagram of comparison of access amounts in a processor applied to a convolutional neural network according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of chip area comparison in a processor applied to a convolutional neural network according to an embodiment of the present invention.
  • a processor for a convolutional neural network is provided.
  • the convolutional neural network of the embodiment of the present invention is composed of three parts, the first part is an input layer, and the second part is composed of a convolution layer.
  • Each of the convolutional layers is followed by a pooled layer or a non-pooled layer, and the third portion is composed of a fully connected multi-layer perceptron classifier. Since the convolutional neural network has a large number of layers of convolutional layers, and each convolutional layer has multiple convolution kernels, the computational complexity of the convolutional neural network is large.
  • multiple processing units are used to perform parallel calculation on any convolutional layer using multiple convolution kernels, thereby improving the calculation speed of the convolutional neural network.
  • FIG. 1 is a schematic diagram of an overall structure of a processor applied to a convolutional neural network according to an embodiment of the present invention, including: a plurality of processing units, each processing unit including a data processor and an index processor;
  • the data processor is configured to convolve each feature map outputted by each layer in the convolutional neural network with each convolution kernel in the convolutional neural network to obtain a convolution result;
  • the data processor is used to convolve the convolution result according to a feature map outputted by each layer in the convolutional neural network and a convolution kernel in the convolutional neural network.
  • Each feature map is convolved with a plurality of convolution kernels, and a plurality of convolution results are output, and the convolution result is also a feature map.
  • the index processor is configured to obtain an index of the convolution result according to an index of each feature map and an index of each convolution kernel
  • the index processor is configured to obtain an index of the convolution result according to an index of each feature map and an index of each convolution kernel.
  • the index of the feature map refers to a storage structure for sorting each feature element in the feature map
  • the index of the convolution kernel refers to a storage structure for sorting each weight element in the convolution kernel.
  • the index of the convolution result is a storage structure that sorts the convolution results.
  • the index of each feature map is an index of non-zero feature elements in each feature map; when each convolution kernel is sparse, the index of each convolution kernel is non-zero in each convolution kernel.
  • the index of the weight element is an index of non-zero feature elements in each feature map; when each convolution kernel is sparse, the index of each convolution kernel is non-zero in each convolution kernel.
  • the feature map and the convolution kernel are respectively divided into two states, a sparse state and a dense state.
  • the sparse state means that the number of elements with a value of 0 in the matrix is far more than the number of non-zero elements
  • the dense state means that the number of non-zero elements in the matrix is mostly.
  • the ratio between the number of non-zero elements in the matrix and the total number of all elements in the matrix is the density of the matrix. For example, if the number of non-zero elements in a matrix is 10 and the total number of all elements in the matrix is 100, the density of the matrix is 0.1.
  • each feature map or convolution kernel When the density of each feature map or convolution kernel is less than a first predetermined threshold, each feature map or convolution kernel is sparse, and when the density of each feature map or convolution kernel is greater than a second predetermined threshold, Then each feature map or convolution kernel is in a dense state.
  • the first preset threshold is smaller than the second preset threshold.
  • the index of the feature map or the convolution kernel is pre-stored, and only the non-zero elements are stored and indexed, which reduces the storage space and the amount of calculation.
  • the index of the feature map or the convolution kernel is an index of all elements in each feature map or convolution kernel, and the index of the feature map or convolution kernel is directly generated on the chip of the processor. .
  • the data processor performs convolution calculation on each feature map outputted by each layer in the convolutional neural network and each convolution kernel in the convolutional neural network, and indexes and convolution kernels of the feature map are performed by the index processor.
  • the index is calculated to obtain the index of the convolution result and the convolution result.
  • the processing unit in this embodiment further includes a feature data masking signal, a weight data masking signal, a processing unit masking signal, and an AND gate; wherein, when each feature graph is in an intermediate state, the feature data masking The signal is turned on; when the feature element in each feature map is 0, the feature data mask signal is 0; when each convolution kernel is in the intermediate state, the weight data mask signal is turned on; when the weight element in each convolution kernel is 0 The weight data masking signal is 0; the feature data masking signal and the weighting data masking signal are masked by the AND gate output processing unit; when the feature data masking signal is 0 or the weighting data masking signal is 0, the processing unit masking signal is 0, the data The processor and index processor are not running.
  • each feature map is an intermediate state.
  • the feature data mask signal A-GUARD is turned on, and the feature data mask signal A-GUARD is used to mask the feature element of 0 in the feature map, that is, the feature element of 0 does not perform data and index calculation.
  • each convolution kernel is in an intermediate state.
  • the weight data masking signal W-GUARD is opened, and the weight data masking signal W-GUARD is used to mask the weighting element of 0 in the convolution kernel, that is, the weighting element of 0 does not perform data and index calculation.
  • the feature data masking signal is 0.
  • the weight data masking signal is 0.
  • the processing unit mask signal EN is 0.
  • the processing unit mask signal EN is used to shield the operation of the processing unit.
  • the processing unit mask signal is 0, the data processor and the index processor in the processing unit do not operate, thereby reducing power consumption, as shown in FIG.
  • the 0 element in the feature map or the convolution sum is not operated when the feature map or the convolution kernel is in an intermediate state, Thereby reducing power consumption.
  • each processing unit in this embodiment further includes an accumulator; wherein the accumulator is configured to store an index of the convolution result and the convolution result output by each processing unit; the plurality of processing units share the accumulator.
  • each processing unit further includes an accumulator.
  • An accumulator is a register that stores the intermediate results produced by the calculation.
  • the accumulator is used to store the convolution result of the data processor output and the index of the convolution result output by the index processor.
  • Several processing units can share the accumulator, thereby reducing the size of each accumulator and greatly reducing the area of the processor chip.
  • the two processing units PE0 and PE1 connected in the group of Figure 1 share the accumulator.
  • the collision detector is further included in the embodiment; wherein the collision detector is configured to use the index corresponding to the index of the convolution result according to the index of the convolution result output by any two adjacent processing units.
  • the collision detector is configured to use the index corresponding to the index of the convolution result according to the index of the convolution result output by any two adjacent processing units.
  • the processor further includes a collision detector for processing a conflict between the connected processing units and blocking when a collision occurs.
  • the convolution result output by the processing unit is routed to an accumulator in the processing unit or an accumulator in a processing unit adjacent to the processing unit, based on an index of the convolution result output by the processing unit. If the convolution result output by any two adjacent processing units is routed to the same accumulator, the convolution result of the output of the two processing units is blocked for one cycle, that is, not stored in the accumulator. In the following cycles, at most two cycles are required to route the convolution result of the two processing unit outputs to the accumulator, thereby solving the conflict problem caused by the processing unit sharing the accumulator.
  • the processor further includes a storage unit, where the storage unit includes a first storage module, a second storage module, and a third storage module; the first storage module is configured to store each feature map.
  • the processor further includes a storage unit, where the storage unit includes three bottom storage modules, that is, a first storage module, a second storage module, and a third storage module.
  • the first storage module is configured to store each feature map
  • the second storage module is configured to store each convolution kernel. Since the output of the different layers reaches the size of the feature map, the feature map of the sparse state and the feature map of the dense state are stored in different ways. Therefore, the third storage module is used to store the location of each feature map.
  • the location includes the starting address of each feature map storage and the number of non-zero elements in each feature map, so that each feature map can be accurately obtained from the first storage module.
  • the first storage module and the second storage module respectively comprise a plurality of memories; the memory in the first storage module is reconstructed by an external control signal and a multiplexer for storage A feature map of different states; the memory in the second memory module is reconstructed by an external control signal and a multiplexer for storing convolution kernels of different states.
  • the feature map and the convolution kernel can be divided into a plurality of states according to the density of the feature map and the convolution kernel, and the embodiment is not limited to the number of states.
  • the first storage module and the second storage module respectively comprise a plurality of memories, the number of the accessors being determined according to the feature map and the number of states of the convolution kernel.
  • the memory in the first memory module is reconstructed by an external control signal CTRL and a multiplexer MUX for storing feature maps of different states.
  • the memory in the second memory module is reconstructed by an external control signal and a multiplexer for storing convolution kernels of different states.
  • the feature map and the convolution kernel of different states are stored in different forms by reconstruction, so that the storage of the feature map and the convolution kernel of different states can be performed when one storage unit is used, thereby improving storage flexibility and reducing The chip area of a small memory unit.
  • the memory in the first storage module in the embodiment includes a first memory, a second memory, and a third memory; when each feature map is in a sparse state, the first memory is used to store each feature map. a non-zero feature element in the second memory for storing an index of the non-zero feature element in the feature map, the third memory being turned off; when each feature map is in an intermediate state, the first memory and the second memory are used to store each feature map All the feature elements in the third memory are opened for storing the identifiers of the feature elements in the feature map, the identifiers are used to mark the feature elements 0; when the feature maps are in a dense state, the first memory and the second memory are used for storing The third memory is turned off for all feature elements in the feature map.
  • each feature map in this embodiment is divided into three states: a sparse state, a dense state, and an intermediate state.
  • the feature maps of the three states are stored using the three memories first memory BANK0, the second memory BANK1, and the third memory ZERO GUARD BANK.
  • the non-zero feature elements in the feature map are stored in BANK0, and BANK1 is used to store the index of the non-zero feature elements in the feature map, and ZERO GUARD BANK is closed.
  • BANK0 and BANK1 are both used to store the feature elements in the feature map, and ZERO GUARD BANK is opened for storing the identity of the feature element 0.
  • BANK0 and BANK1 are opened to store the feature elements in the feature map, as shown in FIG.
  • the plurality of memories in the first storage module are respectively reconstructed, so that the feature maps in different states are stored in different forms, and in the case of using one storage unit, the feature maps in different states can be stored. Increased storage flexibility and reduced chip area of the memory unit.
  • the memory in the second storage module includes a fourth memory, a fifth memory, and a sixth memory; when each convolution kernel is in a sparse state, the fourth memory is used to store each convolution kernel a non-zero weighting element, a fifth memory for storing an index of a non-zero weighting element in the convolution kernel, a sixth memory being turned off; and a fourth memory and a fifth memory for storing each convolution when each convolution kernel is in an intermediate state
  • the weighting element in the core, the sixth memory is opened, for storing the identifier of each weight element in the convolution kernel, the identifier is used to mark the weight element 0; and when the convolution kernel is in a dense state, the fourth memory and the fifth memory Used to store the weighting elements in the convolution kernel, the sixth memory is off.
  • each convolution kernel in this embodiment is divided into three states: a sparse state, a dense state, and an intermediate state.
  • Three states of convolution kernels are stored using three memories, a fourth memory, a fifth memory, and a sixth memory.
  • the convolution kernel of the sparse state S the non-zero weighting elements in the convolution kernel are stored in the fourth memory
  • the fifth memory is used to store the index of the non-zero weighting elements in the convolution kernel
  • the sixth memory is closed.
  • the fourth memory and the fifth memory are both used to store the feature elements in the convolution kernel
  • the sixth memory is opened for storing the identity of the weight element 0.
  • For the convolution kernel of dense state D only the fourth memory and the fifth memory are opened to store the weight elements in the convolution kernel.
  • the convolution kernels of different states are stored in different forms by reconfiguring the plurality of memories in the second storage module, and the convolution kernels of different states can be performed in the case of using one storage unit.
  • Storage increases the flexibility of storage and reduces the chip area of the memory unit.
  • each processing unit in the embodiment is arranged in an array form; wherein each column processing unit corresponds to the same feature element; each row processing unit corresponds to the same weight element.
  • each processing unit in the processor is arranged in an array form.
  • the processor also includes a multimodal hierarchical storage unit.
  • the storage unit includes a feature track group including a feature data track, a feature index track, and a ZERO GUARD track
  • the weight track group includes a weight data track, a weight index track, and a ZERO GUARD track.
  • Each column of processing units PE shares feature data; each row of processing units shares weight data.
  • each row of adjacent PEs forms a group connected structure. For N 2 calculations, only N times of storage units need to be read, so that convolutional neural networks of different sparsity levels are efficiently calculated.
  • the memory in the first storage module in this embodiment includes a first memory and a second memory
  • the memory in the second storage module includes a fourth memory and a fifth memory
  • when each feature map is sparse a state a first memory for storing non-zero feature elements in each feature map, a second memory for storing an index of non-zero feature elements in the feature map; and when each feature map is in a dense state, the first memory and the second
  • the memory is used to store all the feature elements in the feature map; when each convolution kernel is in a sparse state, the fourth memory is used to store non-zero weight elements in each convolution kernel, and the fifth memory is used to store non-zero convolution kernels
  • the index of the zero weight element; when each convolution kernel is in a dense state, the fourth memory and the fifth memory are used to store the weight elements in the convolution kernel.
  • each feature map and convolution kernel are respectively divided into two states: a sparse state and a dense state.
  • the two states are used to store the feature maps and convolution kernels of the two states. Therefore, convolution kernels of different states are stored in different forms, and in the case of using one storage unit, convolution kernels of different states can be stored, which improves storage flexibility and reduces chip area of the storage unit.
  • the chip of the processor is fabricated by a TSMC 65 nm process, the chip having an area of 3 mm*4 mm, an operating frequency of 20-200 MHz, and a power consumption of 20.5-248.4 mW.
  • FIG. 4 shows the number of accesses to the storage in this embodiment and other schemes. The fewer the number of visits, the lower the energy of the processor and the higher the energy efficiency.
  • this embodiment uses a sparse storage scheme like the zero-skip scheme, which significantly reduces memory access compared to the zero-off scheme.
  • this embodiment employs a zero-shutdown scheme, thereby reducing access to the index data in the zero-skip scheme.
  • the present embodiment can simultaneously turn off the zero-off and zero-skip schemes, thereby saving storage access.
  • the convolution kernel and the feature map are sparse, compared with the conventional hash storage scheme, the chip area is reduced by 91.9% under the premise of similar performance, and the processing in this embodiment is performed.
  • the unit group connection can reduce the storage area by 30.3% compared with the case where the processing unit is not connected.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

一种应用于卷积神经网络的处理器,所述处理器包括:多个处理单元,各所述处理单元包括数据处理器和索引处理器;其中,所述数据处理器用于对卷积神经网络中各层输出的各特征图和所述卷积神经网络中各卷积核进行卷积,获取卷积结果;所述索引处理器用于根据各所述特征图的索引和各所述卷积核的索引获取所述卷积结果的索引;当各所述特征图为稀疏状态时,各所述特征图的索引为各所述特征图中非零特征元素的索引;当各所述卷积核为稀疏状态时,各所述卷积核的索引为各所述卷积核中非零权重元素的索引。应用该处理器可以减少存储空间和运算量,同时适用于稀疏状态和稠密状态的卷积神经网络的处理。

Description

一种应用于卷积神经网络的处理器
交叉引用
本申请引用于2018年04月08日提交的专利名称为“一种应用于卷积神经网络的处理器”的第2018103066051号中国专利申请,其通过引用被全部并入本申请。
技术领域
本发明属于运算处理技术领域,更具体地,涉及一种应用于卷积神经网络的处理器。
背景技术
卷积神经网络(Convolutional Neural Network,CNN)是一种前馈神经网络,其人工神经元可以响应一部分覆盖范围内的周围单元,适用于对大型图像的处理。卷积神经网络广泛应用于图像识别、语音识别等领域,但计算量非常大。
由于卷积神经网络中的激活函数ReLU(Rectified linear unit,修正线性单元)会造成大量稀疏的(feature map);同时,采用剪枝等方法训练卷积神经网络会造成大量稀疏的权重数据(weight data)。利用和权重数据的稀疏性可以大幅提高卷积神经网络的计算效率。目前,已有很多方法基于卷积神经网络中和权重数据的稀疏性提高计算速度。这些方法大致可以分为两类,一类着眼于跳过0值。例如有的方法去除输入中的0值,从而减少输入为0的无效计算。另一类采取忽略零值的方法。例如有的方法在输入数据为0时,不执行乘法操作,从而减少运算。但这些方法都着眼于处理稀疏神经网络本身,假定神经网络稀疏是前提。然而实际上卷积神经网络中各层输出可能是稀疏的,可能是非稀疏的。而目前的处理器只能处理稀疏的卷积神经网络,无法同时处理稀疏和非稀疏的卷积神经网络。
发明内容
为克服上述现有技术无法同时处理稀疏和非稀疏的卷积神经网络的 问题或者至少部分地解决上述问题,本发明提供了一种应用于卷积神经网络的处理器。
根据本发明的一方面,提供一种应用于卷积神经网络的处理器,包括:多个处理单元,各所述处理单元包括数据处理器和索引处理器;
其中,所述数据处理器用于对卷积神经网络中各层输出的各特征图和所述卷积神经网络中各卷积核进行卷积,获取卷积结果;
所述索引处理器用于根据各所述特征图的索引和各所述卷积核的索引获取所述卷积结果的索引;
当各所述特征图为稀疏状态时,各所述特征图的索引为各所述特征图中非零特征元素的索引;
当各所述卷积核为稀疏状态时,各所述卷积核的索引为各所述卷积核中非零权重元素的索引。
具体地,所述处理单元还包括特征数据屏蔽信号、权重数据屏蔽信号、处理单元屏蔽信号和与门;
其中,当各所述特征图为中间状态时,所述特征数据屏蔽信号打开;
当各所述特征图中的特征元素为0时,所述特征数据屏蔽信号为0;
当各所述卷积核为中间状态时,所述权重数据屏蔽信号打开;
当各所述卷积核中的权重元素为0时,所述权重数据屏蔽信号为0;
所述特征数据屏蔽信号和所述权重数据屏蔽信号通过所述与门输出处理单元屏蔽信号;
当所述特征数据屏蔽信号为0或所述权重数据屏蔽信号为0时,所述处理单元屏蔽信号为0,所述数据处理器和所述索引处理器不运行。
具体地,各所述处理单元还包括累加器;
其中,所述累加器用于存储各所述处理单元输出的卷积结果和所述卷积结果的索引;
多个所述处理单元共享所述累加器。
具体地,还包括冲突检测器;
其中,所述冲突检测器用于在根据任意相邻两个所述处理单元输出的卷积结果的索引,将所述卷积结果的索引对应的卷积结果路由到同一个所述累加器时,此次不将所述卷积结果的索引对应的卷积结果存储到所述累 加器中。
具体地,还包括存储单元;
其中,所述存储单元包括第一存储模块、第二存储模块和第三存储模块;
所述第一存储模块用于存储各所述特征图;
所述第二存储模块用于存储各所述卷积核;
所述第三存储模块用于存储各所述特征图的位置,所述位置包括各所述特征图存储的起始地址和各所述特征图中非零元素的个数。
具体地,所述第一存储模块和所述第二存储模块分别包括多个存储器;
所述第一存储模块中的存储器通过外部控制信号和多路选择器进行重构,用于存储不同状态的所述特征图;
所述第二存储模块中的存储器通过外部控制信号和多路选择器进行重构,用于存储不同状态的所述卷积核。
具体地,所述第一存储模块中的存储器包括第一存储器、第二存储器和第三存储器;
当各所述特征图为稀疏状态时,所述第一存储器用于存储各所述特征图中的非零特征元素,所述第二存储器用于存储所述特征图中非零特征元素的索引,所述第三存储器关闭;
当各所述特征图为中间状态时,所述第一存储器和所述第二存储器用于存储各所述特征图中的所有特征元素,所述第三存储器打开,用于存储所述特征图中各特征元素的标识,所述标识用于标记特征元素0;
当各所述特征图为稠密状态时,所述第一存储器和所述第二存储器用于存储所述特征图中的所有特征元素,所述第三存储器关闭。
具体地,所述第二存储模块中的存储器包括第四存储器、第五存储器和第六存储器;
当各所述卷积核为稀疏状态时,所述第四存储器用于存储各所述卷积核中的非零权重元素,所述第五存储器用于存储所述卷积核中非零权重元素的索引,所述第六存储器关闭;
当各所述卷积核为中间状态时,所述第四存储器和所述第五存储器用 于存储各所述卷积核中的所有权重元素,所述第六存储器打开,用于存储所述卷积核中各权重元素的标识,所述标识用于标记权重元素0;
当各所述卷积核为稠密状态时,所述第四存储器和所述第五存储器用于存储所述卷积核中的所有权重元素,所述第六存储器关闭。
具体地,各所述处理单元以阵列形式排列;
其中,每一列所述处理单元对应相同的特征元素;
每一行所述处理单元对应相同的权重元素。
具体地,所述第一存储模块中的存储器包括第一存储器和第二存储器,所述第二存储模块中的存储器包括第四存储器和第五存储器;
当各所述特征图为稀疏状态时,所述第一存储器用于存储各所述特征图中的非零特征元素,所述第二存储器用于存储所述特征图中非零特征元素的索引;
当各所述特征图为稠密状态时,所述第一存储器和所述第二存储器用于存储所述特征图中的所有特征元素;
当各所述卷积核为稀疏状态时,所述第四存储器用于存储各所述卷积核中的非零权重元素,所述第五存储器用于存储所述卷积核中非零权重元素的索引;
当各所述卷积核为稠密状态时,所述第四存储器和所述第五存储器用于存储所述卷积核中的所有权重元素。
本发明提供一种应用于卷积神经网络的处理器,该处理器通过数据处理器对卷积神经网络中各层输出的各特征图和卷积神经网络中的各卷积核进行卷积计算,通过索引处理器对特征图的索引和卷积核的索引进行计算,从而获取卷积结果和卷积结果的索引,当特征图或卷积核为稀疏状态时,仅对非零元素设置索引,从而减少存储空间和运算量,同时适用于稀疏状态和非稀疏状态的卷积神经网络的处理。
附图说明
图1为本发明实施例提供的应用于卷积神经网络的处理器整体结构示意图;
图2为本发明实施例提供的应用于卷积神经网络的处理器中存储单元整体结构示意图;
图3为本发明实施例提供的应用于卷积神经网络的处理器中处理单元排列架构示意图;
图4为本发明实施例提供的应用于卷积神经网络的处理器中访问量对比示意图;
图5为本发明实施例提供的应用于卷积神经网络的处理器中芯片面积对比示意图。
具体实施方式
下面结合附图和实施例,对本发明的具体实施方式作进一步详细描述。以下实施例用于说明本发明,但不用来限制本发明的范围。
在本发明的一个实施例中提供一种应用于卷积神经网络的处理器,本发明实施例的卷积神经网络由三部分构成,第一部分为输入层,第二部分由卷积层层组成,其中每个卷积层后接一个池化层或不接池化层,第三部分由一个全连接的多层感知机分类器构成。由于卷积神经网络中卷积层的层数较多,且每一个卷积层对应有多个卷积核,从而导致卷积神经网络的计算量很大。本实施例采用多个处理单元对任一卷积层使用多个卷积核进行并行计算,从而提高卷积神经网络的计算速度。
图1为本发明实施例提供的应用于卷积神经网络的处理器整体结构示意图,包括:多个处理单元,各处理单元包括数据处理器和索引处理器;其中,
数据处理器用于将卷积神经网络中各层输出的各特征图和卷积神经网络中各卷积核进行卷积,获取卷积结果;
数据处理器用于根据卷积神经网络中各层输出的特征图(feature map)和卷积神经网络中的卷积核进行卷积,获取卷积结果。每个特征图与多个卷积核进行卷积,输出多个卷积结果,卷积结果也为特征图。
索引处理器用于根据各特征图的索引和各卷积核的索引获取卷积结果的索引;
索引处理器用于根据各特征图的索引和各卷积核的索引获取卷积结果的索引。特征图的索引是指对特征图中的各特征元素进行排序的一种存储结构,卷积核的索引是指对卷积核中的各权重元素进行排序的一种存储结构。卷积结果的索引是指对卷积结果进行排序的一种存储结构。通过特 征图中各特征元素的排序和各权重元素的排序,获取各特征元素和各权重元素进行卷积后的卷积结果的排序,从而确定卷积结果中每个元素的位置。
当各特征图为稀疏状态时,各特征图的索引为各特征图中非零特征元素的索引;当各卷积核为稀疏状态时,各卷积核的索引为各卷积核中非零权重元素的索引。
本实施例中将特征图和卷积核分别分为两种状态,即稀疏状态和稠密状态。其中,稀疏状态是指矩阵中数值为0的元素数目远远多于非0元素的数目,稠密状态是指矩阵中非0元素的数目占大多数。定义矩阵中非0元素的数目与该矩阵中所有元素的总个数之间的比值为矩阵的稠密度。例如,一个矩阵中非0元素的个数为10,该矩阵中所有元素的总个数为100,则该矩阵的稠密度为0.1。当各特征图或卷积核的稠密度小于第一预设阈值时,则各特征图或卷积核为稀疏状态,当各特征图或卷积核的稠密度大于第二预设阈值时,则各特征图或卷积核为稠密状态。其中,第一预设阈值小于第二预设阈值。当各特征图为稀疏状态时,各特征图的索引为各特征图中非零特征元素的索引;当各卷积核为稀疏状态时,各卷积核的索引为各卷积核中非零权重元素的索引。在特征图或卷积核为稀疏状态时,特征图或卷积核的索引为预先存储,仅对非零元素进行存储和设置索引,减少了存储空间和计算量。在特征图或卷积核为稠密状态时,特征图或卷积核的索引为各特征图或卷积核中所有元素的索引,特征图或卷积核的索引直接在处理器的芯片上产生。
本实施例通过数据处理器对卷积神经网络中各层输出的各特征图和卷积神经网络中的各卷积核进行卷积计算,通过索引处理器对特征图的索引和卷积核的索引进行计算,从而获取卷积结果和卷积结果的索引,当特征图或卷积核为稀疏状态时,仅对非零元素设置索引,从而减少存储空间和运算量,同时适用于稀疏状态和非稀疏状态的卷积神经网络的处理。
在上述实施例的基础上,本实施例中所述处理单元还包括特征数据屏蔽信号、权重数据屏蔽信号、处理单元屏蔽信号和与门;其中,当各特征图为中间状态时,特征数据屏蔽信号打开;当各特征图中的特征元素为0时,特征数据屏蔽信号为0;当各卷积核为中间状态时,权重数据屏蔽信 号打开;当各卷积核中的权重元素为0时,权重数据屏蔽信号为0;特征数据屏蔽信号和权重数据屏蔽信号通过与门输出处理单元屏蔽信号;当特征数据屏蔽信号为0或权重数据屏蔽信号为0时,处理单元屏蔽信号为0,数据处理器和索引处理器不运行。
具体地,当各特征图的稠密度大于或等于第一预设阈值,且小于或等于第二预设阈值时,则各特征图为中间状态。此时,打开特征数据屏蔽信号A-GUARD,所述特征数据屏蔽信号A-GUARD用于屏蔽特征图中为0的特征元素,即为0的特征元素不进行数据和索引计算。当各卷积核的稠密度大于或等于第一预设阈值,且小于或等于第二预设阈值时,则各卷积核为中间状态。此时,打开权重数据屏蔽信号W-GUARD,所述权重数据屏蔽信号W-GUARD用于屏蔽卷积核中为0的权重元素,即为0的权重元素不进行数据和索引计算。当各特征图中的特征元素为0时,特征数据屏蔽信号为0。当各卷积核中的权重元素为0时,权重数据屏蔽信号为0。当特征数据屏蔽信号为0或权重数据屏蔽信号为0时,处理单元屏蔽信号EN为0。处理单元屏蔽信号EN用于屏蔽处理单元的运行,当处理单元屏蔽信号为0时,处理单元中的数据处理器和索引处理器不运行,从而降低功耗,如图1所示。
本实施例中通过使用特征数据屏蔽信号、权重数据屏蔽信号、处理单元屏蔽信号和与门,在特征图或卷积核为中间状态时对特征图或卷积和中的0元素不进行运算,从而降低功耗。
在上述实施例的基础上,本实施例中各处理单元还包括累加器;其中,累加器用于存储各处理单元输出的卷积结果和卷积结果的索引;多个处理单元共享累加器。
具体地,各处理单元还包括累加器。累加器是一种寄存器,用来存储计算产生的中间结果。本实施例中累加器用于存储数据处理器输出的卷积结果和索引处理器输出的卷积结果的索引。若干处理单元可以共享累加器,从而减少每个累加器的大小,大幅减少处理器芯片的面积。图1中组相连的两个处理单元PE0和PE1共享累加器。
在上述实施例的基础上,本实施例中还包括冲突检测器;其中,冲突检测器用于在根据任意相邻两个处理单元输出的卷积结果的索引,将卷积 结果的索引对应的卷积结果路由到同一个累加器时,此次不将卷积结果的索引对应的卷积结果存储到累加器中。
具体地,处理器还包括冲突检测器,冲突检测器用于处理相连的处理单元之间的冲突问题,当发生冲突时进行阻塞。根据处理单元输出的卷积结果的索引,该处理单元输出的卷积结果路由到该处理单元中的累加器或该处理单元临近的处理单元中的累加器。若任意相邻两个处理单元输出的卷积结果被路由到同一个累加器,则将两个处理单元输出的卷积结果阻塞一个周期,即不存储到累加器中。在后面的周期中至多需要两个周期将两个处理单元输出的卷积结果路由到累加器中,从而解决处理单元共享累加器引起的冲突问题。
在上述各实施例的基础上,本实施例中处理器还包括存储单元;其中,存储单元包括第一存储模块、第二存储模块和第三存储模块;第一存储模块用于存储各特征图;第二存储模块用于存储各卷积核;第三存储模块用于存储各特征图的位置,位置包括各特征图存储的起始地址和各特征图中非零元素的个数。
具体地,处理器还包括存储单元,存储单元包括三个底层存储模块,即第一存储模块、第二存储模块和第三存储模块。其中,第一存储模块用于存储各特征图,第二存储模块用于存储各卷积核。由于不同层输出达到特征图的大小不相同,稀疏状态的特征图和稠密状态的特征图存储的方式也不一样。因此通过第三存储模块用于存储各特征图的位置。位置包括各特征图存储的起始地址和各特征图中非零元素的个数,从而能准确从第一存储模块中获取到各特征图。
在上述实施例的基础上,本实施例中第一存储模块和第二存储模块分别包括多个存储器;第一存储模块中的存储器通过外部控制信号和多路选择器进行重构,用于存储不同状态的特征图;第二存储模块中的存储器通过外部控制信号和多路选择器进行重构,用于存储不同状态的卷积核。
具体地,根据特征图和卷积核的稠密度可以将特征图和卷积核划分为多个状态,本实施例不限于状态的个数。第一存储模块和第二存储模块分别包括多个存储器,所述存取器的个数根据特征图和卷积核的状态数目确定。第一存储模块中的存储器通过外部控制信号CTRL和多路选择器MUX 进行重构,用于存储不同状态的特征图。第二存储模块中的存储器通过外部控制信号和多路选择器进行重构,用于存储不同状态的卷积核。通过重构将不同状态的特征图和卷积核进行不同形式的存储,从而在使用一个存储单元的情况下能进行不同状态的特征图和卷积核的存储,提高了存储的灵活性,减小存储单元的芯片面积。
现有技术在卷积神经网络的卷积核或者特征图是稀疏的情况下,处理比较高效,但在卷积核和特征图都是稀疏的情况下效率不高。这是因为卷积核和特征图都稀疏的情况下会造成输出数据的完全没有规律,而现有的集成电路中常用的静态随机存取存储器(Static Random-Access Memory,SRAM)的端口一般只有1到2个,无法实现卷积核和特征图均非规律化写入。虽然现有的哈希加多组存储的方法能实现卷积核和特征图都非规律化写入,但会引入很多存储器组,导致芯片面积过大。
在上述实施例的基础上,本实施例中第一存储模块中的存储器包括第一存储器、第二存储器和第三存储器;当各特征图为稀疏状态时,第一存储器用于存储各特征图中的非零特征元素,第二存储器用于存储特征图中非零特征元素的索引,第三存储器关闭;当各特征图为中间状态时,第一存储器和第二存储器用于存储各特征图中的所有特征元素,第三存储器打开,用于存储特征图中各特征元素的标识,标识用于标记特征元素0;当各特征图为稠密状态时,第一存储器和第二存储器用于存储特征图中的所有特征元素,第三存储器关闭。
具体地,本实施例中各特征图被分为稀疏状态、稠密状态和中间状态三种状态。使用三个存储器第一存储器BANK0、第二存储器BANK1和第三存储器ZERO GUARD BANK对三种状态的特征图进行存储。对于稀疏状态S的特征图,将特征图中的非零特征元素存储在BANK0中,BANK1用于存储特征图中非零特征元素的索引,ZERO GUARD BANK关闭。对于中间状态M的特征图,BANK0和BANK1均用于存储特征图中的特征元素,ZERO GUARD BANK打开,用于存储特征元素0的标识。对于稠密状态D的特征图,只打开BANK0和BANK1以存储特征图中的特征元素,如图2所示。
本实施例通过分别将第一存储模块中的多个存储器进行重构,从而将 不同状态的特征图进行不同形式的存储,在使用一个存储单元的情况下能进行不同状态的特征图的存储,提高了存储的灵活性,减小存储单元的芯片面积。
在上述实施例的基础上,第二存储模块中的存储器包括第四存储器、第五存储器和第六存储器;当各卷积核为稀疏状态时,第四存储器用于存储各卷积核中的非零权重元素,第五存储器用于存储卷积核中非零权重元素的索引,第六存储器关闭;当各卷积核为中间状态时,第四存储器和第五存储器用于存储各卷积核中的所有权重元素,第六存储器打开,用于存储卷积核中各权重元素的标识,标识用于标记权重元素0;当各卷积核为稠密状态时,第四存储器和第五存储器用于存储卷积核中的所有权重元素,所述第六存储器关闭。
具体地,本实施例中各卷积核被分为稀疏状态、稠密状态和中间状态三种状态。使用三个存储器第四存储器、第五存储器和第六存储器对三种状态的卷积核进行存储。对于稀疏状态S的卷积核,将卷积核中的非零权重元素存储第四存储器中,第五存储器用于存储卷积核中非零权重元素的索引,第六存储器关闭。对于中间状态M的卷积核,第四存储器和第五存储器均用于存储卷积核中的特征元素,第六存储器打开,用于存储权重元素0的标识。对于稠密状态D的卷积核,只打开第四存储器和第五存储器以存储卷积核中的权重元素。
本实施例通过分别将第二存储模块中的多个存储器进行重构,从而将不同状态的卷积核进行不同形式的存储,在使用一个存储单元的情况下能进行不同状态的卷积核的存储,提高了存储的灵活性,减小存储单元的芯片面积。
在上述各实施例的基础上,本实施例中各处理单元以阵列形式排列;其中,每一列处理单元对应相同的特征元素;每一行处理单元对应相同的权重元素。
具体地,如图3所示,处理器中各处理单元以阵列形式进行排列。处理器还包括一个多模态层次化的存储单元。存储单元包括一个特征轨道组和一个权重轨道组,特征轨道组包括特征数据轨道、特征索引轨道和ZERO GUARD轨道,权重轨道组包括权重数据轨道、权重索引轨道和ZERO  GUARD轨道。每一列处理单元PE共享特征数据;每一行处理单元共享权重数据。同时,每一行相邻的PE形成组相连结构。对于N 2次计算,仅需读取N次存储单元,从而高效计算不同稀疏程度的卷积神经网络。
在上述实施例的基础上,本实施例中第一存储模块中的存储器包括第一存储器和第二存储器,第二存储模块中的存储器包括第四存储器和第五存储器;当各特征图为稀疏状态时,第一存储器用于存储各特征图中的非零特征元素,第二存储器用于存储特征图中非零特征元素的索引;当各特征图为稠密状态时,第一存储器和第二存储器用于存储特征图中的所有特征元素;当各卷积核为稀疏状态时,第四存储器用于存储各卷积核中的非零权重元素,第五存储器用于存储卷积核中非零权重元素的索引;当各卷积核为稠密状态时,第四存储器和第五存储器用于存储卷积核中的所有权重元素。
本实施例中各特征图和卷积核分别被分为稀疏状态和稠密状态两种状态。使用二个存储器对两种状态的特征图和卷积核进行存储。从而将不同状态的卷积核进行不同形式的存储,在使用一个存储单元的情况下能进行不同状态的卷积核的存储,提高了存储的灵活性,减小存储单元的芯片面积。
例如,采用台积电65nm工艺制作所述处理器的芯片,所述芯片的面积为3mm*4mm,运行频率为20-200MHz,功耗为20.5-248.4毫瓦。图4为本实施例与其他方案对存储的访问次数。访问次数越少,处理器的能量越低,能效越高。在稀疏状态S下,本实施例与零跳过方案一样采用稀疏的存储方案,相比于零关闭方案,大幅降低存储访问。在中间状态M下,本实施例采用零关闭方案,从而降低零跳过方案中索引数据对存储的访问。一个处理单元被关闭时,65%的能量将被节省。当处于稠密状态时,本实施例可以同时关闭零关闭和零跳过方案,从而节省存储访问。如图5所示,本实施例在卷积核和特征图均为稀疏状态下与传统的哈希存储方案相比,在相似性能的前提下减少91.9%的芯片面积,本实施例中的处理单元组相连与处理单元不组相连的情况相比可减少存储面积30.3%。
最后,本申请的方法仅为较佳的实施方案,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改 进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种应用于卷积神经网络的处理器,其特征在于,包括:多个处理单元,各所述处理单元包括数据处理器和索引处理器;
    其中,所述数据处理器用于对卷积神经网络中各层输出的各特征图和所述卷积神经网络中各卷积核进行卷积,获取卷积结果;
    所述索引处理器用于根据各所述特征图的索引和各所述卷积核的索引获取所述卷积结果的索引;
    当各所述特征图为稀疏状态时,各所述特征图的索引为各所述特征图中非零特征元素的索引;
    当各所述卷积核为稀疏状态时,各所述卷积核的索引为各所述卷积核中非零权重元素的索引。
  2. 根据权利要求1所述的处理器,其特征在于,所述处理单元还包括特征数据屏蔽信号、权重数据屏蔽信号、处理单元屏蔽信号和与门;
    其中,当各所述特征图为中间状态时,所述特征数据屏蔽信号打开;
    当各所述特征图中的特征元素为0时,所述特征数据屏蔽信号为0;
    当各所述卷积核为中间状态时,所述权重数据屏蔽信号打开;
    当各所述卷积核中的权重元素为0时,所述权重数据屏蔽信号为0;
    所述特征数据屏蔽信号和所述权重数据屏蔽信号通过所述与门输出处理单元屏蔽信号;
    当所述特征数据屏蔽信号为0或所述权重数据屏蔽信号为0时,所述处理单元屏蔽信号为0,所述数据处理器和所述索引处理器不运行。
  3. 根据权利要求1所述的处理器,其特征在于,各所述处理单元还包括累加器;
    其中,所述累加器用于存储各所述处理单元输出的卷积结果和所述卷积结果的索引;
    多个所述处理单元共享所述累加器。
  4. 根据权利要求3所述的处理器,其特征在于,还包括冲突检测器;
    其中,所述冲突检测器用于在根据任意相邻两个所述处理单元输出的卷积结果的索引,将所述卷积结果的索引对应的卷积结果路由到同一个所述累加器时,此次不将所述卷积结果的索引对应的卷积结果存储到所述累 加器中。
  5. 根据权利要求1-4任一所述的处理器,其特征在于,还包括存储单元;
    其中,所述存储单元包括第一存储模块、第二存储模块和第三存储模块;
    所述第一存储模块用于存储各所述特征图;
    所述第二存储模块用于存储各所述卷积核;
    所述第三存储模块用于存储各所述特征图的位置,所述位置包括各所述特征图存储的起始地址和各所述特征图中非零元素的个数。
  6. 根据权利要求5所述的处理器,其特征在于,所述第一存储模块和所述第二存储模块分别包括多个存储器;
    所述第一存储模块中的存储器通过外部控制信号和多路选择器进行重构,用于存储不同状态的所述特征图;
    所述第二存储模块中的存储器通过外部控制信号和多路选择器进行重构,用于存储不同状态的所述卷积核。
  7. 根据权利要求6所述的处理器,其特征在于,所述第一存储模块中的存储器包括第一存储器、第二存储器和第三存储器;
    当各所述特征图为稀疏状态时,所述第一存储器用于存储各所述特征图中的非零特征元素,所述第二存储器用于存储所述特征图中非零特征元素的索引,所述第三存储器关闭;
    当各所述特征图为中间状态时,所述第一存储器和所述第二存储器用于存储各所述特征图中的所有特征元素,所述第三存储器打开,用于存储所述特征图中各特征元素的标识,所述标识用于标记特征元素0;
    当各所述特征图为稠密状态时,所述第一存储器和所述第二存储器用于存储所述特征图中的所有特征元素,所述第三存储器关闭。
  8. 根据权利要求6所述的处理器,其特征在于,所述第二存储模块中的存储器包括第四存储器、第五存储器和第六存储器;
    当各所述卷积核为稀疏状态时,所述第四存储器用于存储各所述卷积核中的非零权重元素,所述第五存储器用于存储所述卷积核中非零权重元素的索引,所述第六存储器关闭;
    当各所述卷积核为中间状态时,所述第四存储器和所述第五存储器用于存储各所述卷积核中的所有权重元素,所述第六存储器打开,用于存储所述卷积核中各权重元素的标识,所述标识用于标记权重元素0;
    当各所述卷积核为稠密状态时,所述第四存储器和所述第五存储器用于存储所述卷积核中的所有权重元素,所述第六存储器关闭。
  9. 根据权利要求1-4任一所述的处理器,其特征在于,各所述处理单元以阵列形式排列;
    其中,每一列所述处理单元对应相同的特征元素;
    每一行所述处理单元对应相同的权重元素。
  10. 根据权利要求6所述的处理器,其特征在于,所述第一存储模块中的存储器包括第一存储器和第二存储器,所述第二存储模块中的存储器包括第四存储器和第五存储器;
    当各所述特征图为稀疏状态时,所述第一存储器用于存储各所述特征图中的非零特征元素,所述第二存储器用于存储所述特征图中非零特征元素的索引;
    当各所述特征图为稠密状态时,所述第一存储器和所述第二存储器用于存储所述特征图中的所有特征元素;
    当各所述卷积核为稀疏状态时,所述第四存储器用于存储各所述卷积核中的非零权重元素,所述第五存储器用于存储所述卷积核中非零权重元素的索引;
    当各所述卷积核为稠密状态时,所述第四存储器和所述第五存储器用于存储所述卷积核中的所有权重元素。
PCT/CN2018/095364 2018-04-08 2018-07-12 一种应用于卷积神经网络的处理器 WO2019196222A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810306605.1A CN108510066B (zh) 2018-04-08 2018-04-08 一种应用于卷积神经网络的处理器
CN201810306605.1 2018-04-08

Publications (1)

Publication Number Publication Date
WO2019196222A1 true WO2019196222A1 (zh) 2019-10-17

Family

ID=63380723

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095364 WO2019196222A1 (zh) 2018-04-08 2018-07-12 一种应用于卷积神经网络的处理器

Country Status (2)

Country Link
CN (1) CN108510066B (zh)
WO (1) WO2019196222A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414993A (zh) * 2020-03-03 2020-07-14 三星(中国)半导体有限公司 卷积神经网络的裁剪、卷积计算方法及装置
WO2021194529A1 (en) * 2020-03-25 2021-09-30 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors in convolutional neural networks
US11462003B2 (en) 2020-03-25 2022-10-04 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors in convolutional neural networks
US11755683B2 (en) 2019-12-23 2023-09-12 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors (FAST) in machine learning

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020047823A1 (en) * 2018-09-07 2020-03-12 Intel Corporation Convolution over sparse and quantization neural networks
CN109543830B (zh) * 2018-09-20 2023-02-03 中国科学院计算技术研究所 一种用于卷积神经网络加速器的拆分累加器
CN109543815B (zh) * 2018-10-17 2021-02-05 清华大学 神经网络的加速方法及装置
CN109635940B (zh) * 2019-01-28 2021-04-06 深兰人工智能芯片研究院(江苏)有限公司 一种基于卷积神经网络的图像处理方法及图像处理装置
CN109858575B (zh) * 2019-03-19 2024-01-05 苏州市爱生生物技术有限公司 基于卷积神经网络的数据分类方法
CN110188209B (zh) * 2019-05-13 2021-06-04 山东大学 基于层次标签的跨模态哈希模型构建方法、搜索方法及装置
CN111126569B (zh) * 2019-12-18 2022-11-11 中国电子科技集团公司第五十二研究所 一种支持剪枝稀疏化压缩的卷积神经网络装置和计算方法
CN111291230B (zh) * 2020-02-06 2023-09-15 北京奇艺世纪科技有限公司 特征处理方法、装置、电子设备及计算机可读存储介质
CN111415004B (zh) * 2020-03-17 2023-11-03 阿波罗智联(北京)科技有限公司 用于输出信息的方法和装置
WO2021248433A1 (en) * 2020-06-12 2021-12-16 Moffett Technologies Co., Limited Method and system for dual-sparse convolution processing and parallelization
CN112200295B (zh) * 2020-07-31 2023-07-18 星宸科技股份有限公司 稀疏化卷积神经网络的排序方法、运算方法、装置及设备
CN114692847B (zh) * 2020-12-25 2024-01-09 中科寒武纪科技股份有限公司 数据处理电路、数据处理方法及相关产品

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017031630A1 (zh) * 2015-08-21 2017-03-02 中国科学院自动化研究所 基于参数量化的深度卷积神经网络的加速与压缩方法
CN106951961A (zh) * 2017-02-24 2017-07-14 清华大学 一种粗粒度可重构的卷积神经网络加速器及系统
CN107679622A (zh) * 2017-09-06 2018-02-09 清华大学 一种面向神经网络算法的模拟感知计算架构

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650924B (zh) * 2016-10-27 2019-05-14 中国科学院计算技术研究所 一种基于时间维和空间维数据流压缩的处理器、设计方法
CN107239824A (zh) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 用于实现稀疏卷积神经网络加速器的装置和方法
US10162794B1 (en) * 2018-03-07 2018-12-25 Apprente, Inc. Hierarchical machine learning system for lifelong learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017031630A1 (zh) * 2015-08-21 2017-03-02 中国科学院自动化研究所 基于参数量化的深度卷积神经网络的加速与压缩方法
CN106951961A (zh) * 2017-02-24 2017-07-14 清华大学 一种粗粒度可重构的卷积神经网络加速器及系统
CN107679622A (zh) * 2017-09-06 2018-02-09 清华大学 一种面向神经网络算法的模拟感知计算架构

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11755683B2 (en) 2019-12-23 2023-09-12 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors (FAST) in machine learning
CN111414993A (zh) * 2020-03-03 2020-07-14 三星(中国)半导体有限公司 卷积神经网络的裁剪、卷积计算方法及装置
CN111414993B (zh) * 2020-03-03 2024-03-01 三星(中国)半导体有限公司 卷积神经网络的裁剪、卷积计算方法及装置
WO2021194529A1 (en) * 2020-03-25 2021-09-30 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors in convolutional neural networks
US11462003B2 (en) 2020-03-25 2022-10-04 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors in convolutional neural networks
US11797830B2 (en) 2020-03-25 2023-10-24 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors in convolutional neural networks

Also Published As

Publication number Publication date
CN108510066A (zh) 2018-09-07
CN108510066B (zh) 2020-05-12

Similar Documents

Publication Publication Date Title
WO2019196222A1 (zh) 一种应用于卷积神经网络的处理器
US11042795B2 (en) Sparse neuromorphic processor
KR20200060302A (ko) 처리방법 및 장치
JP2018116469A (ja) 演算システムおよびニューラルネットワークの演算方法
Li et al. ReRAM-based accelerator for deep learning
Jarollahi et al. Algorithm and architecture for a low-power content-addressable memory based on sparse clustered networks
Wang et al. TRC‐YOLO: A real‐time detection method for lightweight targets based on mobile devices
Miyashita et al. Time-domain neural network: A 48.5 TSOp/s/W neuromorphic chip optimized for deep learning and CMOS technology
WO2021089009A1 (zh) 数据流重构方法及可重构数据流处理器
Andri et al. Chewbaccann: A flexible 223 tops/w bnn accelerator
Dutta et al. Hdnn-pim: Efficient in memory design of hyperdimensional computing with feature extraction
US11599181B1 (en) Systems and methods for reducing power consumption of convolution operations of artificial neural networks
US20230073012A1 (en) Memory processing unit core architectures
Yavits et al. Accelerator for sparse machine learning
Fan et al. DT-CGRA: Dual-track coarse-grained reconfigurable architecture for stream applications
Angizi et al. Pisa: A binary-weight processing-in-sensor accelerator for edge image processing
Moon et al. FPGA-based sparsity-aware CNN accelerator for noise-resilient edge-level image recognition
Saha et al. An energy-efficient and high throughput in-memory computing bit-cell with excellent robustness under process variations for binary neural network
Lee et al. A vocabulary forest object matching processor with 2.07 M-vector/s throughput and 13.3 nJ/vector per-vector energy for full-HD 60 fps video object recognition
Bose et al. A 75kb SRAM in 65nm CMOS for in-memory computing based neuromorphic image denoising
Guan et al. Recursive binary neural network training model for efficient usage of on-chip memory
Santoro et al. Design-space exploration of pareto-optimal architectures for deep learning with dvfs
Sudarshan et al. Optimization of DRAM based PIM architecture for energy-efficient deep neural network training
US20230244901A1 (en) Compute-in-memory sram using memory-immersed data conversion and multiplication-free operators
Sim et al. LUPIS: Latch-up based ultra efficient processing in-memory system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18914656

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18914656

Country of ref document: EP

Kind code of ref document: A1