CN117852589A

CN117852589A - In-memory accelerator for accelerating convolutional neural network reasoning

Info

Publication number: CN117852589A
Application number: CN202311818871.XA
Authority: CN
Inventors: 孙晓天; 陈晓明; 韩银和
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-04-09

Abstract

The invention provides an in-memory accelerator for accelerating convolutional neural network reasoning, which comprises a global memory, an on-chip route and a plurality of cores connected with the global memory, wherein each core comprises: the control unit is used for acquiring an instruction stream, and executing corresponding operations based on each unit of the instruction stream, wherein the instruction stream comprises: a computing operation and a memory access operation; the local memory unit is used for executing memory access operation, sequentially accessing the input data of the convolutional neural network in the global memory according to the size of the sliding window to obtain the input data corresponding to the calculation operation, and sending the corresponding input data to the memory calculation matrix unit; the in-memory computing matrix unit comprises a plurality of in-memory computing arrays, each in-memory computing array is used for executing the computing operation, and matrix-vector multiplication computation is carried out according to input data corresponding to the computing operation; and the vector function unit is used for executing calculation operation and carrying out post-processing according to the result of matrix-vector multiplication calculation.

Description

In-memory accelerator for accelerating convolutional neural network reasoning

Technical Field

The invention relates to a neural network processor architecture and a design method, in particular to the field of hardware acceleration of neural network model calculation, and more particularly relates to an in-memory accelerator for accelerating convolutional neural network reasoning.

Background

In recent years, deep neural networks have made a significant breakthrough in a variety of tasks. Due to the expansion of the parameters of neural network models, the industry expects that the hardware can get a considerable performance improvement to efficiently run the deep neural network algorithm. Various deep neural network accelerators have therefore been proposed, but conventional complementary metal oxide semiconductor (Complementary Metal Oxide Semiconductor, CMOS) or von neumann architecture-compliant accelerators may encounter challenges when dealing with large-scale neural network algorithms, such as: memory wall problems, such that the existing accelerators experience bottlenecks in terms of storage, bandwidth, energy efficiency, etc.

Currently, in-memory computing is regarded as an important technology for solving the problem of the memory wall, which combines computing and memory functions by utilizing the characteristics of a nonvolatile memory device, thereby effectively solving the problem of the memory wall, and in-memory computing has become a popular research direction in the field of deep neural network accelerator design. In-memory computing has a number of implementations in which emerging non-volatile memory devices, which are two-dimensional arrays of these devices integrated into cross-point arrays, have great potential to challenge CMOS, which has led to in-memory computing arrays, a two-dimensional array structure of non-volatile memory devices, which has been of increasing interest due to high memory density and highly parallel in-situ computing characteristics. For ease of understanding, FIG. 1 shows a schematic diagram of a typical two-dimensional array of non-volatile memory devices, which are capable of performing matrix-vector multiplication operations in an efficient manner, typically in analog circuitry, due to hardware characteristics. Programming data as conductance into each node of the array, and outputting Input as voltage to each row; according to ohm's law, the current value at each node is I _ij ＝G _ij V _j The method comprises the steps of carrying out a first treatment on the surface of the According to kirchhoff's law, the current accumulated value I that can be read in each column _i ＝∑ _j G _ij V _j The method comprises the steps of carrying out a first treatment on the surface of the Based on this, the array is enabled to compute the product operation between the matrix G and the vector V in parallel.

However, since the nonvolatile memory array operates in the analog domain and other circuits of the accelerator operate in the digital domain, peripheral circuits such as a digital-to-analog converter (DAC), an analog-to-digital converter (ADC), a sample-and-hold (S & H), and a shift-and-accumulate (S & a) are also required in the circuits to complete the operation, and the on-chip memory capacity of the deep neural network accelerator is limited, the presence of these peripheral circuits may increase the complexity and power consumption of the whole system, and due to the increasing size and complexity of the model, the memory on the accelerator may not be enough to reduce the throughput of the accelerator, thereby reducing the reasoning speed of the deep neural network, so how to effectively utilize the on-chip local memory becomes a key problem.

It should be noted that: the background is only for describing relevant information of the present invention to facilitate understanding of the technical solution of the present invention, but does not mean that the relevant information is necessarily prior art. Where there is no evidence that related information has been disclosed prior to the filing date of the present application, the related information should not be considered prior art.

Disclosure of Invention

It is therefore an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide an in-memory accelerator for accelerating the reasoning of convolutional neural networks.

The invention aims at realizing the following technical scheme:

according to a first aspect of the present invention, there is provided an in-memory accelerator for accelerating reasoning of a convolutional neural network, the in-memory accelerator comprising a global memory for storing data of the convolutional neural network, an on-chip route for transmitting data of one core to other cores or receiving data transmitted by other cores, and a plurality of cores connected to the global memory, each core comprising: the system comprises a control unit, a vector function unit, a memory calculation matrix unit and a local memory unit, wherein: the control unit is configured to obtain an instruction stream, and execute corresponding operations based on each unit of the instruction stream, where the instruction stream includes: a computing operation and a memory access operation; the local memory unit is used for executing the memory access operation, sequentially accessing the input data of the convolutional neural network in the global memory according to the size of the sliding window to obtain the input data corresponding to the calculation operation, and sending the corresponding input data to the memory calculation matrix unit, wherein when the input data of the convolutional neural network is accessed according to the sliding window, the newly added data of the sliding window to be accessed and the data of the sliding window existing in the local memory are obtained from the global memory by taking the pixel as a unit, and the sliding window data to be accessed are obtained by reorganizing the data of the sliding window existing in the local memory; the in-memory computing matrix unit comprises a plurality of in-memory computing arrays, wherein each in-memory computing array is used for executing the computing operation and performing matrix-vector multiplication computation according to input data corresponding to the computing operation; the vector function unit is used for executing the calculation operation and carrying out post-processing according to the result of the matrix-vector multiplication calculation; and finally obtaining the reasoning result of the convolutional neural network according to the cooperation of the in-memory calculation matrix unit and the vector functional unit.

In some embodiments of the present invention, when input data of a convolutional neural network is accessed according to a sliding window, a difference set and an intersection set of data of an existing sliding window and data of the sliding window to be accessed are generated, and the data of the difference set accessed from a global memory and the data of the intersection set acquired from a local memory unit are recombined to obtain the sliding window data to be accessed.

In some embodiments of the present invention, the data accessing the difference set from the global memory and the data of the intersection set obtained from the local memory unit are recombined according to the order of the sliding window read data, so as to obtain sliding window data to be accessed.

In some embodiments of the invention, the local memory unit is configured to: dividing sliding window data accessed from the global memory according to the computing operation to obtain input data corresponding to the computing operation.

In some embodiments of the present invention, the sequentially accessing the input data of the convolutional neural network in the global memory according to the size of the sliding window to obtain the input data corresponding to the computing operation includes: sequentially acquiring input data of a convolutional neural network in a global memory according to the size of a sliding window to obtain data contained in a first sliding window in sequence, obtaining first sliding window data, and loading the first sliding window data into the local memory unit; moving the first sliding window by a unit length, and judging whether intersection exists between data in the current sliding window and the data of the first sliding window; if no intersection exists, sequentially acquiring data contained in the current sliding window, obtaining second sliding window data, and loading the second sliding window data to the local memory unit; if an intersection exists, determining a difference set between data in a current sliding window and the first sliding window data, recombining the data accessing the difference set from a global memory and the data of the intersection acquired from a local memory unit to obtain second sliding window data, and loading the second sliding window data to the local memory unit; and acquiring the input data of the convolutional neural network based on the acquisition mode of the second sliding window data to obtain the input data corresponding to the calculation operation.

In some embodiments of the invention, the instruction stream is derived according to the following: acquiring a convolutional neural network, wherein the convolutional neural network comprises a plurality of processing layers; dividing the weight matrix of each processing layer according to the size of the in-memory computing array to obtain a plurality of array groups corresponding to each processing layer; iteratively selecting weight replication multiples of each processing layer according to a preset genetic algorithm, and mapping a plurality of array groups corresponding to each processing layer to corresponding cores; and judging whether each core has an array group with incomplete computing tasks according to the array groups mapped by the cores, and if so, generating a corresponding instruction stream according to the incomplete computing tasks.

In some embodiments of the invention, the in-memory computing unit is configured to: in the case that two matrix-vector multiplication operations have structural conflict, the former matrix-vector multiplication operation is firstly executed in sequence, and then the latter matrix-vector multiplication operation is executed; in the case that a data dependency exists between two matrix-vector multiplication operations, a preceding matrix-vector multiplication operation is performed first, and a subsequent matrix-vector multiplication operation is performed according to the result of the preceding matrix-vector multiplication operation; in the case where there are no structural conflicts and data dependencies for multiple matrix-vector multiplication operations, the multiple matrix-vector multiplication operations are performed in parallel.

In some embodiments of the invention, the vector functional unit is configured to: performing post-processing according to the result of the matrix-vector multiplication calculation, wherein the post-processing comprises: an activation process, a pooling process, and/or an element-by-element process.

In some embodiments of the present invention, the element-by-element processing is performing accumulation calculation according to the result of the matrix-vector multiplication calculation, and performing activation processing on the accumulated result to obtain an inference result of the convolutional neural network.

According to a second aspect of the present invention, there is provided a method of accelerating convolutional neural network reasoning using the in-memory accelerator of the first aspect, the method comprising: acquiring an instruction stream; loading data of a convolutional neural network in a global memory according to the instruction stream to obtain corresponding input data; performing matrix-vector multiplication calculation according to the loaded input data to obtain a matrix-vector multiplication calculation result; and carrying out post-processing on the result of the matrix-vector multiplication calculation to obtain the reasoning result of the convolutional neural network.

Compared with the prior art, the invention has the advantages that:

1) The replication multiple of each processing layer and the optimal task mapping core can be automatically selected according to the hardware resource quantity by utilizing a preset genetic algorithm, the hardware resource is fully utilized, the task balanced distribution among the cores is ensured, the computing parallelism is improved, the distribution and the structure conflict of the computing task are not influenced, the throughput of the accelerator is increased, and the reasoning speed of the neural network is accelerated.

2) By utilizing the data reading mode of a plurality of sliding windows, repeated data reading can be avoided, further waste of memory bandwidth is avoided, meanwhile, the instruction number and the complexity of control signals can be reduced, the bandwidth and the on-chip cache pressure are reduced, and the utilization rate of the memory is improved.

3) By utilizing the memory allocation strategy of array group multiplexing, the calculation result can be saved by multiplexing the memory blocks, so that the memory resources are classified, the use of the memory is optimized, the chip area and the system power consumption are reduced, and the memory waste is avoided.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a two-dimensional array of non-volatile memory devices as is currently common;

FIG. 2 is a schematic diagram of an in-memory accelerator for accelerating reasoning of a convolutional neural network according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a preset genetic algorithm according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a data reading method according to an embodiment of the invention;

FIG. 5 is a schematic diagram of a data reading manner of a plurality of sliding windows according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a memory allocation strategy according to an embodiment of the present invention;

FIG. 7 is a flow chart of an in-memory accelerator for accelerating convolutional neural network reasoning in accordance with an embodiment of the present invention.

Detailed Description

For the purpose of making the technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by way of specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As mentioned in the background section, existing in-memory computed accelerators require peripheral circuits to complete the operation, while the on-chip memory capacity of the deep neural network accelerator is limited, the presence of these peripheral circuits may increase the complexity and power consumption of the overall system, and due to the increasing size and complexity of the model, may result in insufficient memory on the accelerator to reduce the throughput of the accelerator, thereby reducing the inference speed of the deep neural network.

In order to solve the above problems, the present invention proposes an in-memory accelerator for accelerating convolutional neural network reasoning, as shown in fig. 2, where the in-memory accelerator includes a global memory, on-chip routing, and a plurality of cores connected to the global memory, and each core includes: the system comprises a control unit, a vector function unit, a memory calculation matrix unit and a local memory unit, wherein: the control unit is configured to obtain an instruction stream, and execute corresponding operations based on each unit of the instruction stream, where the instruction stream includes: a computing operation and a memory access operation; the local memory unit is used for executing the memory access operation, sequentially accessing the input data of the convolutional neural network in the global memory according to the size of the sliding window to obtain the input data corresponding to the calculation operation, and sending the corresponding input data to the memory calculation matrix unit, wherein when the input data of the convolutional neural network is accessed according to the sliding window, the data newly added to the sliding window to be accessed and the data of the sliding window existing in the local memory are obtained from the global memory by taking pixels as units to reconstruct, so that the sliding window data to be accessed is obtained, repeated reading of the data can be avoided according to the reading mode of the data, further waste of memory bandwidth is avoided, so that too much calculation memory is not occupied in the subsequent calculation, and the utilization rate of the memory is improved; the in-memory computing matrix unit comprises a plurality of in-memory computing arrays, wherein each in-memory computing array is used for executing the computing operation and performing matrix-vector multiplication computation according to input data corresponding to the computing operation; the vector function unit is used for executing the calculation operation and carrying out post-processing according to the result of the matrix-vector multiplication calculation; and finally obtaining the reasoning result of the convolutional neural network according to the cooperation of the in-memory calculation matrix unit and the vector functional unit.

In order to better understand the present invention, the following describes the technical scheme of the present invention in detail with reference to specific embodiments.

The invention provides an in-memory accelerator for accelerating reasoning of a convolutional neural network, which comprises a global memory, an on-chip route and a plurality of cores connected with the global memory, wherein the global memory is used for storing data of the convolutional neural network, the on-chip route is used for sending the data of one core to other cores or receiving the data sent by other cores, the cores can be interconnected in a mode of on-chip route or bus and the like and used for executing calculation in parallel and asynchronously, and each core comprises a control unit, a vector function unit, an in-memory calculation matrix unit and a local memory unit, and the in-memory calculation matrix unit and the vector function unit can only access the data in the local memory. The following details the process of data interaction between global memory, on-chip routing, and multiple cores.

Firstly, reading a hardware description file for initializing an in-memory accelerator, wherein the hardware description file comprises the following parameters: global memory capacity, bus bandwidth, on-chip routing bandwidth, number of cores, number of vector functional units in cores, number of compute matrix units in core memory, and local memory capacity. And then reading a pre-trained convolutional neural network model, and analyzing to obtain a model description file, wherein the model description file comprises convolutional neural network node information and a topological structure, and takes nodes as basic elements and takes both topology and parameters into consideration. Table 1 shows an example of the model description file (select node 2 as an example), where bitwidth represents the precision of the node weight, consumer represents the consumer name, provider represents the producer name, node_index represents the node number, node_operation represents the type of node operation, input_dim represents the input dimension of the node, output_dim represents the output dimension of the node, and parameter represents the parameter information of the node, more specifically, as shown in Table 1:

Table 1 sample of model description file

According to one embodiment of the present invention, the pre-trained convolutional neural network model may be pre-trained or may be trained using an existing data set (e.g., ILSVRC2012 data set), and the training process according to the training data includes: the method comprises the steps of obtaining a training set, wherein the training set comprises a plurality of image samples and labels indicating the categories to which the image samples belong, and training the convolutional neural network model once or for many times by using the training set to obtain a pre-trained convolutional neural network model. Illustratively, its convolutional neural network model may use existing models, such as: alex Net, VGG, res Net, acceptance Net, etc., and an appropriate model may be constructed based on actual studies.

Second, to enable the specification of global memory, on-chip routing, and data interaction among multiple cores, the present invention provides a special instruction set architecture to enable the control unit of each core to perform respective operations based on the instruction flow control units in the instruction set architecture, the instruction flow including computation operations performed by the in-memory computation matrix unit and vector function unit of each core, memory access operations performed by the local memory unit of each core, and communication operations performed by the on-chip routing (or bus), more specifically, as shown in table 2:

Table 2 in-memory accelerator specific instruction set architecture

The MVM instruction is a matrix-vector multiplication instruction, and the matrix-vector multiplication is performed by using an in-memory calculation matrix unit. The instruction format is "mvm idx dst src len", which means: the in-memory computing array with the sequence number of idx reads data with the address of src in the local memory, the length of the data is len, a matrix-vector multiplication operation is carried out, and then the computing result is stored in the local memory with the address of dst.

The VEC instruction is a post-processing instruction that is post-processed using a vector functional unit. The instruction format is "vec op dst src1 src2 len", which means: the vector function unit reads two vector data with addresses src1 and src2 in the local memory, the length of the vector data is len, the read two vectors are subjected to post-processing, and then the processed result is stored in the local memory with the address dst. Where op represents identifiers of different post-processing operations, op=0 represents element-by-element processing of addition of two vectors, op=1 represents element-by-element processing of subtraction of two 0 vectors, op=2 represents element-by-element processing of multiplication of two vectors, op=3 represents element-by-element processing of division of two vectors, and op=4 represents element-by-element processing of maximum value of two vectors; when the addresses of src1 and src2 are the same, the vector with the address of src1 can be activated, and correspondingly, op=5 represents activation by using a ReLU activation function, op=6 represents activation by using a tanh activation function, and op=7 represents activation by using a sigmod activation function.

The LOAD instruction is a LOAD instruction that reads data from global memory using a local memory unit and writes the read data to the local memory. The instruction format is "load dst src size", which means: and writing the continuous data with the address src and the length size in the global memory into the local memory with the address dst and the length size.

STORE is a save instruction that uses a local memory unit to write local memory data into global memory. The instruction format is "store dst src size", which means: and writing the continuous data with the address src and the length size in the local memory into the global memory with the address dst and the length size.

The LMD is a move instruction that moves local memory data using a local memory unit. The instruction format is "lmd dst src size", interpreted as: and writing the continuous data with the address src and the length size in the local memory into the local memory with the address dst and the length size.

SEND is a SEND instruction that SENDs data to other cores using on-chip routing (or bus). The instruction format is "send src size idx", which means: and sending the continuous data with the address src and the length size in the local memory to the core with the sequence number idx.

RECV is a receive instruction that uses on-chip routing (or bus) to receive data from other cores. The instruction format is "recv dst size idx", which means: data sent from a core with the sequence number idx is received and stored in a local memory with the address dst and the length size.

In addition, in the study of the in-memory accelerator, it was found that: the prior art focuses on designing a specific hardware architecture, and does not fully consider the details of the deployment of the in-memory computing accelerator by the neural network, so that the throughput of the in-memory accelerator is low, therefore, the invention provides a generating mode for generating an instruction stream, and the instruction stream is obtained according to the following modes: acquiring a convolutional neural network comprising a plurality of processing layers (nodes); dividing the weight matrix of each processing layer according to the size of the in-memory computing array to obtain a plurality of array groups corresponding to each processing layer; iteratively selecting weight replication multiples of each processing layer according to a preset genetic algorithm, and mapping a plurality of array groups corresponding to each processing layer to corresponding cores; and judging whether each core has an array group with incomplete computing tasks according to the array groups mapped by the cores, and if so, generating a corresponding instruction stream according to the incomplete computing tasks.

Since the prior art scheme relies on manual mapping of weight data onto arrays, which ignores the impact of weight mapping on array parallelism, and since the size of the in-memory computational array in the in-memory computational matrix unit is limited, and a complete processing layer (convolutional layer or fully-connected layer) in a convolutional neural network cannot generally be mapped completely into the same in-memory computational array, the weights of these processing layers need to be divided according to the size of the in-memory computational array, according to one embodiment of the present inventionFor example, the weight of each convolution kernel of the processing layers in the convolutional neural network is reorganized into a column (the organization sequence needs to be consistent with the sequence of reorganizing the sliding window into a column) to obtain corresponding weight matrixes, and the weight matrixes of each processing layer are divided according to the size of the in-memory calculation matrix to obtain a plurality of array groups corresponding to each processing layer. Schematically, the fully connected layers in the convolutional neural network are regarded as convolutional layers with the convolutional kernel size of 1, the number of convolutional kernel channels of the convolutional layers of the input element number of the fully connected layers and the number of the convolutional kernels of the output element number of the fully connected layers, and the weights of each convolutional kernel of the convolutional layers are reorganized into a row, so that a height k can be obtained _w ×k _h ×C _in Width is C _out Wherein k is a weight matrix of _w Representing the convolution kernel length, k, of the convolution layer _h Representing the convolution kernel width, C _in Representing the number of input channels, C _out Representing the number of output channels, the size of the in-memory computational array is: width W _xbar The height is H _xbar Dividing the weight matrix of each processing layer according to the size of the in-memory calculation matrix to obtainAnd a plurality of array groups. The technical scheme of the embodiment at least can realize the following beneficial technical effects: the weight matrix is divided by the size of the in-memory computing array, so that the obtained array group can adapt to the scale of in-memory computing hardware resources, and the bandwidth and the pressure of on-chip cache are reduced.

Secondly, in the prior art, the weight is often ignored to be copied, or intuitive methods are adopted to select the copy multiple, such as copying the first several layers in the network by several times to achieve the interlayer calculation balance, but the selection cannot effectively utilize the resources, and because the in-memory calculation matrix unit in the in-memory accelerator is both a storage unit and a calculation unit, the important way of improving the calculation parallelism is to copy the weight data multiple times, but the core mapping can influence the distribution and the structural conflict of the calculation task, so that the calculation parallelism can be improved, and the distribution and the structural conflict of the calculation task can not be influenced The invention provides a genetic algorithm to solve the two problems at the same time, and the genetic algorithm adopts integer coding, so that the flexibility and the operation efficiency can be considered, and the operation time increase caused by adopting binary coding is avoided. According to one embodiment of the present invention, to ensure that the core map is not too scattered, which would cause on-chip memory to be a limiting factor, the present invention sets the number of processing layers that each core can accommodate, so that the location of each gene in the chromosome determines the sequence number of the cores corresponding to the array set. The genetic algorithm execution step is shown in figure 3, step T1, encoding a plurality of array groups of each processing layer into integer; step T2, randomly selecting copy times for the weight of each processing layer, and randomly selecting cores to be bound for the array group; step T3, judging whether the selected replication times and the mapping core meet preset requirements, wherein the preset requirements are that the iteration times of the genetic algorithm reach an upper limit or the self-utilization rate reaches the requirements, if so, turning to T4, and if not, turning to T5; step T4, if the processing layer is satisfied, the weight is duplicated according to the duplication multiple, and a plurality of array groups of each processing layer are mapped to corresponding cores according to the mapping cores; step T5, if the preset requirement is not met, evaluating the time required by the in-memory computing unit to execute one matrix-vector multiplication operation by utilizing the fitness function according to the initial copy multiple and the mapping core; step T6, selecting a plurality of processing layers with shortest time according to the estimated time; t7, randomly selecting one treatment layer from a plurality of treatment layers, and selecting a weight replication multiple and a mapping core for the treatment layers by utilizing gene mutation in a genetic algorithm; returning to the step T3, and judging whether the selected weight replication times and the mapping cores meet the preset requirements; if the execution of the step T4 is satisfied, the execution of the step T5 is not satisfied. Illustratively, the fitness function uses the inferred total time as an index, where the inferred total time is: t=max (T _i ),T _i ＝n _i ×MVM _time Wherein n is _i Representing the total number of matrix-vector multiplication operation tasks for the ith core, MVM _time The time required for the array to perform a matrix-vector multiplication operation is calculated for the memory.

According to one embodiment of the present invention, the gene is mutatedComprising the following steps: 1) Randomly selecting a processing layer, increasing the replication factor, and mapping into the corresponding core, for example: suppose that the ith processing layer is randomly selected, and its replication multiple is originally r _i The replication multiple of the ith treatment layer after mutation operation is r _i +1. 2) Randomly selecting a processing layer, reducing replication times, and increasing array resources, for example: assuming that the ith treatment layer is randomly selected, its replication factor is originally r _i . The replication multiple of the first node after mutation operation is r _i -1. 3) Randomly selecting a processing layer, and dispersing the corresponding array group onto other cores, for example: assuming that the ith processing layer is randomly selected, the array group of the processing layers is distributed in k cores, and the core mapping process is as follows: the array groups thereof are reassigned to k+1 cores, i.e., the degree of dispersion of the arrays at the time of assignment of the groups is increased. 4) Randomly selecting one processing layer, combining its corresponding array into the same processing layer of other cores, for example: assuming that the ith processing layer is randomly selected, the array group of the processing layers is distributed in k cores, and the core mapping process is as follows: the corresponding array groups are reassigned to k-1 cores, i.e., the degree of dispersion of the array groups upon assignment is reduced. Since the genetic algorithm is an iterative optimization process, it is possible that the kth iteration selects the weight replication multiple and the k+1th operation selects the core map. The technical scheme of the embodiment at least can realize the following beneficial technical effects: by selecting the weight replication times and the mapping cores through the genetic algorithm, the calculation parallelism and the resource utilization rate can be improved under the condition that the distribution and the structure conflict of calculation tasks are not influenced.

According to one embodiment of the invention, when the weight matrix and the in-memory computing array are bound, the arrays in the same array group are preferentially mapped into the same core, and because the arrays belonging to the same array group can be driven by the same instruction, the instruction number and the control signal complexity can be reduced. The technical scheme of the embodiment at least can realize the following beneficial technical effects: the arrays have identical inputs and if mapped onto the same core, the input data may be broadcast to the arrays, thereby avoiding repeated reads of data corresponding to each array while reducing bandwidth and on-chip cache pressure.

And judging whether each core has an array group with incomplete computing tasks according to the weight replication multiple and the core mapping obtained by a preset genetic algorithm and according to the array groups of each core mapping, and if so, generating a corresponding instruction stream according to the incomplete computing tasks. According to one embodiment of the present invention, the number of total computing tasks of each array group is determined according to the weight replication multiple, the computing tasks of the array groups in each core are recorded by the data structure of each core, the array groups are mapped to the corresponding cores according to the core mapping, each core is mapped with a plurality of array groups, each array group corresponds to the computing tasks of the matrix-vector multiplication operation, whether the core has an array group with incomplete computing tasks is judged according to the data structure of each core, if there is an array group with incomplete computing tasks, a corresponding instruction stream is generated according to the incomplete computing tasks, wherein the corresponding instruction stream is the computing tasks in the data structure corresponding to each execution of the computing tasks of the matrix-vector multiplication operation, the computing tasks in the data structure corresponding to each execution of the computing tasks are subtracted by one.

The process of accelerating the convolutional neural network reasoning by the initialized in-memory accelerator according to the generated instruction stream comprises the following steps:

1. and accessing the input data of the convolutional neural network in the global memory in sequence according to the size of a sliding window (sliding window) by utilizing the executing the memory access operation so as to obtain the input data corresponding to the calculation operation, and sending the corresponding input data to the memory calculation matrix unit.

The existing data reading method generally loads the data of the global memory into the local memory according to the size of the sliding window from left to right and from top to bottom, but the method may repeatedly read part of the data, so that the data is too complex and huge in subsequent calculation, so that more calculation memory is occupied, and the memory bandwidth is wasted. In order to improve the utilization rate of the memory, the invention provides a data reading mode of a plurality of sliding windows, namely when the input data of a convolutional neural network is accessed according to the sliding windows, the newly added data of the sliding window to be accessed and the existing sliding window data in the local memory are obtained from the global memory by taking pixels as units to reconstruct, so as to obtain the sliding window data to be accessed; specifically, when the input data of the convolutional neural network is accessed according to the sliding window, a difference set and an intersection set of the existing sliding window data and the sliding window data to be accessed are generated, the data accessing the difference set from the global memory and the data of the intersection set acquired from the local memory unit are recombined according to the sequence of the sliding window reading data, so that the sliding window data to be accessed is obtained, wherein the sequence of the sliding window reading data can be left-to-right, top-to-bottom or left-to-right.

According to an embodiment of the present invention, the sequentially accessing the input data of the convolutional neural network in the global memory according to the size of the sliding window to obtain the input data corresponding to the computing operation includes: sequentially acquiring input data of a convolutional neural network in a global memory according to the size of a sliding window to obtain data contained in a first sliding window, obtaining first sliding window data, and loading the first sliding window data to the local memory unit; moving the first sliding window by one unit length, and judging whether the intersection (or repeated data) exists between the data in the current sliding window and the data of the first sliding window; if no intersection exists, sequentially acquiring data contained in the current sliding window, obtaining second sliding window data, and loading the second sliding window data to the local memory unit; if an intersection exists, determining a difference set between data in a current sliding window and the first sliding window data, recombining the data accessing the difference set from a global memory and the data of the intersection acquired from a local memory unit to obtain second sliding window data, and loading the second sliding window data to the local memory unit; and acquiring the input data of the convolutional neural network based on the acquisition mode of the second sliding window data to obtain the input data corresponding to the calculation operation. It is worth to be noted that, when the length of the sliding window is one unit length, and the sliding window is moved leftwards or rightwards by one unit length, repeated data do not exist between the current sliding window and the first sliding window, and the data of the current sliding window are still acquired according to the data acquisition mode of the first sliding window. The technical scheme of the embodiment at least can realize the following beneficial technical effects: through the data reading mode of a plurality of sliding windows, repeated data reading can be avoided, further waste of memory bandwidth is avoided, meanwhile, the instruction number and the complexity of control signals can be reduced, the bandwidth and the on-chip cache pressure are reduced, and the utilization rate of the memory is improved.

According to an example of the present invention, as shown in fig. 4, the sliding window data corresponds to the data of the 1 st, 2 nd, 3 rd, 4 th, 5 th, 6 th, 7 th, 8 th and 9 th input channels in the global memory, when the input data is read from the global memory to the local memory, the data of each input channel is read into the local memory in the order from left to right and from top to bottom, so as to make the data become continuous input data (123456789); the sliding window data (123456789) accessed from the global memory is divided according to the computing operation, so as to obtain input data corresponding to the computing operation (for example, input data corresponding to array groups 0, 1 and 2).

According to an example of the present invention, the data reading manner of a plurality of sliding windows is shown in fig. 5, where a first sliding window corresponds to data of input channels 0, 1, 2, 5, 6, 7, 10, 11, 12 in a global memory, and according to the existing manner of reading input data, the data corresponding to the sliding window is read into a local memory to form continuous data, and the arrangement order of the data is sequentially 0, 1, 2, 5, 6, 7, 10, 11, 12; the second sliding window corresponds to the data of the 1 st, 2 nd, 3 rd, 6 th, 7 th, 8 th, 11 th, 12 th and 13 th input channels in the global memory, a data repetition part of the second sliding window corresponding to the first sliding window can be obtained, the repeated data are the data of the 1 st, 2 nd, 6 th, 7 th, 11 th and 12 th input channels, namely the data of the 1 st, 2 nd, 6 th, 7 th, 11 th and 12 th input channels are intersections of the data of the first sliding window and the data of the second sliding window, the data of the 3 rd, 8 th and 13 th input channels are differences of the two sliding windows, when the data of the second sliding window are read, only the data of the input channels (3, 8, 13) which are not loaded before are read from the global memory, and the data of the input channels (1, 2, 6, 7, 11, 12) which are loaded before are subjected to data migration of the local memory through LMD instructions, and the data of the 3 rd, 8, 13 th and 1, 2, 6, 7 th and 11, 12 th input channels are recombined according to the sequence of the data read by the sliding window, so that the data of the second sliding window (1, 2, 7, 11, 12) are obtained; and loading the data in the global memory to the local memory according to the data reading mode. It should be appreciated that the size of the sliding window in this example is 3×3, and that one skilled in the art may adjust the size of the sliding window, for example: the size of the sliding window is adjusted to 2×2, 4×4, or 3×1 to obtain other embodiments.

2. And performing matrix-vector multiplication calculation according to the input data corresponding to the calculation operation accessed by the local memory unit by utilizing a plurality of in-memory calculation arrays in the in-memory calculation matrix unit.

According to one embodiment of the invention, for any two matrix-vector multiplication operations in the instruction stream of each core, the in-memory computing unit in each core is configured to: under the condition that two matrix-vector multiplication operations have structural conflict, the former matrix-vector multiplication operation is firstly executed in sequence, and then the latter matrix-vector multiplication operation is executed, wherein the structural conflict is that the two matrix-vector multiplication operations are all calculation aiming at the same array, so that the latter matrix-vector multiplication operation is started after the former matrix-vector multiplication operation is ended; under the condition that a data dependency relationship exists between two matrix-vector multiplication operations, a previous matrix-vector multiplication operation is executed first, and a subsequent matrix-vector multiplication operation is executed according to the result of the previous matrix-vector multiplication operation, wherein the data dependency relationship is that the output of the previous matrix-vector multiplication operation is the input of the subsequent matrix-vector multiplication operation, so that the subsequent matrix-vector multiplication operation is started after the end of the previous matrix-vector multiplication operation; in the case where there are no structural conflicts and data dependencies for multiple matrix-vector multiplication operations, the multiple matrix-vector multiplication operations are performed in parallel, but the order in which the computations are started is calculated in order of the instruction stream, and the start time of two adjacent matrix-vector multiplication operations is constrained by the on-chip memory bandwidth.

According to one embodiment of the invention, each sliding window is expanded to a size k _w ×k _h ×C _in The convolutional operation of the convolutional neural network can be converted into a matrix-vector multiplication operation.

3. And executing the calculation operation by using a vector function unit, and carrying out post-processing on the result subjected to matrix-vector multiplication calculation in the in-memory calculation matrix unit.

According to one embodiment of the invention, the vector functional unit is configured to: performing post-processing according to the result of the matrix-vector multiplication calculation, wherein the post-processing comprises: activation processing, pooling processing, and/or element-by-element processing, wherein post-processing is assigned to complete in common to all cores. More specifically, element-by-element processing (e.g., addition, subtraction, multiplication, and/or division, etc.) is performed based on the result of the matrix-vector multiplication computation, and activation processing is performed on the result of the element-by-element processing to obtain the inference result of the convolutional neural network. Illustratively, taking two vectors for addition as an example, accumulating the results of the matrix-vector multiplication calculation, and activating the accumulated results to obtain the reasoning result of the convolutional neural network.

According to one embodiment of the invention, when mapping the plurality of array groups of each processing layer to the core for storage, the array groups of the plurality of processing layers may be mapped on one core, the array groups of one processing layer may be mapped on the plurality of cores, the calculation results obtained by the plurality of array groups of the same processing layer need to be accumulated to obtain a complete convolution result, if the array groups of the same layer are mapped on different cores, data accumulation needs to be performed across cores, wherein in the process of data accumulation across cores, the calculation results belonging to the same weight block in other cores need to be sent to the core where the first array group of the weight block is located by using on-chip routing for accumulation. To reduce the synchronization overhead of inter-core transmissions, each array group may compute several rounds before inter-core transmissions. Because it is difficult for on-chip local memory to store complete data for each layer, it is necessary to periodically transfer portions of input/output data between global memory and local memory.

4. And finally obtaining the reasoning result of the convolutional neural network according to the cooperation of the in-memory calculation matrix unit and the vector functional unit.

According to one embodiment of the invention, the reasoning result of the convolutional neural network is stored to the global memory by the local memory unit. In addition, the local memory unit can be used for executing operations such as supplementing, splicing, dividing and the like besides the access operation.

Because on-chip memory capacity is limited and accessing global memory is a very expensive operation, it is important to efficiently utilize on-chip local memory, and therefore, the present invention provides a memory allocation policy to allocate memory resources so as to realize efficient memory utilization. According to an embodiment of the present invention, as shown in fig. 6, the calculation results and the accumulation results of the array groups 0, 1, and 2 need to be saved, and the existing memory allocation policy is now used to save the memory allocation policy given by the present invention, as shown in fig. 6 (a initially) shows the existing memory allocation policy, where the existing memory allocation policy allocates new memory to each operation to save the result after each operation, which is schematically shown: firstly, a memory block is allocated to the result of array group 0 calculation, and the result A is stored in the 1 st memory; distributing a memory block to the result calculated by the array group 1, and storing the calculation result B into a 2 nd memory; accumulating the results of the array group 0 and the array group 1, and storing the result C into a 3 rd memory; distributing a memory block to the result calculated by the array group 2, and storing the result D into a 4 th memory; and finally accumulating the result D and the result C of the array group 2, and storing the final result E into a 5 th memory, wherein a plurality of memories in the existing memory allocation strategy are not accessed after being used once, so that the resource waste is caused; based on this, the present invention provides a memory allocation policy of accumulation multiplexing, as shown in fig. 6 (b), where the accumulation multiplexing is to save memory by multiplexing the memory blocks of the accumulation operation to save the result of the new operation, schematically: firstly, a memory block is allocated to the result of array group 0 calculation, and the result A is stored in the 1 st memory; distributing a memory block to the result calculated by the array group 1, and storing the calculation result B into a 2 nd memory; accumulating the results of the array group 0 and the array group 1, and storing the result C into a 3 rd memory; distributing a memory block to the result calculated by the array group 2, and storing the result D into a 4 th memory; when the result D of the final array group 2 is accumulated with the result C, a new memory block is not allocated, but the accumulated final result E is saved to the 3 rd memory used for saving the accumulated result C, and the memory allocation policy of accumulation multiplexing can reduce the memory blocks for saving intermediate results, but the memory allocation policy of each array group still causes resource waste, so the memory allocation policy of array group multiplexing provided on the basis of accumulation multiplexing, as shown in fig. 6 (C), the array group multiplexing further multiplexes the memory blocks of matrix-vector multiplication operation on the basis of accumulation multiplexing to save memory, and the method is schematically as follows: firstly, a memory block is allocated to the result of the array group 0 calculation, and the calculation result A is stored in a 1 st memory block; distributing a memory block to the result calculated by the array group 1, and storing the calculation result B into a 2 nd memory; then, the result C of accumulating the calculation result of the array group 0 and the calculation result of the array group 1 is stored in the 1 st memory, because the contents (respectively, the result A and the result B) stored in the 1 st memory and the 2 nd memory are not used later; the result D calculated for the array group 2 is stored in a 2 nd memory; and finally, storing the accumulated result E of the calculation result of the array group 2 and the result C in the 1 st memory into the 1 st memory again. The technical scheme of the embodiment at least can realize the following beneficial technical effects: the memory block is multiplexed to save the result of the new operation, so that the use of the memory is planned, the memory multiplexing is increased, the use of the on-chip memory is reduced, the chip area and the system power consumption are reduced, and the waste of the memory is avoided through the memory allocation strategy of the array group multiplexing.

Correspondingly, the invention also provides a method for accelerating the reasoning of the convolutional neural network by using the in-memory accelerator, which is provided by the first aspect of the invention, and comprises the following steps: acquiring an instruction stream; loading data of a convolutional neural network in a global memory according to the instruction stream to obtain corresponding input data; performing matrix-vector multiplication calculation according to the loaded input data to obtain a matrix-vector multiplication calculation result; and carrying out post-processing on the result of the matrix-vector multiplication calculation to obtain the reasoning result of the convolutional neural network.

For a more complete understanding of the flow of the technical solution of the present invention, the following provides the related algorithm related to the technical solution of the present invention:

according to the related algorithm, the following describes the flow of the technical scheme of the present invention in detail with reference to fig. 7. As shown in fig. 7, a process of performing acceleration calculation by using an in-memory accelerator for accelerating reasoning of a convolutional neural network according to the present invention includes:

s1, encoding a plurality of array groups of each treatment layer into integer;

step S2, randomly selecting copy times for the weight of each processing layer, and randomly selecting cores to be bound for the array group;

S3, judging whether the selected copy multiple and the mapping core meet preset requirements, wherein the preset requirements are that the iteration number of the genetic algorithm reaches an upper limit or the self-use utilization rate reaches a requirement; if the execution step T4 is not satisfied, if the execution step T7 is satisfied;

step S4, evaluating the time required by the in-memory computing unit to execute one matrix-vector multiplication operation by utilizing the fitness function according to the selected copy multiple and the mapping core;

step S5, selecting a plurality of processing layers with shortest time according to the estimated time;

s6, randomly selecting one treatment layer from a plurality of treatment layers, and selecting a weight replication multiple and a mapping core for the treatment layers by utilizing gene mutation in a genetic algorithm; returning to the step T3, and judging whether the selected weight replication times and the mapping cores meet the preset requirements; if the execution step T4 is not satisfied, if the execution step T7 is satisfied;

step S7, copying weights according to the copy multiples, and mapping a plurality of array groups of each processing layer to corresponding cores according to the mapping cores;

step S8, judging whether the core has an array group which does not complete the calculation task according to a plurality of array groups mapped by each core, if so, executing step S9, and if not, executing step S12;

Step S9, generating a corresponding instruction stream according to the unfinished calculation task;

step S10, reading input data corresponding to the calculation task from a global memory according to an instruction stream;

step S11, performing matrix-vector multiplication operation once according to input data corresponding to the calculation task, returning to step S8, judging whether an array group which does not complete the calculation task exists in the core at the moment, if yes, performing step S9, and if not, performing step S12;

step S12, post-processing the calculation result of the matrix-vector multiplication operation;

and step S13, storing the post-processing result into a global memory.

In summary, the in-memory accelerator for accelerating the reasoning of the convolutional neural network provided by the invention has the following advantages:

It should be noted that, although the steps are described above in a specific order, it is not meant to necessarily be performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order, as long as the required functions are achieved.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An in-memory accelerator for accelerating reasoning of a convolutional neural network, the in-memory accelerator comprising a global memory for storing data of the convolutional neural network, an on-chip route for transmitting data of one core to other cores or receiving data transmitted by other cores, and a plurality of cores connected to the global memory, each core comprising: the system comprises a control unit, a vector function unit, a memory calculation matrix unit and a local memory unit, wherein:

the control unit is configured to obtain an instruction stream, and execute corresponding operations based on each unit of the instruction stream, where the instruction stream includes: a computing operation and a memory access operation;

The local memory unit is used for executing the memory access operation, sequentially accessing the input data of the convolutional neural network in the global memory according to the size of the sliding window to obtain the input data corresponding to the calculation operation, and sending the corresponding input data to the memory calculation matrix unit, wherein when the input data of the convolutional neural network is accessed according to the sliding window, the newly added data of the sliding window to be accessed and the data of the sliding window existing in the local memory are obtained from the global memory by taking the pixel as a unit, and the sliding window data to be accessed are obtained by reorganizing the data of the sliding window existing in the local memory;

the in-memory computing matrix unit comprises a plurality of in-memory computing arrays, wherein each in-memory computing array is used for executing the computing operation and performing matrix-vector multiplication computation according to input data corresponding to the computing operation;

the vector function unit is used for executing the calculation operation and carrying out post-processing according to the result of the matrix-vector multiplication calculation;

and finally obtaining the reasoning result of the convolutional neural network according to the cooperation of the in-memory calculation matrix unit and the vector functional unit.

2. The in-memory accelerator according to claim 1, wherein,

When the input data of the convolutional neural network is accessed according to the sliding window, generating a difference set and an intersection set of the data of the existing sliding window and the data of the sliding window to be accessed, and recombining the data of the difference set accessed from the global memory and the data of the intersection set acquired from the local memory unit to obtain the data of the sliding window to be accessed.

3. The in-memory accelerator according to claim 2, wherein,

and reorganizing the data accessing the difference set from the global memory and the data of the intersection set acquired from the local memory unit according to the sequence of the sliding window reading data to obtain the sliding window data to be accessed.

4. The in-memory accelerator of claim 3, wherein the local memory unit is configured to: dividing sliding window data accessed from the global memory according to the computing operation to obtain input data corresponding to the computing operation.

5. The in-memory accelerator of claim 3, wherein sequentially accessing input data of the convolutional neural network in the global memory according to a size of a sliding window to obtain input data corresponding to the computing operation comprises:

sequentially acquiring input data of a convolutional neural network in a global memory according to the size of a sliding window to obtain data contained in a first sliding window in sequence, obtaining first sliding window data, and loading the first sliding window data into the local memory unit;

Moving the first sliding window by a unit length, and judging whether intersection exists between data in the current sliding window and the data of the first sliding window; if no intersection exists, sequentially acquiring data contained in the current sliding window, obtaining second sliding window data, and loading the second sliding window data to the local memory unit;

if an intersection exists, determining a difference set between data in a current sliding window and the first sliding window data, recombining the data accessing the difference set from a global memory and the data of the intersection acquired from a local memory unit to obtain second sliding window data, and loading the second sliding window data to the local memory unit;

and acquiring the input data of the convolutional neural network based on the acquisition mode of the second sliding window data to obtain the input data corresponding to the calculation operation.

6. The in-memory accelerator of claim 1, wherein the instruction stream is derived from:

acquiring a convolutional neural network, wherein the convolutional neural network comprises a plurality of processing layers;

dividing the weight matrix of each processing layer according to the size of the in-memory computing array to obtain a plurality of array groups corresponding to each processing layer;

iteratively selecting weight replication multiples of each processing layer according to a preset genetic algorithm, and mapping a plurality of array groups corresponding to each processing layer to corresponding cores;

And judging whether each core has an array group with incomplete computing tasks according to the array groups mapped by the cores, and if so, generating a corresponding instruction stream according to the incomplete computing tasks.

7. The in-memory accelerator of claim 1, wherein the in-memory computing unit is configured to:

in the case that two matrix-vector multiplication operations have structural conflict, the former matrix-vector multiplication operation is firstly executed in sequence, and then the latter matrix-vector multiplication operation is executed;

in the case that a data dependency exists between two matrix-vector multiplication operations, a preceding matrix-vector multiplication operation is performed first, and a subsequent matrix-vector multiplication operation is performed according to the result of the preceding matrix-vector multiplication operation;

in the case where there are no structural conflicts and data dependencies for multiple matrix-vector multiplication operations, the multiple matrix-vector multiplication operations are performed in parallel.

8. The in-memory accelerator of claim 1, wherein the vector functional unit is configured to: performing post-processing according to the result of the matrix-vector multiplication calculation, wherein the post-processing comprises: an activation process, a pooling process, and/or an element-by-element process.

9. The in-memory accelerator of claim 8, wherein the element-wise processing is an accumulation calculation based on the result of the matrix-vector multiplication calculation, and the accumulated result is activated to obtain an inference result of the convolutional neural network.

10. A method of accelerating convolutional neural network reasoning using an in-memory accelerator as recited in any of claims 1-9, the method comprising:

acquiring an instruction stream;

loading data of a convolutional neural network in a global memory according to the instruction stream to obtain corresponding input data;

performing matrix-vector multiplication calculation according to the loaded input data to obtain a matrix-vector multiplication calculation result;

and carrying out post-processing on the result of the matrix-vector multiplication calculation to obtain the reasoning result of the convolutional neural network.