WO2019076095A1

WO2019076095A1 - Processing method and apparatus

Info

Publication number: WO2019076095A1
Application number: PCT/CN2018/095548
Authority: WO
Inventors: 刘少礼; 周徐达; 杜子东; 刘道福; 张磊; 陈天石; 胡帅; 韦洁; 孟小甫
Original assignee: 上海寒武纪信息科技有限公司
Priority date: 2017-10-20
Filing date: 2018-07-13
Publication date: 2019-04-25

Abstract

Provided are a processing method and apparatus. The method involves: respectively quantizing a weight and an input neuron to determine a weight dictionary, a weight code book, a neuron dictionary and a neuron code book; and determining a calculation code book according to the weight code book and the neuron code book. In addition, in the present application, a calculation code book is determined according to quantized data, and the two types of quantized data are combined, thereby facilitating data processing.

Description

Processing method and device

Technical field

The present application relates to the field of data processing, and in particular, to a processing method and apparatus, an operation method, and an apparatus.

Background technique

Neural networks have achieved very successful applications. However, large-scale parameters and large-scale computing of neural networks have become a huge challenge for neural network applications. On the one hand, large-scale parameters put high demands on storage capacity, and at the same time lead to a large amount of access energy consumption. On the other hand, large-scale calculations place high demands on the design of the arithmetic unit, and at the same time lead to a large amount of computational energy consumption. Therefore, how to reduce the parameters and calculation of neural networks has become an urgent problem to be solved.

Summary of the invention

The purpose of the present application is to provide a processing method and apparatus, an arithmetic method and an apparatus to solve at least one of the above technical problems.

In an aspect of the present application, a processing method is provided, including:

Weighting and inputting neurons are separately quantified to determine a weight dictionary, a weight codebook, a neuron dictionary, and a neuron codebook;

The operation codebook is determined based on the weight codebook and the neuron codebook.

In a possible embodiment of the present application, quantifying the weight includes the steps of:

Grouping the weights, performing a clustering operation on each set of weights by using a clustering algorithm, dividing each set of weights into m classes, m being a positive integer, and each type of weight corresponding to a weight index, Determining the weight dictionary, wherein the weight dictionary includes a weight position and a weight index, the weight position indicating a position of the weight in the neural network structure;

The weighted codebook is determined by replacing the ownership value of each class with a central weight, wherein the weighted codebook includes a weighted index and a central weight.

In a possible embodiment of the present application, the step of quantizing the input neurons comprises the steps of:

Deciphering the input neuron into p segments, each segment input neuron corresponding to a neuron range and a neuron index, determining the neuron dictionary, wherein p is a positive integer;

The input neurons are encoded, and all input neurons of each segment are replaced with a central neuron to determine the neuron codebook.

In a possible embodiment of the present application, the determining the operation codebook includes the following steps:

Determining, according to the weight value, a corresponding weight index in the weight credential, and determining, by using a weight index, a center weight corresponding to the weight;

Determining, according to the input neuron, a corresponding neuron index in the neuron codebook, and determining, by the neuron index, a central neuron corresponding to the input neuron; and

The center weight and the central neuron are operated to obtain an operation result, and the operation result is formed into a matrix to determine the operation codebook.

In a possible embodiment of the present application, the operation operation includes at least one of the following: addition, multiplication, and pooling, wherein the pooling includes: average pooling, maximum pooling, and median pooling .

In a possible embodiment of the present application, the method further includes the steps of: re-training the weight and the input neuron, and training only the weight codebook and the neuron codebook during the retraining, the weight dictionary and The content in the neuron dictionary remains unchanged, and the retraining uses a backpropagation algorithm.

In a possible embodiment of the present application, the grouping the weights includes:

Divided into groups, grouping the ownership values in the neural network into one group;

Layer type grouping, dividing weights of all convolution layers in the neural network, weights of all fully connected layers, and weights of all long and short memory network layers into a group;

Inter-layer grouping, dividing weights of one or more convolution layers in the neural network, weights of one or more fully connected layers, and weights of one or more long-term memory network layers into a group;

In-layer grouping, the weights in one layer of the neural network are segmented, and each part after the segmentation is divided into a group.

In a possible embodiment of the present application, the clustering algorithm comprises K-means, K-medoids, Clara and/or Clarans.

In one possible embodiment of the present application, each class corresponding to the center of weights selection method comprising: determining that the cost function J (w, w ₀₎ the minimum value W _0, in which case the value _0, i.e. W For the center weight;

among them,

J(w,w ₀ ) is the cost function, W is the ownership value in the class, W ₀ is the central weight, n is the number of ownership values under the class, and W _i is the i-th weight in the class, 1≤ i ≤ n, and i is a positive integer.

In another aspect of the present application, a processing apparatus is provided, comprising:

a memory for storing an operation instruction;

The processor is configured to execute an operation instruction in the memory, and operate according to the foregoing processing method when the operation instruction is executed.

In a possible embodiment of the present application, the operation instruction is a binary number, including an operation code and an address code, the operation code indicates an operation to be performed by the processor, and the address code indicates that the processor reads the participation operation into the address in the memory. The data.

In still another aspect of the present application, an arithmetic device is provided, including:

An instruction control unit, configured to decode the received instruction to generate search control information;

The lookup table unit is configured to search for output neurons from the operation codebook according to the lookup control information, and the received weight dictionary, the neuron dictionary, the operation codebook, the weights, and the input neurons.

In a possible embodiment of the present application, the weight dictionary includes a weight position and a weight index; the neuron dictionary includes an input neuron and a neuron index; the operation codebook includes a weight index, a neuron The index and the result of the operation of the input neurons and weights.

In a possible embodiment of the present application, the computing device further includes:

a pre-processing unit, configured to pre-process input information of the external input, to obtain the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation code book;

a storage unit for storing input neurons, weights, weights dictionary, neuron dictionary, operation codebook and instructions, and receiving output neurons;

a cache unit for buffering the instruction, input neurons, weights, weight indexes, neuron indexes, and output neurons;

a direct memory access unit for reading and writing data or instructions between the storage unit and the cache unit.

In a possible embodiment of the present application, the cache unit includes:

An instruction cache for buffering the instruction and outputting the cached instruction to the instruction control unit;

a weight buffer for buffering the weights;

Input a neuron cache for caching the input neurons;

Outputs a neuron cache that is used to cache the output neurons of the lookup table cell output.

In a possible embodiment of the present application, the cache unit further includes:

a weighted index cache for caching weight indexing;

A neuron index cache that is used to cache neuron indexes.

In a possible embodiment of the present application, the pre-processing unit is specifically used for: pre-processing, Gaussian filtering, binarization, regularization, and/or normalization when pre-processing input information input externally. .

In a possible embodiment of the present application, the lookup table unit includes:

The multiplication lookup table is used to input the weight index in1 and the neuron index in2, and the multiplication operation table mult_lookup is performed through the multiplication lookup table, and the multiplication operation of the central weight data1 corresponding to the weight index and the central neuron data2 corresponding to the neuron index is completed. , that is, using the lookup table operation out=mult_lookup(in1, in2) to complete the multiplication function out=data1*data2; and/or

The addition lookup table is used to perform the addition operation of the center data data corresponding to the index by the table-searching operation add_lookup according to the input index in. The in and data are vectors of length N, and N is a positive integer, that is, The table lookup operation out=add_lookup(in) completes the addition function out=data[1]+data[2]+...+data[N], and/or the input weight index in1 and the neuron index in2 through the addition lookup table After the table operation completes the addition operation of the center weight data1 corresponding to the weight index and the center neuron data2 corresponding to the neuron index, the addition function is completed by the table lookup operation out=add_lookup(in1, in2), out=data1+data2 ;and / or

The pooled lookup table is used to input the pooling operation of the central data data corresponding to the index, that is, the pooling operation out=pool(data) is completed by using the table out=pool_lookup(in), and the pooling operation includes the average pooling and maximum Value pooling and median pooling.

In a possible embodiment of the present application, the instruction is a neural network specific instruction, and the neural network specific instruction includes:

Control instructions for controlling the execution of the neural network;

A data transfer instruction for performing data transfer between different storage media, the data format including a matrix, a vector, and a scalar;

The operation instruction is used to complete the arithmetic operation of the neural network, including the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Command, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction, MAXOUT neural network operation instructions;

Logic instructions for performing logical operations on neural networks, including vector logic operations instructions and scalar logic operation instructions.

In a possible embodiment of the present application, the neural network dedicated instruction includes at least one Cambricon instruction, the Cambricon instruction includes an operation code and an operand, and the Cambricon instruction includes:

a Cambricon control instruction for controlling an execution process, and the Cambricon control instruction includes a jump instruction and a conditional branch instruction;

The Cambricon data transfer instruction is configured to complete data transfer between different storage media, including a load instruction, a storage instruction, and a transfer instruction; wherein the load instruction is used to load data from the main memory to the cache; the storage instruction, Used to store data from a cache to main memory; the transfer instruction is used to transfer data between a cache and a cache or a cache and a register or a register and a register;

The Cambricon operation instruction is used to complete a neural network arithmetic operation, including a Cambricon matrix operation instruction, a Cambricon vector operation instruction, and a Cambricon scalar operation instruction; wherein the Cambricon matrix operation instruction is used to complete a matrix operation in a neural network, including matrix multiplication Vector operation, vector multiplication matrix operation, matrix multiplication scalar operation, outer product operation, matrix plus matrix operation, and matrix subtraction matrix operation; the Cambricon vector operation instruction for performing vector operations in a neural network, including vector basic operations, vectors Transcendental function operations, inner product operations, vector random generation operations, and maximum/minimum operations in vectors; Cambricon scalar operations instructions for performing scalar operations in neural networks, including scalar basic operations and scalar transcendental functions;

Cambricon logic instructions for logical operations of a neural network, the Cambricon logic instructions including Cambricon vector logic operation instructions and Cambricon scalar logic operation instructions; wherein the Cambricon vector logic operation instructions are used for vector comparison operations, vector logic operations, and vectors Greater than the merge operation, wherein the vector logic operation includes AND, OR, and NOT; the Cambricon scalar logic operation instruction is used for scalar comparison operation and scalar logic operation.

In a possible embodiment of the present application, the Cambricon data transmission instruction supports one or more of the following data organization modes: a matrix, a vector, and a scalar; the vector basic operation includes a vector addition, subtraction, multiplication, and division; The transcendental function refers to a function that does not satisfy the polynomial equation with a polynomial as a coefficient, including an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function; the scalar basic operation includes scalar addition, subtraction, multiplication, and division; the scalar transcendental function refers to dissatisfaction A function of a polynomial equation in which a polynomial is a coefficient, including an exponential function, a logarithmic function, a trigonometric function, an inverse trigonometric function; the vector comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to; the vector logic operation includes And, or, and; the scalar comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to; the scalar logical operation includes AND, OR, and NOT.

In yet another aspect of the present application, another method of operation is provided, including:

Receiving weights, input neurons, instructions, weights dictionaries, neuron dictionaries, and arithmetic codebooks;

Decoding the instructions to determine lookup control information;

According to the search control information, the weight, the weight dictionary, the neuron dictionary, and the input neurons, the output neurons are searched for in the operation codebook.

In a possible embodiment of the present application, the weight dictionary includes a weight position and a weight index; the neuron dictionary includes an input neuron and a neuron index; the operation codebook includes a weight index, a neuron The index and the result of the weight and input neurons.

In a possible embodiment of the present application, searching for output neurons in the operation codebook according to the search control information, weights, and input neurons includes the following steps:

Determining a neuron index in the neuron dictionary to determine a neuron index and determining a weight index by determining a weight location in the neuron dictionary according to the weight, the input neuron, the weight dictionary, and the neuron dictionary ;

According to the weight index and the neuron index, the operation result is searched in the operation codebook to determine the output neuron.

In a possible embodiment of the present application, the operation result includes the following results of at least one operation operation: addition, multiplication, and pooling, wherein the pooling includes: average pooling, maximum pooling, and median Pooling.

In a possible embodiment of the present application, before receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation of the codebook, the method further includes the steps of: preprocessing the input information of the external input to obtain the Weights, input neurons, instructions, weights dictionaries, neuron dictionaries, arithmetic codebooks;

After receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation of the codebook, the method further includes the steps of: storing the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, the operation codebook, And receiving output neurons; and caching the instructions, input neurons, weights, and output neurons.

In a possible embodiment of the present application, after receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook, the method further includes the step of: buffering the weight index and the neuron index.

In a possible embodiment of the present application, the pre-processing includes segmentation, Gaussian filtering, binarization, regularization, and/or normalization.

Control instructions for controlling the execution of the neural network;

a data transfer instruction for performing data transfer between different storage media, the data format of the data including a matrix, a vector, and a scalar;

The Cambricon data transfer instruction is configured to complete data transfer between different storage media, including a load instruction, a storage instruction, and a transfer instruction; wherein the load instruction is used to load data from the main memory to the cache; the storage instruction, Used to store data from the cache to main memory; implement a move instruction to transfer data between the cache and the cache or the cache and registers or registers and registers;

Cambricon logic instructions for logical operations of a neural network, the Cambricon logic instructions including Cambricon vector logic operation instructions and Cambricon scalar logic operation instructions; wherein the Cambricon vector logic operation instructions are used for vector comparison operations, vector logic operations, and vectors Greater than the merge operation, wherein the vector logic operation includes AND, OR, and NOT; the Cambricon scalar logic operation instruction is used for scalar comparison operations and scalar logic operations.

In still another aspect of the present application, another computing device is provided, the computing device comprising:

a cache unit for buffering the instructions, input neurons, weights, weight indexes, neuron indexes, and output neurons;

a direct memory access unit for reading or writing data or instructions between the storage unit and the cache unit.

In a possible embodiment of the present application, the cache unit includes:

a weight buffer for buffering the weights;

Input a neuron cache for caching the input neurons;

Weight index cache for caching weight index;

A neuron index cache that is used to cache neuron indexes.

In a possible embodiment of the present application, the preprocessing unit performs preprocessing on externally input information including: segmentation, Gaussian filtering, binarization, regularization, and/or normalization.

Multiplication lookup table: used to input the weight index in1 and the neuron index in2, and through the multiplication lookup table through the table lookup operation mult_lookup, complete the multiplication operation of the center weight data1 corresponding to the weight index and the center neuron data2 corresponding to the neuron index , that is, using the lookup table operation out=mult_lookup(in1, in2) to complete the multiplication function out=data1*data2; and/or

Addition lookup table: used to add the central data of the index corresponding to the index by the table-searching operation add_lookup according to the input index in. The in and data are vectors of length N, and N is a positive integer, that is, The table lookup operation out=add_lookup(in) completes the addition function out=data[1]+data[2]+...+data[N], and/or the input weight index in1 and the neuron index in2 through the addition lookup table After the table operation completes the addition operation of the center weight data1 corresponding to the weight index and the center neuron data2 corresponding to the neuron index, the addition function is completed by the table lookup operation out=add_lookup(in1, in2), out=data1+data2 ;and / or

Pooled lookup table: used to input the central data of the index corresponding to the pooling operation, that is, use the lookup table out=pool_lookup(in) to complete the pooling operation out=pool(data), and the pooling operation includes the average pooling and maximum Value pooling and median pooling.

Control instructions for controlling the execution of the neural network;

In a possible embodiment of the present application, the neural network dedicated instruction includes at least one Cambricon instruction including an operation code and an operand, and the Cambricon instruction includes:

In yet another aspect of the present application, a further processing method is provided, including:

Decoding the instruction to determine search control information;

Control instructions for controlling the execution of the neural network;

Neural networks have achieved very successful applications, but large-scale neural network parameters place high demands on storage. On the one hand, a large number of neural network parameters require huge storage capacity. On the other hand, accessing a large amount of neural network data will bring huge energy consumption for access.

The memory that stores the neural network parameters is now error checking and correcting ECC (Error Correcting Code (ECC) memory. Although ECC memory can correct errors when reading data, ECC memory will bring additional storage capacity overhead and memory access. Power consumption overhead. The neural network algorithm has certain fault tolerance. The use of ECC memory storage for all parameters of the neural network ignores the fault tolerance of the neural network, and brings additional storage overhead, computational overhead and memory access overhead. Therefore, how to combine the neural network fault tolerance with the appropriate neural network. Handling memory is a problem that needs to be solved.

In still another aspect of the present application, a storage device is provided, including:

An accurate storage unit for storing important bits in the data;

An inexact memory location for storing non-significant bits in the data.

In a possible embodiment of the present application, the precise storage unit uses ECC memory, and the inexact storage unit uses non-ECC memory.

In a possible embodiment of the present application, the data is a neural network parameter, including input neurons, weights, and output neurons; the precise storage unit is configured to store important bits of the input neurons, and output neurons. Important bits of important bits and weights; the imprecise storage unit is used to store non-significant bits of the input neurons, non-significant bits of the output neurons, and non-significant bits of the weights.

In a possible embodiment of the present application, the data includes floating point data and fixed point data; the symbol bit and the exponent part in the floating point data are important bits, and the bottom part is a non-significant bit; The first x bits of the sign bit and the value part in the point type data are important bits, and the remaining bits of the value part are non-important bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a total bit of data Bit.

In a possible embodiment of the present application, the ECC memory includes an ECC-checked DRAM and an ECC-checked SRAM; the ECC-checked SRAM uses a 6T SRAM, or a 4T SRAM or a 3T SRAM.

In a possible embodiment of the present application, the non-ECC memory includes a non-ECC check DRAM and a non-ECC check SRAM; the non-ECC check SRAM uses 6T SRAM, or 4T SRAM or 3T SRAM.

In a possible embodiment of the present application, the storage unit storing each bit in the 6T SRAM includes 6 MOS tubes; the storage unit storing each bit in the 4T SRAM includes 4 MOS tubes; the 3T SRAM The memory cell in which each bit is stored includes three MOS transistors.

In a possible embodiment of the present application, the four MOS transistors include: a first MOS transistor, a second MOS transistor, a third MOS transistor, and a fourth MOS transistor, and the first MOS transistor and the second MOS transistor are used for gating The third MOS transistor and the fourth MOS transistor are used for storage, wherein the first MOS transistor gate is electrically connected to the word line WL, the source is electrically connected to the bit line BL, and the second MOS transistor gate is electrically connected to the word line WL. The source is electrically connected to the bit line BLB; the gate of the third MOS transistor is connected to the source of the fourth MOS transistor and the drain of the second MOS transistor, and is connected to the working voltage through the resistor R2, and the drain of the third MOS transistor is grounded; The gate of the MOS transistor is connected to the source of the third MOS transistor and the drain of the first MOS transistor, and is connected to the working voltage through the resistor R1, and the drain of the fourth MOS transistor is grounded; WL is used for controlling the gate access of the memory unit, for BL For reading and writing of the storage unit.

In a possible embodiment of the present application, the three MOS transistors include: a first MOS transistor, a second MOS transistor, and a third MOS transistor, the first MOS transistor is used for gating, and the second MOS transistor and the third MOS transistor are used. In the storage, the first MOS transistor gate is electrically connected to the word line WL, the source is electrically connected to the bit line BL, the second MOS transistor gate is connected to the third MOS transistor source, and is connected to the working voltage through the resistor R2. a second MOS transistor drain is grounded; a third MOS transistor gate is connected to the second MOS transistor source and the first MOS transistor drain, and is connected to the working voltage through the resistor R1, and the third MOS transistor drain is grounded; For controlling the gate access of the memory unit, the BL is used for reading and writing the memory unit.

In still another aspect of the present application, a data processing apparatus is provided, including:

An arithmetic unit, an instruction control unit, and the storage device; the storage device is configured to receive the input instruction and the operation parameter, and store the important bits and instructions in the operation parameter in the precise storage unit, and the non-important in the operation parameter The bit is stored in the inaccurate storage unit; the instruction control unit is configured to receive an instruction in the storage device, and decode the generated control information; the operation unit is configured to receive the operation parameter in the storage device, and perform an operation according to the control information, And transfer the result of the operation to the storage device.

In a possible embodiment of the present application, the computing unit is a neural network processor.

In a possible embodiment of the present application, the operation parameter is a neural network parameter, and the operation unit is configured to receive an input neuron and a weight in the storage device, complete the neural network operation according to the control information, and obtain an output neuron, and The output neurons are transmitted to the storage device.

In a possible embodiment of the present application, the operation unit is configured to receive important bits of an input neuron in the storage device and important bits of the weight for calculation; or the operation unit is configured to receive important bits. And the non-significant bits spliced the complete input neurons and weights for calculation.

In a possible embodiment of the present application, the method further includes: an instruction cache, disposed between the storage device and the instruction control unit, for storing the dedicated instruction; and inputting a layered buffer of the neuron, disposed between the storage device and the operation unit, For buffering input neurons, the input neuron hierarchical cache includes an input neuron exact cache and an input neuron inexact cache; a weighted hierarchical cache, disposed between the storage device and the arithmetic unit for buffering weights Data, the weighted layered cache includes a weighted precision cache and a weighted inexact cache; an output neuron layered cache, disposed between the storage device and the arithmetic unit for buffering output neurons, the output neurons Hierarchical caching includes output neuron exact caching and output neuron inexact caching.

In a possible embodiment of the present application, a direct data access unit DMA is further included for use in the storage device, the instruction cache, the weight layer cache, the input neuron hierarchical cache, and the output neuron hierarchical cache. Perform data or instruction reading and writing.

In a possible embodiment of the present application, the instruction cache, the input neuron hierarchical cache, the weight layer cache, and the output neuron hierarchical cache use 4T SRAM or 3T SRAM.

In a possible embodiment of the present application, a preprocessing module is further included for preprocessing and transmitting the input data to the storage device; the preprocessing includes segmentation, Gaussian filtering, binarization, regularization, and normalization. Chemical.

In a possible embodiment of the present application, the operation unit is a general purpose operation processor.

In still another aspect of the present application, an electronic device is provided, including the data processing device described above.

In still another aspect of the present application, a storage method is provided, comprising: accurately storing important bits in data; and performing inexact storage of non-significant bits in the data.

In a possible embodiment of the present application, the accurately storing the important bits in the data specifically includes: extracting important bits of the data, and storing the important bits in the data in the ECC memory for accurate storage.

In a possible embodiment of the present application, the performing inaccurate storage of non-significant bits in the data specifically includes: extracting non-significant bits of the data, and storing non-important bits in the data in the non-ECC memory. Inexact storage in the middle.

In a possible embodiment of the present application, the data is a neural network parameter, including an input neuron, a weight, and an output neuron; an important bit of the input neuron, an important bit of the output neuron, and a weight The important bits are accurately stored; the non-significant bits of the input neurons, the non-significant bits of the output neurons, and the non-significant bits of the weights are stored inexactly.

In a possible embodiment of the present application, the data includes floating point data and fixed point data; the symbol bit and the exponent part in the floating point data are important bits, and the bottom part is a non-significant bit; The first x bits of the sign bit and the value part in the point type data are important bits, and the remaining bits of the value part are non-significant bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a parameter total bit Bit.

In a possible embodiment of the present application, the ECC memory includes an ECC-checked DRAM and an ECC-checked SRAM; and the ECC-checked SRAM uses a 6T SRAM, a 4T SRAM, or a 3T SRAM.

In a possible embodiment of the present application, the non-ECC memory includes a non-ECC check DRAM and a non-ECC check SRAM; the non-ECC check SRAM uses 6T SRAM, 4T SRAM or 3T SRAM.

In still another aspect of the present application, a data processing method is provided, including:

Receiving instructions and parameters, and accurately storing important bits and instructions in the parameter, inaccurately storing non-important bits in the parameter; receiving the instruction, and decoding the instruction to generate control information; receiving the parameter, and according to The control information is calculated and the operation result is stored.

In a possible embodiment of the present application, the operation is a neural network operation, and the parameter is a neural network parameter.

In a possible embodiment of the present application, the receiving parameter is performed according to the control information, and storing the operation result includes: receiving the input neuron and the weight, completing the neural network operation according to the control information, and obtaining the output neuron, and Output neuron storage or output.

In a possible embodiment of the present application, the receiving the input neuron and the weight, and completing the neural network operation according to the control information to obtain the output neuron include: receiving important bits of the input neuron and important bits of the weight for calculation Or, it receives the input neurons and weights that spliced the important bits and the non-significant bits into a complete calculation.

In a possible embodiment of the present application, the data processing method further includes: a cache dedicated instruction; an accurate cache and an inexact cache on the input neuron; an accurate cache and an inexact cache on the weight data; and an output neuron Perform precise and inexact caching.

In a possible embodiment of the present application, the operation is a general operation.

In a possible embodiment of the present application, before the receiving the instruction and the parameter, and storing the important bits and instructions in the parameter for accurate storage, and performing non-precise storage of the non-important bits in the parameter, the method further includes: The input data is pre-processed and stored; the pre-processing includes segmentation, Gaussian filtering, binarization, regularization, and normalization.

In yet another aspect of the present application, a memory unit is provided, the memory unit being a 4T SRAM or a 3T SRAM for storing neural network parameters.

In a possible embodiment of the present application, the storage unit storing each bit in the 4T SRAM includes 4 MOS tubes; and the storage unit storing each bit in the 3T SRAM includes 3 MOS tubes.

In a possible embodiment of the present application, the neural network parameters include input neurons, weights, and output neurons.

With the improvement of operating frequency and the continuous development of semiconductor technology, the power consumption of chips has become an important consideration in deep sub-nanometer integrated circuits. Dynamic Voltage Frequency Scaling (DVFS) is currently in progress. A dynamic voltage frequency adjustment technology widely used in the semiconductor field. The DVFS technology specifically adjusts the operating frequency and voltage of the chip (for the same chip, the higher the frequency, the higher the voltage required), thereby achieving energy saving. However, in the prior art, there is a lack of dynamic voltage modulation and frequency modulation method applied to the smart chip and the design of the corresponding device, and the scene information cannot be applied to complete the advance adjustment of the voltage frequency of the chip.

In another aspect of the present application, a dynamic voltage regulation and frequency modulation apparatus is provided, including:

An information collecting unit, configured to collect, in real time, working state information or application scenario information of a chip connected to the dynamic voltage regulating frequency modulation, where the application scenario information is obtained by using the neural network or connected to the chip Information collected by the sensor;

The voltage-adjusting and frequency-modulating unit is configured to send voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip, where the voltage frequency regulation information is used to instruct the chip to adjust its working voltage or operating frequency.

In a possible embodiment of the present application, the working state information of the chip includes an operating speed of the chip, the voltage frequency regulation information includes first voltage frequency regulation information, and the voltage regulating frequency modulation unit is configured to:

Transmitting the first voltage frequency regulation information to the chip when the running speed of the chip is greater than a target speed, where the first voltage frequency regulation information is used to indicate that the chip reduces its operating frequency or operating voltage, The target speed is the running speed of the chip when the user's demand is met.

In a possible embodiment of the present application, the chip includes at least a first unit and a second unit, the output data of the first unit is input data of the second unit, and the working status information of the chip includes The operating speed of the first unit and the operating speed of the second unit, the voltage frequency regulation information includes second voltage frequency regulation information, and the frequency modulation unit is further configured to:

And when the running time of the first unit exceeds the running time of the second unit according to the running speed of the first unit and the running speed of the second unit, sending the second unit to the second unit Voltage frequency regulation information, the second voltage frequency regulation information is used to instruct the second unit to reduce its operating frequency or operating voltage.

In a possible embodiment of the present application, the voltage frequency regulation information includes third voltage frequency regulation information, and the frequency modulation unit is further configured to:

Transmitting the third unit to the first unit when it is determined that an operating time of the second unit exceeds a running time of the first unit according to an operating speed of the first unit and an operating speed of the second unit Voltage frequency regulation information, the third voltage frequency regulation information is used to indicate that the first unit reduces its operating frequency or operating voltage.

In a possible embodiment of the present application, the chip includes at least N units, and the working state information of the chip includes working state information of at least S units of the at least N units, where N is greater than 1. An integer that is less than or less than an integer of N, the voltage frequency regulation information includes fourth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is configured to:

And determining, according to the working state information of the unit A, that the unit A is in an idle state, sending the fourth voltage frequency regulation information to the unit A, where the fourth voltage frequency regulation information is used to indicate that the unit A is lowered. Its working frequency or working voltage,

The unit A is any one of the at least S units.

In a possible embodiment of the present application, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:

And determining, according to the working state information of the unit A, that the unit A is in an active state, sending the fifth voltage frequency regulation information to the unit A, where the fifth voltage frequency regulation information is used to indicate the unit A Increase its working voltage or operating frequency.

In a possible embodiment of the present application, the application scenario of the chip is image recognition, the application scenario information is the number of objects in the image to be identified, and the voltage frequency regulation information includes sixth voltage frequency regulation information. The voltage regulating FM unit is also used to:

When it is determined that the number of objects in the image to be identified is less than a first threshold, sending the sixth voltage frequency regulation information to the chip, where the sixth voltage frequency regulation information is used to indicate that the chip reduces its working voltage Or the working frequency.

In a possible embodiment of the present application, the application scenario information is object tag information, the voltage frequency regulation information includes seventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:

When it is determined that the object tag information belongs to the preset object tag set, sending the seventh voltage frequency regulation information to the chip, where the seventh voltage frequency regulation information is used to indicate that the chip raises its working voltage or works frequency.

In a possible embodiment of the present application, the chip is applied to voice recognition, the application scenario information is a voice input rate, the voltage frequency regulation information includes eighth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further used. to:

And when the voice input rate is less than the second threshold, sending, to the chip, eighth voltage frequency regulation information, where the eighth voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency.

In a possible embodiment of the present application, the application scenario information is a keyword obtained by performing speech recognition on the chip, the voltage frequency regulation information includes ninth voltage frequency regulation information, and the frequency modulation and voltage adjustment unit is further used to :

When the keyword belongs to the preset keyword set, the ninth voltage frequency regulation information is sent to the chip, and the ninth voltage frequency regulation information is used to indicate that the chip increases its working voltage or operating frequency.

In a possible embodiment of the present application, the chip is applied to machine translation, and the application scenario information is a speed of text input or a number of characters in an image to be translated, and the voltage frequency regulation information includes tenth voltage frequency regulation information. The voltage regulating and frequency modulation unit is further configured to:

And when the text input speed is less than a third threshold or the number of characters in the image to be translated is less than a fourth threshold, sending the tenth voltage frequency regulation information to the chip, where the tenth voltage frequency regulation information is used to indicate The chip reduces its operating voltage or operating frequency.

In a possible embodiment of the present application, the application scenario information is ambient light intensity, the voltage frequency regulation information includes eleventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:

Transmitting the eleventh voltage frequency regulation information to the chip when the ambient light intensity is less than a fifth threshold, the eleventh voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency .

In a possible embodiment of the present application, the chip is applied to image beauty, and the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further used. to:

When the application scenario information is a face image, sending the twelfth voltage frequency regulation information to the chip, where the twelfth voltage frequency regulation information is used to indicate that the chip reduces its working voltage;

And when the application scenario information is not a face image, sending the thirteenth voltage frequency regulation information to the chip, where the thirteenth voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency.

In another aspect of the present application, a dynamic voltage regulation and frequency modulation method is provided, including:

Collecting working state information or application scenario information of the chip connected to the dynamic voltage-modulating frequency modulation in real time, where the application scenario information is information collected by a sensor obtained by the chip through a neural network or connected to the chip;

And transmitting voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip, where the voltage frequency regulation information is used to instruct the chip to adjust its working voltage or operating frequency.

In a possible embodiment of the present application, the working state information of the chip includes an operating speed of the chip, and the voltage frequency regulation information includes first voltage frequency regulation information, according to the working state information of the chip or Transmitting the voltage frequency regulation information to the chip by using the scenario information includes:

In a possible embodiment of the present application, the chip includes at least a first unit and a second unit, the output data of the first unit is input data of the second unit, and the working status information of the chip includes The operating speed of the first unit and the operating speed of the second unit, the voltage frequency regulation information includes second voltage frequency regulation information, and the voltage frequency is sent to the chip according to the working state information or the application scenario information of the chip. Regulatory information also includes:

In a possible embodiment of the present application, the voltage frequency regulation information includes the second voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:

In a possible embodiment of the present application, the chip includes at least N units, and the working state information of the chip includes working state information of at least S units of the at least N units, where N is greater than 1. The integer value, the S is an integer less than or less than N, the voltage frequency regulation information includes second voltage frequency regulation information, and the voltage frequency regulation is sent to the chip according to the working state information or the application scenario information of the chip. The information also includes:

The unit A is any one of the at least S units.

In a possible embodiment of the present application, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:

In a possible embodiment of the present application, the application scenario of the chip is image recognition, the application scenario information is the number of objects in the image to be identified, and the voltage frequency regulation information includes sixth voltage frequency regulation information. The sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:

In a possible embodiment of the present application, the application scenario information is object tag information, and the voltage frequency regulation information includes seventh voltage frequency regulation information, where the device is based on the working state information or the application scenario information of the chip. The chip transmitting voltage frequency regulation information further includes:

In a possible embodiment of the present application, the chip is applied to voice recognition, the application scenario information is a voice input rate, and the voltage frequency regulation information includes eighth voltage frequency regulation information, where the operation is performed according to the chip. The sending the voltage frequency regulation information to the chip by the status information or the application scenario information further includes:

And when the voice input rate is less than the second threshold, sending the eighth voltage frequency regulation information to the chip, where the eighth voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency.

In a possible embodiment of the present application, the application scenario information is a keyword obtained by performing speech recognition on the chip, and the voltage frequency regulation information includes ninth voltage frequency regulation information, according to the working state of the chip. Sending the voltage frequency regulation information to the chip by the information or the application scenario information further includes:

In a possible embodiment of the present application, the chip is applied to machine translation, and the application scenario information is a speed of text input or a number of characters in an image to be translated, and the voltage frequency regulation information includes tenth voltage frequency regulation information. The sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:

In a possible embodiment of the present application, the application scenario information is an ambient light intensity, and the voltage frequency regulation information includes eleventh voltage frequency regulation information, and the working state information or application scenario information according to the chip Sending voltage frequency regulation information to the chip further includes:

In a possible embodiment of the present application, the chip is applied to image beauty, and the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency regulation information, and the operation according to the chip The sending the voltage frequency regulation information to the chip by the status information or the application scenario information further includes:

With the increase of operating frequency and the continuous development of semiconductor technology, the power consumption of chips has become an important consideration in deep sub-nanometer integrated circuits. Dynamic Voltage Frequency Scaling (DVFS) is currently in semiconductors. A dynamic voltage frequency adjustment technology widely used in the field, the DVFS technology specifically adjusts the operating frequency and voltage of the chip (for the same chip, the higher the frequency, the higher the voltage required), thereby achieving the purpose of energy saving. However, in the prior art, there is a lack of a dynamic voltage modulation and frequency modulation method applied to a smart chip such as a convolution operation device and a corresponding device design.

In still another aspect of the present application, a convolution operation device includes: a dynamic voltage modulation and frequency modulation device, an instruction storage unit, a controller unit, a data access unit, an interconnection module, a main operation module, and N slave operation modules, N is an integer greater than 1, where:

The instruction storage unit is configured to store an instruction read by the data access unit;

The controller unit is configured to read an instruction from the instruction storage unit, and translate the instruction into a control signal for controlling behavior of other modules, where the other module includes the data access unit, the main operation module, and the Said N slave arithmetic modules;

The data access unit is configured to perform data or instruction read and write operations between the external address space and the convolution operation device;

The N slave operation modules are configured to implement a convolution operation of the input data and the convolution kernel in the convolutional neural network algorithm;

The interconnection module is configured to perform data transmission between the main operation module and the slave operation module;

The main operation module is configured to splicing intermediate vectors of all input data into intermediate results, and performing subsequent operations on the intermediate results;

The dynamic voltage regulation and frequency modulation device is configured to collect operation state information of the convolution operation device; and send voltage frequency regulation information to the convolution operation device according to the operation state information of the convolution operation device, the voltage frequency The regulation information is used to instruct the convolution operation device to adjust its operating voltage or operating frequency.

In a possible embodiment of the present application, the main operation module is further configured to add an intermediate result to the offset data to perform an activation operation.

In a possible embodiment of the present application, the N slave operation modules are specifically configured to calculate respective output scalars in parallel by using the same input data and respective convolution kernels.

In a possible embodiment of the present application, the activation function active used by the main operation module is any nonlinear function of the nonlinear functions sigmoid, tanh, relu, and softmax.

In a possible embodiment of the present application, the interconnection module constitutes a data path of continuous or discretized data between the main operation module and the N slave operation modules, and the interconnection module is a tree structure. Any one of a ring structure, a mesh structure, a hierarchical interconnection structure, and a bus structure.

In a possible embodiment of the present application, the main operation module includes:

a first storage unit, configured to buffer input data and output data used by the main operation module in the calculation process;

a first operation unit, configured to complete various computing functions of the main operation module;

a first data dependency determining unit, configured to read and write a port of the first storage unit by the first computing unit, to ensure consistency of reading and writing data to the first storage unit, and to read from the first storage unit An input neuron vector and sent to the N slave arithmetic modules by the interconnect module; and an intermediate result vector from the interconnect module is sent to the first arithmetic unit.

In a possible embodiment of the present application, each of the N slave computing modules includes:

a second operation unit, configured to receive a control signal sent by the controller unit and perform an arithmetic logic operation;

a second data dependency determining unit, configured to perform read and write operations on the second storage unit and the third storage unit during the calculating process to ensure read and write consistency to the second storage unit and the third storage unit;

a second storage unit, configured to buffer input data and an output scalar calculated from the operation module;

And a third storage unit, configured to cache a convolution kernel required by the slave computing module in the calculation process.

In a possible embodiment of the present application, the first data dependency determining unit and the second data dependency determining unit ensure read and write consistency by:

Determining whether there is a dependency between the control signal that has not been executed and the data of the control signal being executed, if not, allowing the control signal to be immediately transmitted, otherwise it is necessary to wait until all control signals on which the control signal depends Once completed, the control signal is allowed to be transmitted.

In a possible embodiment of the present application, the data access unit reads at least one of input data, offset data, and a convolution kernel from an external address space.

In a possible embodiment of the present application, the dynamic voltage regulation and frequency modulation apparatus includes:

An information collecting unit, configured to collect working state information of the convolution operation device in real time;

a voltage regulating unit for transmitting voltage frequency regulation information to the convolution operation device according to the working state information of the convolution operation device, wherein the voltage frequency adjustment information is used to instruct the convolution operation device to adjust its working voltage Or the working frequency.

In a possible embodiment of the present application, the working state information of the convolution operation device includes an operating speed of the convolution operation device, the voltage frequency regulation information includes first voltage frequency regulation information, and the voltage regulation and frequency modulation Unit is used to:

Transmitting, by the convolution operation device, the first voltage frequency regulation information, when the operation speed of the convolution operation device is greater than a target speed, the first voltage frequency regulation information being used to indicate that the convolution operation device is lowered Its operating frequency or operating voltage, which is the operating speed of the convolutional computing device when the user's needs are met.

In a possible embodiment of the present application, the working state information of the convolution operation device includes an operating speed of the data access unit and an operating speed of the main computing module, and the voltage frequency control information includes second voltage frequency control information. The FM voltage regulator unit is further configured to:

And when the running time of the data access unit exceeds the running time of the main computing module according to the running speed of the data access unit and the running speed of the main computing module, sending the second to the main computing module Voltage frequency regulation information, the second voltage frequency regulation information is used to instruct the main operation module to reduce its operating frequency or operating voltage.

And when the running time of the main operation module exceeds the running time of the data access unit according to the running speed of the data access unit and the running speed of the main operation module, sending the third to the data access unit Voltage frequency regulation information, the third voltage frequency regulation information is used to instruct the data access unit to reduce its operating frequency or operating voltage.

In a possible embodiment of the present application, the working state information of the convolution operation device includes an instruction storage unit, a controller unit, a data access unit, an interconnection module, a main operation module, and at least S of the N slave operation modules. Working state information of the unit/module, the S is an integer greater than 1 and less than or equal to N+5, the voltage frequency regulation information includes fourth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is configured to:

The unit A is any one of the at least S units/modules.

In yet another aspect of the present application, a neural network processor is provided, the neural network processor comprising a convolution operation device as described above.

In yet another aspect of the present application, an electronic device is provided, the electronic device comprising a neural network processor as described above.

In still another aspect of the present application, a method for performing a forward operation of a single-layer convolutional neural network is provided, which is applied to the above-described convolution operation device, and includes:

Pre-storing an input/output IO instruction at the first address of the instruction storage unit;

The operation begins, the controller unit reads the IO instruction from the first address of the instruction storage unit, and according to the decoded control signal, the data access unit reads all corresponding convolutional neural network operation instructions from the external address space, and Caching in the instruction storage unit;

The controller unit then reads in the next IO instruction from the instruction storage unit, and according to the decoded control signal, the data access unit reads all data required by the main operation module from the external address space to the main operation module. First storage unit;

The controller unit then reads the next IO instruction from the instruction storage unit, and the data access unit reads the convolution kernel data required by the operation module from the external address space according to the decoded control signal;

The controller unit then reads the next CONFIG instruction from the instruction storage unit, and the convolution operation device configures various constants required for the calculation of the layer neural network according to the decoded control signal;

The controller unit then reads the next COMPUTE instruction from the instruction storage unit, and according to the translated control signal, the main operation module first sends the input data in the convolution window to the N slave operations through the interconnect module. a module, saved to the second storage unit of the N slave computing modules, and then moving the convolution window according to the instruction;

According to the control signal decoded by the COMPUTE instruction, the operation unit of the N slave operation modules reads the convolution kernel from the third storage unit, reads the input data from the second storage unit, and completes the input data and the convolution kernel. Convolution operation, returning the obtained output scalar through the interconnect module;

In the interconnection module, the output scalars returned by the N operation modules are successively formed into a complete intermediate vector;

The main operation module obtains an intermediate vector returned by the interconnection module, and the convolution window traverses all the input data, and the main operation module concatenates all the return vectors into an intermediate result, and the control signal decoded according to the COMPUTE instruction is from the first storage unit. Reading the offset data, adding the offset result to the intermediate result by the vector addition unit, then the activation unit activates the offset result, and writes the last output data back to the first storage unit;

The controller unit then reads the next IO instruction from the instruction storage unit, and the data access unit stores the output data in the first storage unit to the specified address in the external address space according to the translated control signal, and the operation ends. .

In a possible embodiment of the present application, the method further includes:

Collecting working state information of the convolution operation device in real time;

The voltage frequency adjustment information is transmitted to the convolution operation device according to the operation state information of the convolution operation device, and the voltage frequency adjustment information is used to instruct the convolution operation device to adjust its operating voltage or operating frequency.

In a possible embodiment of the present application, the working state information of the convolution operation device includes an operating speed of the convolution operation device, and the voltage frequency regulation information includes first voltage frequency regulation information, according to the volume The transmitting the operating state information of the product computing device to the convolution computing device includes:

Transmitting, by the convolution operation device, the first voltage frequency regulation information, when the operation speed of the convolution operation device is greater than a target speed, the first voltage frequency regulation information being used to indicate that the convolution operation device is lowered Its operating frequency or operating voltage, the target speed is the running speed of the chip when the user needs are met.

In a possible embodiment of the present application, the working state information of the convolution operation device includes an operating speed of the data access unit and an operating speed of the main computing module, and the voltage frequency control information includes second voltage frequency control information. The transmitting the voltage frequency regulation information to the convolution operation device according to the working state information of the convolution operation device further includes:

In a possible embodiment of the present application, the voltage frequency regulation information includes third voltage frequency regulation information, and the voltage frequency regulation information is sent to the convolution operation device according to the operation state information of the convolution operation device. include:

In a possible embodiment of the present application, the working state information of the convolution operation device includes at least the instruction storage unit, the controller unit, the data access unit, the interconnect module, the main operation module, and the N slave operation modules. Working state information of S units/modules, the S is an integer greater than 1 and less than or equal to N+5, and the voltage frequency regulation information includes fourth voltage frequency regulation information, according to the convolution operation device The sending the voltage frequency regulation information to the convolution operation device by the working state information further includes:

The unit A is any one of the at least S units/modules.

In a possible embodiment of the present application, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the voltage frequency regulation information is sent to the convolution operation device according to the operation state information of the convolution operation device. include:

In another aspect of the present application, a method for performing a forward operation of a multi-layer convolutional neural network is provided, comprising:

Performing a forward operation of the single-layer convolutional neural network as described above for each layer. When the upper-layer convolutional neural network is executed, the operation instruction of the layer will be the upper layer stored in the main operation module. The output data address is used as the input data address of this layer, and the convolution kernel and the offset data address in the instruction are changed to the address corresponding to this layer.

With the advent of the era of big data, data is growing at an explosive rate. A huge amount of data carries information between people. Image is the visual basis of human perception of the world. It is human access to information, expression and transmission of information. An important means.

In the prior art, the amount of data is effectively reduced by image compression, and the transmission rate of an image is improved. However, after the image is compressed, it is difficult to retain all the information of the original image, and therefore, how to perform image compression is still a technical problem to be solved by those skilled in the art.

In still another aspect of the present application, an image compression method is provided, including:

Obtaining an original image of a first resolution, where the original image is any training image in a compressed training map set of a compressed neural network, and label information of the original image is used as target label information;

Compressing the original image based on a target model to obtain a compressed image of a second resolution, wherein the second resolution is smaller than the first resolution, and the target model is a current neural network model of the compressed neural network;

Identifying the compressed image based on the recognition neural network model, and obtaining reference tag information, wherein the identifying the neural network model is to identify a corresponding neural network model when the neural network training is completed;

Obtaining a loss function according to the target tag information and the reference tag information;

Obtaining the target original image of the first resolution, and using the target model as the compression, when the loss function converges to a first threshold, or the current training number of the compressed neural network is greater than or equal to a second threshold Corresponding compressed neural network model when neural network training is completed;

The target original image is compressed based on the compressed neural network model to obtain a target compressed image of the second resolution.

In a possible embodiment of the present application, the image compression method further includes:

And when the loss function does not converge to the first threshold, or the current training number of the compressed neural network is less than the second threshold, updating the target model according to the loss function to obtain an updated model, The update model is used as the target model, and the next training image is used as the original image, and the step of acquiring the original image of the first resolution is performed.

In a possible embodiment of the present application, the identifying the compressed image based on the identification neural network model, and obtaining the reference label information specifically includes:

Pre-processing the compressed image to obtain an image to be identified;

And identifying the to-be-identified image based on the identifying neural network model to obtain the reference tag information.

In a possible embodiment of the present application, the pre-processing includes a size processing, and the pre-processing the compressed image to obtain the image to be identified specifically includes:

When the image size of the compressed image is smaller than the basic image size of the recognition neural network, the compressed image is filled with pixels according to the basic image size to obtain the image to be recognized.

In a possible embodiment of the present application, the compressed training atlas includes at least an identification training atlas, and the method further includes:

The identification neural network is trained by using the identification training atlas to obtain the identification neural network model, and each training image in the identification training map set at least includes label information that is consistent with the type of the target label information.

In a possible embodiment of the present application, after the target original image is compressed based on the compressed neural network model to obtain the target compressed image of the second resolution, the method further includes:

And compressing the target compressed image based on the recognized neural network model to obtain tag information of the target original image, and storing tag information of the target original image.

In a possible embodiment of the present application, the compressed training atlas includes a plurality of dimensions, and the compressed image is compressed by the target model to obtain a compressed image of the second resolution, including:

Identifying the original image based on the target model to obtain a plurality of image information, each dimension corresponding to one image information;

The original image is compressed based on the target model and the plurality of image information to obtain the compressed image.

In still another aspect of the present application, an image compression apparatus includes a processor, a memory coupled to the processor, wherein:

The memory is configured to store a first threshold, a second threshold, a current neural network model and a training number of the compressed neural network, a compressed training atlas of the compressed neural network, and a label of each training image in the compressed training map set Information, a recognition neural network model, a compressed neural network model, and a current neural network model of the compressed neural network as a target model, the compressed neural network model being a corresponding target model when the compressed neural network training is completed, the identification The neural network model is to identify a corresponding neural network model when the neural network training is completed;

The processor is configured to acquire an original image of a first resolution, where the original image is any training image in the compressed training map set, and label information of the original image is used as target label information; The model compresses the original image to obtain a compressed image of a second resolution, the second resolution is smaller than the first resolution; and the compressed image is identified based on the recognized neural network model to obtain a reference label Obtaining a loss function according to the target tag information and the reference tag information; acquiring the first when the loss function converges to the first threshold, or the training number is greater than or equal to the second threshold a target original image of a resolution, confirming that the target model is the compressed neural network model; compressing the target original image based on the compressed neural network model to obtain a target compressed image of the second resolution.

In a possible embodiment of the present application, the processor is further configured to: according to the loss function, when the loss function does not converge to the first threshold, or the training number is less than the second threshold Updating the target model, obtaining an updated model, using the updated model as the target model, and using the next training image as the original image, performing the step of acquiring the original image of the first resolution.

In a possible embodiment of the present application, the processor is specifically configured to preprocess the compressed image to obtain an image to be identified, and identify the image to be identified based on the recognized neural network model, to obtain the Refer to the label information.

In a possible embodiment of the present application, the pre-processing includes size processing, the memory is further configured to store a basic image size of the recognition neural network; and the processor is specifically configured to be used in an image of the compressed image When the size is smaller than the basic image size, the compressed image is filled with pixels according to the basic image size, and the image to be recognized is obtained.

In a possible embodiment of the present application, the compressed training atlas includes at least identifying a training atlas, and the processor is further configured to use the identification training atlas to train the recognized neural network to obtain the identification. The neural network model, each training image in the identification training map set includes at least tag information consistent with the type of the target tag information.

In a possible embodiment of the present application, the processor is further configured to: identify the target compressed image based on the recognized neural network model, and obtain label information of the target original image; And storing label information of the target original image.

In a possible embodiment of the present application, the compressed training atlas includes multiple dimensions, and the processor is specifically configured to identify the original image based on the target model to obtain multiple image information, each dimension. Corresponding to one image information; compressing the original image based on the target model and the plurality of image information to obtain the compressed image.

In another aspect of the present application, there is provided another electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be processed by the above Executed, the program includes instructions for some or all of the steps described in the image compression method as described above.

In another aspect of the present application, a computer readable storage medium is provided, the computer storage medium storing a computer program, the computer program comprising program instructions, the program instructions causing the processor when executed by a processor The image compression method described above is performed.

The processing method and device, the computing method and the device provided by the present application have at least the following advantages compared with the prior art:

1. Quantize the neurons and weights of the neural network by using the quantified method, use the weight dictionary and the weight codebook to represent the quantized weights, and use the neuron dictionary and the neuron codebook to represent the quantized neurons. Then the operation in the neural network is transformed into a table lookup operation, thereby reducing the amount of neural network parameter storage, reducing the memory consumption and computing energy consumption. The neural network processor integrates a lookup table-based calculation method, optimizes the look-up table operation, simplifies the structure, reduces the neural network access energy consumption and calculates the energy consumption, and at the same time realizes the diversification of the operation.

2, the neural network can be retrained, and only need to train the codebook during retraining, and does not need to train the weight dictionary, which simplifies the retraining operation.

3. The neural network dedicated instruction and flexible arithmetic unit for multi-layer artificial neural network operation for local quantization solve the problem that the central processor CPU and graphics processor GPU have insufficient performance and the front-end decoding overhead is large, effectively improving Support for multi-layer artificial neural network algorithms.

4. By using a dedicated on-chip buffer for multi-layer artificial neural network operation algorithm, the reuse of input neurons and weight data is fully exploited, avoiding repeated reading of these data in memory, reducing memory access bandwidth and avoiding Memory bandwidth is a performance bottleneck caused by multi-layer artificial neural network operations and its training algorithms.

DRAWINGS

FIG. 1A is a schematic flowchart of a processing method according to an embodiment of the present application.

FIG. 1B is a schematic diagram of a process for quantifying weights according to an embodiment of the present application.

FIG. 1C is a schematic diagram of a process for quantifying input neurons according to an embodiment of the present application.

FIG. 1D is a schematic diagram of a process for determining a computing codebook according to an embodiment of the present application.

FIG. 1 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application.

FIG. 1 is a schematic structural diagram of an arithmetic device according to an embodiment of the present application.

FIG. 1G is a schematic structural diagram of an arithmetic device according to an embodiment of the present application.

FIG. 1H is a schematic flowchart diagram of an operation method according to an embodiment of the present application.

FIG. 1 is a schematic flowchart diagram of another computing method according to a specific embodiment of the present disclosure.

2A is a schematic structural diagram of a layered storage device according to an embodiment of the present application.

2B is a schematic structural diagram of a 4T SRAM memory unit according to an embodiment of the present application.

2C is a schematic structural diagram of a 3T SRAM memory unit according to an embodiment of the present application.

2D is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.

2E is a schematic structural diagram of another data processing apparatus according to an embodiment of the present application.

2F is a flowchart of a data storage method according to an embodiment of the present application.

2G is a flowchart of a data processing method according to an embodiment of the present application.

FIG. 3A is a schematic structural diagram of a dynamic voltage regulation and frequency modulation apparatus according to an embodiment of the present application.

FIG. 3B is a schematic diagram of a dynamic voltage regulation and frequency modulation application scenario according to an embodiment of the present disclosure.

FIG. 3C is a schematic diagram of another dynamic voltage regulation and frequency modulation application scenario according to an embodiment of the present disclosure.

FIG. 3D is a schematic diagram of another dynamic voltage regulation and frequency modulation application scenario provided by an embodiment of the present application.

FIG. 3E is a schematic diagram of an implementation manner of an interconnection module 4 according to an embodiment of the present application.

FIG. 3F is a block diagram showing an example of a structure of a main operation module 5 in an apparatus for performing a forward operation of a convolutional neural network according to an embodiment of the present application.

FIG. 3G is a block diagram showing an example of a structure of a slave operation module 6 in an apparatus for performing a forward operation of a convolutional neural network according to an embodiment of the present application.

FIG. 3 is a schematic flowchart of a dynamic voltage regulation and frequency modulation method according to an embodiment of the present application.

4A is a schematic structural diagram of a convolution operation device according to an embodiment of the present application.

FIG. 4B is a block diagram showing an example of a structure of a main operation module in a convolution operation device according to an embodiment of the present application.

4C is a block diagram showing an example of a structure of a slave arithmetic module in a convolution operation device according to an embodiment of the present application.

4D is a block diagram showing an example of a structure of a dynamic voltage regulation and frequency modulation apparatus in a convolution operation device according to an embodiment of the present application.

FIG. 4E is a schematic diagram of an implementation manner of the interconnect module 4 according to an embodiment of the present application.

FIG. 4F is a schematic structural diagram of another convolution operation device according to an embodiment of the present application.

FIG. 4G is a schematic flowchart of a method for performing a forward operation of a single-layer convolutional neural network according to an embodiment of the present application.

FIG. 5A is a schematic diagram of operations of a neural network according to an embodiment of the present application.

FIG. 5B is a schematic flowchart of an image compression method according to an embodiment of the present application.

FIG. 5C is a schematic diagram of a scenario of a size processing method according to an embodiment of the present application.

FIG. 5 is a schematic flowchart of a single-layer neural network operation method according to an embodiment of the present application.

FIG. 5E is a schematic structural diagram of a reverse training device for performing a compressed neural network according to an embodiment of the present application.

FIG. 5 is a schematic structural diagram of an H-tree module according to an embodiment of the present application.

FIG. 5G is a schematic structural diagram of a main operation module according to an embodiment of the present application.

FIG. 5 is a schematic structural diagram of an operation module according to an embodiment of the present application.

FIG. 5I is a block diagram of an example of reverse training of a compressed neural network according to an embodiment of the present application.

FIG. 5 is a schematic flowchart of an image compression method according to an embodiment of the present application.

FIG. 5K is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed ways

Based on the prior art, the excessive computational amount of data processing of the neural network makes the application of the neural network impaired. The present application provides a processing method, apparatus, and method and apparatus. The processing method and device quantify the input data between the neurons and the weights, respectively, and respectively mine the similarity between the layers, the data between the segments, and the local similarity of the data in the intra- and intra-segment to excavate the two kinds of data. The distribution characteristics thus perform low bit quantization, reducing the number of bits used to represent each data, thereby reducing data storage overhead and memory access overhead. The processing method and device realize the operation operations of the quantized neurons and weights through the table look-up operation, thereby reducing the energy consumption of the neural network to access the storage and calculating the energy consumption.

The input neurons and output neurons mentioned in this application do not refer to the neurons in the input layer of the entire neural network and the neurons in the output layer, but to any two adjacent layers in the network, which are under the network feedforward operation. The neurons in the middle are the input neurons, and the neurons in the upper layer of the network feedforward operation are the output neurons. Taking a convolutional neural network as an example, let a convolutional neural network have an L layer, K=1, 2,..., L-1. For the Kth layer and the K+1th layer, the Kth layer is called As an input layer, the neuron is the input neuron, and the K+1th layer is called an output layer, wherein the neuron is the output neuron. That is, except for the top layer, each layer can be used as an input layer, and the next layer is the corresponding output layer.

In order to make the objects, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the accompanying drawings.

Referring to FIG. 1A, FIG. 1A is a schematic flowchart of a processing method according to an embodiment of the present disclosure. As shown in FIG. 1A, the processing method includes:

Step S1, respectively quantizing the weight and the input neuron, and determining the weight dictionary, the weight codebook, the neuron dictionary, and the neuron codebook;

The process of quantifying the weight includes the following steps:

The weights are grouped, and each group of weights is clustered by a clustering algorithm, and a set of weights is divided into m classes, m is a positive integer, each type of weight corresponds to a weight index, and a weight dictionary is determined. Wherein the weight dictionary includes a weight position and a weight index, and the weight position refers to a position of the weight in the neural network structure;

The ownership value of each class is replaced with a central weight, and the weighted password book is determined. The weighted codebook includes a weight index and a center weight.

Referring to FIG. 1B, FIG. 1B is a schematic diagram of a process for quantifying weights according to an embodiment of the present application. As shown in FIG. 1B, weights are grouped according to a preset grouping strategy to obtain an ordered matrix of weights. . Then, the intra-group sampling and clustering operations are performed on the grouped weight matrix, and the weights with similar values are classified into the same category, and the central weights under the four categories are calculated according to the loss function are 1.50, -0.13, - 1.3 and 0.23, respectively, correspond to the weights of the four categories. In the known weight codebook, the weight index of the category with a center weight of -1.3 is 00, the weight index of the category with a center weight of -0.13 is 01, and the weight index of the category with a center weight of 0.23. For a value of 10, a category with a center weight of 1.50 has an index of 11. In addition, weight values corresponding to the weights (00, 01, 10, and 11) of the four weights are respectively used to represent the weights in the corresponding categories, thereby obtaining a weight dictionary. It should be noted that the weight dictionary also includes the weight position, that is, the position of the weight in the neural network structure. In the weight dictionary, the weight position refers to the coordinate of the qth column of the pth row, ie (p, q) In the present embodiment, 1 ≤ p ≤ 4, and 1 ≤ q ≤ 4.

It can be seen that the quantization process fully exploits the similarity of the weights between the neural network layers and the local similarity of the intra-layer weights, and obtains the weight distribution characteristics of the neural network to perform low-bit quantization, which is used to represent each weight. The number of bits, which reduces the weight storage overhead and fetch overhead.

Optionally, the preset grouping policy includes, but is not limited to, the following: grouping the groups, and grouping the ownership values of the neural network into groups; layer type grouping, weighting all convolution layers in the neural network, The weights of all fully connected layers and the weights of all long and short memory network layers are grouped into one group; the inter-layer grouping, the weight of one or more convolution layers in the neural network, the weight of one or more fully connected layers The value and the weight of one or more long-term memory network layers are grouped into one group; and the intra-layer grouping divides the weights in the layer of the neural network, and each part after the segmentation is divided into a group.

Clustering algorithms include K-means, K-medoids, Clara, and/or Clarans. The central weight of each class is chosen such that when the cost function J(w, w ₀ ) is minimized, the value of W ₀ is the center weight, and the cost function can be the squared distance:

Where J is the cost function, W is the ownership value in the class, W ₀ is the central weight, n is the number of weights in each class, and wi is the i-th weight in the class, 1 ≤ i ≤ n, and n Is a positive integer.

Further, the input neurons are quantized, which includes the steps of:

Dividing the input neuron into p segments, each segment input neuron corresponding to a neuron range and a neuron index, determining a neuron dictionary, where p is a positive integer;

The input neurons are encoded, and all input neurons of each segment are replaced with a central neuron to determine a neuron codebook.

Referring to FIG. 1C, FIG. 1C is a schematic diagram of a process for quantifying input neurons according to an embodiment of the present application. As shown in FIG. 1C, this embodiment uses a method for quantifying ReLU activation layer neurons as an example. First, the ReLU function is segmented into four segments. The central neurons representing the four segments are represented by 0.0, 0.2, 0.5, and 0.7, respectively, and the neuron index is represented by 00, 01, 10, and 11. Finally, a neuron codebook containing a neuron index and a central neuron is generated; and a neuron dictionary containing a neuron range and a neuron index, wherein the neuron range and the neuron index are correspondingly stored, and x represents an unquantized neuron. The value of the neuron. The quantization process of the input neuron can divide the input neuron into multiple segments according to actual needs, and obtain an index of each segment to form a neuron dictionary. According to the neuron index, the input neurons in each segment are replaced with the central neurons in the neuron codebook, which can fully exploit the similarity between the input neurons and obtain the distribution characteristics of the input neurons for low bit quantization. The number of bits representing each input neuron is reduced, thereby reducing the storage overhead and memory access overhead of the input neurons.

Step S2: Determine the operation codebook according to the weight codebook and the neuron codebook, and specifically include the steps:

S21: Determine, according to the weight value, a corresponding weight index in the weight code book, and then determine, by using the weight index, a center weight corresponding to the weight value;

S22. Determine, according to the input neuron, a corresponding neuron index in the neuron codebook, and then determine, by the neuron index, a central neuron corresponding to the input neuron; and

S23. Perform an operation operation on the center weight and the central neuron to obtain an operation result, and form the operation result into a matrix to determine the operation codebook.

Referring to FIG. 1D, FIG. 1D is a schematic diagram of a process for determining a fixed codebook according to an embodiment of the present application. As shown in FIG. 1D, the multiplier codebook is taken as an example in this embodiment. In other embodiments, the codebook is further It can be an add-on code book, a pooled code book, etc., and this application is not limited. First, in the weight dictionary, the weight index corresponding to the weight and the center weight corresponding to the weight index are determined; and in the neuron codebook, the corresponding neuron index and the neuron are determined according to the input neuron. The center neuron corresponding to the index. Finally, the neuron index and the weight index are used as the row index and the column index of the operation codebook, and the central neuron and the center weight are multiplied to form a matrix, and the multiplication codebook can be obtained.

After step S2, step S3 may be further included, the weight and the input neuron are retrained, and only the weight codebook and the neuron codebook are trained during the retraining, and the contents of the weight dictionary and the neuron dictionary remain unchanged, simplifying Heavy training operations reduce the workload. Preferably, the retraining employs a backpropagation algorithm.

Referring to FIG. 1E, FIG. 1E is a schematic structural diagram of a processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 1E, the processing apparatus includes:

a memory 51, configured to store an operation instruction;

The processor 52 is configured to execute an operation instruction in the memory 51, and perform an operation according to the foregoing processing method when the operation instruction is executed. The operation instruction may be a binary number including an operation code and an address code, the operation code indicates an operation to be performed by the processor 52, and the address code instructs the processor 52 to read the data participating in the operation into the address in the memory 51.

In the processing device of the data of the present application, the processor 52 performs an operation in accordance with the processing method of the foregoing data by executing an operation instruction in the memory 51, and can quantize the disordered weight and the input neuron to obtain a low-bit and normalized center. Weights and central neurons, the local similarity between the mining weights and the input neurons, the distribution characteristics of the two are obtained, and the low-bit quantization is performed according to the distribution characteristics of the two, which reduces the representation of each weight and the input neurons. The number of bits, which reduces the storage overhead and memory access overhead of both.

1F, FIG. 1F is a schematic structural diagram of an arithmetic device according to an embodiment of the present application. As shown in FIG. 1F, the computing device includes: an instruction control unit 1 and a lookup table unit 2;

The instruction control unit 1 is configured to decode the received instruction to generate search control information;

The lookup table unit 2 is configured to search for output neurons from the operation codebook according to the search control information generated by the instruction control unit 1 and the received weight dictionary, the neuron dictionary, the operation codebook, the weight and the input neurons. . Wherein, the weight dictionary includes a weight position (ie, a position of a weight in a neural network structure, represented by (p, q), specifically indicating a position of a p-th qth column in the weight dictionary) and a weight An index; the neuron dictionary includes an input neuron and a neuron index; the operational codebook includes a weight index, a neuron index, and an operation result of the input neuron and the weight.

The specific working process of the lookup table unit is: determining a weight index according to the weight value corresponding to the weight position in the weight dictionary, and determining the neuron according to the corresponding neuron range in the neuron dictionary of the input neuron The index uses the weight index and the neuron index as the column index and the row index of the operation codebook, and finds the value of the column and the row (operation result) from the operation codebook, and the value is the output neuron.

Referring to FIG. 1B to FIG. 1D, when performing a search, assuming that the neuron index of a certain neuron is 01 and the weight index of a certain weight is 10, when the neuron and the weight are operated, the multiplication codebook is searched. The value corresponding to the second row and the third column is 0.046, which is the output neuron. Similarly, the addition and pooling operations are similar to the multiplication operations and will not be described here. It can be understood that pooling includes, but is not limited to, average pooling, maximum pooling, and median pooling.

More specifically, the lookup table may include at least one of the following according to different arithmetic operations:

Addition lookup table: used to add the central data of the index corresponding to the index by the table-searching operation add_lookup according to the input index in. The in and data are vectors of length N, and N is a positive integer, that is, The table lookup operation out=add_lookup(in) completes the addition function out=data[1]+data[2]+...+data[N], and/or, the input weight index in1 and the neuron index in2 are searched by addition. The table performs the addition operation of the center weight data1 corresponding to the weight index and the center neuron data2 corresponding to the neuron index through the table lookup operation, that is, the addition function is completed by the table lookup operation out=add_lookup(in1, in2), out=data1+ Data2; and / or

1G, FIG. 1G is a schematic structural diagram of another computing device according to an embodiment of the present application. As shown in FIG. 1G, the computing device of the specific embodiment further includes: a preprocessing unit compared to the computing device in FIG. 4. The storage unit 3, the cache unit 6, and the direct memory access unit 5 are capable of optimizing the processing of the present application to make the processing of data more orderly.

The pre-processing unit 4 is configured to pre-process the input information of the external input to obtain the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook, and the pre-processing includes but is not limited to segmentation, Gaussian filtering, binarization, regularization, and/or normalization.

The storage unit 3 is configured to store input neurons, weights, weights dictionary, neuron dictionary, operation codebook and instructions, and receive output neurons;

The cache unit 6 is configured to cache the instruction, the weight index, the neuron index, and the output neuron, and may include:

The instruction cache 61 is configured to buffer the instruction and output the cached instruction to the instruction control unit 1;

a weight buffer 62, configured to cache the weight, and output the cached weight to the lookup table unit 2;

The input neuron cache 63 is configured to buffer the input neuron and output the buffered input neuron to the lookup table unit 2;

The output neuron cache 64 is configured to cache the output neurons output by the lookup table unit 2, and output the buffered output neurons to the lookup table unit 2;

a neuron index cache 65, configured to determine a corresponding neuron index according to the input neuron, cache the neuron index, and output the cached neuron index to the lookup table unit 2;

The weight index cache 66 is configured to determine a corresponding weight index according to the weight, cache the weight index, and output the cached weight index to the lookup table unit 2.

The direct memory access unit 5 is configured to perform data or instruction reading and writing between the storage unit 3 and the cache unit 6.

Optionally, regarding the instruction, the instruction may be a neural network specific instruction, including all instructions dedicated to completing an artificial neural network operation. Neural network specific instructions include, but are not limited to, control instructions, data transfer instructions, arithmetic instructions, and logic instructions. Among them, the control command controls the execution process of the neural network. Data transfer instructions complete the transfer of data between different storage media, including but not limited to matrices, vectors, and scalars. The arithmetic instruction completes the arithmetic operation of the neural network, including but not limited to the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Command, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction and MAXOUT neural network operation instructions. Logic instructions are used to perform logical operations on the neural network, including but not limited to vector logic operations instructions and scalar logic operation instructions.

Among them, the RBM neural network operation instruction is used to implement the Restricted Boltzmann Machine (RBM) neural network operation.

The LRN neural network operation instruction is used to implement the Local Response Normalization (LRN) neural network operation.

The LSTM neural network operation instructions are used to implement Long Short-Term Memory (LSTM) neural network operations.

The RNN neural network operation instruction is used to implement Recurrent Neural Networks (RNN) neural network operations.

The RELU neural network operation instruction is used to implement a Rectified linear unit (RELU) neural network operation.

The PRELU neural network operation instruction is used to implement Parametric Rectified Linear Unit (PRELU) neural network operations.

SIGMOID neural network operation instructions are used to implement S-type growth curve (SIGMOID) neural network operations

The TANH neural network operation instruction is used to implement a hyperbolic tangent function (TANH) neural network operation.

The MAXOUT neural network operation instruction is used to implement (MAXOUT) neural network operations.

Further, the neural network specific instruction includes a Cambricon instruction set, wherein the Cambricon instruction set includes at least one Cambricon instruction, and the Cambricon instruction has a length of 64 bits, and the Cambricon instruction includes an operation code and an operand. The Cambricon instruction contains four types of instructions, namely Cambricon control instructions, Cambricon data transfer instructions, Cambricon operation instructions, and Cambricon logic instructions.

Optionally, Cambricon control instructions are used to control the execution process. The Cambricon control instructions include a jump jump instruction and a conditional branch conditional branch instruction.

Optionally, the Cambricon data transfer instruction is used to complete data transfer between different storage media. Cambricon data transfer instructions include load instructions, store instructions, and move instructions. The load instruction is used to load data from the main memory to the cache, the store instruction is used to store data from the cache to the main memory, and the move instruction is used to transfer data between the cache and the cache or the cache and registers or registers and registers. Data transfer instructions support three different ways of organizing data, including matrices, vectors, and scalars.

Optionally, the Cambricon operation instruction is used to perform neural network arithmetic operations. Cambricon arithmetic instructions include Cambricon matrix operation instructions, Cambricon vector operation instructions, and Cambricon scalar operation instructions.

Optionally, the Cambricon matrix operation instruction is used to complete the matrix operation in the neural network, including a matrix multiply vector, a vector multiply matrix, a matrix multiply scalar, and an outer product. Product), matrix add matrix, and matrix subtract matrix.

Optionally, the Cambricon vector operation instruction is used to perform vector operations in a neural network, including vector elementary arithmetics, vector transcendental functions, dot products, and vector random generation (random). Vector generator) and the maximum/minimum of a vector. Among them, vector basic operations include vector addition, subtraction, multiplication, and division (add, subtract, multiply, divide), and vector transcendental functions are functions that do not satisfy any polynomial equation with polynomials as coefficients, including but not limited to exponential functions. Logarithmic functions, trigonometric functions, and inverse trigonometric functions.

Optionally, the Cambricon scalar instruction is used to perform scalar operations in neural networks, including scalar elementary arithmetics and scalar transcendental functions. Among them, scalar basic operations include scalar addition, subtraction, multiplication, and division (add, subtract, multiply, divide), and scalar transcendental functions are functions that do not satisfy any polynomial equations with polynomials as coefficients, including but not limited to exponential functions. Logarithmic function, trigonometric function, inverse trigonometric function.

Optionally, Cambricon logic instructions are used to perform logical operations on the neural network. Cambricon logic operations include Cambricon vector logic operations and Cambricon scalar logic operations. Among them, the Cambricon vector logic operation instruction is used to complete vector comparison, vector logical operations, and vector greater than merge. Wherein, the vector comparison includes but is not limited to less than, greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. Vector logic operations include AND, OR, and NOT.

Alternatively, the Cambricon scalar logic operation is used to perform scalar comparison, scalar logical operations. Wherein, the scalar comparison includes but is not limited to greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. Scalar logic operations include AND, OR, and NOT.

Referring to FIG. 1H, FIG. 1H is a schematic flowchart of another operation method according to an embodiment of the present application. As shown in FIG. 1H, the operation method includes the following steps:

S81, a receiving weight, an input neuron, an instruction, a weight dictionary, a neuron dictionary, and an operation codebook; wherein the weight dictionary includes a weight position and a weight index; the neuron dictionary includes an input neuron and The neuron index; the arithmetic codebook includes a weight index, a neuron index, and an operation result of the input neuron and the weight.

S82. Decode the instruction to determine search control information.

S83. Search for an output neuron in the operation codebook according to the search control information, the weight value, the weight dictionary, the neuron dictionary, and the input neuron.

Step S83 is similar to the specific working process of the lookup table unit, and specifically includes the following substeps:

S831. Determine, according to the weight, the input neuron, the weight dictionary, and the neuron dictionary, the neuron index in the neuron dictionary to determine the neuron index, and determine the weight position in the weight dictionary to determine the weight Value index;

S832. Search for the operation result in the operation codebook according to the weight index and the neuron index, and determine the output neuron.

In order to optimize the operation method of the present application, the processing is more convenient and orderly, and the embodiment of the present application provides another operation method. FIG. 1 is a schematic flowchart of an operation method according to an embodiment of the present application. The calculation method includes the following steps:

Step S90: Preprocessing external input input information.

Optionally, the pre-processing the input information of the external input specifically includes: obtaining a weight corresponding to the input information, an input neuron, an instruction, a weight dictionary, a neuron dictionary, and an operation codebook; and the preprocessing includes Segmentation, Gaussian filtering, binarization, regularization, or normalization.

Step S91: Receive the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook.

Step S92, storing the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook.

Step S93, buffering the weight, inputting a neuron, an instruction, a weight index, and a neuron index.

Step S94, decoding the instruction, and determining the search control information.

Step S95, according to the weight, the input neuron, the weight dictionary, and the neuron dictionary, determine the neuron index in the neuron dictionary to determine the neuron index, and determine the weight position in the weight dictionary to determine Weight index.

Step S96: Searching the operation result in the operation codebook according to the weight index and the neuron index, and determining the output neuron.

Referring to FIG. 2A, FIG. 2A is a schematic structural diagram of a hierarchical storage device according to an embodiment of the present disclosure. As shown in FIG. 2A, the device includes: an accurate storage unit and an inexact storage unit, where the precise storage unit is used to store data. Important bits, inaccurate memory locations are used to store non-significant bits in the data.

The precision memory unit uses error checking and correcting ECC memory, and the inexact memory unit uses non-ECC memory.

Further, the data stored by the tiered storage device is a neural network parameter, including input neurons, weights, and output neurons, and the precise storage unit stores input neurons, output neurons, and important bits of weights, and inaccurate storage units. Stores input neurons, output neurons, and non-significant bits of weights.

Further, the data stored by the tiered storage device includes floating point data and fixed point data, and the symbol bit and the exponent portion in the floating point data are designated as important bits, and the base portion is designated as a non-significant bit, and the fixed point is to be fixed. The sign bit in the type data and the first x bit of the value part are designated as important bits, and the remaining bits of the value part are designated as non-significant bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a fixed point The total bit of the type data. Store important bits in ECC memory for accurate storage, and store non-significant bits in non-ECC memory for inexact storage.

Further, the ECC memory includes a DRAM (Dynamic Random Access Memory, DRAM for short) dynamic random access memory and an SRAM (Static Random-Access Memory, SRAM) static random access memory with ECC check; Among them, the SRAM with ECC check uses 6T SRAM, and in other embodiments of the present application, 4T SRAM or 3T SRAM can also be used.

Further, the non-ECC memory includes a non-ECC-checked DRAM and a non-ECC-checked SRAM, and the non-ECC-checked SRAM uses a 6T SRAM. In other embodiments of the present application, 4T SRAM or 3TSRAM may also be employed.

Among them, the unit for storing each bit in the 6T SRAM is composed of 6 MOSFETs (metal: MOS) tubes, and the unit for storing each bit in the 4T SRAM is composed of 4 MOS tubes, and is stored in 3T SRAM. Each bit unit consists of 3 MOS tubes.

SRAMs that store neural network weights generally use 6T SRAM, although 6T SRAM has high stability but large area and high read and write power consumption. The neural network algorithm has certain fault tolerance, and the 6T SRAM cannot utilize the fault tolerance of the neural network. Therefore, in this embodiment, in order to fully exploit the fault tolerance of the neural network, 4T SRAM or 3T SRAM storage technology is used instead of 6T SRAM to increase SRAM storage. Density, reduce SRAM memory consumption, and use the fault tolerance of neural network algorithms to mask the shortcomings of 4T SRAM's weak anti-noise ability.

Referring to FIG. 2B, FIG. 2B is a schematic structural diagram of a 4T SRAM memory cell according to an embodiment of the present application. As shown in FIG. 2B, the 4T SRAM memory cell is composed of four NMOSs, respectively M1 (first MOS transistor), M2. (Second MOS transistor), M3 (third MOS transistor), M4 (fourth MOS transistor). M1 and M2 are used for gating, and M3 and M4 are used for storage.

The M1 gate is electrically connected to the word line WL (Word Line), the source is electrically connected to the bit line BL (Bit Line), the M2 gate is electrically connected to the word line WL, the source is electrically connected to the bit line BLB, and the M3 gate is connected. M4 source, M2 drain connection, and connected to the working voltage Vdd through the resistor R2, M3 drain is grounded; M4 gate is connected to the M3 source, M1 drain, and connected to the working voltage Vdd through the resistor R1, M4 drain Ground. WL is used to control the gate access of the memory unit, and BL is used to read and write the memory unit. When a read operation is performed, WL is pulled high and the bit is read from the BL. When a write operation is performed, the WL is pulled high, and the BL is pulled high or low. Since the driving capability of the BL is stronger than that of the memory cell, the original state is forcibly overwritten.

Referring to FIG. 2C, FIG. 2C is a schematic structural diagram of a 3T SRAM memory cell according to an embodiment of the present application. As shown in FIG. 2C, the 3T SRAM memory cell is composed of three MOSs, respectively M1 (first MOS transistor), M2. (second MOS tube) and M3 (third MOS tube). M1 is used for gating, and M2 and M3 are used for storage.

The M1 gate is electrically connected to the word line WL (Word Line), the source is electrically connected to the bit line BL (Bit Line), the M2 gate is connected to the M3 source, and is connected to the operating voltage Vdd through the resistor R2, and the M2 drain is grounded. The M3 gate is connected to the M2 source and the drain of M1, and is connected to the operating voltage Vdd through the resistor R1, and the drain of the M3 is grounded. WL is used to control the gate access of the memory unit, and BL is used to read and write the memory unit. When a read operation is performed, WL is pulled high and the bit is read from the BL. When a write operation is performed, the WL is pulled high, and the BL is pulled high or low. Since the driving capability of the BL is stronger than that of the memory cell, the original state is forcibly overwritten.

The storage device of the present application adopts an approximate storage technology, which can fully exploit the fault tolerance of the neural network, and approximate the neural parameters. The important bits in the parameters are accurately stored, and the unimportant bits are stored inaccurately, thereby reducing storage overhead. And access to energy consumption costs.

An embodiment of the present application provides a data processing apparatus, which is an acceleration device corresponding to an approximate storage technology. Referring to FIG. 2D, FIG. 2D is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The apparatus includes: an inaccurate arithmetic unit, an instruction control unit, and the hierarchical storage device described above.

The tiered storage device receives the instruction and the operation parameter, and stores the important bits and instructions in the operation parameter in the precise storage unit, and stores the non-significant bits in the operation parameter in the inexact storage unit.

The instruction control unit receives the instruction in the tiered storage device and decodes the instruction to generate control information to control the inexact operation unit to perform the calculation operation.

The inaccurate operation unit receives the operation parameters in the tiered storage device, performs operations according to the control information, and transmits the operation results to the tiered storage device for storage or output.

Further, the inaccurate arithmetic unit is a neural network processor. Further, the operation parameter is a neural network parameter, and the tiered storage device is used to store neurons, weights and instructions of the neural network, and store important bits of the neuron, important bits and instructions of the weight in the precise storage unit. Non-significant bits of neurons and non-significant bits of weights are stored in inexact memory cells. The inexact computing unit receives the input neurons and weights in the tiered storage device, completes the neural network operation according to the control information to obtain the output neurons, and retransmits the output neurons to the tiered storage device for storage or output.

Further, the inaccurate operation unit may have two calculation modes: (1) the inexact operation unit directly receives important bits from the input neurons in the precise storage unit of the tiered storage device and important bits of the weight for calculation. (2) The inexact arithmetic unit receives the significant input bits and the non-significant bits to splicing the complete input neurons and weights, wherein the input neurons and the important bits of the weights and the non-significant bits are in the storage unit. Splicing when reading.

Further, referring to FIG. 2E, as shown in FIG. 2E, the data processing apparatus further includes a pre-processing module for pre-processing the input original data and transmitting the data to the storage device, and the pre-processing includes segmentation, Gaussian filtering, and binarization. , regularization, normalization, and so on.

Further, the data processing apparatus further includes an instruction cache, an input neuron hierarchical cache, a weight hierarchical cache, and an output neuron hierarchical cache, wherein the instruction cache is disposed between the hierarchical storage device and the instruction control unit, and is configured to: Storing a dedicated instruction; the input neuron layered cache is disposed between the storage device and the imprecise computing unit for buffering the input neurons, and the input neuron hierarchical buffer includes an input neuron exact cache and an input neuron inexact cache, respectively Cache the important bits and non-important bits of the input neurons; the weighted layered cache is set between the storage device and the inexact computing unit for buffering the weighted data, and the weighted hierarchical cache includes the weighted exact cache and the weight The value is inaccurately buffered, and the important bit and the non-important bit of the weight are separately cached; the output neuron layered buffer is disposed between the storage device and the inexact computing unit for buffering the output neurons, and the output neuron is divided The layer buffer includes an output neuron exact cache and an output neuron inexact cache, which respectively buffer the important ratio of the output neurons. Bits and unimportant bits.

Further, the data processing apparatus further includes a direct memory access unit (DMA) for performing in the storage device, the instruction cache, the weight layer cache, the input neuron layer buffer, and the output neuron layer buffer. Data or instruction reading and writing.

Further, the above instruction cache, input neuron hierarchical cache, weighted hierarchical cache, and output neuron hierarchical cache all use 4T SRAM or 3T SRAM.

Further, the inaccurate arithmetic unit includes but is not limited to three parts, a first partial multiplier, a second partial addition tree, and the third part is an activation function unit. The first part multiplies the input data 1 (in1) and the input data 2 (in2) to obtain the multiplied output (out). The process is: out=in1*in2; the second part passes the input data in1 through the addition tree step by step. Add the output data (out), where in1 is a vector of length N, N is greater than 1, the process is: out=in1[1]+in1[2]+...+in1[N]; or, will be input The data (in1) is added by the addition tree and added to the input data (in2) to obtain the output data (out). The process is: out=in1[1]+in1[2]+...+in1[N]+in2; Alternatively, the input data (in1) and the input data (in2) are added to obtain output data (out), which is called: out=in1+in2; the third part is obtained by the input function (in) through an activation function (active). Activate the output data (out), the process is: out=active(in), the active function active can be sigmoid, tanh, relu, softmax, etc. In addition to the activation operation, the third part can input the data through other nonlinear functions ( In) The output data (out) is obtained by the operation (f), and the process is: out=f(in).

The inexact computing unit may further include a pooling unit, and the pooling unit obtains output data (out) through the pooling operation, and the process is out=pool(in), wherein the pool is a pooling operation, and the pooling operation is performed. Including but not limited to: average pooling, maximum pooling, median pooling, input data in is the data in a pooled core associated with output out.

The non-precise operation unit performs operations including several parts. The first part is to multiply the input data 1 and the input data 2 to obtain the multiplied data; the second part performs the addition tree operation for passing the input data 1 through the addition tree. Level addition, or the input data 1 is added step by step through the addition tree and added to the input data 2 to obtain output data; the third part performs an activation function operation, and the output data is obtained by an activation function (active) operation to obtain output data. . The operations of the above parts can be freely combined to realize the operation of various functions.

The data processing device of the present application can fully utilize the approximate storage technology, fully exploit the fault tolerance capability of the neural network, reduce the computational load of the neural network and the amount of neural network access, thereby reducing computational energy consumption and memory consumption. By adopting the dedicated SIMD instruction for the multi-layer artificial neural network operation and the customized operation unit, the problem that the CPU and GPU have insufficient performance and the front-end decoding overhead is solved, and the support for the multi-layer artificial neural network operation algorithm is effectively improved; By using the on-chip cache of dedicated inaccurate storage for multi-layer artificial neural network operation algorithm, the importance of input neurons and weight data is fully exploited, which avoids repeatedly reading these data into memory, reducing memory access bandwidth and avoiding The memory bandwidth becomes a problem of multi-layer artificial neural network operation and performance bottleneck of its training algorithm.

The above is merely an illustrative description, but the application is not limited thereto, and the data processing apparatus may include a non-neural network processor, such as a general-purpose arithmetic processor, which has corresponding general-purpose arithmetic instructions and data, for example, scalar arithmetic operations. A scalar logic operation or the like, such as but not limited to, includes one or more multipliers, one or more adders, and performs basic operations such as addition, multiplication, and the like.

A further embodiment of the present application provides a data storage method, which uses a storage method to store data in a hierarchical manner. Referring to FIG. 2F, FIG. 2F is a flowchart of a data storage method according to an embodiment of the present application, including the following steps. :

S601: Accurate storage of important bits in the data.

S602: Perform non-precise storage of non-significant bits in the data.

Specifically, the data storage method includes the following steps:

Extracting important bits and non-significant bits of the data;

Store important bits in the data in ECC memory for accurate storage;

Non-significant bits in the data are stored in non-ECC memory for inexact storage.

In this embodiment, the stored data is a neural network parameter, and the bit number bits representing the neural network parameters are divided into important bits and non-significant bits. For example, a parameter of a neural network has a total of m bits, where n bits are important bits, (mn) bits are non-significant bits, where m is an integer greater than 0, and n is greater than 0 and less than or equal to m The integer.

The neural network parameters include input neurons, weights, and output neurons, which store the important bits of the input neurons, the important bits of the output neurons, and the important bits of the weights; the non-significant bits of the input neurons Bits, non-significant bits of output neurons, and non-significant bits of weights are stored inexactly.

The data includes floating-point data and fixed-point data, wherein the sign bit and the exponent portion in the definition floating-point data are important bits, the base portion is a non-significant bit; the sign bit and the value portion in the fixed-point type data are before The x bit is an important bit, and the remaining bits of the value part are non-significant bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a parameter total bit.

The ECC memory includes an ECC-checked SRAM and an ECC-checked DRAM; the non-ECC memory includes a non-ECC-checked SRAM and a non-ECC-checked DRAM; the ECC-checked SRAM and non-ECC checksum The SRAM uses 6T SRAM, and in other embodiments of the present application, 4T SRAM or 3T SRAM can also be used.

A further embodiment of the present application provides a data processing method, and FIG. 2G is a flowchart of a data processing method according to an embodiment of the present application. As shown in FIG. 2G, the method includes:

S1: receiving instructions and parameters, and accurately storing important bits and instructions in the parameters, and inaccurately storing non-important bits in the parameters;

S2: receiving an instruction, and decoding the instruction to generate control information;

S3: Receive parameters, and perform operations according to the control information, and store the operation results.

The above operation is a neural network operation, and the parameters are neural network parameters, including input neurons, weights, and output neurons.

Step S3 further includes: receiving the input neuron and the weight, completing the neural network operation according to the control information to obtain the output neuron, and storing or outputting the output neuron.

Further, the receiving input neuron and the weight, and completing the neural network operation according to the control information to obtain the output neuron include: receiving important bits of the input neuron and important bits of the weight for calculation; or receiving important bits And the non-significant bits spliced the complete input neurons and weights for calculation.

Further, the method further includes the following steps: caching dedicated instructions; accurately buffering and inexact caching of input neurons; performing accurate and inexact caching of weight data; and accurately and inaccurately buffering output neurons.

Further, before step S1, the method further includes: pre-processing the parameters.

A further embodiment of the present application is directed to a storage unit, which is a 4T SRAM or a 3T SRAM, for storing neural network parameters, wherein the specific structure of the 4T SRAM is as shown in FIG. 2B, and the 3T SRAM is The specific structure refers to the structure shown in FIG. 2C, and will not be described here.

Referring to FIG. 3A, FIG. 3A is a schematic structural diagram of a dynamic voltage regulation and frequency modulation apparatus 100 according to an embodiment of the present application. As shown in FIG. 3A, the dynamic voltage regulation and frequency modulation apparatus 100 includes:

The information collecting unit 101 is configured to collect, in real time, working state information or application scenario information of the chip connected to the dynamic voltage regulating frequency modulation, where the application scenario information is obtained by using the neural network or the chip Information collected by connected sensors;

The voltage-adjusting and frequency-modulating unit 102 is configured to send voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip, where the voltage frequency regulation information is used to instruct the chip to adjust its working voltage or operating frequency.

In a possible embodiment of the present application, the working state information of the chip includes an operating speed of the chip, the voltage frequency regulation information includes first voltage frequency regulation information, and the voltage regulating and frequency modulation unit 102 is configured to:

Specifically, the information collecting unit 101 collects the running speed of the chip connected thereto in real time. The running speed of the chip can be different types of speeds depending on the tasks performed by the above chips. When the operation performed by the chip is video image processing, the running speed of the chip may be a frame rate of the video image processing performed by the chip; when the operation performed by the chip is voice recognition, the running speed of the chip is the above information. The speed at which speech recognition is performed. The voltage-modulating and frequency-modulating unit 102 determines that the running speed of the chip is greater than the target speed, that is, when the operating speed of the chip reaches the running speed of the chip when the user meets the demand, the first voltage frequency control information is sent to the chip to indicate the chip. Reduce its operating voltage or operating frequency to reduce the power consumption of the chip.

For example, it is assumed that the operation performed by the above chip is video image processing, and the above target speed is 24 frames/second. The information collecting unit collects the frame rate of the video image processing by the chip in real time, and the current frame rate of the video image processing by the chip is 54 frames/second. The voltage regulation and frequency modulation unit determines that the frame rate of the video image processing of the current chip is greater than the target speed, and sends the first voltage frequency regulation information to the chip to indicate that the chip reduces the operating voltage or the operating frequency to reduce the power consumption of the chip. .

In a possible embodiment of the present application, the chip includes at least a first unit and a second unit, the output data of the first unit is input data of the second unit, and the working status information of the chip includes The operating speed of the first unit and the operating speed of the second unit, the voltage frequency regulation information includes second voltage frequency regulation information, and the frequency modulation unit 102 is further configured to:

Specifically, the chip performing the task requires the cooperation of the first unit and the second unit, and the output data of the first unit is the input data of the second unit. The information collecting unit 101 collects the operating speeds of the first unit and the second unit in real time. When it is determined that the running speed of the first unit is less than the running speed of the second unit, that is, the running time of the first unit exceeds the running time of the second unit, the voltage regulating and frequency converting unit 102 sends the second unit to the second unit. The voltage frequency regulation information is used to instruct the second unit to lower its working voltage or operating frequency, so as to reduce the power consumption of the whole chip without affecting the overall running speed of the chip.

In a possible embodiment of the present application, the voltage frequency regulation information includes third voltage frequency regulation information, and the frequency modulation unit 102 is further configured to:

In a possible embodiment of the present application, the chip includes at least N units, and the working state information of the chip includes working state information of at least S units of the at least N units, where N is greater than 1. The integer frequency, the S is an integer less than or less than N, the voltage frequency regulation information includes fourth voltage frequency regulation information, and the voltage regulation and frequency modulation unit 102 is configured to:

The unit A is any one of the at least S units.

In a possible embodiment of the present application, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the voltage regulation and frequency modulation unit 102 is further configured to:

Specifically, in the working process of the chip, the information collecting unit 101 collects the working state information of at least S units in the chip in real time. When it is determined that the unit A is in an idle state according to the working state information of the unit A, the voltage regulating and frequency modulation unit 102 sends the fourth voltage frequency regulation information to the unit A to indicate that the unit A lowers its operating frequency or operating voltage. The power consumption of the unit A is lowered. When the unit A is determined to be in the working state again according to the working state information of the unit A, the voltage regulating and frequency modulation unit 102 sends the fifth voltage frequency control information to the unit A to indicate the unit A. Increase its operating frequency or operating voltage so that the operating speed of the unit A meets the needs of the work.

In a possible embodiment of the present application, the application scenario of the chip is image recognition, the application scenario information is the number of objects in the image to be identified, and the voltage frequency regulation information includes sixth voltage frequency regulation information. The voltage regulating and frequency modulation unit 102 is also used to:

Specifically, the chip is applied to image recognition, and the number of objects in the image to be identified is obtained by the neural network algorithm, and the information collecting unit 101 acquires the number of objects in the image to be identified from the chip (ie, the above After applying the scenario information, when the voltage-modulating and frequency-modulating unit 102 determines that the number of objects in the image to be identified is less than the first threshold, the voltage-modulating and frequency-modulating unit 102 sends the sixth voltage-frequency control information to the chip to indicate the foregoing. The chip lowers its working voltage or operating frequency; when it is determined that the number of objects in the image to be identified is greater than the first threshold, the voltage regulating and frequency modulation unit 102 sends a signal to the chip to indicate that the chip raises its working voltage or operating frequency. Voltage frequency regulation information.

In a possible embodiment of the present application, the application scenario information is object tag information, the voltage frequency regulation information includes seventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit 102 is further configured to:

For example, the preset object tag set includes a plurality of object tags, and the object tags may be “person”, “dog”, “tree”, and “flower”. When the chip determines that the current application scenario includes a dog by using a neural network algorithm, the chip transmits the object tag information including the “dog” to the information collecting unit 101, and when the frequency modulation unit 102 determines that the object tag information includes "dog", sending seventh voltage frequency regulation information to the chip to indicate that the chip raises its working voltage or operating frequency; and when determining that the object tag information does not belong to the preset object tag set, the voltage regulating frequency modulation unit 102 transmits voltage frequency regulation information for instructing the chip to reduce its operating voltage or operating frequency to the chip.

Specifically, the application scenario of the chip is voice recognition, and the input unit of the chip inputs voice to the chip at a certain rate. The information collecting unit 101 collects the voice input rate in real time, and sends the voice input rate information to the voltage regulating and frequency modulation unit 102. When the voltage regulation and frequency modulation unit 102 determines that the voice input rate is less than the second threshold, the eighth voltage frequency regulation information is sent to the chip to instruct the chip to lower its operating voltage or operating frequency. When the voltage regulating and frequency modulation unit 102 determines that the voice input rate is greater than the second threshold, the voltage frequency regulation information for instructing the chip to increase its operating voltage is sent to the chip.

Further, when the keyword does not belong to the keyword set, the FM voltage regulator unit 102 transmits the voltage regulation and frequency modulation information for instructing the chip to lower its operating voltage or operating frequency to the chip.

For example, the application scenario of the above chip is speech recognition, and the preset keyword set includes keywords such as “image beauty”, “neural network algorithm”, “image processing” and “Alipay”. Assume that the application scenario information is “image beauty”, the FM voltage regulation unit 102 sends the ninth voltage frequency regulation information to the above to indicate that the chip increases its working voltage or operating frequency; When the frequency modulation unit 102 transmits the voltage regulation and frequency modulation information for instructing the chip to lower its operating voltage or operating frequency to the chip.

Specifically, the chip is applied to the machine translation, and the application scene information collected by the information collection unit 101 is the speed of the text input or the number of characters in the image to be translated, and the application scenario information is transmitted to the voltage modulation and frequency modulation unit 102. When it is determined that the text input speed is less than the third threshold or the number of characters in the image to be translated is less than the fourth threshold, the voltage regulating and frequency modulation unit 102 sends the tenth voltage frequency regulation information to the chip for instructing the chip to reduce its work. a voltage; when it is determined that the text input speed is greater than a third threshold or the number of characters in the image to be translated is greater than a fourth threshold, the voltage-modulating frequency modulation unit 102 sends a voltage frequency regulation to the chip to instruct the chip to increase its operating voltage. information.

Specifically, the illumination intensity of the external environment is acquired by an illumination sensor connected to the chip. After acquiring the above-mentioned light intensity, the information collecting unit 101 transmits the light intensity to the voltage-modulating and frequency-modulating unit 102. When it is determined that the illumination intensity is less than the fifth threshold, the voltage regulation and frequency modulation unit 102 transmits the eleventh voltage frequency regulation information to the chip to instruct the chip to lower its operating voltage; when determining that the illumination intensity is greater than the fifth threshold The voltage-modulating and frequency-modulating unit 102 transmits voltage frequency regulation information for instructing the chip to increase its operating voltage or operating frequency to the chip.

In a possible embodiment of the present application, the chip is applied to image beauty, and the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency offset control information, and the voltage regulation and frequency modulation unit further Used for:

When the application scenario information is a face image, sending the twelfth voltage frequency regulation information to the chip, where the twelfth voltage frequency regulation information is used to indicate that the chip increases its working voltage or operating frequency;

And when the application scenario information is not a face image, sending thirteenth voltage frequency regulation information to the chip, where the thirteenth voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency.

In a possible embodiment of the present application, the chip is applied to voice recognition, and the application scenario information is voice strength. When the voice strength is greater than a sixth threshold, the voltage modulation and frequency modulation unit 102 sends the chip to the chip to indicate the chip. The voltage frequency regulation information of the operating voltage or the operating frequency is reduced; when the voice strength is less than the sixth threshold, the voltage regulation and frequency modulation unit 102 sends a voltage frequency regulation to the chip to indicate that the chip increases its working voltage or operating frequency. information.

It should be noted that the foregoing scene information may be information of an external scene collected by the sensor, such as light intensity, voice intensity, and the like. The application scenario information may also be information calculated according to the artificial intelligence algorithm. For example, in the object recognition task, the real-time calculation result information of the chip is fed back to the information collection unit, where the information includes the number of objects in the scene, the face image, Information such as object tag keywords.

Optionally, the artificial intelligence algorithm described above includes, but is not limited to, a neural network algorithm.

It can be seen that, in the solution of the embodiment of the present invention, the dynamic voltage-adjusting and frequency-modulating device in real time is connected with the chip and the working state information of each internal unit or the application scenario information of the chip, according to the working state information of the chip and its internal units. Or the application scenario information of the chip to adjust the working frequency or working voltage of the chip or its internal units to reduce the overall operating power consumption of the chip.

Referring to FIG. 3B, FIG. 3B is a schematic diagram of a dynamic voltage regulation and frequency modulation application scenario according to an embodiment of the present application. As shown in FIG. 3B, the convolution operation device includes a dynamic voltage regulation and frequency modulation device 210, and a chip 220 connected to the dynamic voltage regulation and frequency modulation device.

The chip 220 includes a control unit 221, a storage unit 222, and an operation unit 223. The chip 220 described above can be used for tasks such as image processing, voice processing, and the like.

The dynamic voltage-modulating and frequency-modulating device 210 collects the working state information of the chip 220 in real time. The operational status information of the chip 220 includes the operating speed of the chip 220, the operating speed of the control unit 221, the operating speed of the storage unit 222, and the operating speed of the computing unit 223.

In a possible embodiment of the present application, when the chip 220 performs a task, when the dynamic voltage-modulating and frequency-modulating device 210 determines the running time of the storage unit 222 exceeds the operation unit 223 according to the running speed of the storage unit 222 and the operating speed of the computing unit 223. The running time, the dynamic voltage modulation and frequency modulation device 210 can determine that the storage unit 222 becomes a bottleneck during the execution of the task, and after the operation unit 223 performs the current operation operation, it needs to wait for the storage unit 222 to perform the reading task and The read data is transmitted to the arithmetic unit 223, and the arithmetic unit 223 can perform an arithmetic operation based on the data transmitted from the storage unit 222. The dynamic voltage-modulating and frequency-modulating device 210 transmits the first voltage-frequency regulation information to the operation unit 223, where the first voltage-frequency regulation information is used to instruct the operation unit 223 to lower its operating voltage or operating frequency to reduce the operating speed of the computing unit 223, so that Without affecting the completion time of the task, the overall operating power consumption of the chip 220 is reduced.

In a possible embodiment of the present application, when the chip 220 performs a task, when the dynamic voltage-modulating and frequency-modulating device 210 determines the running time of the storage unit 222 is lower than the computing unit according to the running speed of the storage unit 222 and the operating speed of the computing unit 223 At the runtime of 223, the dynamic voltage-modulating frequency modulation device 210 can determine that the arithmetic unit 223 becomes a bottleneck during the execution of the task. After the storage unit 222 completes the data reading, the operation unit 223 has not completed the current operation operation, and the storage unit 222 needs to wait for the operation unit 223 to complete the current operation operation, and then transfers the read data to the operation unit 223. The dynamic voltage modulation and frequency modulation device 210 sends the second voltage frequency regulation information to the storage unit 222, where the second voltage frequency regulation information is used to instruct the storage unit 222 to lower its operating voltage or operating frequency to reduce the operating speed of the storage unit 222, so that Without affecting the completion time of the task, the overall operating power consumption of the chip 220 is reduced.

In a possible embodiment of the present application, the dynamic voltage modulation and frequency modulation device 210 acquires the running speed of the chip 220 in real time. When the dynamic voltage modulation and frequency modulation device 210 determines that the operating speed of the chip 220 is greater than the target operating speed, the target operating speed is an operating speed that can meet the user's demand, and the dynamic voltage regulating and frequency modulation device 210 sends the third voltage frequency control information to the chip 220. The third voltage frequency regulation information is used to instruct the chip 220 to lower its operating voltage or operating frequency to reduce the operating power consumption of the chip 220.

For example, the chip 220 is used for video processing. For example, the frame rate of the video processing required by the user under normal conditions is not less than 30 frames. It is assumed that the frame rate of the actual video processing of the chip 220 is 100, and the dynamic voltage regulating and frequency modulation device is used. The voltage frequency regulation information is sent to the chip 220, and the voltage frequency regulation information is used to instruct the chip 220 to lower the operating voltage or the operating frequency to reduce the frame rate of the video processing to about 30 frames.

In a possible embodiment of the present application, the dynamic voltage modulation and frequency modulation device 210 monitors the working states of each unit (including the control unit 221, the storage unit 222, and the operation unit 223) in the chip 220 in real time. When the surface dynamic voltage modulation and frequency modulation device 220 determines that any one of the units is in an idle state, the fourth voltage frequency control information is sent to the unit, where the fourth voltage frequency control information is used to indicate that the operating voltage or the operating frequency of the unit is decreased. Thereby reducing the power consumption of the chip 220. When the unit is in the working state again, the dynamic voltage regulating and frequency modulation device 210 sends the fifth voltage frequency regulation information to the unit to raise the working voltage or the operating frequency of the unit, so that the operating speed of the unit meets the working requirement. It can be seen that, in the solution of the application embodiment, the dynamic voltage regulation and frequency modulation device 210 sets the real-time acquisition chip and the running speed information of each unit therein, and reduces the chip or the internal units according to the running speed information of the chip and its internal units. The operating frequency or operating voltage is used to reduce the overall operating power consumption of the chip.

Referring to FIG. 3C, FIG. 3C is a schematic diagram of another dynamic voltage regulation and frequency modulation application scenario according to an embodiment of the present application. As shown in FIG. 3C, the convolution operation device includes a dynamic voltage-modulating frequency modulation device 317 register unit 312, an interconnection module 313, an operation unit 314, a control unit 315, and a data access unit 316.

The operation unit 314 includes at least two of an addition calculator, a multiplication calculator, a comparator, and an activation operator.

The interconnecting module 313 is configured to control the connection relationship of the calculators in the computing unit 314 such that the at least two types of calculators form different computing topologies.

The register unit 312 (which may be a register unit, an instruction cache, a scratch pad memory) is configured to store the operation instruction, the address of the data block in the storage medium, and the calculation topology corresponding to the operation instruction.

Optionally, the convolution operation device further includes a storage medium 311.

The storage medium 311 may be an off-chip memory. Of course, in an actual application, it may also be an on-chip memory for storing data blocks. The data block may be n-dimensional data, and n is an integer greater than or equal to 1, for example, n= At 1 o'clock, it is 1D data, that is, a vector, such as n=2, which is 2D data, that is, a matrix, such as n=3 or more, which is multidimensional data.

The control unit 315 is configured to extract an operation instruction from the register unit 312, an operation domain corresponding to the operation instruction, and a first calculation topology corresponding to the operation instruction, and decode the operation instruction into an execution instruction, where the execution instruction is used for control The arithmetic unit 314 performs an arithmetic operation, transmits the operational domain to the data access unit 316, and transmits the computational topology to the interconnect module 313.

The data access unit 316 is configured to extract a data block corresponding to the operation domain from the storage medium 311, and transmit the data block to the interconnection module 313.

The interconnect module 313 is configured to receive the data block of the first computing topology.

In a possible embodiment of the present application, the interconnect module 313 also repositions the data block according to the first computing topology.

The operation unit 314, the calculator for executing the instruction call operation unit 314 performs an operation operation on the data block to obtain an operation result, and transmits the operation result to the data access unit 316 and stores it in the storage medium 312.

In a possible embodiment of the present application, the operation unit 314 is further configured to perform an operation operation on the re-arranged data block according to the first calculation topology and the execution instruction to obtain an operation result, and transmit the operation result to the data. Access unit 316 is stored and stored in storage medium 312.

In a possible embodiment, the interconnecting module 313 is further configured to form a first computing topology according to the connection relationship of the calculators in the control computing unit 314.

The dynamic voltage regulation and frequency modulation device 317 is configured to monitor the working state of the entire convolution operation device and dynamically adjust its voltage and frequency.

The specific calculation method of the convolution operation device is described below by using different operation instructions. The operation instruction here is exemplified by a convolution calculation instruction, and the convolution calculation instruction can be applied to a neural network, so the convolution calculation instruction can also It is called a convolutional neural network. For a convolutional calculation instruction, the formula that it actually needs to execute can be:

s=s(∑wx _i +b)

Wherein, the convolution kernel W (which may include a plurality of data) is multiplied by the input data χ _i , summed, and then optionally biased b, and then optionally the activation operation s(h), The final output result S is obtained. According to the formula, the calculation topology can be obtained as a multiplier-adder-(optional) activation operator. The convolution calculation instruction may include an instruction set including a convolutional neural network COMPUTE instruction having different functions, and a CONFIG instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction.

In one embodiment, the COMPUTE instruction includes:

a convolution operation instruction, according to which the convolution operation device extracts input data of a specified size and a convolution kernel from a specified address of a memory (a preferred scratch pad memory or a scalar register file), and performs the convolution operation unit Convolution operation.

a convolutional neural network sigmoid instruction according to which the convolution operation means respectively extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferably a scratch pad memory or a scalar register file), in a convolution operation unit Do the convolution operation, and then make the output result sigmoid activation;

The convolutional neural network TanH instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, then the output is TanH activated;

The convolutional neural network ReLU instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, and then the output is re-activated by ReLU;

The convolutional neural network group instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and after dividing the group, the convolution operation unit Do a convolution operation and then activate the output.

The CONFIG command is used to configure the various constants required for the current layer calculation before each layer of artificial neural network calculation begins.

The IO instruction is used to read the input data required for calculation from the external storage space and store the data back to the external space after the calculation is completed.

The NOP instruction is used to clear the control signals in all the control signal buffer queues of the current convolution operation device, and ensure that all the instructions before the NOP instruction are all executed. The NOP instruction itself does not contain any operations;

The JUMP instruction is used to control the jump of the next instruction address to be read from the instruction storage unit, and is used to implement the jump of the control flow;

The MOVE instruction is used to carry data of an address in the internal address space of the convolution operation device to another address in the internal address space of the convolution operation device, the process is independent of the operation unit, and does not occupy the operation unit during execution. H.

The method for executing the convolution calculation instruction by the convolution operation device may specifically be:

The control unit 315 extracts a convolution calculation instruction from the register unit 312, an operation domain corresponding to the convolution calculation instruction, and a first calculation topology corresponding to the convolution calculation instruction (multiplier-adder-adder-activation operation) And the control unit transmits the operation domain to the data access unit, and transmits the first computing topology to the interconnection module.

The data access unit 316 extracts the convolution kernel w and the offset b corresponding to the operation domain from the storage medium 311 (when b is 0, it is not necessary to extract the offset b), and the convolution kernel w and the offset b are transmitted to the operation. Unit 314.

The multiplier of the operation unit 314 obtains the first result after performing the multiplication operation on the convolution kernel w and the input data Xi, and inputs the first result to the adder to perform the addition operation to obtain the second result, and the second result and the offset b The addition operation is performed to obtain a third result, and the third result is input to the activation operator to perform an activation operation to obtain an output result s, and the output result s is transmitted to the data access unit for storage into the storage medium. Among them, after each step, the output can be directly transferred to the data access storage to the storage medium without the following steps. In addition, the step of performing the addition of the second result and the offset b to obtain the third result is optional, that is, when b is 0, this step is not required. In addition, the order of addition and multiplication operations can be reversed.

Optionally, the first result may include a result of a plurality of multiplication operations.

In a possible embodiment of the present application, an embodiment of the present application provides a neural network processor including the above convolution operation device.

The above neural network processor is used to perform artificial neural network operations, and implements artificial intelligence applications such as speech recognition, image recognition, and translation.

In this convolution calculation task, the above dynamic voltage modulation and frequency modulation device 317 works as follows:

Case 1: In the process of performing the convolution operation, the dynamic voltage-modulating and frequency-modulating device 317 acquires the running speeds of the data access unit 316 and the computing unit 314 of the neural network processor in real time. When the dynamic voltage-modulating and frequency-modulating device 317 determines that the running time of the data access unit 316 exceeds the running time of the computing unit 314 according to the operating speeds of the data access unit 316 and the computing unit 314, the dynamic voltage-modulating and frequency-modulating device 317 can determine that the convolution operation is being performed. The data access unit 316 becomes a bottleneck. After the current convolution operation operation is performed, the operation unit 314 needs to wait for the data access unit 316 to execute the read task and transfer the read data to the operation unit 314. The operation unit 314 The convolution operation operation can be performed based on the data transmitted from the data access unit 316. The dynamic voltage modulation and frequency modulation device 317 sends the first voltage frequency regulation information to the operation unit 314, where the first voltage frequency regulation information is used to instruct the operation unit 314 to lower the operating voltage or the operating frequency thereof to reduce the operating speed of the operation unit 314, so that the operation The running speed of the unit 314 is matched with the running speed of the data access unit 316, which reduces the power consumption of the computing unit 314, prevents the computing unit 314 from being idle, and finally reduces the above without affecting the completion time of the task. The overall operating power of the neural network processor.

Case 2: In the process of performing the convolution operation, the dynamic voltage-modulating and frequency-modulating device 317 acquires the running speeds of the data access unit 316 and the computing unit 314 of the neural network processor in real time. When the dynamic voltage-modulating and frequency-modulating device 317 determines that the running time of the computing unit 314 exceeds the running time of the data access unit 316 according to the operating speed of the data access unit 316 and the computing unit 314, the dynamic voltage-modulating and frequency-modulating device 317 can determine that the convolution operation process is performed. The operation unit 314 becomes a bottleneck. After the data access unit 316 performs the current data read operation, the data access unit 316 needs to wait for the operation unit 314 to perform the current convolution operation. The above arithmetic unit 314. The dynamic voltage-modulating frequency modulation device 317 sends the second voltage frequency regulation information to the data access unit 316, and the second voltage frequency control information is used to instruct the data access unit 316 to lower its operating voltage or operating frequency to reduce the operating speed of the data access unit 316. The running speed of the data access unit 316 is matched with the running speed of the budget unit 314, the power consumption of the data access unit 316 is reduced, and the data access unit 316 is prevented from being idle, and finally, the completion time of the task is not affected. In this case, the overall operating power consumption of the above neural network processor is reduced.

When the neural network processor performs the artificial neural network operation and performs the artificial intelligence application, the dynamic voltage regulation and frequency modulation device 317 collects the working parameters of the artificial neural network application by the neural network processor in real time, and adjusts the neural network processing according to the working parameter. The operating voltage or operating frequency of the device.

Specifically, the artificial intelligence application may be video image processing, object recognition, machine translation, voice recognition, image beauty, and the like.

Case 3: When the above neural network processor performs video image processing, the dynamic voltage regulation and frequency modulation device 317 collects the frame rate of the video image processing by the neural network processor in real time. When the frame rate of the video image processing exceeds the target frame rate, the target frame rate is a video image processing frame rate that is normally required by the user, and the dynamic voltage modulation and frequency modulation device 317 sends the third voltage frequency regulation information to the neural network processor. The third voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or operating frequency, and reduces the power consumption of the neural network processor while satisfying the normal video image processing requirements of the user.

Case 4: When the neural network processor performs speech recognition, the dynamic voltage modulation and frequency modulation device 317 collects the speech recognition speed of the neural network processor in real time. When the voice recognition speed of the neural network processor exceeds the actual voice recognition speed of the user, the dynamic voltage regulation and frequency modulation device 317 sends fourth voltage frequency regulation information to the neural network processor, where the fourth voltage frequency regulation information is used to indicate the nerve The network processor reduces its operating voltage or operating frequency, and reduces the power consumption of the above-mentioned neural network processor while satisfying the user's normal speech recognition requirements.

Case 5, the dynamic voltage regulation and frequency modulation device 317 monitors each unit or module in the above neural network processor in real time (including the storage medium 311, the register unit 312, the interconnection module 313, the operation unit 314, the controller unit 315, and the data access unit 316) Working status. When any unit or module of the above neural network processor is in an idle state, the dynamic voltage regulating and frequency modulation device 317 sends the fifth voltage frequency control information to the unit or the module to reduce the working voltage of the unit or the module or Operating frequency to further reduce the power consumption of the unit or module. When the unit or module is in the working state again, the dynamic voltage regulating and frequency modulation device 317 sends the sixth voltage frequency regulation information to the unit or the module to raise the working voltage or the operating frequency of the unit or the module, so that the unit or the module The speed of operation meets the needs of the work.

Referring to FIG. 3D, FIG. 3D is a schematic diagram of another dynamic voltage-modulation frequency modulation application scenario provided by an embodiment of the present application. As shown in FIG. 3D, the convolution operation device includes a dynamic voltage-modulating frequency modulation device 7, an instruction storage unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a main operation module 5, and a plurality of slave operation modules 6. . The instruction storage unit 1, the controller unit 2, the data access unit 3, the interconnection module 4, the main operation module 5, and the slave operation module 6 may all pass through hardware circuits (including but not limited to FPGA, CGRA, application specific integrated circuit ASIC, analog Circuits and memristors, etc.).

The instruction storage unit 1 reads in an instruction through the data access unit 3 and stores the read instruction.

The controller unit 2 reads an instruction from the instruction storage unit 1, translates the instruction into a control signal that controls the behavior of other modules, and transmits it to other modules such as the data access unit 3, the main operation module 5, and the slave operation module 6.

The data access unit 3 can access the external address space, directly read and write data to and from the respective memory cells inside the convolution operation device, and complete data loading and storage.

The interconnect module 4 is used to connect the main operation module and the slave operation module, and can be implemented into different interconnection topologies (such as a tree structure, a ring structure, a grid structure, a hierarchical interconnection, a bus structure, etc.).

The dynamic voltage-modulating and frequency-modulating device 7 is configured to acquire the working state information of the data access unit 3 and the main operation unit 5 in real time, and adjust the data access unit according to the working state information of the data access unit 3 and the main operation unit 5. 3 and the operating voltage or operating frequency of the main arithmetic module 5 described above.

In a possible embodiment of the present application, an embodiment of the present invention provides a neural network processor including the above convolution operation device.

In this convolution calculation task, the dynamic voltage modulation and frequency modulation device 7 works as follows:

Case 1: The convolutional neural network processor performs the convolution operation, and the dynamic voltage-modulating and frequency-modulating device 7 acquires the running speeds of the data access unit 3 and the main operation module 5 of the neural network processor in real time. When the dynamic voltage-modulating and frequency-modulating device 7 determines that the running time of the data access unit 3 exceeds the running time of the main computing module 5 according to the operating speeds of the data access unit 3 and the main computing module 5, the dynamic voltage-modulating and frequency-modulating device 7 can determine that the convolution operation is performed. During the process, the data access unit 3 becomes a bottleneck. After the main arithmetic module 5 performs the current convolution operation, it needs to wait for the data access unit 3 to execute the read task and transfer the read data to the main operation module 5. The main operation module 5 can perform a convolution operation operation based on the data transmitted from the data access unit 3 at this time. The dynamic voltage regulating and frequency-modulating device 7 sends the first voltage frequency control information to the main operation module 5, where the first voltage frequency control information is used to instruct the main operation module 5 to lower the operating voltage or the operating frequency thereof to reduce the running speed of the main operation module 5. The running speed of the main operation module 5 is matched with the running speed of the data access unit 3, the power consumption of the main operation module 5 is reduced, and the idle operation of the main operation module 5 is avoided, and finally, the completion time of the task is not affected. In this case, the overall operating power consumption of the above neural network processor is reduced.

Case 2: In the process of performing the convolution operation, the dynamic voltage-modulating and frequency-modulating device 7 acquires the running speeds of the data access unit 3 and the main operation module 5 of the neural network processor in real time. When the dynamic voltage-modulating and frequency-modulating device 3 determines that the running time of the main arithmetic module 5 exceeds the running time of the data access unit 3 according to the operating speed of the data access unit 3 and the main arithmetic module 5, the dynamic voltage-modulating and frequency-modulating device 7 can determine that the convolution operation is performed. In the process, the main operation module 5 becomes a bottleneck. After the data access unit 3 performs the current data read operation, it needs to wait for the main operation module 5 to perform the current convolution operation operation, and then the data access unit 3 reads the data. The data is transferred to the main operation module 5. The dynamic voltage-modulating and frequency-modulating device 7 sends the second voltage frequency control information to the data access unit 3, and the second voltage frequency control information is used to instruct the data access unit 316 to lower its operating voltage or operating frequency to reduce the operating speed of the data access unit 3. So that the running speed of the data access unit 3 matches the running speed of the main operation module 5, the power consumption of the data access unit 3 is reduced, and the idle condition of the data access unit 3 is avoided, and finally the completion time of the task is not affected. In the case of the above, the operating power consumption of the above-mentioned neural network processor as a whole is reduced.

When the neural network processor performs the artificial neural network operation and performs the artificial intelligence application, the dynamic voltage modulation and frequency modulation device 317 collects the working parameters of the artificial neural network application by the neural network processor in real time and adjusts the neural network processor according to the working parameter. Working voltage or operating frequency.

Case 3: When the above neural network processor performs video image processing, the dynamic voltage modulation and frequency modulation device 7 collects the frame rate of the video image processing by the neural network processor in real time. When the frame rate of the video image processing exceeds the target frame rate, the target frame rate is a video image processing frame rate that is normally required by the user, and the dynamic voltage modulation and frequency modulation device 7 sends the third voltage frequency regulation information to the neural network processor. The third voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or operating frequency, and reduces the power consumption of the neural network processor while satisfying the normal video image processing requirements of the user.

Case 4: When the neural network processor performs speech recognition, the dynamic voltage modulation and frequency modulation device 7 collects the speech recognition speed of the neural network processor in real time. When the voice recognition speed of the neural network processor exceeds the actual voice recognition speed of the user, the dynamic voltage regulation and frequency modulation device 7 sends fourth voltage frequency regulation information to the neural network processor, where the fourth voltage frequency regulation information is used to indicate the nerve The network processor reduces its operating voltage or operating frequency, and reduces the power consumption of the above-mentioned neural network processor while satisfying the user's normal speech recognition requirements.

Case 5, the dynamic voltage regulation and frequency modulation device 7 monitors and acquires each unit or module in the above neural network processor in real time (including instruction 1, controller unit 2, data access unit 3, interconnection module 4, main operation module 5, and slave operation). Module 6) working status information. When any unit or module of the above neural network processor is in an idle state, the dynamic voltage regulating and frequency modulation device 7 sends the fifth voltage frequency control information to the unit or the module to reduce the working voltage of the unit or the module or Operating frequency to further reduce the power consumption of the unit or module. When the unit or module is in the working state again, the dynamic voltage regulating and frequency modulation device 7 sends the sixth voltage frequency regulation information to the unit or the module to raise the working voltage or the operating frequency of the unit or the module, so that the unit or the module The speed of operation meets the needs of the work.

Referring to Figure 3E, Figure 3E schematically illustrates an embodiment of an interconnect module 4: an H-tree module. The interconnection module 4 constitutes a data path between the main operation module 5 and the plurality of slave operation modules 6, and is a binary tree path composed of a plurality of nodes, and each node transmits the upstream data to the downstream two nodes in the same manner. The data returned by the two downstream nodes is merged and returned to the upstream node. For example, in the beginning of the calculation phase of the convolutional neural network, the neuron data in the main operation module 5 is sent to the respective slave operation modules 6 through the interconnection module 4; when the calculation process from the operation module 6 is completed, when the calculation from the operation module is completed After the process is completed, the value of each neuron output from the arithmetic module is progressively combined into a complete vector of neurons in the interconnect module 4. For example, if there are a total of N slave arithmetic modules in the device, the input data xi is sent to the N slave arithmetic modules, and each slave computing module convolves the input data xi with the convolution kernel corresponding to the slave computing module. Obtaining a scalar data, the scalar data of each slave arithmetic module is merged by the interconnect module 4 into an intermediate vector containing N elements. Assuming that the convolution window traverses a total of A*B (A in the X direction, B in the Y direction, and X, Y are the coordinate axes of the three-dimensional orthogonal coordinate system), the data xi is input, and the above is performed for A*B xi Convolution operation, all the obtained vectors are combined in the main operation module to obtain the three-dimensional intermediate result of A*B*N.

Referring to FIG. 3F, FIG. 3F illustrates an example block diagram of the structure of the main operation module 5 in the apparatus for performing a convolutional neural network forward operation according to an embodiment of the present application. As shown in FIG. 3F, the main operation module 5 includes a first operation unit 51, a first data dependency determination unit 52, and a first storage unit 53.

The first operation unit 51 includes a vector addition unit 511 and an activation unit 512. The first operation unit 51 receives the control signal from the controller unit 2, and completes various operation functions of the main operation module 5, and the vector addition unit 511 is used to implement the offset operation in the forward calculation of the convolutional neural network, and the component will The offset data is added to the intermediate result pair to obtain a bias result, and the activation operation unit 512 performs an activation function operation on the bias result. The offset data may be read from an external address space or may be stored locally.

The first data dependency determining unit 52 is a port in which the first computing unit 51 reads and writes the first storage unit 53, and ensures read/write consistency of data in the first storage unit 53. At the same time, the first data dependency determining unit 52 is also responsible for transmitting the data read from the first storage unit 53 to the slave computing module through the interconnect module 4, and the output data from the computing module 6 is directly sent to the slave module 4 through the interconnect module 4. The first arithmetic unit 51. The command output from the controller unit 2 is sent to the calculation unit 51 and the first data dependency determination unit 52 to control its behavior.

The storage unit 53 is configured to buffer the input data and the output data used by the main operation module 5 in the calculation process.

Referring to FIG. 3G, FIG. 3G illustrates an example block diagram of the structure of the slave arithmetic module 6 in an apparatus for performing a convolutional neural network forward operation in accordance with an embodiment of the present application. As shown in FIG. 3G, each slave arithmetic module 6 includes a second arithmetic unit 61, a data dependency determining unit 62, a second storage unit 63, and a third storage unit 64.

The second arithmetic unit 61 receives the control signal from the controller unit 2 and performs a convolution operation. The second arithmetic unit includes a vector multiplication unit 611 and an accumulating unit 612, which are respectively responsible for the vector multiplication operation and the accumulation operation in the convolution operation.

The second data dependency determining unit 62 is responsible for the read and write operations on the second storage unit 63 in the calculation process. Before the second data dependency determining unit 62 performs the read/write operation, it first ensures that there is no read/write consistency conflict between the data used between the instructions. For example, all control signals sent to the data dependency unit 62 are stored in an instruction queue inside the data dependency unit 62, in which the range of read data of the read command is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.

The second storage unit 63 buffers the input data of the slave arithmetic module 6 and outputs scalar data.

The third storage unit 64 buffers the convolution kernel data required by the slave arithmetic module 6 in the calculation process.

It can be seen that, in the solution of the embodiment of the present invention, the dynamic voltage regulating and frequency modulation device collects the running speed of the neural network processor and its internal units and modules in real time, according to the neural network processor and its internal units and modules. The running speed determines to reduce the operating frequency or working voltage of the neural network processor or its internal units, and can achieve the purpose of reducing the overall operating power consumption of the chip while meeting the needs of the user in actual work.

Referring to FIG. 3H, FIG. 3H is a schematic flowchart of a dynamic voltage regulation and frequency modulation method according to an embodiment of the present application. As shown in FIG. 3H, the method includes:

S801, the dynamic voltage modulation and frequency modulation device collects working state information or application scenario information of the chip connected to the dynamic voltage regulation frequency modulation in real time, where the application scenario information is obtained by the chip through a neural network operation or with the chip Information collected by connected sensors.

S802: The dynamic voltage regulation and frequency modulation device sends voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip, where the voltage frequency regulation information is used to instruct the chip to adjust its working voltage or operating frequency.

The working state information of the chip includes an operating speed of the chip, and the voltage frequency control information includes first voltage frequency regulation information, and the sending, according to the working state information or application scenario information of the chip, is sent to the chip. The voltage frequency regulation information includes:

Further, the chip includes at least a first unit and a second unit, the output data of the first unit is input data of the second unit, and the working state information of the chip includes an operating speed of the first unit And the operating speed of the second unit, the voltage frequency regulation information includes the second voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:

Further, the voltage frequency regulation information includes the third voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:

Optionally, the chip includes at least N units, and the working state information of the chip includes working state information of at least S units of the N units, where N is an integer greater than 1, and the S is The voltage frequency regulation information includes a fourth voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information of the chip further includes:

The unit A is any one of the at least S units.

Optionally, the voltage frequency regulation information includes the fifth voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:

Determining, according to the working state information of the unit A, that the unit A is in the working state again, sending the fifth voltage frequency regulation information to the unit A, where the fifth voltage frequency regulation information is used to indicate that the unit A is raised. Its working voltage or operating frequency.

Optionally, the application scenario of the chip is image recognition, the application scenario information is the number of objects in the image to be identified, the voltage frequency regulation information includes sixth voltage frequency regulation information, and the voltage regulation and frequency modulation unit further Used for:

Optionally, the application scenario information is object tag information, the voltage frequency regulation information includes seventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:

Optionally, the chip is applied to the voice recognition, the application scenario information is a voice input rate, the voltage frequency regulation information includes an eighth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:

Optionally, the application scenario information is a keyword obtained by performing voice recognition on the chip, the voltage frequency regulation information includes ninth voltage frequency regulation information, and the frequency modulation and voltage adjustment unit is further configured to:

Optionally, the chip is applied to machine translation, where the application scenario information is a speed of text input or a quantity of characters in an image to be translated, and the voltage frequency regulation information includes tenth voltage frequency regulation information, and the voltage regulation and frequency modulation The unit is also used to:

Optionally, the application scenario information is ambient light intensity, the voltage frequency regulation information includes eleventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:

Optionally, the chip is applied to image beauty, and the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency offset control information, and the voltage regulation and frequency modulation unit is further configured to:

It should be noted that the specific implementation process of the foregoing method embodiment may refer to the related description of the embodiment shown in FIG. 3A, and is not described herein.

Referring to FIG. 4A, FIG. 4A is a schematic structural diagram of a convolution operation device according to an embodiment of the present application. As shown in FIG. 4A, the convolution operation device includes a dynamic voltage-modulating frequency modulation device 7, an instruction storage unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a main operation module 5, and N slave operation modules 6. .

The instruction storage unit 1, the controller unit 2, the data access unit 3, the interconnection module 4, the main operation module 5, and the N slave operation modules 6 can all pass hardware circuits (including but not limited to FPGA, CGRA, and dedicated integration). Circuit ASICs, analog circuits, and memristors are implemented.

The instruction storage unit 1 is configured to store an instruction read by the data access unit 3.

The controller unit 2 is configured to read an instruction from the instruction storage unit 1, translate the instruction into a control signal for controlling the behavior of other modules, and send the signal to other modules such as the data access unit 3, the main operation module 5, and the N slave operations. Module 6 and so on.

The data access unit 3 is configured to perform data or instruction read and write operations between the external address space and the convolution operation device.

Specifically, the data access unit 3 accesses the external address space, directly reads and writes data to each storage unit inside the device, and completes loading and storing of the data.

N slave arithmetic modules 6 are used to implement convolution operations of input data and convolution kernels in a convolutional neural network algorithm.

The N slave operation modules 6 are specifically configured to calculate respective output scalars in parallel by using the same input data and respective convolution kernels.

The interconnection module 4 is configured to connect the main operation mode 5 block and the N slave operation module 6, and can realize different interconnection topologies (such as a tree structure, a ring structure, a grid structure, a hierarchical interconnection, and a bus structure). Wait). The interconnection module 4 can implement data transmission between the main operation module 5 and the N slave operation modules 6.

In other words, the interconnection module 4 constitutes a data path of continuous or discretized data between the main operation module 5 and the N slave operation modules 6, and the interconnection module 4 is a tree structure, a ring structure, a grid structure, Any of a hierarchical interconnection and a bus structure.

The main operation module 5 is configured to splicing intermediate vectors of all input data into intermediate results, and performing subsequent operations on the intermediate results.

The main operation module 5 is further configured to add the intermediate result and the offset data, and then perform an activation operation. The activation function active used by the main operation module is any nonlinear function of the nonlinear functions sigmoid, tanh, relu, and softmax.

The main operation module 5 includes:

The first storage unit 53 is configured to cache input data and output data used by the main operation module 5 in the calculation process;

The first operation unit 51 is configured to complete various computing functions of the main operation module 5;

The first data dependency determining unit 52 is a port of the first computing unit 51 that reads and writes the first storage unit 53 for ensuring consistency of reading and writing data to the first storage unit 53, and reads from the first storage unit 53. The input neuron vector is taken and sent to the N slave arithmetic modules 6 through the interconnect module 4; and the intermediate result vector from the interconnect module 4 is sent to the first arithmetic unit 51 described above.

The slave operation module of each of the N slave operation modules 6 includes:

a second operation unit 61, configured to receive a control signal sent by the controller unit 2 and perform an arithmetic logic operation;

The second data dependency determining unit 62 is configured to perform read and write operations on the second storage unit 63 and the third storage unit 64 in the calculation process to ensure consistent reading and writing of the second storage unit 63 and the third storage unit 64. Sex

a second storage unit 63, configured to buffer input data and an output scalar calculated by the slave computing module;

The third storage unit 64 is configured to buffer a convolution kernel required by the slave computing module in the calculation process.

Further, the first data dependency determining unit 52 and the second data dependency determining unit 62 ensure read and write consistency by:

Determining whether there is a dependency between the control signal that has not been executed and the data of the control signal being executed, if not, allowing the control signal to be immediately transmitted, otherwise it is necessary to wait until all control signals on which the control signal depends This control signal is allowed to be transmitted after completion.

Optionally, the data access unit 3 reads at least one of input data, offset data, and a convolution kernel from an external address space.

Before the start of the neural network full connection layer forward operation, the main operation module 5 delivers the input data to each of the N slave operation modules 6 through the interconnection module 4, and ends the calculation process at the N slave operation modules 6. Afterwards, the interconnect module 4 progressively divides the output scalars of the N slave arithmetic modules 6 into intermediate vectors and sends them back to the main arithmetic module 5.

The specific calculation method of the above convolution operation device is described by different operation instructions. The operation instruction here is exemplified by a convolution calculation instruction, and the convolution calculation instruction can be applied to a neural network, so the convolution calculation instruction can also It is called a convolutional neural network. For a convolutional calculation instruction, the formula that it actually needs to execute can be:

s=s(∑wx _i +b)

Wherein, the convolution kernel W (which may include a plurality of data) is multiplied by the input data χ _i , summed, and then optionally biased b, and optionally an activation operation S(h), The final output result S is obtained. According to the formula, the calculation topology can be obtained as a multiplier-adder-(optional) activation operator. The convolution calculation instruction may include an instruction set including a convolutional neural network COMPUTE instruction having different functions, and a CONFIG instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction.

In one embodiment, the COMPUTE instruction includes:

The convolutional neural network ReLU instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, then the output is re-activated by ReLU;

The convolutional neural network group instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and after dividing the group, the convolution operation unit The convolution operation is performed, preferably, and then the output is activated.

The NOP instruction is responsible for clearing the control signals in all control signal buffer queues of the current device, and ensuring that all instructions before the NOP instruction are all completed. The NOP instruction itself does not contain any operations;

The MOVE instruction is used to carry data of an address in the internal address space of the device to another address in the internal address space of the device. The process is independent of the operation unit and does not occupy the resources of the operation unit during execution.

The controller unit 2 extracts the convolution calculation instruction from the instruction storage unit 1, the operation domain corresponding to the convolution calculation instruction, and the first calculation topology corresponding to the convolution calculation instruction (multiplier-adder-adder-activate) The arithmetic unit transmits the operation domain to the data access unit to transmit the first computing topology to the interconnection module 4.

The data access unit 3 extracts the convolution kernel w and the offset b corresponding to the operation domain from the external storage medium (when b is 0, it is not necessary to extract the offset b), and the convolution kernel w and the offset b are transmitted to the main operation. Module 5.

The dynamic voltage regulation and frequency modulation device 7 is configured to collect operation state information of the convolution operation device, and send voltage frequency regulation information to the convolution operation device according to the operation state information of the convolution operation device, where the voltage frequency regulation The information is used to instruct the convolution operation device to adjust its operating voltage or operating frequency.

Specifically, the dynamic voltage regulation and frequency modulation device 7 includes:

The information collecting unit 71 is configured to collect the working state information of the convolution operation device in real time;

The voltage regulation and frequency adjustment unit 72 is configured to send voltage frequency regulation information to the convolution operation device 71 according to the operation state information of the convolution operation device, and the voltage frequency regulation information is used to instruct the convolution operation device 71 to adjust its working voltage or work. frequency.

In a possible embodiment of the present application, the operating state information of the convolution operation device includes an operating speed of the convolution operation device, the voltage frequency regulation information includes first voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is configured to:

And when the operating speed of the convolution operation device is greater than the target speed, transmitting the first voltage frequency regulation information to the convolution operation device, where the first voltage frequency regulation information is used to instruct the convolution operation device to lower its operating frequency or work The voltage, the target speed is an operating speed of the convolutional computing device when the user's demand is met.

In a possible embodiment of the present application, the working state information of the convolution computing device includes an operating speed of the data access unit 3 and an operating speed of the main computing module 5, and the voltage frequency regulation information includes second electrical frequency regulation information, and frequency modulation. The pressure regulating unit 72 is also used to:

When it is determined that the running time of the data access unit 3 exceeds the running time of the main computing module 5 according to the running speed of the data access unit 3 and the running speed of the main computing module 5, the second voltage frequency control is sent to the main computing module 5 The second voltage frequency regulation information is used to indicate that the main operation module 5 reduces the operating frequency or the operating voltage thereof.

Further, the voltage frequency regulation information includes third electrical frequency regulation information, and the frequency modulation unit 72 is further configured to:

When it is determined that the running time of the main operation module 5 exceeds the running time of the data access unit 3 according to the operating speed of the data access unit 3 and the operating speed of the main operation module 5, the third voltage is transmitted to the data access unit 3. Frequency regulation information, the third voltage frequency regulation information is used to instruct the data access unit 3 to reduce its operating frequency or operating voltage.

In a possible embodiment of the present application, the working state information of the convolution operation device includes an instruction storage unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a main operation module 5, and N slave operation modules. The working state information of at least S units/modules in the case, the S is an integer greater than 1 and less than or equal to N+5, the voltage frequency regulation information includes fourth voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is configured to:

And determining, according to the working state information of the unit A, that the unit A is in an idle state, sending the fourth voltage frequency regulation information to the unit A, where the fourth voltage frequency regulation information is used to indicate that the unit A reduces its operating frequency or operating voltage. ,

The unit A is any one of the at least S units/modules.

Further, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is further configured to:

When it is determined that the unit A is in the working state again according to the working state information of the unit A, the fifth voltage frequency regulation information is sent to the unit A, and the fifth voltage frequency regulation information is used to indicate that the unit A raises its working voltage. Or the working frequency.

In this convolution calculation task, the dynamic voltage modulation and frequency modulation device 7 in Fig. 4A works as follows:

Case 1: The convolutional neural network processor performs the convolution operation, and the dynamic voltage-modulating and frequency-modulating device 7 acquires the running speeds of the data access unit 3 and the main arithmetic module 5 of the neural network processor in FIG. 4A in real time. When the dynamic voltage-modulating and frequency-modulating device 7 determines that the running time of the data access unit 3 exceeds the running time of the main computing module 5 according to the operating speed of the data access unit 3 and the main computing module 5, the dynamic voltage-modulating and frequency-modulating device 7 can determine that the convolution operation is performed. During the process, the data access unit 3 becomes a bottleneck. After the main arithmetic module 5 performs the current convolution operation, it needs to wait for the data access unit 3 to execute the read task and transfer the read data to the main operation module. 5. The main operation module 5 can perform a convolution operation operation based on the data transmitted by the data access unit 3. The dynamic voltage regulation and frequency modulation device 7 sends the first voltage frequency regulation information to the main operation module 5, and the first voltage frequency regulation information is used to instruct the main operation module 5 to lower its working voltage or operating frequency to reduce the operation of the main operation module 5. The speed is such that the running speed of the main operation module 5 matches the running speed of the data access unit 3, the power consumption of the main operation module 5 is reduced, and the idle operation of the main operation module 5 is avoided, and finally the completion time of the task is not affected. In the case of the above, the operating power consumption of the above-mentioned neural network processor as a whole is reduced.

Case 2: In the process of performing the convolution operation, the dynamic voltage-modulating and frequency-modulating device 7 in FIG. 4A acquires the running speeds of the data access unit 3 and the main operation module 5 of the neural network processor in real time. When the dynamic voltage-modulating and frequency-modulating device 7 determines that the running time of the main arithmetic module 5 exceeds the running time of the data access unit 3 according to the operating speed of the data access unit 3 and the main arithmetic module 5, the dynamic voltage-modulating and frequency-modulating device 7 can determine that the convolution operation is performed. In the process, the main operation module 5 becomes a bottleneck. After the data access unit 3 performs the current data read operation, it needs to wait for the main operation module 5 to perform the current convolution operation operation, and then the data access unit 3 reads the data. The data is transferred to the main operation module 5. The dynamic voltage-modulating and frequency-modulating device 7 sends the second voltage frequency control information to the data access unit 3, and the second voltage frequency control information is used to instruct the data access unit 3 to lower its operating voltage or operating frequency to reduce the operating speed of the data access unit 3. So that the running speed of the data access unit 3 matches the running speed of the main operation module 5, the power consumption of the data access unit 3 is reduced, and the idle condition of the data access unit 3 is avoided, and finally the completion time of the task is not affected. In the case of the above, the operating power consumption of the above-mentioned neural network processor as a whole is reduced.

When the neural network processor performs the artificial neural network operation and performs the artificial intelligence application, the dynamic voltage-modulating frequency modulation device 7 in FIG. 4A collects the working parameters of the artificial neural network processor in the real-time application and adjusts the above according to the working parameters. The operating voltage or operating frequency of the neural network processor.

Case 3: When the above-mentioned neural network processor performs video image processing, the dynamic voltage-modulating and frequency-modulating device 7 in FIG. 4A collects the frame rate of the video image processing by the above-mentioned neural network processor in real time. When the frame rate of the video image processing exceeds the target frame rate, the target frame rate is a video image processing frame rate that is normally required by the user, and the dynamic voltage modulation and frequency modulation device 7 sends the third voltage frequency regulation information to the neural network processor. The third voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or operating frequency, and reduces the power consumption of the neural network processor while satisfying the normal video image processing requirements of the user.

Case 4: When the neural network processor performs speech recognition, the dynamic voltage-modulating frequency modulation device 7 in FIG. 4A collects the speech recognition speed of the neural network processor in real time. When the voice recognition speed of the neural network processor exceeds the actual voice recognition speed of the user, the dynamic voltage regulation and frequency modulation device 7 sends fourth voltage frequency regulation information to the neural network processor, where the fourth voltage frequency regulation information is used to indicate the neural network. The processor reduces its operating voltage or operating frequency, and reduces the power consumption of the above neural network processor while satisfying the user's normal speech recognition requirements.

Case 5, the dynamic voltage-modulating frequency modulation device 7 in FIG. 4A monitors and acquires each unit or module in the above-mentioned neural network processor in real time (including the instruction storage unit 1, the controller unit 2, the data access unit 3, the interconnection module 4, and the main The operational status information of the arithmetic module 5 and the N slave arithmetic modules 6). When any unit or module of the above neural network processor is in an idle state, the dynamic voltage regulating and frequency modulation device 7 sends the fifth voltage frequency control information to the unit or the module to reduce the working voltage of the unit or the module or Operating frequency to reduce the power consumption of the unit or module. When the unit or module is in the working state again, the dynamic voltage regulating and frequency modulation device 7 sends the sixth voltage frequency regulation information to the unit or the module to raise the working voltage or the operating frequency of the unit or the module, so that the unit or the module The speed of operation meets the needs of the work.

Referring to Figure 4E, Figure 4E schematically illustrates an embodiment of an interconnect module 4: an H-tree module. The interconnection module 4 constitutes a data path between the main operation module 5 and the plurality of slave operation modules 6, and is a binary tree path composed of a plurality of nodes, and each node transmits the upstream data to the downstream two nodes in the same manner. The data returned by the two downstream nodes is merged and returned to the upstream node. For example, in the beginning of the calculation phase of the convolutional neural network, the neuron data in the main operation module 5 is sent to the respective slave operation modules 6 through the interconnection module 4; when the calculation process from the operation module 6 is completed, when the calculation from the operation module is completed After the process is completed, the value of each neuron output from the arithmetic module is progressively assembled into a complete vector of neurons in the interconnect module. For example, if there are a total of N slave arithmetic modules in the convolution operation device, the input data xi is sent to the N slave arithmetic modules, and each slave arithmetic module rolls the input data xi with the convolution kernel corresponding to the slave arithmetic module. The product operation obtains a scalar data, and the scalar data of each slave arithmetic module is merged by the interconnect module 4 into an intermediate vector containing N elements. Assuming that the convolution window traverses a total of A*B (A in the X direction, B in the Y direction, and X, Y are the coordinate axes of the three-dimensional orthogonal coordinate system), the data xi is input, and the above is performed for A*B xi Convolution operation, all the obtained vectors are combined in the main operation module to obtain the three-dimensional intermediate result of A*B*N.

Referring to FIG. 4B, FIG. 4B illustrates an example block diagram of the structure of the main operation module 5 in the apparatus for performing a convolutional neural network forward operation in accordance with an embodiment of the present application. As shown in FIG. 4B, the main operation module 5 includes a first operation unit 51, a first data dependency determination unit 52, and a first storage unit 53.

The first operation unit 51 includes a vector addition unit 511 and an activation unit 512. The first operation unit 51 receives the control signal from the controller unit 2 in FIG. 4A, and completes various operation functions of the main operation module 5, and the vector addition unit 511 is used to implement the offset in the forward calculation of the convolutional neural network. Operation, the component adds the offset data to the intermediate result bit to obtain an offset result, and the activation operation unit 512 performs an activation function operation on the bias result. The offset data may be read from an external address space or may be stored locally.

The first data dependency determining unit 52 is a port in which the first computing unit 51 reads and writes the first storage unit 53, and ensures read/write consistency of data in the first storage unit 53. At the same time, the first data dependency determining unit 52 is also responsible for transmitting the data read from the first storage unit 53 to the slave computing module 6 through the interconnect module 4, and the output data of the slave computing module 6 is directly transmitted through the interconnect module 4. The first arithmetic unit 51 is given. The command output from the controller unit 2 is sent to the calculation unit 51 and the first data dependency determination unit 52 to control its behavior.

Referring to FIG. 4C, FIG. 4C illustrates an example block diagram of the structure of the slave arithmetic module 6 in an apparatus for performing a convolutional neural network forward operation in accordance with an embodiment of the present application. As shown in FIG. 4C, each slave arithmetic module 6 includes a second arithmetic unit 61, a second data dependency determining unit 62, a second storage unit 63, and a third storage unit 64.

The second arithmetic unit 61 receives the control signal from the controller unit 2 in Fig. 4A and performs a convolution operation. The second arithmetic unit includes a vector multiplication unit 611 and an accumulating unit 612, which are respectively responsible for the vector multiplication operation and the accumulation operation in the convolution operation.

The second data dependency determining unit 62 is responsible for the read and write operations on the second storage unit 63 in the calculation process. Before the second data dependency determining unit 62 performs the read/write operation, it first ensures that there is no read/write consistency conflict between the data used between the instructions. For example, all control signals sent to the data dependency unit 62 are stored in an instruction queue internal to the data dependency unit 62, in which the range of read data of the read command is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.

In a specific application scenario, in the line convolution calculation task, the working process of the dynamic voltage regulating and frequency modulation device 7 in FIG. 4A is as follows:

The information collecting unit 71 of the dynamic voltage-modulating and frequency-modulating device 7 collects the working state information or the application scenario information of the neural network processor connected to the dynamic voltage-modulating and frequency-modulating device 7 in real time, and the application scenario information is the neural network processor through the neural network. Information obtained by the sensor obtained by the sensor or connected to the neural network processor; the voltage-modulating frequency unit 72 of the dynamic voltage-modulating frequency modulation device 7 is directed to the nerve according to the working state information or the application scene information of the neural network processor The network processor transmits voltage frequency regulation information for instructing the neural network processor to adjust its operating voltage or operating frequency.

In a possible embodiment of the present application, the working state information of the neural network processor includes an operating speed of the neural network processor, the voltage frequency regulation information includes first voltage frequency regulation information, and the voltage regulating and frequency modulation unit 72 Used for:

Transmitting the first voltage frequency regulation information to the neural network processor when the operating speed of the neural network processor is greater than a target speed, the first voltage frequency regulation information being used to indicate that the neural network processor is reduced Its operating frequency or operating voltage, which is the operating speed of the neural network processor when the user's needs are met.

Specifically, the information collecting unit 71 collects the running speed of the neural network processor connected thereto in real time. The speed of operation of the neural network processor can be different types of speeds depending on the tasks performed by the neural network processor described above. When the operation performed by the neural network processor is video image processing, the operating speed of the neural network processor may be a frame rate of the video image processing performed by the neural network processor; when the operation performed by the neural network processor is voice When recognized, the operating speed of the above-mentioned neural network processor is the speed at which the above information is voice-recognized. The voltage regulating and frequency-modulating unit 72 determines that the operating speed of the neural network processor is greater than the target speed, that is, when the operating speed of the neural network processor reaches the operating speed of the neural network processor when the user demands are met, and sends the first to the neural network processor. A voltage frequency regulation information to instruct the neural network processor to reduce its operating voltage or operating frequency to reduce the power consumption of the neural network processor.

For example, assume that the operation performed by the above neural network processor is video image processing, and the above target speed is 24 frames/second. The information collecting unit 71 collects the frame rate of the video image processing by the neural network processor in real time, and the current frame rate of the video processing by the above neural network processor is 54 frames/second. The voltage-modulating and frequency-modulating unit 72 determines that when the frame rate of the video image processing by the above-mentioned neural network processor is greater than the target speed, the first voltage frequency regulation information is sent to the neural network processor to instruct the neural network processor to reduce the operating voltage thereof. Or operating frequency to reduce the power consumption of the neural network processor.

In a possible embodiment of the present application, the neural network processor includes at least a first unit and a second unit, and output data of the first unit is input data of the second unit, the neural network processor The operating state information includes the operating speed of the first unit and the operating speed of the second unit, the voltage frequency regulation information includes second voltage frequency regulation information, and the frequency modulation unit 72 is further configured to:

Specifically, the foregoing neural network processor needs to cooperate with the first unit and the second unit, and the output data of the first unit is the input data of the second unit. The information collecting unit 71 collects the operating speeds of the first unit and the second unit in real time. When it is determined that the operating speed of the first unit is less than the operating speed of the second unit, that is, the running time of the first unit exceeds the running time of the second unit, the voltage regulating and frequency converting unit 72 transmits the second voltage to the second unit. The frequency regulation information is used to instruct the second unit to lower its working voltage or operating frequency, so as to reduce the power consumption of the entire neural network processor without affecting the overall operating speed of the neural network processor.

In a possible embodiment of the present application, the voltage frequency regulation information includes third voltage frequency regulation information, and the frequency modulation unit 72 is further configured to:

Specifically, the foregoing neural network processor needs to cooperate with the first unit and the second unit, and the output data of the first unit is the input data of the second unit. The information collecting unit 71 collects the operating speeds of the first unit and the second unit in real time. When it is determined that the running speed of the first unit is greater than the running speed of the second unit, that is, the running time of the second unit exceeds the running time of the first unit, the voltage regulating frequency adjusting unit 72 transmits the third voltage to the first unit. The frequency regulation information is used to indicate that the first unit reduces the operating voltage or the operating frequency thereof, so as to reduce the power consumption of the entire neural network processor without affecting the overall operating speed of the neural network processor.

In a possible embodiment of the present application, the neural network processor includes at least N units, and the working state information of the neural network processor includes working state information of at least S units of the at least N units, The N is an integer greater than 1, the S is an integer less than or less than N, the voltage frequency regulation information includes fourth voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is configured to:

The unit A is any one of the at least S units.

In a possible embodiment of the present application, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is further configured to:

Specifically, in the working process of the neural network processor, the information collecting unit 71 collects the working state information of at least S units inside the neural network processor in real time. When it is determined according to the working state information of the above unit A that the unit A is in an idle state, the voltage regulating frequency modulation unit 72 transmits fourth voltage frequency regulation information to the unit A to indicate that the unit A lowers its operating frequency or operating voltage to reduce The power consumption of the unit A; when it is determined that the unit A is in the working state according to the working state information of the unit A, the voltage regulating and frequency adjusting unit 72 sends the fifth voltage frequency regulation information to the unit A to indicate that the unit A is rising. The operating frequency or operating voltage is high so that the operating speed of the unit A satisfies the requirements of the work.

In a possible embodiment of the present application, when the application scenario of the neural network processor is image recognition, the application scenario information is the number of objects in the image to be identified, and the voltage frequency regulation information includes a sixth voltage. The frequency regulation information, the voltage regulation and frequency modulation unit 72 is also used to:

When it is determined that the number of objects in the image to be identified is less than a first threshold, sending the sixth voltage frequency regulation information to the neural network processor, where the sixth voltage frequency regulation information is used to indicate the neural network processing The device reduces its operating voltage or operating frequency.

Specifically, the neural network processor is applied to image recognition, and the number of objects in the image to be identified is obtained by the neural network processor by using a neural network algorithm, and the information collecting unit 71 obtains the to-be-identified from the neural network processor. After the number of the objects in the image (ie, the application scenario information), when the voltage-modulating frequency unit 72 determines that the number of objects in the image to be identified is less than the first threshold, the voltage-modulating frequency unit 72 sends the neural network processor to the neural network processor. The sixth voltage frequency control information is used to indicate that the neural network processor reduces its working voltage or operating frequency; and when it is determined that the number of objects in the image to be identified is greater than the first threshold, the voltage regulating and frequency converting unit 72 goes to the neural network. The processor transmits voltage frequency regulation information for instructing the neural network processor to increase its operating voltage or operating frequency.

In a possible embodiment of the present application, the application scenario information is object tag information, and the voltage frequency regulation information includes seventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is further configured to:

When it is determined that the object tag information belongs to the preset object tag set, sending the seventh voltage frequency regulation information to the neural network processor, where the seventh voltage frequency regulation information is used to indicate the core neural network processor Increase its working voltage or operating frequency.

For example, the preset object tag set includes a plurality of object tags, and the object tags may be “person”, “dog”, “tree”, and “flower”. When the neural network processor determines, by the neural network algorithm, that the dog is included in the current application scenario, the neural network processor transmits the object tag information including the "dog" to the information collecting unit 71, and the frequency modulation unit 72 determines the above. When the object tag information includes a "dog", the seventh voltage frequency regulation information is sent to the neural network processor to indicate that the neural network processor raises its working voltage or operating frequency; when it is determined that the object tag information does not belong to the preset When the object tag set is set, the voltage regulation frequency modulation unit 72 sends voltage frequency regulation information for instructing the neural network processor to reduce its operating voltage or operating frequency to the neural network processor.

In a possible embodiment of the present application, when the neural network processor is applied to voice recognition, the application scenario information is a voice input rate, and the voltage frequency regulation information includes an eighth voltage frequency regulation information, and voltage regulation and frequency modulation. Unit 72 is also used to:

Transmitting, to the neural network processor, eighth voltage frequency regulation information, where the voice input rate is lower than a second threshold, the eighth voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or works frequency.

Specifically, the application scenario of the above neural network processor is speech recognition, and the input unit of the neural network processor inputs speech to the neural network processor according to a certain rate. The information collecting unit 71 collects the voice input rate in real time, and transmits the voice input rate information to the voltage regulating and frequency adjusting unit 72. When the voltage modulation and frequency modulation unit 72 determines that the voice input rate is less than the second threshold, the eighth voltage frequency regulation information is sent to the neural network processor to instruct the neural network processor to reduce its operating voltage or operating frequency. When the voltage regulation and frequency modulation unit 72 determines that the voice input rate is greater than the second threshold, the voltage network control information for instructing the neural network processor to increase its operating voltage is sent to the neural network processor.

In a possible embodiment of the present application, when the application scenario information is a keyword obtained by the neural network processor for performing speech recognition, the voltage frequency regulation information includes ninth voltage frequency regulation information, and the frequency modulation is adjusted. The press unit is also used to:

Transmitting, to the neural network processor, the ninth voltage frequency regulation information when the keyword belongs to a preset keyword set, where the ninth voltage frequency regulation information is used to instruct the neural network processor to raise the Working voltage or operating frequency.

Further, when the keyword does not belong to the keyword set, the frequency modulation unit 72 sends the voltage modulation and frequency modulation information for instructing the neural network processor to reduce its working voltage or operating frequency to the neural network processor.

For example, when the application scenario of the neural network processor is speech recognition, the preset keyword set includes keywords such as “image beauty”, “neural network algorithm”, “image processing” and “Alipay”. Assume that the application scenario information is “image beauty”, and the frequency modulation unit 72 sends the ninth voltage frequency regulation information to the foregoing to instruct the neural network processor to increase its working voltage or operating frequency; When "photographing", the frequency modulation unit 72 transmits to the above-mentioned neural network processor voltage-regulating and frequency-modulating information for instructing the above-mentioned neural network processor to lower its operating voltage or operating frequency.

In a possible embodiment of the present application, when the neural network processor is applied to machine translation, the application scenario information is a speed of text input or a number of characters in an image to be translated, and the voltage frequency regulation information includes Ten voltage frequency regulation information, the voltage regulation frequency modulation unit 72 is also used to:

And when the text input speed is less than a third threshold or the number of characters in the image to be translated is less than a fourth threshold, sending the tenth voltage frequency regulation information to the neural network processor, where the tenth voltage frequency regulation information is used The neural network processor is instructed to reduce its operating voltage or operating frequency.

Specifically, the neural network processor is applied to the machine translation, and the application scenario information collected by the information collection unit 71 is the speed of the text input or the number of characters in the image to be translated, and the application scenario information is transmitted to the voltage modulation and frequency modulation unit 72. When it is determined that the text input speed is less than the third threshold or the number of characters in the image to be translated is less than the fourth threshold, the voltage regulating and frequency modulation unit 72 transmits the tenth voltage frequency regulation information to the neural network processor to indicate the neural network processor. Decreasing the operating voltage; when it is determined that the text input speed is greater than the third threshold or the number of characters in the image to be translated is greater than the fourth threshold, the voltage regulating and frequency converting unit 72 sends the neural network processor to the neural network processor to indicate that the neural network processor is High voltage and frequency regulation information of its working voltage.

In a possible embodiment of the present application, when the application scenario information is the ambient light intensity, the voltage frequency regulation information includes the eleventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is further configured to:

Transmitting, by the neural network processor, the eleventh voltage frequency regulation information, when the ambient light intensity is less than a fifth threshold, the eleventh voltage frequency regulation information is used to indicate that the neural network processor is reduced Its working voltage or operating frequency.

Specifically, the illumination intensity of the external environment is acquired by an illumination sensor connected to the neural network processor. After acquiring the above illumination intensity, the information collection unit 71 transmits the illumination intensity to the voltage regulation and frequency modulation unit 72. When it is determined that the illumination intensity is less than the fifth threshold, the voltage regulation and frequency modulation unit 72 transmits the eleventh voltage frequency regulation information to the neural network processor to instruct the neural network processor to reduce its operating voltage; when determining the light intensity When it is greater than the fifth threshold, the voltage regulation and frequency modulation unit 72 transmits to the neural network processor voltage frequency regulation information for instructing the neural network processor to increase its operating voltage or operating frequency.

In a possible embodiment of the present application, the neural network processor is applied to image beauty, and the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency offset control information, and the voltage modulation and frequency modulation unit is 72 is also used to:

Transmitting, when the application scenario information is a face image, the twelfth voltage frequency regulation information to the neural network processor, where the twelfth voltage frequency regulation information is used to instruct the neural network processor to raise the Working voltage or working frequency;

And sending, when the application scenario information is not a face image, the thirteenth voltage frequency regulation information to the neural network processor, where the thirteenth voltage frequency regulation information is used to instruct the neural network processor to reduce Working voltage or operating frequency.

In a possible embodiment of the present application, when the neural network processor is applied to voice recognition, the application scenario information is voice strength, and when the voice strength is greater than a sixth threshold, the voltage modulation and frequency modulation unit 72 processes the neural network. Transmitting, by the neural network processor, voltage frequency regulation information for instructing the neural network processor to reduce its operating voltage or operating frequency; when the voice strength is less than the sixth threshold, the voltage regulating and frequency modulation unit 72 sends the neural network processor to indicate the nerve The network processor raises the voltage frequency regulation information of its operating voltage or operating frequency.

It should be noted that the foregoing scene information may be information of an external scene collected by the sensor, such as light intensity, voice intensity, and the like. The application scenario information may also be information calculated according to an artificial intelligence algorithm. For example, in the object recognition task, the real-time calculation result information of the neural network processor is fed back to the information collection unit, where the information includes the number of objects and people in the scene. Information such as face images, object tag keywords, and so on.

Referring to FIG. 4F, FIG. 4F is a schematic structural diagram of another convolution operation device according to an embodiment of the present application. As shown in FIG. 4F, the convolution operation device includes a dynamic voltage modulation and frequency modulation device 617, a register unit 612, an interconnection module 613, an operation unit 614, a control unit 615, and a data access unit 616.

The operation unit 614 includes at least two of an addition calculator, a multiplication calculator, a comparator, and an activation operator.

The interconnecting module 613 is configured to control the connection relationship of the calculators in the computing unit 614 such that at least two types of calculators form different computing topologies.

The register unit 612 (which may be a register unit, an instruction cache, a scratch pad memory) is configured to store an operation instruction, an address of the data block in the storage medium, and a calculation topology corresponding to the operation instruction.

Optionally, the convolution operation device further includes a storage medium 611.

The storage medium 611 may be an off-chip memory. Of course, in an actual application, it may also be an on-chip memory for storing data blocks. The data block may be n-dimensional data, and n is an integer greater than or equal to 1, for example, n= At 1 o'clock, it is 1D data, that is, a vector, such as n=2, which is 2D data, that is, a matrix, such as n=3 or more, which is multidimensional data.

The control unit 615 is configured to extract an operation instruction from the register unit 612, an operation domain corresponding to the operation instruction, and a first calculation topology corresponding to the operation instruction, and decode the operation instruction into an execution instruction, where the execution instruction is used to control The arithmetic unit 614 performs an arithmetic operation, transfers the operational domain to the data access unit 616, and transmits the computational topology to the interconnect module 613.

The data access unit 616 is configured to extract a data block corresponding to the operation domain from the storage medium 611, and transmit the data block to the interconnection module 613.

The interconnecting module 613 is configured to receive the data block of the first computing topology.

In a possible embodiment of the present application, the interconnect module 613 also repositions the data block according to the first computing topology.

The operation unit 614 is configured to execute an instruction, and the calculator in the operation unit 614 is called to perform an operation operation on the data block to obtain an operation result, and the operation result is transmitted to the data access unit 616 and stored in the storage medium 611.

In a possible embodiment of the present application, the operation unit 614 is further configured to, according to the first computing topology and the execution instruction, invoke a calculator to perform an operation operation on the re-arranged data block, obtain an operation result, and transmit the operation result. The data access unit 616 is stored in the storage medium 611.

In a possible embodiment, the interconnecting module 613 is further configured to form a first computing topology according to the connection relationship of the calculator in the control computing unit 614.

The dynamic voltage regulation and frequency modulation device 617 is configured to monitor the working state of the entire convolution operation device and dynamically adjust its voltage and frequency.

s=s(∑wx _i +b)

In one embodiment, the COMPUTE instruction includes:

a convolutional neural network ReLU instruction, according to which the device respectively extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferred scratch pad memory), performs a convolution operation in the convolution operation unit, and then Output results for ReLU activation;

The control unit 615 extracts the convolution calculation instruction, the operation domain corresponding to the convolution calculation instruction, and the first calculation topology corresponding to the convolution calculation instruction from the register unit 612 (multiplier-adder-adder-activation operator) The control unit transmits the operational domain to the data access unit 616 to transmit the first computational topology to the interconnect module 613.

The data access unit 616 extracts the convolution kernel w and the offset b corresponding to the operation domain from the storage medium 611 (when b is 0, it is not necessary to extract the offset b), and the convolution kernel w and the offset b are transmitted to the operation. Unit 614.

The multiplier of the operation unit 614 obtains the first result after performing the multiplication operation on the convolution kernel w and the input data Xi, and inputs the first result to the adder to perform the addition operation to obtain the second result, and the second result and the offset b The addition operation is performed to obtain a third result, the third result is input to the activation operator to perform an activation operation to obtain an output result s, and the output result s is transmitted to the data access unit 616 for storage in the storage medium 611. After each step, the direct output result can be transferred to the data access storage to the storage medium 611 without the following steps. In addition, the step of performing the addition of the second result and the offset b to obtain the third result is optional, that is, when b is 0, this step is not required. In addition, the order of addition and multiplication operations can be reversed.

In this convolution calculation task, the dynamic voltage modulation and frequency modulation device 617 in FIG. 4F works as follows:

Case 1: In the process of performing the convolution operation, the dynamic voltage-modulating frequency modulation device 617 in FIG. 4F acquires the running speed of the data access unit 616 and the arithmetic unit 614 of the neural network processor in real time. When the dynamic voltage-modulating frequency modulation device 617 determines that the running time of the data access unit 616 exceeds the running time of the computing unit 614 according to the operating speed of the data access unit 616 and the computing unit 614, the dynamic voltage-modulating frequency adjusting device 617 can determine that the convolution operation is performed. During the process, the data access unit 616 becomes a bottleneck. After the current convolution operation operation is performed, the operation unit 614 needs to wait for the data access unit 616 to execute the read task and transmit the read data to the operation unit 614. The arithmetic unit 614 can perform a convolution operation operation based on the data transmitted by the data access unit 616 mentioned above. The dynamic voltage modulation and frequency modulation device 617 sends the first voltage frequency regulation information to the operation unit 614, where the first voltage frequency regulation information is used to instruct the operation unit 614 to lower the operating voltage or the operating frequency thereof to reduce the operating speed of the operation unit 614. The running speed of the computing unit 614 is matched with the running speed of the data access unit 616, which reduces the power consumption of the computing unit 614, prevents the computing unit 614 from being idle, and finally reduces the time without affecting the completion time of the task. The overall operating power consumption of the above neural network processor.

Case 2: In the process of performing the convolution operation, the dynamic voltage modulation and frequency modulation device 617 acquires the running speed of the data access unit 616 and the operation unit 614 of the neural network processor in real time. When the upper dynamic voltage modulation and frequency modulation device 617 determines that the operation time of the operation unit 614 exceeds the running time of the data access unit 616 according to the operation speed of the data access unit 616 and the operation unit 614, the dynamic voltage modulation and frequency modulation device 617 can determine that the convolution operation process is performed. The operation unit 614 becomes a bottleneck. After the data access unit 616 performs the current data read operation, the data access unit 616 transfers the data read by the data access unit 616 after waiting for the operation unit 614 to perform the current convolution operation. The above operation unit 614. The dynamic voltage-modulating frequency modulation device 617 sends the second voltage frequency control information to the data access unit 616, where the second voltage frequency control information is used to instruct the data access unit 616 to lower its operating voltage or operating frequency to reduce the data access unit 616. The running speed is such that the running speed of the data access unit 616 matches the running speed of the budget unit 614, the power consumption of the data access unit 616 is reduced, and the data access unit 616 is prevented from being idle, and finally the task is not affected. In the case of completion time, the overall operating power consumption of the above-described neural network processor is reduced.

The neural network processor performs an artificial neural network operation. When performing the artificial intelligence application, the dynamic voltage regulation and frequency modulation device 617 collects the working parameters of the artificial neural network application by the neural network processor in real time, and adjusts the neural network according to the working parameter. The operating voltage or operating frequency of the processor.

Case 3: When the above neural network processor performs video image processing, the dynamic voltage regulation and frequency modulation device 617 collects the frame rate of the video image processing by the neural network processor in real time. When the frame rate of the video image processing exceeds the target frame rate, the target frame rate is a video image processing frame rate that is normally required by the user, and the dynamic voltage modulation and frequency modulation device 617 sends the third voltage frequency regulation information to the neural network processor. The third voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or operating frequency, and reduces the power consumption of the neural network processor while satisfying the normal video image processing requirements of the user.

Case 4: When the neural network processor performs speech recognition, the dynamic voltage modulation and frequency modulation device 617 collects the speech recognition speed of the neural network processor in real time. When the voice recognition speed of the neural network processor exceeds the actual voice recognition speed of the user, the dynamic voltage regulation and frequency modulation device 617 sends fourth voltage frequency regulation information to the neural network processor, where the fourth voltage frequency regulation information is used to indicate the neural network. The network processor reduces its operating voltage or operating frequency, and reduces the power consumption of the above-mentioned neural network processor while satisfying the user's normal speech recognition requirements.

Case 5, the dynamic voltage modulation and frequency modulation device 617 monitors each unit or module in the above neural network processor in real time (including the storage medium 611, the register unit 612, the interconnection module 613, the operation unit 614, the controller unit 615, and the data access unit 616). Working status. When any unit or module of the above neural network processor is in an idle state, the dynamic voltage regulating and frequency modulation device sends fifth voltage frequency regulation information to the unit or module to reduce the working voltage of the unit or the module or Operating frequency to further reduce the power consumption of the unit or module. When the unit or the module is in the working state again, the dynamic voltage regulating and frequency modulation device sends the sixth voltage frequency regulation information to the unit or the module to raise the working voltage or the operating frequency of the unit or the module, so that the unit or the module The speed of operation meets the needs of the work.

Referring to FIG. 4G, FIG. 4G is a schematic flowchart of a method for performing a forward operation of a single-layer convolutional neural network according to an embodiment of the present application, where the method is applied to the convolution operation device. As shown in FIG. 4G, the method includes the following steps:

S701, pre-storing an input and output IO instruction at a first address of the instruction storage unit;

S702: The operation starts, the controller unit reads the IO instruction from the first address of the instruction storage unit, and according to the decoded control signal, the data access unit reads all corresponding convolutional neural network operation instructions from the external address space. And buffering it in the instruction storage unit;

S703, the controller unit then reads the next IO instruction from the instruction storage unit, and according to the decoded control signal, the data access unit reads all data required by the main operation module from the external address space to the main a first storage unit of the computing module;

S704, the controller unit then reads the next IO instruction from the instruction storage unit, and the data access unit reads the convolution kernel data required by the operation module from the external address space according to the decoded control signal;

S705, the controller unit then reads the next CONFIG instruction from the instruction storage unit, and the convolution operation device configures various constants required for the calculation of the layer neural network according to the decoded control signal;

S706, the controller unit then reads the next COMPUTE instruction from the instruction storage unit, and according to the decoded control signal, the main operation module first sends the input data in the convolution window to the N through the interconnection module. The operation module saves to the second storage unit of the N slave operation modules, and then moves the convolution window according to the instruction;

S707, according to the control signal decoded by the COMPUTE instruction, the operation unit of the N slave operation modules reads the convolution kernel from the third storage unit, reads the input data from the second storage unit, and completes the input data and the convolution. a convolution operation of the core, returning the resulting output scalar through the interconnect module;

S708. In the interconnecting module, the output scalars returned by the N operation modules are successively formed into a complete intermediate vector.

S709, the main operation module obtains an intermediate vector returned by the interconnection module, and the convolution window traverses all the input data, and the main operation module concatenates all the return vectors into an intermediate result, and the control signal is decoded according to the COMPUTE instruction from the first The storage unit reads the offset data, and adds the offset result to the intermediate result by the vector addition unit, and then the activation unit activates the offset result, and writes the last output data back to the first storage unit;

S710, the controller unit reads the next IO instruction from the instruction storage unit, and the data access unit stores the output data in the first storage unit to an external address space specified address according to the decoded control signal. The operation ends.

Optionally, the method further includes:

And transmitting voltage frequency regulation information to the convolution operation device according to the operation state information of the convolution operation device, wherein the voltage frequency adjustment information is used to instruct the convolution operation device to adjust its operating voltage or operating frequency.

Optionally, the working state information of the convolution operation device includes an operating speed of the convolution operation device, the voltage frequency regulation information includes first voltage frequency regulation information, and the operation according to the convolution operation device Sending the voltage frequency regulation information to the convolution operation device by the status information includes:

Optionally, the working state information of the convolution operation device includes an operation speed of the data access unit and an operation speed of the main operation unit, and the voltage frequency regulation information includes second voltage frequency regulation information, according to the The transmitting the operating state information of the convolution operation device to the convolution operation device further includes:

And when the running time of the data access unit exceeds the running time of the main operating unit according to the running speed of the data access unit and the running speed of the main operating unit, sending the second to the main operating unit Voltage frequency regulation information, the second voltage frequency regulation information is used to instruct the main operation unit to lower its operating frequency or operating voltage.

Optionally, the voltage frequency regulation information includes third voltage frequency regulation information, and the sending the voltage frequency regulation information to the convolution operation device according to the working state information of the convolution operation device further includes:

And when the running time of the main operation unit exceeds the running time of the data access unit according to the running speed of the data access unit and the running speed of the main operation unit, sending the third to the data access unit Voltage frequency regulation information, the third voltage frequency regulation information is used to instruct the data access unit to reduce its operating frequency or operating voltage.

Optionally, the working state information of the convolution operation device includes the instruction storage unit, the controller unit, the data access unit, the interconnection module, the main operation module, and at least S units/modules of the N slave operation modules. Working state information, the S is an integer greater than 1 and less than or equal to N+5, and the voltage frequency regulation information includes fourth voltage frequency regulation information, according to the working state information of the convolution operation device The transmitting the voltage frequency regulation information by the convolution operation device further includes:

The unit A is any one of the at least S units/modules.

Optionally, the voltage frequency regulation information includes the fifth voltage frequency regulation information, and the sending the voltage frequency regulation information to the convolution operation device according to the working state information of the convolution operation device further includes:

It should be noted that the specific implementation process of the foregoing method embodiment may refer to the related description of the embodiment shown in FIG. 4A to FIG. 4F, and is not described herein.

In a possible embodiment of the present application, a method for performing a forward operation of a multi-layer convolutional neural network is provided, comprising: performing a neural network forward operation method as shown in FIG. 4G for each layer, After the execution of the upper convolutional neural network, the operation instruction of this layer uses the output data address of the upper layer stored in the main operation module as the input data address of the layer, and the convolution kernel and the offset data in the instruction. The address is changed to the address corresponding to this layer.

In still another aspect of the present application, an image compression method and related apparatus are provided, which can train a compressed neural network for image compression, and improve the effectiveness of image compression and the accuracy of recognition.

Referring to FIG. 5A, FIG. 5A provides a neural network operation process according to the present application. As shown in FIG. 5A, the dotted arrow indicates the reverse operation, and the solid arrow indicates the forward operation. In the forward operation, when the execution of the upper artificial neural network is completed, the output neurons obtained by the previous layer are operated as the input neurons of the next layer (or some operations are performed on the output neurons. The input neurons of the next layer), at the same time, replace the weights with the weights of the next layer. In the reverse operation, when the inverse operation of the upper artificial neural network is completed, the input neuron gradient obtained by the previous layer is used as the output neuron gradient of the next layer (or the input neuron) The gradient performs some operations as the output neuron gradient of the next layer, and the weight is replaced with the weight of the next layer.

The forward propagation phase of the neural network corresponds to the forward operation, which is the process of inputting data input to output data. The back propagation phase corresponds to the inverse operation, and the error between the final result data and the expected output data is reversed. In the process of the propagation phase, through the repeated forward and backward propagation, the weights of each layer are corrected according to the error gradient, and the weights of each layer are adjusted. This is also the process of neural network learning and training, which can reduce the network output. error.

In the present application, there is no limitation on the type of compression training atlas of the compressed neural network and the number of training images included in each type of training atlas. The more types, the greater the number, the more training times, the more the loss rate of image compression. Low, easy to improve the accuracy of image recognition.

The compressed training atlas may include multiple dimensions such as images of multiple angles, images of multiple light intensities, or images acquired by multiple different types of image acquisition devices. When the compression neural network is trained for the compression training atlas corresponding to the above different dimensions, the effectiveness of image compression in different situations is improved, and the application range of the image compression method is expanded.

The compressed training map focuses on the label information included in the training image. The specific content of the label information is not limited in this application, and the image portion to be trained is marked, which can be used to detect whether the compressed neural network is trained. For example, in the driving image taken by road video surveillance, the tag information is the target license plate information, the driving image is input to the compressed neural network to obtain a compressed image, and the compressed image is identified based on the recognized neural network model to obtain reference license plate information, if reference vehicle license information is used If the target license plate information is matched, the training of the compressed neural network can be determined. Otherwise, when the current training number of the compressed neural network is less than the preset threshold, the compressed neural network needs to be trained.

The application does not limit the type of tag information, and may be license plate information, face information, traffic sign information, object classification information, and the like.

The recognition neural network model involved in the present application is data obtained when the recognition neural network for image recognition is completed, and the training method for identifying the neural network is not limited, and a Batch Gradient Descent (BGD) algorithm may be adopted. Training is performed by Stochastic Gradient Descent (SGD) or mini-batch SGD. One training period is completed by single forward operation and reverse gradient propagation.

Wherein, each training image in the identification training map set includes at least tag information that is consistent with the type of the target tag information of each training image in the compressed training image. That is to say, the recognition neural network model can identify the compressed image output by the compressed neural network (to be trained or completed training).

For example, if the type of the tag information of the compressed training image is a license plate, the type of the tag information identifying the training image includes at least the license plate, thereby ensuring that the recognized neural network model recognizes the compressed image output by the compressed neural network, and obtains the license plate information.

Optionally, the compressed training atlas includes at least an identification training atlas.

Since the images in the training map are limited by factors such as angle, light or image acquisition equipment, when using the training training atlas, the accuracy of the recognition neural network model can be improved, thereby improving the training efficiency of the compressed neural network. That is to improve the effectiveness of image compression.

Referring to FIG. 5B, FIG. 5B is a schematic flowchart of an image compression method according to an embodiment of the present application. As shown in FIG. 5B, the image compression method includes the following steps:

Step S201: Acquire an original image of the first resolution.

The first resolution is the input resolution of the compressed neural network, and the second resolution is smaller than the first resolution, which is the output resolution of the compressed neural network, that is, the compression ratio of the image input to the compressed neural network (the second resolution and The ratio of the first resolution is fixed, that is, the same compression ratio is obtained by compressing different images based on the same compressed neural network model.

The original image is any training image in the compressed training map set of the compressed neural network, and the label information of the original image is used as the target label information. The application does not limit the tag information, and may be obtained by marking the human identification, or inputting the original image into the recognition neural network, and performing recognition based on the recognition neural network model.

Step S202: compress the original image based on the target model to obtain a compressed image of the second resolution.

The target model is the current neural network model of the compressed neural network, that is, the target model is the current parameter of the compressed neural network. Compressing the original image with a resolution equal to the input resolution of the compressed neural network based on the target model yields a compressed image having a resolution equal to the output resolution of the compressed neural network.

Optionally, the compressing the original image based on the target model to obtain the compressed image of the second resolution comprises: identifying the original image based on the target model to obtain a plurality of image information; and based on the target model And compressing the original image with the plurality of image information to obtain the compressed image.

The training image as described above includes multiple dimensions. First, the original image is identified based on the target model, image information corresponding to each dimension can be determined, and the original image is compressed for each image information, thereby improving image compression in different dimensions. The accuracy rate.

Step S203: Identify the compressed image based on the recognition neural network model to obtain reference label information.

The present application does not limit the identification method, and may include two parts: feature extraction and feature recognition, and the result of feature recognition is used as reference label information. For example, the reference label information corresponding to the driving compressed image is obtained as the license plate number after the driving image is compressed; After the image is compressed, the reference tag information corresponding to the face compressed image is obtained as the face recognition result.

Optionally, the identifying the compressed image by using the identifying neural network model to obtain the reference label information comprises: preprocessing the compressed image to obtain an image to be identified; and determining the image to be recognized based on the identifying neural network model The identification is performed to obtain the reference tag information.

The preprocessing includes, but is not limited to, any one or more of the following: data format conversion processing (eg, normalization processing, integer data conversion, etc.), data deduplication processing, data exception processing, data missing padding processing, and the like. By preprocessing the compressed image, the recognition efficiency and accuracy of image recognition can be improved.

Similarly, the acquiring the original image of the first resolution comprises: receiving an input image; and preprocessing the input image to obtain the original image. The compression efficiency of image compression can be improved by preprocessing the input image.

The pre-processing described above also includes size processing, since the neural network has a fixed size requirement that only images of the same basic image size as the neural network can be processed. The basic image size of the compressed neural network is taken as the first basic image size, and the basic image size of the neural network is identified as the second basic image size, that is, the size of the input image of the compressed neural network is required to be equal to the first basic image size. The recognition neural network requires that the size of the input image be equal to the second basic image size. The compressed neural network may compress the image to be compressed that satisfies the first basic image size to obtain a compressed image; the recognition neural network may identify the image to be identified that satisfies the second basic image size to obtain reference tag information.

The specific manner of the size processing is not limited, and may include a method of cutting or filling pixels, a method of scaling according to a basic image size, a down sampling method for an input image, and the like.

Wherein, the peripheral pixel is cropped to a non-critical information area around the cropped image; the downsampling process is a process of reducing the sampling rate of the specific signal, for example, four adjacent pixels are averaged as corresponding positions of the processed image. The value of a pixel, thereby reducing the size of the image.

Optionally, the pre-processing the compressed image to obtain an image to be identified includes: when the image size of the compressed image is smaller than a basic image size of the recognition neural network, performing the compressed image according to the basic image size. Filling the pixels to obtain the image to be identified.

The present application does not limit the pixel point, and may correspond to any color mode, for example: rgb (0, 0, 0). The specific position of the pixel padding is not limited, and may be any position other than the compressed image, that is, the compressed image is not processed, but the image is expanded by filling the pixel, and the compressed image is not deformed. It is convenient to improve the recognition efficiency and accuracy of image recognition.

For example, as shown in FIG. 5C, the compressed image is placed on the upper left of the image to be recognized, and the position of the image to be recognized is filled with pixels other than the compressed image.

Similarly, the pre-processing the input image to obtain the original image comprises: when the image size of the input image is smaller than a first basic image size of the compressed neural network, according to the first basic image size The input image is filled with pixel points to obtain the original image. The original image to be compressed is identified by the recognition neural network by pixel dot filling to obtain reference label information, and the pixel point fills the compression ratio of the input image without changing, which is convenient for improving the efficiency and accuracy of training the compressed neural network.

Step S204: Acquire a loss function according to the target tag information and the reference tag information.

In the present application, the loss function is used to describe the magnitude of the error between the target tag information and the reference tag information. The tag information includes multiple dimensions, which are generally calculated using a squared difference formula:

Where: c is the dimension of the tag information, t _k is the kth dimension of the reference tag information, and y _k is the kth dimension of the target tag information.

Step S205: Determine whether the loss function converges to the first threshold or whether the current training number of the compressed neural network is greater than or equal to the second threshold. If yes, go to step S206; if no, go to step S207.

In the training method of the compressed neural network involved in the present application, the training period corresponding to each training image is completed by a single forward operation and reverse gradient propagation, and the threshold of the loss function is set to the first threshold, and the training of the compressed neural network is performed. The threshold of the number of times is set to the second threshold. That is, if the loss function converges to the first threshold or the number of training times is greater than or equal to the second threshold, the training of the compressed neural network is completed, and the target model is used as the compressed neural network model corresponding to the completion of the training of the compressed neural network. Otherwise, enter the back propagation phase of the compressed neural network according to the loss function, that is, update the target model according to the loss function, and train for the next training image, that is, perform steps S202-S205 until the above conditions are met, end the training, wait Step S206 is performed.

The present application is not limited to the reverse training method of the compressed neural network. Optionally, please refer to the flow diagram of the single layer neural network operation method provided in FIG. 5D, and FIG. 5D can be applied to the compressed nerve shown in FIG. 5E. Schematic diagram of the structure of the network reverse training device.

As shown in FIG. 5E, the apparatus includes an instruction cache unit 21, a controller unit 22, a direct memory access unit 23, an H-tree module 24, a main operation module 25, and a plurality of slave operation modules 26, which can be implemented by hardware circuits (for example, ASIC) implementation.

The instruction cache unit 21 reads the instruction through the direct memory access unit 23 and caches the read instruction; the controller unit 22 reads the instruction from the instruction cache unit 21, and translates the instruction into a micro instruction that controls the behavior of other modules. Other modules such as the direct memory access unit 23, the main operation module 25 and the slave operation module 26, etc.; the direct memory access unit 23 can access the external address space, directly read and write data to each cache unit inside the device, and complete data loading and storage. .

Referring to FIG. 5F, FIG. 5F shows the structure of the H-tree module 24. As shown in FIG. 5F, the H-tree module 24 constitutes a data path between the main operation module 25 and the plurality of slave operation modules 26, and has an H-tree type. structure. The H-tree is a binary tree path composed of multiple nodes. Each node sends the upstream data to the downstream two nodes in the same way, and the data returned by the two downstream nodes are combined and returned to the upstream node. For example, in the inverse operation of the neural network, the vectors returned by the two downstream nodes are added to a vector at the current node and returned to the upstream node. At the stage where each layer of artificial neural network starts calculation, the input gradient in the main operation module 25 is sent to the respective slave operation modules 26 through the H-tree module 24; when the calculation process from the operation module 26 is completed, each slave operation module 26 outputs The sum of the output gradient vectors will be summed two-by-two in the H-tree module 24, ie, summing and summing all the output gradient vectors as the final output gradient vector.

Referring to FIG. 5G, FIG. 5G is a schematic structural diagram of the main operation module 25. As shown in FIG. 5G, the main operation module 25 includes an operation unit 251, a data dependency determination unit 252, and a neuron buffer unit 253.

The neuron buffer unit 253 is configured to buffer the input data and the output data used by the main operation module 25 in the calculation process. The arithmetic unit 251 performs various arithmetic functions of the main arithmetic module. The data dependency judging unit 252 is a port in which the arithmetic unit 251 reads and writes the neuron cache unit 253, and at the same time, can ensure that there is no consistency conflict with the reading and writing of data in the neuron buffer unit 253. Specifically, the data dependency determining unit 252 determines whether there is a dependency between the microinstruction that has not been executed and the data of the microinstruction that is being executed, and if not, allows the microinstruction to be immediately transmitted, otherwise it is necessary to wait until the microinstruction The microinstruction is allowed to be transmitted after all the microinstructions on which the instruction depends are executed. For example, all microinstructions sent to the data dependency unit 252 are stored in an instruction queue inside the data dependency unit 252, in which the range of read data of the read instruction is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed. At the same time, the data dependency determination unit 252 is also responsible for reading the input gradient vector from the neuron buffer unit 253 and transmitting it to the slave operation module 26 through the H-tree module 24, and the output data from the operation module 26 is directly sent to the operation through the H-tree module 24. Unit 251. The command output from the controller unit 22 is sent to the arithmetic unit 251 and the dependency determination unit 252 to control its behavior.

Referring to FIG. 5H, FIG. 5H is a schematic structural diagram of the operation module 26. As shown in FIG. 5H, each slave operation module 26 includes an operation unit 261, a data dependency determination unit 262, a neuron buffer unit 263, a weight buffer unit 264, and Weight gradient buffer unit 265.

The arithmetic unit 261 receives the micro-instructions issued by the controller unit 22 and performs arithmetic logic operations.

The data dependency determination unit 262 is responsible for the read and write operations on the cache unit in the calculation process. The data dependency judging unit 262 ensures that there is no consistency conflict between the reading and writing of the cache unit. Specifically, the data dependency determining unit 262 determines whether there is a dependency relationship between the microinstruction that has not been executed and the data of the microinstruction that is being executed, and if not, allows the microinstruction to be immediately transmitted, otherwise it is necessary to wait until the micro The microinstruction is allowed to be transmitted after all the microinstructions on which the instruction depends are executed. For example, all microinstructions sent to the data dependency unit 262 are stored in an instruction queue inside the data dependency unit 262, in which the range of read data of the read instruction is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.

The neuron buffer unit 263 buffers the input gradient vector data and the output gradient vector partial sum calculated by the arithmetic operation module 26.

The weight buffer unit 264 buffers the weight vector required by the slave operation module 26 in the calculation process. For each slave arithmetic module, only the columns in the weight matrix corresponding to the slave arithmetic module 26 are stored.

The weight gradient buffer unit 265 buffers the weight gradient data required by the corresponding slave module in updating the weights. Each of the weight gradient data stored from the arithmetic module 26 corresponds to its stored weight vector.

From the operation module 26, the first half of the parallel and the update of the weights in the process of calculating the output gradient vector for each layer of artificial neural network reverse training are implemented. Taking the artificial neural network full connection layer (MLP) as an example, the process is out_gradient=w*in_gradient, where the multiplication of the weight matrix w and the input gradient vector in_gradient can be divided into unrelated parallel computing subtasks, out_gradient and in_gradient are column vectors. Each of the arithmetic modules calculates only the product of the corresponding partial scalar element in the in_gradient and the column corresponding to the weight matrix w, and each of the obtained output vectors is a sum of the final result and the sum of the parts and the H-tree The middle and the second are added together to get the final result. So the computational process becomes a parallel computational part of the process and the subsequent accumulation process. Each of the slave arithmetic modules 26 calculates a partial sum of the output gradient vectors and performs a summation operation in the H-tree module 24 to obtain the final output gradient vector. Each slave arithmetic module 26 simultaneously multiplies the input gradient vector by the output value of each layer in the forward operation to calculate a weight gradient to update the weight stored in the slave arithmetic module 26. Forward and reverse training are the two main processes of neural network algorithms. To train (update) the weights in the network, the neural network needs to calculate the forward output of the input vector in the network composed of the current weights. This is positive. To the process, the weight of each layer is trained (updated) layer by layer according to the difference between the output value and the label value of the input vector itself. In the forward calculation process, the output vector of each layer and the derivative value of the activation function are saved. These data are required for the reverse training process, so these data are guaranteed to exist at the beginning of the reverse training. The output value of each layer in the forward operation is the data existing at the beginning of the reverse operation, and can be buffered in the main operation module by the direct memory fetch unit and sent to the slave operation module through the H-tree. The main operation module 25 performs subsequent calculation based on the output gradient vector, for example, multiplying the output gradient vector by the derivative of the activation function in the forward operation to obtain the input gradient value of the next layer. The derivative of the activation function in the forward operation is the data existing at the beginning of the reverse operation, and can be cached in the main operation module by the direct memory fetch unit.

In accordance with an embodiment of the present invention, an instruction set for performing an artificial neural network forward operation on the aforementioned apparatus is also provided. The instruction set includes the CONFIG instruction, the COMPUTE instruction, the IO instruction, the NOP instruction, the JUMP instruction, and the MOVE instruction, where:

The CONFIG command is used to configure various constants required for current layer calculation before each layer of artificial neural network calculation begins;

COMPUTE instruction for completing the arithmetic logic calculation of each layer of artificial neural network;

IO instruction, which is used to read input data required for calculation from an external address space and store the data back to the external space after the calculation is completed;

The NOP instruction is responsible for clearing the microinstructions currently loaded into all internal microinstruction buffer queues, and ensuring that all instructions before the NOP instruction are all completed. The NOP instruction itself does not contain any operations;

The JUMP instruction is used to be responsible for the jump of the next instruction address that the controller will read from the instruction cache unit, and is used to implement the jump of the control flow;

Referring to FIG. 5I, FIG. 5I is a block diagram of an example of the reverse training of the compressed neural network provided by the embodiment of the present application. The process of calculating the output gradient vector is out_gradient=w*in_gradient, wherein the matrix vector multiplication of the weight matrix w and the input gradient vector in_gradient can be divided into uncorrelated parallel computation subtasks, each of which calculates the output gradient vector from the operation module 26. Partial sum, all parts and summation operations in H-tree module 24 yield the final output gradient vector. The output gradient vector input gradient of the upper layer in Fig. 5I is multiplied by the corresponding activation function derivative to obtain the input data of this layer, and then multiplied by the weight matrix to obtain the output gradient vector. The process of calculating the weight update gradient is dw=x*in_gradient, wherein each slave operation module 26 calculates an update gradient of the weight of the corresponding portion of the module. The operation module 26 multiplies the input gradient and the input neuron in the forward operation to calculate the weight update gradient dw, and then uses w, dw, and the weight used in the last update of the weight to update the gradient dw' according to the instruction set learning. Rate update weight w.

Referring to FIG. 5I, the input gradient ([input gradient0,..., input gradient3] in FIG. 5I) is the output gradient vector of the n+1th layer, which is first and the derivative value of the nth layer in the forward operation process. ([f'(out0),...,f'(out3)]) in Fig. 5) is multiplied to obtain an input gradient vector of the nth layer, which is completed in the main operation module 25, and sent to the H-tree module 24 From the arithmetic module 26, it is temporarily stored in the neuron buffer unit 263 of the slave arithmetic module 26. Then, the input gradient vector is multiplied by the weight matrix to obtain an output gradient vector of the nth layer. In this process, the i-th slave computing module calculates the product of the i-th scalar in the input gradient vector and the column vector [w_i0,...,w_iN] in the weight matrix, and the resulting output vector is step-by-step in the H-tree module 24. The two additions yield the final output gradient vector output gradient ([output gradient0,...,output gradient3] in Figure 5I).

At the same time, the slave computing module 26 also needs to update the weight stored in the module, and the process of calculating the weight update gradient is dw_ij=x_j*in_gradient_i, where x_j is the input of the nth layer in the forward operation (ie, the n-1th layer Output) The jth element of the vector, in_gradient_i is the i-th element of the inverse input n-th layer input gradient vector (ie, the product of input gradient and derivative f' in Figure 5I). The input of the nth layer in the forward operation is the data existing at the beginning of the reverse training, and is sent to the slave arithmetic module 26 through the H-tree module 24 and temporarily stored in the neuron buffer unit 263. Then, in the slave operation module 26, after completing the calculation of the sum of the output gradient vectors, the i-th scalar of the input gradient vector and the input vector of the n-th layer of the forward operation are multiplied to obtain a gradient vector dw of the updated weight. This update weight.

As shown in FIG. 5D, an IO instruction is pre-stored at the first address of the instruction cache unit; the controller unit reads the IO instruction from the first address of the instruction cache unit, and directly accesses the memory access unit according to the translated microinstruction. The external address space reads all instructions related to the single layer artificial neural network reverse training and caches it in the instruction cache unit; the controller unit then reads the next IO instruction from the instruction cache unit, according to the translated micro The direct memory access unit reads all data required by the main operation module from the external address space to the neuron buffer unit of the main operation module, where the data includes the input neuron and the activation function derivative value and the input gradient in the previous forward operation The controller unit then reads the next IO instruction from the instruction cache unit. According to the translated microinstruction, the direct memory access unit reads the ownership value data and the weight gradient data required by the operation module from the external address space, and respectively a weight buffer unit and a weight gradient buffer unit stored in the corresponding slave arithmetic module; the controller unit then The next CONFIG instruction is read from the instruction cache unit, and the operation unit configures the value of the internal unit of the operation unit according to the parameters in the translated micro instruction, including various constants required for the calculation of the layer neural network, and the precision calculation and update of the calculation of the layer. The learning rate of the weight, etc.; the controller unit then reads the next COMPUTE instruction from the instruction cache unit, and according to the translated microinstruction, the main operation module sends the input gradient vector and the input neuron in the forward operation through the H-tree module. For each slave arithmetic module, the input gradient vector and the input neuron in the forward operation are stored in the neuron cache unit of the slave arithmetic module; the microinstruction decoded according to the COMPUTE instruction, from the arithmetic unit of the arithmetic module from the weight buffer unit Reading the weight vector (ie, the partial column of the weight matrix stored by the slave module), completing the vector multiplication scalar operation of the weight vector and the input gradient vector, returning the output vector portion and returning through the H-tree; The input gradient vector is multiplied by the input neuron to obtain a weight gradient stored in the weight gradient buffer unit; in the H-tree module, each The output gradient part returned from the arithmetic module is added step by step to obtain a complete output gradient vector; the main operation module obtains the return value of the H-tree module, and reads from the neuron cache unit according to the micro-instruction decoded by the COMPUTE instruction. The value of the activation function in the forward operation is multiplied by the returned output vector to obtain the input gradient vector of the next layer of reverse training, which is written back to the neuron buffer unit; the controller unit then proceeds from the instruction cache unit. Read the next COMPUTE instruction, read the weight w from the weight buffer unit from the arithmetic module according to the translated micro-instruction, and read the weight gradient dw and the last update weight from the weight gradient buffer unit. The weight gradient dw', the update weight w; the controller unit then reads the next IO instruction from the instruction cache unit, and according to the translated microinstruction, the direct memory access unit stores the output gradient vector in the neuron cache unit to The external address space specifies the address and the operation ends.

For a multi-layer artificial neural network, the implementation process is similar to that of a single-layer neural network. When the previous artificial neural network is executed, the next-level operation instruction will use the output gradient vector calculated in the main operation module as the next layer. The trained input gradient vector performs the above calculation process, and the weight address and the weight gradient address in the instruction are also changed to the address corresponding to the layer.

By using a neural network reverse training device, the support for the forward operation of the multi-layer artificial neural network is effectively improved. The use of dedicated on-chip buffering for multi-layer neural network reverse training fully exploits the reusability of input neurons and weight data, avoids repeatedly reading these data into memory, reduces memory access bandwidth, and avoids memory bandwidth. Multi-layer artificial neural network forward computing performance bottleneck problem.

Step S206: Acquire a target original image of the first resolution, and compress the target original image based on a compressed neural network model to obtain a target compressed image of the second resolution.

The target original image is an image (an image belonging to the same data set) that matches the type of the tag information of the training image. If the loss function converges to the first threshold or the number of training times is greater than or equal to the second threshold, the compressed neural network completes the training, and can directly input the compressed neural network to perform image compression to obtain the target compressed image, and the target compressed image can be recognized by the recognized neural network.

Optionally, after the target original image is compressed by the compressed neural network model to obtain the target compressed image of the second resolution, the method further includes: targeting the target based on the identifying neural network model The compressed image is identified to obtain tag information of the target original image, and tag information of the target original image is stored.

That is to say, after the compression neural network training is completed, the compressed image can be identified based on the recognition neural network model, and the efficiency and accuracy of the manual identification of the tag information are improved.

Step S207, updating the target model according to the loss function to obtain an updated model, using the updated model as the target model, and using the next training image as the original image, and executing step S202.

It can be understood that the loss function is obtained by using the reference tag value obtained by the trained neural network model and the target tag value included in the original image, when the loss function satisfies the preset condition or the current training number of the compressed neural network exceeds the preset threshold. The training is completed. Otherwise, the weight is repeatedly adjusted by training the compressed neural network, that is, the image content represented by each pixel in the same image is adjusted to reduce the loss of the compressed neural network. The compressed neural network model obtained through training is used for image compression, which improves the effectiveness of image compression, and thus improves the accuracy of recognition.

Referring to FIG. 5J, FIG. 5J is a schematic structural diagram of an image compression apparatus 300 according to an embodiment of the present disclosure. As shown in FIG. 5J, the image compression apparatus 300 includes a processor 301 and a memory 302.

In the embodiment of the present application, the memory 302 is configured to store a first threshold, a second threshold, a current neural network model and a training number of the compressed neural network, a compressed training atlas of the compressed neural network, and each of the compressed training map sets. a training image tag information, a recognition neural network model, a compressed neural network model, and a current neural network model of the compressed neural network as a target model, wherein the compressed neural network model is a target corresponding to the compression neural network training completion The model, the recognition neural network model is a corresponding neural network model for identifying a neural network training completion.

The processor 301 is configured to acquire an original image of a first resolution, where the original image is any training image in the compressed training map set, and label information of the original image is used as target label information; The original image is compressed to obtain a compressed image of a second resolution, the second resolution is smaller than the first resolution; and the compressed image is identified based on the recognized neural network model to obtain reference label information; Obtaining a loss function according to the target tag information and the reference tag information; acquiring the first resolution when the loss function converges to the first threshold, or the training number is greater than or equal to the second threshold a target original image of the rate, confirming that the target model is the compressed neural network model; compressing the target original image based on the compressed neural network model to obtain the target compressed image of the second resolution.

Optionally, the processor 301 is further configured to: when the loss function does not converge to the first threshold, or when the training number is less than the second threshold, update the target model according to the loss function, An update model is obtained, the update model is used as the target model, and the next training image is used as the original image, and the step of acquiring the original image of the first resolution is performed.

Optionally, the processor 301 is specifically configured to perform pre-processing on the compressed image to obtain an image to be identified, and to identify the to-be-identified image based on the recognized neural network model to obtain the reference tag information.

Optionally, the preprocessing includes a size processing, the memory 302 is further configured to store a basic image size of the identification neural network, and the processor 301 is specifically configured to: when an image size of the compressed image is smaller than the basic image size, The compressed image is filled with pixels according to the basic image size to obtain the image to be recognized.

Optionally, the compressed training atlas includes at least an identification training atlas, and the processor 301 is further configured to use the identification training atlas to train the recognized neural network to obtain the recognized neural network model, where the identification Each training image in the training map set includes at least tag information that is consistent with the type of the target tag information.

Optionally, the processor 301 is further configured to: identify, according to the identifying a neural network model, the target compressed image, to obtain label information of the target original image;

The memory 302 is also used to store tag information of the target original image.

Optionally, the compressed training atlas includes multiple dimensions, and the processor 301 is specifically configured to: identify the original image based on the target model, obtain multiple image information, and each dimension corresponds to one image information; The original image is compressed based on the target model and the plurality of image information to obtain the compressed image.

It can be understood that the compressed image of the original image is obtained based on the target model, the reference tag information of the compressed image is obtained based on the recognition neural network model, and the loss function is obtained according to the target tag information included in the original image and the reference tag information, and the loss function converges to the first threshold. Or when the current training number of the compressed neural network is greater than or equal to the second threshold, the training of the compressed neural network for image compression is completed, and the target model is used as the compressed neural network model, and the target of the target original image can be acquired based on the compressed neural network model. Compress the image. That is, the loss function is obtained by the reference tag value obtained by the trained neural network model and the target tag value included in the original image, and the loss function satisfies the preset condition or the current training number of the compressed neural network exceeds the preset threshold. The training is completed at any time. Otherwise, the weight is repeatedly adjusted by training the compressed neural network, that is, the image content represented by each pixel in the same image is adjusted, the loss of the compressed neural network is reduced, and the effectiveness of image compression is improved, thereby facilitating Improve the accuracy of recognition.

In a possible embodiment of the present application, an electronic device 400 is provided. The electronic device 400 includes an image compression device 300. As shown in FIG. 5K, the electronic device 400 includes a processor 401, a memory 402, a communication interface 403, and one or more. Program 404, wherein one or more programs 404 are stored in memory 402 and configured to be executed by processor 401, which includes instructions for performing some or all of the steps described in the image compression method described above.

It should be noted that each of the above units or modules may be a circuit, including a digital circuit, an analog circuit, and the like. Physical implementations of the various unit or module structures described above include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The above chip or the above neural network processor may be any suitable hardware processor such as a CPU, GPU, FPGA, DSP, ASIC, and the like. The storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as Resistive Random Access Memory (RRAM), Dynamic Random Access Memory (DRAM), static. Random Random Access Memory (SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (HBM), Hybrid Memory Cube (HMC) and many more.

This application can be used in a variety of general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, top-mounted, programmable consumer electronics, personal computers (PCs), Small computer, mainframe computer, distributed computing environment including any of the above systems or devices, and the like.

In one embodiment, the present application provides a chip that includes the foregoing computing device that is capable of performing multiple operations on weights and input neurons simultaneously, thereby achieving diversification of operations. In addition, by using a dedicated on-chip cache for multi-layer artificial neural network operation algorithms, the reuse of input neurons and weight data is fully exploited, avoiding repeated reading of these data into memory, reducing memory access bandwidth and avoiding memory. Bandwidth becomes a problem of multi-layer artificial neural network operation and performance bottleneck of its training algorithm.

In a possible embodiment of the present application, an embodiment of the present invention provides a chip package structure including the above neural network processor.

In a possible embodiment of the present application, an embodiment of the present invention provides a board that includes the above chip package structure.

In a possible embodiment of the present application, an embodiment of the present invention provides an electronic device including the above card.

The above electronic devices include, but are not limited to, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers, cameras, cameras, projectors, watches, headphones, mobile Storage, wearables, vehicles, household appliances, medical equipment.

The vehicle includes an airplane, a ship, and/or a vehicle; the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood; the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments of the application herein can be implemented in electronic hardware, computer software, or a combination of both, for clarity of hardware and software. Interchangeability, the composition and steps of the various examples have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.

A person skilled in the art can clearly understand that, for the convenience and brevity of the description, the specific working process of the terminal and the unit described above can be referred to the corresponding process in the foregoing method embodiment, and details are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed terminal and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the above units is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, or an electrical, mechanical or other form of connection.

The units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present invention.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

The above-described integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention contributes in essence or to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the above-described methods of various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

It should be noted that the implementations that are not shown or described in the drawings or the text of the specification are all known to those of ordinary skill in the art and are not described in detail. In addition, the above definitions of the various elements and methods are not limited to the specific structures, shapes or manners mentioned in the embodiments, and those skilled in the art can simply modify or replace them.

The specific embodiments of the present invention have been described in detail with reference to the specific embodiments of the present application. It is to be understood that Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims

A processing method comprising:

Weighting and inputting neurons are separately quantified to determine a weight dictionary, a weight codebook, a neuron dictionary, and a neuron codebook;

The operation codebook is determined based on the weight codebook and the neuron codebook.
The processing method according to claim 1, wherein said quantizing the weight comprises the steps of:

Grouping the weights, performing a clustering operation on each set of weights by using a clustering algorithm, dividing each set of weights into m classes, m being a positive integer, and each type of weight corresponding to a weight index, Determining the weight dictionary, wherein the weight dictionary includes a weight position and a weight index, the weight position indicating a position of the weight in the neural network structure;

The weighted codebook is determined by replacing the ownership value of each class with a central weight, wherein the weighted codebook includes a weighted index and a central weight.
The processing method according to claim 1 or 2, wherein said quantizing the input neurons comprises the steps of:

Deciphering the input neuron into p segments, each segment input neuron corresponding to a neuron range and a neuron index, determining the neuron dictionary, wherein p is a positive integer;

The input neurons are encoded, and all input neurons of each segment are replaced with a central neuron to determine the neuron codebook.
The processing method according to claim 3, wherein the determining the operation codebook comprises the following steps:

Determining, according to the weight value, a corresponding weight index in the weight credential, and determining, by using a weight index, a center weight corresponding to the weight;

Determining, according to the input neuron, a corresponding neuron index in the neuron codebook, and determining, by the neuron index, a central neuron corresponding to the input neuron; and

The center weight and the central neuron are operated to obtain an operation result, and the operation result is formed into a matrix to determine the operation codebook.
The processing method according to claim 4, wherein the arithmetic operation comprises at least one of: addition, multiplication, and pooling, wherein the pooling comprises: average pooling, maximum pooling, and median Pooling.
The processing method according to any one of claims 1 to 5, further comprising the steps of: retraining the weight and the input neuron, and training only the weight codebook and the neuron codebook during retraining The contents of the weight dictionary and the neuron dictionary remain unchanged, and the retraining uses a back propagation algorithm.
The processing method according to claim 2, wherein said grouping said weights comprises:

Divided into groups, grouping the ownership values in the neural network into one group;

Layer type grouping, dividing weights of all convolution layers in the neural network, weights of all fully connected layers, and weights of all long and short memory network layers into a group;

Inter-layer grouping, dividing weights of one or more convolution layers in the neural network, weights of one or more fully connected layers, and weights of one or more long-term memory network layers into a group;

In-layer grouping, the weights in one layer of the neural network are segmented, and each part after the segmentation is divided into a group.
The processing method according to claim 2, wherein the clustering algorithm comprises K-means, K-medoids, Clara, and/or Clarans.
The processing method according to any one of claims 2 to 8, wherein the method for selecting the center weight corresponding to each class comprises: determining the value of W 0 when the value of the cost function J(w, w 0 ) is minimized, At this time, the value of W 0 is the center weight;

among them,
J(w, w 0 ) is the cost function, W is the ownership value in the class, W 0 is the center weight, n is the number of ownership values in the class, and W i is the i-th weight in the class, 1 ≤ i ≤ n, and i is a positive integer.
A processing device comprising:

a memory for storing an operation instruction;

And a processor for executing an operation instruction in the memory, and operating the processing instruction according to any one of claims 1 to 9 when the operation instruction is executed.
The apparatus according to claim 10, wherein said operation instruction is a binary number including an operation code and an address code, the operation code indicating an operation to be performed by the processor, and the address code instructing the processor to read the participation in the address in the memory Operational data.