WO2019076095A1 - Processing method and apparatus - Google Patents

Processing method and apparatus Download PDF

Info

Publication number
WO2019076095A1
WO2019076095A1 PCT/CN2018/095548 CN2018095548W WO2019076095A1 WO 2019076095 A1 WO2019076095 A1 WO 2019076095A1 CN 2018095548 W CN2018095548 W CN 2018095548W WO 2019076095 A1 WO2019076095 A1 WO 2019076095A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
neuron
weight
voltage
instruction
Prior art date
Application number
PCT/CN2018/095548
Other languages
French (fr)
Chinese (zh)
Inventor
刘少礼
周徐达
杜子东
刘道福
张磊
陈天石
胡帅
韦洁
孟小甫
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201710989575.4A external-priority patent/CN109697135B/en
Priority claimed from CN201711061069.5A external-priority patent/CN109697509B/en
Priority claimed from CN201711029543.6A external-priority patent/CN109725700A/en
Priority claimed from CN201711289667.8A external-priority patent/CN109903350B/en
Priority to KR1020197037574A priority Critical patent/KR102434729B1/en
Priority to EP19215860.8A priority patent/EP3660706B1/en
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Priority to EP18868807.1A priority patent/EP3627397B1/en
Priority to KR1020197023878A priority patent/KR102434726B1/en
Priority to US16/482,710 priority patent/US11593658B2/en
Priority to KR1020197037566A priority patent/KR102434728B1/en
Priority to EP19215859.0A priority patent/EP3660628B1/en
Priority to EP19215858.2A priority patent/EP3667569A1/en
Publication of WO2019076095A1 publication Critical patent/WO2019076095A1/en
Priority to US16/528,948 priority patent/US10747292B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3206Monitoring of events, devices or parameters that trigger a change in power modality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/324Power saving characterised by the action undertaken by lowering clock frequency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/3287Power saving characterised by the action undertaken by switching off individual functional units in the computer system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/3296Power saving characterised by the action undertaken by lowering the supply or operating voltage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of data processing, and in particular, to a processing method and apparatus, an operation method, and an apparatus.
  • Neural networks have achieved very successful applications.
  • large-scale parameters and large-scale computing of neural networks have become a huge challenge for neural network applications.
  • large-scale parameters put high demands on storage capacity, and at the same time lead to a large amount of access energy consumption.
  • large-scale calculations place high demands on the design of the arithmetic unit, and at the same time lead to a large amount of computational energy consumption. Therefore, how to reduce the parameters and calculation of neural networks has become an urgent problem to be solved.
  • the purpose of the present application is to provide a processing method and apparatus, an arithmetic method and an apparatus to solve at least one of the above technical problems.
  • a processing method including:
  • Weighting and inputting neurons are separately quantified to determine a weight dictionary, a weight codebook, a neuron dictionary, and a neuron codebook;
  • the operation codebook is determined based on the weight codebook and the neuron codebook.
  • quantifying the weight includes the steps of:
  • the weight dictionary includes a weight position and a weight index, the weight position indicating a position of the weight in the neural network structure;
  • the weighted codebook is determined by replacing the ownership value of each class with a central weight, wherein the weighted codebook includes a weighted index and a central weight.
  • the step of quantizing the input neurons comprises the steps of:
  • the input neurons are encoded, and all input neurons of each segment are replaced with a central neuron to determine the neuron codebook.
  • the determining the operation codebook includes the following steps:
  • the center weight and the central neuron are operated to obtain an operation result, and the operation result is formed into a matrix to determine the operation codebook.
  • the operation operation includes at least one of the following: addition, multiplication, and pooling, wherein the pooling includes: average pooling, maximum pooling, and median pooling .
  • the method further includes the steps of: re-training the weight and the input neuron, and training only the weight codebook and the neuron codebook during the retraining, the weight dictionary and The content in the neuron dictionary remains unchanged, and the retraining uses a backpropagation algorithm.
  • the grouping the weights includes:
  • Layer type grouping dividing weights of all convolution layers in the neural network, weights of all fully connected layers, and weights of all long and short memory network layers into a group;
  • Inter-layer grouping dividing weights of one or more convolution layers in the neural network, weights of one or more fully connected layers, and weights of one or more long-term memory network layers into a group;
  • the weights in one layer of the neural network are segmented, and each part after the segmentation is divided into a group.
  • the clustering algorithm comprises K-means, K-medoids, Clara and/or Clarans.
  • each class corresponding to the center of weights selection method comprising: determining that the cost function J (w, w 0) the minimum value W 0, in which case the value 0, i.e. W For the center weight;
  • J(w,w 0 ) is the cost function
  • W is the ownership value in the class
  • W 0 is the central weight
  • n is the number of ownership values under the class
  • W i is the i-th weight in the class, 1 ⁇ i ⁇ n, and i is a positive integer.
  • a processing apparatus comprising:
  • a memory for storing an operation instruction
  • the processor is configured to execute an operation instruction in the memory, and operate according to the foregoing processing method when the operation instruction is executed.
  • the operation instruction is a binary number, including an operation code and an address code
  • the operation code indicates an operation to be performed by the processor
  • the address code indicates that the processor reads the participation operation into the address in the memory.
  • an arithmetic device including:
  • An instruction control unit configured to decode the received instruction to generate search control information
  • the lookup table unit is configured to search for output neurons from the operation codebook according to the lookup control information, and the received weight dictionary, the neuron dictionary, the operation codebook, the weights, and the input neurons.
  • the weight dictionary includes a weight position and a weight index
  • the neuron dictionary includes an input neuron and a neuron index
  • the operation codebook includes a weight index, a neuron The index and the result of the operation of the input neurons and weights.
  • the computing device further includes:
  • a pre-processing unit configured to pre-process input information of the external input, to obtain the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation code book;
  • a storage unit for storing input neurons, weights, weights dictionary, neuron dictionary, operation codebook and instructions, and receiving output neurons;
  • a cache unit for buffering the instruction, input neurons, weights, weight indexes, neuron indexes, and output neurons;
  • a direct memory access unit for reading and writing data or instructions between the storage unit and the cache unit.
  • the cache unit includes:
  • An instruction cache for buffering the instruction and outputting the cached instruction to the instruction control unit
  • Input a neuron cache for caching the input neurons
  • Outputs a neuron cache that is used to cache the output neurons of the lookup table cell output.
  • the cache unit further includes:
  • weighted index cache for caching weight indexing
  • a neuron index cache that is used to cache neuron indexes.
  • the pre-processing unit is specifically used for: pre-processing, Gaussian filtering, binarization, regularization, and/or normalization when pre-processing input information input externally. .
  • the lookup table unit includes:
  • the addition lookup table is used to perform the addition operation of the center data data corresponding to the index by the table-searching operation add_lookup according to the input index in.
  • the in and data are vectors of length N, and N is a positive integer, that is,
  • the instruction is a neural network specific instruction
  • the neural network specific instruction includes:
  • a data transfer instruction for performing data transfer between different storage media the data format including a matrix, a vector, and a scalar
  • the operation instruction is used to complete the arithmetic operation of the neural network, including the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Command, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction, MAXOUT neural network operation instructions;
  • Logic instructions for performing logical operations on neural networks including vector logic operations instructions and scalar logic operation instructions.
  • the neural network dedicated instruction includes at least one Cambricon instruction
  • the Cambricon instruction includes an operation code and an operand
  • the Cambricon instruction includes:
  • Cambricon control instruction for controlling an execution process
  • the Cambricon control instruction includes a jump instruction and a conditional branch instruction
  • the Cambricon data transfer instruction is configured to complete data transfer between different storage media, including a load instruction, a storage instruction, and a transfer instruction; wherein the load instruction is used to load data from the main memory to the cache; the storage instruction, Used to store data from a cache to main memory; the transfer instruction is used to transfer data between a cache and a cache or a cache and a register or a register and a register;
  • the Cambricon operation instruction is used to complete a neural network arithmetic operation, including a Cambricon matrix operation instruction, a Cambricon vector operation instruction, and a Cambricon scalar operation instruction; wherein the Cambricon matrix operation instruction is used to complete a matrix operation in a neural network, including matrix multiplication Vector operation, vector multiplication matrix operation, matrix multiplication scalar operation, outer product operation, matrix plus matrix operation, and matrix subtraction matrix operation; the Cambricon vector operation instruction for performing vector operations in a neural network, including vector basic operations, vectors Transcendental function operations, inner product operations, vector random generation operations, and maximum/minimum operations in vectors; Cambricon scalar operations instructions for performing scalar operations in neural networks, including scalar basic operations and scalar transcendental functions;
  • Cambricon logic instructions for logical operations of a neural network including Cambricon vector logic operation instructions and Cambricon scalar logic operation instructions; wherein the Cambricon vector logic operation instructions are used for vector comparison operations, vector logic operations, and vectors Greater than the merge operation, wherein the vector logic operation includes AND, OR, and NOT; the Cambricon scalar logic operation instruction is used for scalar comparison operation and scalar logic operation.
  • the Cambricon data transmission instruction supports one or more of the following data organization modes: a matrix, a vector, and a scalar;
  • the vector basic operation includes a vector addition, subtraction, multiplication, and division;
  • the transcendental function refers to a function that does not satisfy the polynomial equation with a polynomial as a coefficient, including an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function;
  • the scalar basic operation includes scalar addition, subtraction, multiplication, and division;
  • the scalar transcendental function refers to dissatisfaction A function of a polynomial equation in which a polynomial is a coefficient, including an exponential function, a logarithmic function, a trigonometric function, an inverse trigonometric function;
  • the vector comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to;
  • the weight, the weight dictionary, the neuron dictionary, and the input neurons, the output neurons are searched for in the operation codebook.
  • the weight dictionary includes a weight position and a weight index
  • the neuron dictionary includes an input neuron and a neuron index
  • the operation codebook includes a weight index, a neuron The index and the result of the weight and input neurons.
  • searching for output neurons in the operation codebook according to the search control information, weights, and input neurons includes the following steps:
  • the operation result is searched in the operation codebook to determine the output neuron.
  • the operation result includes the following results of at least one operation operation: addition, multiplication, and pooling, wherein the pooling includes: average pooling, maximum pooling, and median Pooling.
  • the method before receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation of the codebook, the method further includes the steps of: preprocessing the input information of the external input to obtain the Weights, input neurons, instructions, weights dictionaries, neuron dictionaries, arithmetic codebooks;
  • the method further includes the steps of: storing the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, the operation codebook, And receiving output neurons; and caching the instructions, input neurons, weights, and output neurons.
  • the method after receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook, the method further includes the step of: buffering the weight index and the neuron index.
  • the pre-processing includes segmentation, Gaussian filtering, binarization, regularization, and/or normalization.
  • the instruction is a neural network specific instruction
  • the neural network specific instruction includes:
  • a data transfer instruction for performing data transfer between different storage media, the data format of the data including a matrix, a vector, and a scalar;
  • the operation instruction is used to complete the arithmetic operation of the neural network, including the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Command, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction, MAXOUT neural network operation instructions;
  • Logic instructions for performing logical operations on neural networks including vector logic operations instructions and scalar logic operation instructions.
  • the neural network dedicated instruction includes at least one Cambricon instruction
  • the Cambricon instruction includes an operation code and an operand
  • the Cambricon instruction includes:
  • Cambricon control instruction for controlling an execution process
  • the Cambricon control instruction includes a jump instruction and a conditional branch instruction
  • the Cambricon data transfer instruction is configured to complete data transfer between different storage media, including a load instruction, a storage instruction, and a transfer instruction; wherein the load instruction is used to load data from the main memory to the cache; the storage instruction, Used to store data from the cache to main memory; implement a move instruction to transfer data between the cache and the cache or the cache and registers or registers and registers;
  • the Cambricon operation instruction is used to complete a neural network arithmetic operation, including a Cambricon matrix operation instruction, a Cambricon vector operation instruction, and a Cambricon scalar operation instruction; wherein the Cambricon matrix operation instruction is used to complete a matrix operation in a neural network, including matrix multiplication Vector operation, vector multiplication matrix operation, matrix multiplication scalar operation, outer product operation, matrix plus matrix operation, and matrix subtraction matrix operation; the Cambricon vector operation instruction for performing vector operations in a neural network, including vector basic operations, vectors Transcendental function operations, inner product operations, vector random generation operations, and maximum/minimum operations in vectors; Cambricon scalar operations instructions for performing scalar operations in neural networks, including scalar basic operations and scalar transcendental functions;
  • Cambricon logic instructions for logical operations of a neural network including Cambricon vector logic operation instructions and Cambricon scalar logic operation instructions; wherein the Cambricon vector logic operation instructions are used for vector comparison operations, vector logic operations, and vectors Greater than the merge operation, wherein the vector logic operation includes AND, OR, and NOT; the Cambricon scalar logic operation instruction is used for scalar comparison operations and scalar logic operations.
  • the Cambricon data transmission instruction supports one or more of the following data organization modes: a matrix, a vector, and a scalar;
  • the vector basic operation includes a vector addition, subtraction, multiplication, and division;
  • the transcendental function refers to a function that does not satisfy the polynomial equation with a polynomial as a coefficient, including an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function;
  • the scalar basic operation includes scalar addition, subtraction, multiplication, and division;
  • the scalar transcendental function refers to dissatisfaction A function of a polynomial equation in which a polynomial is a coefficient, including an exponential function, a logarithmic function, a trigonometric function, an inverse trigonometric function;
  • the vector comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to;
  • computing device comprising:
  • An instruction control unit configured to decode the received instruction to generate search control information
  • the lookup table unit is configured to search for output neurons from the operation codebook according to the lookup control information, and the received weight dictionary, the neuron dictionary, the operation codebook, the weights, and the input neurons.
  • the weight dictionary includes a weight position and a weight index
  • the neuron dictionary includes an input neuron and a neuron index
  • the operation codebook includes a weight index, a neuron The index and the result of the operation of the input neurons and weights.
  • the computing device further includes:
  • a pre-processing unit configured to pre-process input information of the external input, to obtain the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation code book;
  • a storage unit for storing input neurons, weights, weights dictionary, neuron dictionary, operation codebook and instructions, and receiving output neurons;
  • a cache unit for buffering the instructions, input neurons, weights, weight indexes, neuron indexes, and output neurons;
  • a direct memory access unit for reading or writing data or instructions between the storage unit and the cache unit.
  • the cache unit includes:
  • An instruction cache for buffering the instruction and outputting the cached instruction to the instruction control unit
  • Input a neuron cache for caching the input neurons
  • Outputs a neuron cache that is used to cache the output neurons of the lookup table cell output.
  • the cache unit further includes:
  • Weight index cache for caching weight index
  • a neuron index cache that is used to cache neuron indexes.
  • the preprocessing unit performs preprocessing on externally input information including: segmentation, Gaussian filtering, binarization, regularization, and/or normalization.
  • the lookup table unit includes:
  • Addition lookup table used to add the central data of the index corresponding to the index by the table-searching operation add_lookup according to the input index in.
  • the in and data are vectors of length N, and N is a positive integer, that is,
  • the instruction is a neural network specific instruction
  • the neural network specific instruction includes:
  • a data transfer instruction for performing data transfer between different storage media the data format including a matrix, a vector, and a scalar
  • the operation instruction is used to complete the arithmetic operation of the neural network, including the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Command, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction, MAXOUT neural network operation instructions;
  • Logic instructions for performing logical operations on neural networks including vector logic operations instructions and scalar logic operation instructions.
  • the neural network dedicated instruction includes at least one Cambricon instruction including an operation code and an operand
  • the Cambricon instruction includes:
  • Cambricon control instruction for controlling an execution process
  • the Cambricon control instruction includes a jump instruction and a conditional branch instruction
  • the Cambricon data transfer instruction is configured to complete data transfer between different storage media, including a load instruction, a storage instruction, and a transfer instruction; wherein the load instruction is used to load data from the main memory to the cache; the storage instruction, Used to store data from the cache to main memory; implement a move instruction to transfer data between the cache and the cache or the cache and registers or registers and registers;
  • the Cambricon operation instruction is used to complete a neural network arithmetic operation, including a Cambricon matrix operation instruction, a Cambricon vector operation instruction, and a Cambricon scalar operation instruction; wherein the Cambricon matrix operation instruction is used to complete a matrix operation in a neural network, including matrix multiplication Vector operation, vector multiplication matrix operation, matrix multiplication scalar operation, outer product operation, matrix plus matrix operation, and matrix subtraction matrix operation; the Cambricon vector operation instruction for performing vector operations in a neural network, including vector basic operations, vectors Transcendental function operations, inner product operations, vector random generation operations, and maximum/minimum operations in vectors; Cambricon scalar operations instructions for performing scalar operations in neural networks, including scalar basic operations and scalar transcendental functions;
  • Cambricon logic instructions for logical operations of a neural network including Cambricon vector logic operation instructions and Cambricon scalar logic operation instructions; wherein the Cambricon vector logic operation instructions are used for vector comparison operations, vector logic operations, and vectors Greater than the merge operation, wherein the vector logic operation includes AND, OR, and NOT; the Cambricon scalar logic operation instruction is used for scalar comparison operations and scalar logic operations.
  • the Cambricon data transmission instruction supports one or more of the following data organization modes: a matrix, a vector, and a scalar;
  • the vector basic operation includes a vector addition, subtraction, multiplication, and division;
  • the transcendental function refers to a function that does not satisfy the polynomial equation with a polynomial as a coefficient, including an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function;
  • the scalar basic operation includes scalar addition, subtraction, multiplication, and division;
  • the scalar transcendental function refers to dissatisfaction A function of a polynomial equation in which a polynomial is a coefficient, including an exponential function, a logarithmic function, a trigonometric function, an inverse trigonometric function;
  • the vector comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to;
  • the weight, the weight dictionary, the neuron dictionary, and the input neurons, the output neurons are searched for in the operation codebook.
  • the weight dictionary includes a weight position and a weight index
  • the neuron dictionary includes an input neuron and a neuron index
  • the operation codebook includes a weight index, a neuron The index and the result of the weight and input neurons.
  • searching for output neurons in the operation codebook according to the search control information, weights, and input neurons includes the following steps:
  • the operation result is searched in the operation codebook to determine the output neuron.
  • the operation result includes the following results of at least one operation operation: addition, multiplication, and pooling, wherein the pooling includes: average pooling, maximum pooling, and median Pooling.
  • the method before receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation of the codebook, the method further includes the steps of: preprocessing the input information of the external input to obtain the Weights, input neurons, instructions, weights dictionaries, neuron dictionaries, arithmetic codebooks;
  • the method further includes the steps of: storing the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, the operation codebook, And receiving output neurons; and caching the instructions, input neurons, weights, and output neurons.
  • the method after receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook, the method further includes the step of: buffering the weight index and the neuron index.
  • the pre-processing includes segmentation, Gaussian filtering, binarization, regularization, and/or normalization.
  • the instruction is a neural network specific instruction
  • the neural network specific instruction includes:
  • a data transfer instruction for performing data transfer between different storage media the data format including a matrix, a vector, and a scalar
  • the operation instruction is used to complete the arithmetic operation of the neural network, including the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Command, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction, MAXOUT neural network operation instructions;
  • Logic instructions for performing logical operations on neural networks including vector logic operations instructions and scalar logic operation instructions.
  • the neural network dedicated instruction includes at least one Cambricon instruction
  • the Cambricon instruction includes an operation code and an operand
  • the Cambricon instruction includes:
  • Cambricon control instruction for controlling an execution process
  • the Cambricon control instruction includes a jump instruction and a conditional branch instruction
  • the Cambricon data transfer instruction is configured to complete data transfer between different storage media, including a load instruction, a storage instruction, and a transfer instruction; wherein the load instruction is used to load data from the main memory to the cache; the storage instruction, Used to store data from the cache to main memory; implement a move instruction to transfer data between the cache and the cache or the cache and registers or registers and registers;
  • the Cambricon operation instruction is used to complete a neural network arithmetic operation, including a Cambricon matrix operation instruction, a Cambricon vector operation instruction, and a Cambricon scalar operation instruction; wherein the Cambricon matrix operation instruction is used to complete a matrix operation in a neural network, including matrix multiplication Vector operation, vector multiplication matrix operation, matrix multiplication scalar operation, outer product operation, matrix plus matrix operation, and matrix subtraction matrix operation; the Cambricon vector operation instruction for performing vector operations in a neural network, including vector basic operations, vectors Transcendental function operations, inner product operations, vector random generation operations, and maximum/minimum operations in vectors; Cambricon scalar operations instructions for performing scalar operations in neural networks, including scalar basic operations and scalar transcendental functions;
  • Cambricon logic instructions for logical operations of a neural network including Cambricon vector logic operation instructions and Cambricon scalar logic operation instructions; wherein the Cambricon vector logic operation instructions are used for vector comparison operations, vector logic operations, and vectors Greater than the merge operation, wherein the vector logic operation includes AND, OR, and NOT; the Cambricon scalar logic operation instruction is used for scalar comparison operations and scalar logic operations.
  • the Cambricon data transmission instruction supports one or more of the following data organization modes: a matrix, a vector, and a scalar;
  • the vector basic operation includes a vector addition, subtraction, multiplication, and division;
  • the transcendental function refers to a function that does not satisfy the polynomial equation with a polynomial as a coefficient, including an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function;
  • the scalar basic operation includes scalar addition, subtraction, multiplication, and division;
  • the scalar transcendental function refers to dissatisfaction A function of a polynomial equation in which a polynomial is a coefficient, including an exponential function, a logarithmic function, a trigonometric function, an inverse trigonometric function;
  • the vector comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to;
  • Neural networks have achieved very successful applications, but large-scale neural network parameters place high demands on storage. On the one hand, a large number of neural network parameters require huge storage capacity. On the other hand, accessing a large amount of neural network data will bring huge energy consumption for access.
  • ECC Error Correcting Code
  • a storage device including:
  • An inexact memory location for storing non-significant bits in the data.
  • the precise storage unit uses ECC memory
  • the inexact storage unit uses non-ECC memory
  • the data is a neural network parameter, including input neurons, weights, and output neurons;
  • the precise storage unit is configured to store important bits of the input neurons, and output neurons. Important bits of important bits and weights; the imprecise storage unit is used to store non-significant bits of the input neurons, non-significant bits of the output neurons, and non-significant bits of the weights.
  • the data includes floating point data and fixed point data; the symbol bit and the exponent part in the floating point data are important bits, and the bottom part is a non-significant bit;
  • the first x bits of the sign bit and the value part in the point type data are important bits, and the remaining bits of the value part are non-important bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a total bit of data Bit.
  • the ECC memory includes an ECC-checked DRAM and an ECC-checked SRAM; the ECC-checked SRAM uses a 6T SRAM, or a 4T SRAM or a 3T SRAM.
  • the non-ECC memory includes a non-ECC check DRAM and a non-ECC check SRAM; the non-ECC check SRAM uses 6T SRAM, or 4T SRAM or 3T SRAM.
  • the storage unit storing each bit in the 6T SRAM includes 6 MOS tubes; the storage unit storing each bit in the 4T SRAM includes 4 MOS tubes; the 3T SRAM The memory cell in which each bit is stored includes three MOS transistors.
  • the four MOS transistors include: a first MOS transistor, a second MOS transistor, a third MOS transistor, and a fourth MOS transistor, and the first MOS transistor and the second MOS transistor are used for gating The third MOS transistor and the fourth MOS transistor are used for storage, wherein the first MOS transistor gate is electrically connected to the word line WL, the source is electrically connected to the bit line BL, and the second MOS transistor gate is electrically connected to the word line WL.
  • the source is electrically connected to the bit line BLB; the gate of the third MOS transistor is connected to the source of the fourth MOS transistor and the drain of the second MOS transistor, and is connected to the working voltage through the resistor R2, and the drain of the third MOS transistor is grounded; The gate of the MOS transistor is connected to the source of the third MOS transistor and the drain of the first MOS transistor, and is connected to the working voltage through the resistor R1, and the drain of the fourth MOS transistor is grounded; WL is used for controlling the gate access of the memory unit, for BL For reading and writing of the storage unit.
  • the three MOS transistors include: a first MOS transistor, a second MOS transistor, and a third MOS transistor, the first MOS transistor is used for gating, and the second MOS transistor and the third MOS transistor are used.
  • the first MOS transistor gate is electrically connected to the word line WL
  • the source is electrically connected to the bit line BL
  • the second MOS transistor gate is connected to the third MOS transistor source, and is connected to the working voltage through the resistor R2.
  • a second MOS transistor drain is grounded;
  • a third MOS transistor gate is connected to the second MOS transistor source and the first MOS transistor drain, and is connected to the working voltage through the resistor R1, and the third MOS transistor drain is grounded;
  • the BL is used for reading and writing the memory unit.
  • a data processing apparatus including:
  • An arithmetic unit, an instruction control unit, and the storage device is configured to receive the input instruction and the operation parameter, and store the important bits and instructions in the operation parameter in the precise storage unit, and the non-important in the operation parameter The bit is stored in the inaccurate storage unit;
  • the instruction control unit is configured to receive an instruction in the storage device, and decode the generated control information;
  • the operation unit is configured to receive the operation parameter in the storage device, and perform an operation according to the control information, And transfer the result of the operation to the storage device.
  • the computing unit is a neural network processor.
  • the operation parameter is a neural network parameter
  • the operation unit is configured to receive an input neuron and a weight in the storage device, complete the neural network operation according to the control information, and obtain an output neuron, and The output neurons are transmitted to the storage device.
  • the operation unit is configured to receive important bits of an input neuron in the storage device and important bits of the weight for calculation; or the operation unit is configured to receive important bits. And the non-significant bits spliced the complete input neurons and weights for calculation.
  • the method further includes: an instruction cache, disposed between the storage device and the instruction control unit, for storing the dedicated instruction; and inputting a layered buffer of the neuron, disposed between the storage device and the operation unit,
  • the input neuron hierarchical cache includes an input neuron exact cache and an input neuron inexact cache;
  • a weighted hierarchical cache disposed between the storage device and the arithmetic unit for buffering weights Data
  • the weighted layered cache includes a weighted precision cache and a weighted inexact cache;
  • an output neuron layered cache disposed between the storage device and the arithmetic unit for buffering output neurons, the output neurons Hierarchical caching includes output neuron exact caching and output neuron inexact caching.
  • a direct data access unit DMA is further included for use in the storage device, the instruction cache, the weight layer cache, the input neuron hierarchical cache, and the output neuron hierarchical cache. Perform data or instruction reading and writing.
  • the instruction cache, the input neuron hierarchical cache, the weight layer cache, and the output neuron hierarchical cache use 4T SRAM or 3T SRAM.
  • a preprocessing module is further included for preprocessing and transmitting the input data to the storage device; the preprocessing includes segmentation, Gaussian filtering, binarization, regularization, and normalization. Chemical.
  • the operation unit is a general purpose operation processor.
  • an electronic device including the data processing device described above.
  • a storage method comprising: accurately storing important bits in data; and performing inexact storage of non-significant bits in the data.
  • the accurately storing the important bits in the data specifically includes: extracting important bits of the data, and storing the important bits in the data in the ECC memory for accurate storage.
  • the performing inaccurate storage of non-significant bits in the data specifically includes: extracting non-significant bits of the data, and storing non-important bits in the data in the non-ECC memory. Inexact storage in the middle.
  • the data is a neural network parameter, including an input neuron, a weight, and an output neuron; an important bit of the input neuron, an important bit of the output neuron, and a weight
  • the important bits are accurately stored; the non-significant bits of the input neurons, the non-significant bits of the output neurons, and the non-significant bits of the weights are stored inexactly.
  • the data includes floating point data and fixed point data; the symbol bit and the exponent part in the floating point data are important bits, and the bottom part is a non-significant bit;
  • the first x bits of the sign bit and the value part in the point type data are important bits, and the remaining bits of the value part are non-significant bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a parameter total bit Bit.
  • the ECC memory includes an ECC-checked DRAM and an ECC-checked SRAM; and the ECC-checked SRAM uses a 6T SRAM, a 4T SRAM, or a 3T SRAM.
  • the non-ECC memory includes a non-ECC check DRAM and a non-ECC check SRAM; the non-ECC check SRAM uses 6T SRAM, 4T SRAM or 3T SRAM.
  • a data processing method including:
  • the operation is a neural network operation
  • the parameter is a neural network parameter
  • the receiving parameter is performed according to the control information
  • storing the operation result includes: receiving the input neuron and the weight, completing the neural network operation according to the control information, and obtaining the output neuron, and Output neuron storage or output.
  • the receiving the input neuron and the weight, and completing the neural network operation according to the control information to obtain the output neuron include: receiving important bits of the input neuron and important bits of the weight for calculation Or, it receives the input neurons and weights that spliced the important bits and the non-significant bits into a complete calculation.
  • the data processing method further includes: a cache dedicated instruction; an accurate cache and an inexact cache on the input neuron; an accurate cache and an inexact cache on the weight data; and an output neuron Perform precise and inexact caching.
  • the operation is a general operation.
  • the method before the receiving the instruction and the parameter, and storing the important bits and instructions in the parameter for accurate storage, and performing non-precise storage of the non-important bits in the parameter, the method further includes:
  • the input data is pre-processed and stored; the pre-processing includes segmentation, Gaussian filtering, binarization, regularization, and normalization.
  • a memory unit is provided, the memory unit being a 4T SRAM or a 3T SRAM for storing neural network parameters.
  • the storage unit storing each bit in the 4T SRAM includes 4 MOS tubes; and the storage unit storing each bit in the 3T SRAM includes 3 MOS tubes.
  • the four MOS transistors include: a first MOS transistor, a second MOS transistor, a third MOS transistor, and a fourth MOS transistor, and the first MOS transistor and the second MOS transistor are used for gating The third MOS transistor and the fourth MOS transistor are used for storage, wherein the first MOS transistor gate is electrically connected to the word line WL, the source is electrically connected to the bit line BL, and the second MOS transistor gate is electrically connected to the word line WL.
  • the source is electrically connected to the bit line BLB; the gate of the third MOS transistor is connected to the source of the fourth MOS transistor and the drain of the second MOS transistor, and is connected to the working voltage through the resistor R2, and the drain of the third MOS transistor is grounded; The gate of the MOS transistor is connected to the source of the third MOS transistor and the drain of the first MOS transistor, and is connected to the working voltage through the resistor R1, and the drain of the fourth MOS transistor is grounded; WL is used for controlling the gate access of the memory unit, for BL For reading and writing of the storage unit.
  • the three MOS transistors include: a first MOS transistor, a second MOS transistor, and a third MOS transistor, the first MOS transistor is used for gating, and the second MOS transistor and the third MOS transistor are used.
  • the first MOS transistor gate is electrically connected to the word line WL
  • the source is electrically connected to the bit line BL
  • the second MOS transistor gate is connected to the third MOS transistor source, and is connected to the working voltage through the resistor R2.
  • a second MOS transistor drain is grounded;
  • a third MOS transistor gate is connected to the second MOS transistor source and the first MOS transistor drain, and is connected to the working voltage through the resistor R1, and the third MOS transistor drain is grounded;
  • the BL is used for reading and writing the memory unit.
  • the neural network parameters include input neurons, weights, and output neurons.
  • DVFS Dynamic Voltage Frequency Scaling
  • the DVFS technology specifically adjusts the operating frequency and voltage of the chip (for the same chip, the higher the frequency, the higher the voltage required), thereby achieving energy saving.
  • a dynamic voltage regulation and frequency modulation apparatus including:
  • An information collecting unit configured to collect, in real time, working state information or application scenario information of a chip connected to the dynamic voltage regulating frequency modulation, where the application scenario information is obtained by using the neural network or connected to the chip Information collected by the sensor;
  • the voltage-adjusting and frequency-modulating unit is configured to send voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip, where the voltage frequency regulation information is used to instruct the chip to adjust its working voltage or operating frequency.
  • the working state information of the chip includes an operating speed of the chip
  • the voltage frequency regulation information includes first voltage frequency regulation information
  • the voltage regulating frequency modulation unit is configured to:
  • the target speed is the running speed of the chip when the user's demand is met.
  • the chip includes at least a first unit and a second unit, the output data of the first unit is input data of the second unit, and the working status information of the chip includes The operating speed of the first unit and the operating speed of the second unit, the voltage frequency regulation information includes second voltage frequency regulation information, and the frequency modulation unit is further configured to:
  • the second voltage frequency regulation information is used to instruct the second unit to reduce its operating frequency or operating voltage.
  • the voltage frequency regulation information includes third voltage frequency regulation information
  • the frequency modulation unit is further configured to:
  • the third voltage frequency regulation information is used to indicate that the first unit reduces its operating frequency or operating voltage.
  • the chip includes at least N units, and the working state information of the chip includes working state information of at least S units of the at least N units, where N is greater than 1.
  • An integer that is less than or less than an integer of N the voltage frequency regulation information includes fourth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is configured to:
  • the unit A is any one of the at least S units.
  • the voltage frequency regulation information includes fifth voltage frequency regulation information
  • the voltage regulation and frequency modulation unit is further configured to:
  • the application scenario of the chip is image recognition
  • the application scenario information is the number of objects in the image to be identified
  • the voltage frequency regulation information includes sixth voltage frequency regulation information.
  • the voltage regulating FM unit is also used to:
  • the application scenario information is object tag information
  • the voltage frequency regulation information includes seventh voltage frequency regulation information
  • the voltage regulation and frequency modulation unit is further configured to:
  • the chip is applied to voice recognition
  • the application scenario information is a voice input rate
  • the voltage frequency regulation information includes eighth voltage frequency regulation information
  • the voltage regulation and frequency modulation unit is further used.
  • the application scenario information is a keyword obtained by performing speech recognition on the chip
  • the voltage frequency regulation information includes ninth voltage frequency regulation information
  • the frequency modulation and voltage adjustment unit is further used to :
  • the ninth voltage frequency regulation information is sent to the chip, and the ninth voltage frequency regulation information is used to indicate that the chip increases its working voltage or operating frequency.
  • the chip is applied to machine translation
  • the application scenario information is a speed of text input or a number of characters in an image to be translated
  • the voltage frequency regulation information includes tenth voltage frequency regulation information.
  • the voltage regulating and frequency modulation unit is further configured to:
  • the application scenario information is ambient light intensity
  • the voltage frequency regulation information includes eleventh voltage frequency regulation information
  • the voltage regulation and frequency modulation unit is further configured to:
  • the eleventh voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency .
  • the chip is applied to image beauty
  • the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency regulation information
  • the voltage regulation and frequency modulation unit is further used. to:
  • the application scenario information is a face image
  • sending the twelfth voltage frequency regulation information to the chip where the twelfth voltage frequency regulation information is used to indicate that the chip reduces its working voltage
  • a dynamic voltage regulation and frequency modulation method including:
  • the working state information of the chip includes an operating speed of the chip
  • the voltage frequency regulation information includes first voltage frequency regulation information, according to the working state information of the chip or Transmitting the voltage frequency regulation information to the chip by using the scenario information includes:
  • the target speed is the running speed of the chip when the user's demand is met.
  • the chip includes at least a first unit and a second unit, the output data of the first unit is input data of the second unit, and the working status information of the chip includes The operating speed of the first unit and the operating speed of the second unit, the voltage frequency regulation information includes second voltage frequency regulation information, and the voltage frequency is sent to the chip according to the working state information or the application scenario information of the chip.
  • Regulatory information also includes:
  • the second voltage frequency regulation information is used to instruct the second unit to reduce its operating frequency or operating voltage.
  • the voltage frequency regulation information includes the second voltage frequency regulation information
  • the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
  • the third voltage frequency regulation information is used to indicate that the first unit reduces its operating frequency or operating voltage.
  • the chip includes at least N units, and the working state information of the chip includes working state information of at least S units of the at least N units, where N is greater than 1.
  • the voltage frequency regulation information includes second voltage frequency regulation information, and the voltage frequency regulation is sent to the chip according to the working state information or the application scenario information of the chip.
  • the information also includes:
  • the unit A is any one of the at least S units.
  • the voltage frequency regulation information includes fifth voltage frequency regulation information
  • the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
  • the application scenario of the chip is image recognition
  • the application scenario information is the number of objects in the image to be identified
  • the voltage frequency regulation information includes sixth voltage frequency regulation information.
  • the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
  • the application scenario information is object tag information
  • the voltage frequency regulation information includes seventh voltage frequency regulation information, where the device is based on the working state information or the application scenario information of the chip.
  • the chip transmitting voltage frequency regulation information further includes:
  • the chip is applied to voice recognition
  • the application scenario information is a voice input rate
  • the voltage frequency regulation information includes eighth voltage frequency regulation information, where the operation is performed according to the chip.
  • the sending the voltage frequency regulation information to the chip by the status information or the application scenario information further includes:
  • the application scenario information is a keyword obtained by performing speech recognition on the chip
  • the voltage frequency regulation information includes ninth voltage frequency regulation information, according to the working state of the chip. Sending the voltage frequency regulation information to the chip by the information or the application scenario information further includes:
  • the ninth voltage frequency regulation information is sent to the chip, and the ninth voltage frequency regulation information is used to indicate that the chip increases its working voltage or operating frequency.
  • the chip is applied to machine translation
  • the application scenario information is a speed of text input or a number of characters in an image to be translated
  • the voltage frequency regulation information includes tenth voltage frequency regulation information.
  • the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
  • the application scenario information is an ambient light intensity
  • the voltage frequency regulation information includes eleventh voltage frequency regulation information
  • the working state information or application scenario information according to the chip Sending voltage frequency regulation information to the chip further includes:
  • the eleventh voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency .
  • the chip is applied to image beauty
  • the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency regulation information
  • the operation according to the chip The sending the voltage frequency regulation information to the chip by the status information or the application scenario information further includes:
  • the application scenario information is a face image
  • sending the twelfth voltage frequency regulation information to the chip where the twelfth voltage frequency regulation information is used to indicate that the chip reduces its working voltage
  • Dynamic Voltage Frequency Scaling is currently in semiconductors.
  • a dynamic voltage frequency adjustment technology widely used in the field, the DVFS technology specifically adjusts the operating frequency and voltage of the chip (for the same chip, the higher the frequency, the higher the voltage required), thereby achieving the purpose of energy saving.
  • a dynamic voltage modulation and frequency modulation method applied to a smart chip such as a convolution operation device and a corresponding device design.
  • a convolution operation device includes: a dynamic voltage modulation and frequency modulation device, an instruction storage unit, a controller unit, a data access unit, an interconnection module, a main operation module, and N slave operation modules, N is an integer greater than 1, where:
  • the instruction storage unit is configured to store an instruction read by the data access unit
  • the controller unit is configured to read an instruction from the instruction storage unit, and translate the instruction into a control signal for controlling behavior of other modules, where the other module includes the data access unit, the main operation module, and the Said N slave arithmetic modules;
  • the data access unit is configured to perform data or instruction read and write operations between the external address space and the convolution operation device;
  • the N slave operation modules are configured to implement a convolution operation of the input data and the convolution kernel in the convolutional neural network algorithm
  • the interconnection module is configured to perform data transmission between the main operation module and the slave operation module;
  • the main operation module is configured to splicing intermediate vectors of all input data into intermediate results, and performing subsequent operations on the intermediate results;
  • the dynamic voltage regulation and frequency modulation device is configured to collect operation state information of the convolution operation device; and send voltage frequency regulation information to the convolution operation device according to the operation state information of the convolution operation device, the voltage frequency
  • the regulation information is used to instruct the convolution operation device to adjust its operating voltage or operating frequency.
  • the main operation module is further configured to add an intermediate result to the offset data to perform an activation operation.
  • the N slave operation modules are specifically configured to calculate respective output scalars in parallel by using the same input data and respective convolution kernels.
  • the activation function active used by the main operation module is any nonlinear function of the nonlinear functions sigmoid, tanh, relu, and softmax.
  • the interconnection module constitutes a data path of continuous or discretized data between the main operation module and the N slave operation modules, and the interconnection module is a tree structure. Any one of a ring structure, a mesh structure, a hierarchical interconnection structure, and a bus structure.
  • the main operation module includes:
  • a first storage unit configured to buffer input data and output data used by the main operation module in the calculation process
  • a first operation unit configured to complete various computing functions of the main operation module
  • a first data dependency determining unit configured to read and write a port of the first storage unit by the first computing unit, to ensure consistency of reading and writing data to the first storage unit, and to read from the first storage unit
  • each of the N slave computing modules includes:
  • a second operation unit configured to receive a control signal sent by the controller unit and perform an arithmetic logic operation
  • a second data dependency determining unit configured to perform read and write operations on the second storage unit and the third storage unit during the calculating process to ensure read and write consistency to the second storage unit and the third storage unit;
  • a second storage unit configured to buffer input data and an output scalar calculated from the operation module
  • a third storage unit configured to cache a convolution kernel required by the slave computing module in the calculation process.
  • the first data dependency determining unit and the second data dependency determining unit ensure read and write consistency by:
  • the data access unit reads at least one of input data, offset data, and a convolution kernel from an external address space.
  • the dynamic voltage regulation and frequency modulation apparatus includes:
  • An information collecting unit configured to collect working state information of the convolution operation device in real time
  • a voltage regulating unit for transmitting voltage frequency regulation information to the convolution operation device according to the working state information of the convolution operation device, wherein the voltage frequency adjustment information is used to instruct the convolution operation device to adjust its working voltage Or the working frequency.
  • the working state information of the convolution operation device includes an operating speed of the convolution operation device
  • the voltage frequency regulation information includes first voltage frequency regulation information
  • the voltage regulation and frequency modulation Unit is used to:
  • the convolution operation device Transmitting, by the convolution operation device, the first voltage frequency regulation information, when the operation speed of the convolution operation device is greater than a target speed, the first voltage frequency regulation information being used to indicate that the convolution operation device is lowered Its operating frequency or operating voltage, which is the operating speed of the convolutional computing device when the user's needs are met.
  • the working state information of the convolution operation device includes an operating speed of the data access unit and an operating speed of the main computing module
  • the voltage frequency control information includes second voltage frequency control information.
  • the FM voltage regulator unit is further configured to:
  • the second voltage frequency regulation information is used to instruct the main operation module to reduce its operating frequency or operating voltage.
  • the voltage frequency regulation information includes third voltage frequency regulation information
  • the frequency modulation unit is further configured to:
  • the third voltage frequency regulation information is used to instruct the data access unit to reduce its operating frequency or operating voltage.
  • the working state information of the convolution operation device includes an instruction storage unit, a controller unit, a data access unit, an interconnection module, a main operation module, and at least S of the N slave operation modules.
  • Working state information of the unit/module, the S is an integer greater than 1 and less than or equal to N+5
  • the voltage frequency regulation information includes fourth voltage frequency regulation information
  • the voltage regulation and frequency modulation unit is configured to:
  • the unit A is any one of the at least S units/modules.
  • the voltage frequency regulation information includes fifth voltage frequency regulation information
  • the voltage regulation and frequency modulation unit is further configured to:
  • a neural network processor comprising a convolution operation device as described above.
  • an electronic device comprising a neural network processor as described above.
  • a method for performing a forward operation of a single-layer convolutional neural network which is applied to the above-described convolution operation device, and includes:
  • the controller unit reads the IO instruction from the first address of the instruction storage unit, and according to the decoded control signal, the data access unit reads all corresponding convolutional neural network operation instructions from the external address space, and Caching in the instruction storage unit;
  • the controller unit then reads in the next IO instruction from the instruction storage unit, and according to the decoded control signal, the data access unit reads all data required by the main operation module from the external address space to the main operation module.
  • the controller unit then reads the next IO instruction from the instruction storage unit, and the data access unit reads the convolution kernel data required by the operation module from the external address space according to the decoded control signal;
  • the controller unit then reads the next CONFIG instruction from the instruction storage unit, and the convolution operation device configures various constants required for the calculation of the layer neural network according to the decoded control signal;
  • the controller unit then reads the next COMPUTE instruction from the instruction storage unit, and according to the translated control signal, the main operation module first sends the input data in the convolution window to the N slave operations through the interconnect module.
  • a module saved to the second storage unit of the N slave computing modules, and then moving the convolution window according to the instruction;
  • the operation unit of the N slave operation modules reads the convolution kernel from the third storage unit, reads the input data from the second storage unit, and completes the input data and the convolution kernel. Convolution operation, returning the obtained output scalar through the interconnect module;
  • the output scalars returned by the N operation modules are successively formed into a complete intermediate vector
  • the main operation module obtains an intermediate vector returned by the interconnection module, and the convolution window traverses all the input data, and the main operation module concatenates all the return vectors into an intermediate result, and the control signal decoded according to the COMPUTE instruction is from the first storage unit. Reading the offset data, adding the offset result to the intermediate result by the vector addition unit, then the activation unit activates the offset result, and writes the last output data back to the first storage unit;
  • the controller unit then reads the next IO instruction from the instruction storage unit, and the data access unit stores the output data in the first storage unit to the specified address in the external address space according to the translated control signal, and the operation ends. .
  • the method further includes:
  • the voltage frequency adjustment information is transmitted to the convolution operation device according to the operation state information of the convolution operation device, and the voltage frequency adjustment information is used to instruct the convolution operation device to adjust its operating voltage or operating frequency.
  • the working state information of the convolution operation device includes an operating speed of the convolution operation device
  • the voltage frequency regulation information includes first voltage frequency regulation information, according to the volume
  • the transmitting the operating state information of the product computing device to the convolution computing device includes:
  • the convolution operation device Transmitting, by the convolution operation device, the first voltage frequency regulation information, when the operation speed of the convolution operation device is greater than a target speed, the first voltage frequency regulation information being used to indicate that the convolution operation device is lowered Its operating frequency or operating voltage, the target speed is the running speed of the chip when the user needs are met.
  • the working state information of the convolution operation device includes an operating speed of the data access unit and an operating speed of the main computing module
  • the voltage frequency control information includes second voltage frequency control information.
  • the transmitting the voltage frequency regulation information to the convolution operation device according to the working state information of the convolution operation device further includes:
  • the second voltage frequency regulation information is used to instruct the main operation module to reduce its operating frequency or operating voltage.
  • the voltage frequency regulation information includes third voltage frequency regulation information, and the voltage frequency regulation information is sent to the convolution operation device according to the operation state information of the convolution operation device.
  • the third voltage frequency regulation information is used to instruct the data access unit to reduce its operating frequency or operating voltage.
  • the working state information of the convolution operation device includes at least the instruction storage unit, the controller unit, the data access unit, the interconnect module, the main operation module, and the N slave operation modules.
  • Working state information of S units/modules the S is an integer greater than 1 and less than or equal to N+5
  • the voltage frequency regulation information includes fourth voltage frequency regulation information, according to the convolution operation device
  • the sending the voltage frequency regulation information to the convolution operation device by the working state information further includes:
  • the unit A is any one of the at least S units/modules.
  • the voltage frequency regulation information includes fifth voltage frequency regulation information, and the voltage frequency regulation information is sent to the convolution operation device according to the operation state information of the convolution operation device.
  • a method for performing a forward operation of a multi-layer convolutional neural network comprising:
  • the operation instruction of the layer will be the upper layer stored in the main operation module.
  • the output data address is used as the input data address of this layer, and the convolution kernel and the offset data address in the instruction are changed to the address corresponding to this layer.
  • Image is the visual basis of human perception of the world. It is human access to information, expression and transmission of information. An important means.
  • an image compression method including:
  • an original image of a first resolution where the original image is any training image in a compressed training map set of a compressed neural network, and label information of the original image is used as target label information;
  • the target original image is compressed based on the compressed neural network model to obtain a target compressed image of the second resolution.
  • the image compression method further includes:
  • the loss function does not converge to the first threshold, or the current training number of the compressed neural network is less than the second threshold, updating the target model according to the loss function to obtain an updated model,
  • the update model is used as the target model, and the next training image is used as the original image, and the step of acquiring the original image of the first resolution is performed.
  • the identifying the compressed image based on the identification neural network model, and obtaining the reference label information specifically includes:
  • the pre-processing includes a size processing
  • the pre-processing the compressed image to obtain the image to be identified specifically includes:
  • the compressed image is filled with pixels according to the basic image size to obtain the image to be recognized.
  • the compressed training atlas includes at least an identification training atlas, and the method further includes:
  • the identification neural network is trained by using the identification training atlas to obtain the identification neural network model, and each training image in the identification training map set at least includes label information that is consistent with the type of the target label information.
  • the method further includes:
  • the compressed training atlas includes a plurality of dimensions
  • the compressed image is compressed by the target model to obtain a compressed image of the second resolution, including:
  • the original image is compressed based on the target model and the plurality of image information to obtain the compressed image.
  • an image compression apparatus includes a processor, a memory coupled to the processor, wherein:
  • the memory is configured to store a first threshold, a second threshold, a current neural network model and a training number of the compressed neural network, a compressed training atlas of the compressed neural network, and a label of each training image in the compressed training map set Information, a recognition neural network model, a compressed neural network model, and a current neural network model of the compressed neural network as a target model, the compressed neural network model being a corresponding target model when the compressed neural network training is completed, the identification The neural network model is to identify a corresponding neural network model when the neural network training is completed;
  • the processor is configured to acquire an original image of a first resolution, where the original image is any training image in the compressed training map set, and label information of the original image is used as target label information;
  • the model compresses the original image to obtain a compressed image of a second resolution, the second resolution is smaller than the first resolution; and the compressed image is identified based on the recognized neural network model to obtain a reference label Obtaining a loss function according to the target tag information and the reference tag information; acquiring the first when the loss function converges to the first threshold, or the training number is greater than or equal to the second threshold a target original image of a resolution, confirming that the target model is the compressed neural network model; compressing the target original image based on the compressed neural network model to obtain a target compressed image of the second resolution.
  • the processor is further configured to: according to the loss function, when the loss function does not converge to the first threshold, or the training number is less than the second threshold Updating the target model, obtaining an updated model, using the updated model as the target model, and using the next training image as the original image, performing the step of acquiring the original image of the first resolution.
  • the processor is specifically configured to preprocess the compressed image to obtain an image to be identified, and identify the image to be identified based on the recognized neural network model, to obtain the Refer to the label information.
  • the pre-processing includes size processing
  • the memory is further configured to store a basic image size of the recognition neural network
  • the processor is specifically configured to be used in an image of the compressed image
  • the size is smaller than the basic image size
  • the compressed image is filled with pixels according to the basic image size, and the image to be recognized is obtained.
  • the compressed training atlas includes at least identifying a training atlas
  • the processor is further configured to use the identification training atlas to train the recognized neural network to obtain the identification.
  • the neural network model, each training image in the identification training map set includes at least tag information consistent with the type of the target tag information.
  • the processor is further configured to: identify the target compressed image based on the recognized neural network model, and obtain label information of the target original image; And storing label information of the target original image.
  • the compressed training atlas includes multiple dimensions
  • the processor is specifically configured to identify the original image based on the target model to obtain multiple image information, each dimension. Corresponding to one image information; compressing the original image based on the target model and the plurality of image information to obtain the compressed image.
  • another electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be processed by the above Executed, the program includes instructions for some or all of the steps described in the image compression method as described above.
  • a computer readable storage medium storing a computer program, the computer program comprising program instructions, the program instructions causing the processor when executed by a processor
  • the image compression method described above is performed.
  • the processing method and device, the computing method and the device provided by the present application have at least the following advantages compared with the prior art:
  • the neural network processor integrates a lookup table-based calculation method, optimizes the look-up table operation, simplifies the structure, reduces the neural network access energy consumption and calculates the energy consumption, and at the same time realizes the diversification of the operation.
  • the neural network can be retrained, and only need to train the codebook during retraining, and does not need to train the weight dictionary, which simplifies the retraining operation.
  • the neural network dedicated instruction and flexible arithmetic unit for multi-layer artificial neural network operation for local quantization solve the problem that the central processor CPU and graphics processor GPU have insufficient performance and the front-end decoding overhead is large, effectively improving Support for multi-layer artificial neural network algorithms.
  • FIG. 1A is a schematic flowchart of a processing method according to an embodiment of the present application.
  • FIG. 1B is a schematic diagram of a process for quantifying weights according to an embodiment of the present application.
  • FIG. 1C is a schematic diagram of a process for quantifying input neurons according to an embodiment of the present application.
  • FIG. 1D is a schematic diagram of a process for determining a computing codebook according to an embodiment of the present application.
  • FIG. 1 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application.
  • FIG. 1 is a schematic structural diagram of an arithmetic device according to an embodiment of the present application.
  • FIG. 1G is a schematic structural diagram of an arithmetic device according to an embodiment of the present application.
  • FIG. 1H is a schematic flowchart diagram of an operation method according to an embodiment of the present application.
  • FIG. 1 is a schematic flowchart diagram of another computing method according to a specific embodiment of the present disclosure.
  • FIG. 2A is a schematic structural diagram of a layered storage device according to an embodiment of the present application.
  • 2B is a schematic structural diagram of a 4T SRAM memory unit according to an embodiment of the present application.
  • 2C is a schematic structural diagram of a 3T SRAM memory unit according to an embodiment of the present application.
  • 2D is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.
  • 2E is a schematic structural diagram of another data processing apparatus according to an embodiment of the present application.
  • 2F is a flowchart of a data storage method according to an embodiment of the present application.
  • 2G is a flowchart of a data processing method according to an embodiment of the present application.
  • FIG. 3A is a schematic structural diagram of a dynamic voltage regulation and frequency modulation apparatus according to an embodiment of the present application.
  • FIG. 3B is a schematic diagram of a dynamic voltage regulation and frequency modulation application scenario according to an embodiment of the present disclosure.
  • FIG. 3C is a schematic diagram of another dynamic voltage regulation and frequency modulation application scenario according to an embodiment of the present disclosure.
  • FIG. 3D is a schematic diagram of another dynamic voltage regulation and frequency modulation application scenario provided by an embodiment of the present application.
  • FIG. 3E is a schematic diagram of an implementation manner of an interconnection module 4 according to an embodiment of the present application.
  • FIG. 3F is a block diagram showing an example of a structure of a main operation module 5 in an apparatus for performing a forward operation of a convolutional neural network according to an embodiment of the present application.
  • FIG. 3G is a block diagram showing an example of a structure of a slave operation module 6 in an apparatus for performing a forward operation of a convolutional neural network according to an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a dynamic voltage regulation and frequency modulation method according to an embodiment of the present application.
  • FIG. 4A is a schematic structural diagram of a convolution operation device according to an embodiment of the present application.
  • FIG. 4B is a block diagram showing an example of a structure of a main operation module in a convolution operation device according to an embodiment of the present application.
  • 4C is a block diagram showing an example of a structure of a slave arithmetic module in a convolution operation device according to an embodiment of the present application.
  • 4D is a block diagram showing an example of a structure of a dynamic voltage regulation and frequency modulation apparatus in a convolution operation device according to an embodiment of the present application.
  • FIG. 4E is a schematic diagram of an implementation manner of the interconnect module 4 according to an embodiment of the present application.
  • FIG. 4F is a schematic structural diagram of another convolution operation device according to an embodiment of the present application.
  • FIG. 4G is a schematic flowchart of a method for performing a forward operation of a single-layer convolutional neural network according to an embodiment of the present application.
  • FIG. 5A is a schematic diagram of operations of a neural network according to an embodiment of the present application.
  • FIG. 5B is a schematic flowchart of an image compression method according to an embodiment of the present application.
  • FIG. 5C is a schematic diagram of a scenario of a size processing method according to an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a single-layer neural network operation method according to an embodiment of the present application.
  • FIG. 5E is a schematic structural diagram of a reverse training device for performing a compressed neural network according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an H-tree module according to an embodiment of the present application.
  • FIG. 5G is a schematic structural diagram of a main operation module according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an operation module according to an embodiment of the present application.
  • FIG. 5I is a block diagram of an example of reverse training of a compressed neural network according to an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of an image compression method according to an embodiment of the present application.
  • FIG. 5K is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the present application provides a processing method, apparatus, and method and apparatus.
  • the processing method and device quantify the input data between the neurons and the weights, respectively, and respectively mine the similarity between the layers, the data between the segments, and the local similarity of the data in the intra- and intra-segment to excavate the two kinds of data.
  • the distribution characteristics thus perform low bit quantization, reducing the number of bits used to represent each data, thereby reducing data storage overhead and memory access overhead.
  • the processing method and device realize the operation operations of the quantized neurons and weights through the table look-up operation, thereby reducing the energy consumption of the neural network to access the storage and calculating the energy consumption.
  • the input neurons and output neurons mentioned in this application do not refer to the neurons in the input layer of the entire neural network and the neurons in the output layer, but to any two adjacent layers in the network, which are under the network feedforward operation.
  • the neurons in the middle are the input neurons
  • the neurons in the upper layer of the network feedforward operation are the output neurons.
  • FIG. 1A is a schematic flowchart of a processing method according to an embodiment of the present disclosure. As shown in FIG. 1A, the processing method includes:
  • Step S1 respectively quantizing the weight and the input neuron, and determining the weight dictionary, the weight codebook, the neuron dictionary, and the neuron codebook;
  • the process of quantifying the weight includes the following steps:
  • the weights are grouped, and each group of weights is clustered by a clustering algorithm, and a set of weights is divided into m classes, m is a positive integer, each type of weight corresponds to a weight index, and a weight dictionary is determined.
  • the weight dictionary includes a weight position and a weight index, and the weight position refers to a position of the weight in the neural network structure;
  • the ownership value of each class is replaced with a central weight, and the weighted password book is determined.
  • the weighted codebook includes a weight index and a center weight.
  • FIG. 1B is a schematic diagram of a process for quantifying weights according to an embodiment of the present application.
  • weights are grouped according to a preset grouping strategy to obtain an ordered matrix of weights. .
  • the intra-group sampling and clustering operations are performed on the grouped weight matrix, and the weights with similar values are classified into the same category, and the central weights under the four categories are calculated according to the loss function are 1.50, -0.13, - 1.3 and 0.23, respectively, correspond to the weights of the four categories.
  • the weight index of the category with a center weight of -1.3 is 00
  • the weight index of the category with a center weight of -0.13 is 01
  • the weight index of the category with a center weight of 0.23 For a value of 10, a category with a center weight of 1.50 has an index of 11.
  • weight values corresponding to the weights (00, 01, 10, and 11) of the four weights are respectively used to represent the weights in the corresponding categories, thereby obtaining a weight dictionary.
  • the weight dictionary also includes the weight position, that is, the position of the weight in the neural network structure.
  • the weight position refers to the coordinate of the qth column of the pth row, ie (p, q) In the present embodiment, 1 ⁇ p ⁇ 4, and 1 ⁇ q ⁇ 4.
  • the quantization process fully exploits the similarity of the weights between the neural network layers and the local similarity of the intra-layer weights, and obtains the weight distribution characteristics of the neural network to perform low-bit quantization, which is used to represent each weight.
  • the number of bits which reduces the weight storage overhead and fetch overhead.
  • the preset grouping policy includes, but is not limited to, the following: grouping the groups, and grouping the ownership values of the neural network into groups; layer type grouping, weighting all convolution layers in the neural network, The weights of all fully connected layers and the weights of all long and short memory network layers are grouped into one group; the inter-layer grouping, the weight of one or more convolution layers in the neural network, the weight of one or more fully connected layers The value and the weight of one or more long-term memory network layers are grouped into one group; and the intra-layer grouping divides the weights in the layer of the neural network, and each part after the segmentation is divided into a group.
  • Clustering algorithms include K-means, K-medoids, Clara, and/or Clarans.
  • the central weight of each class is chosen such that when the cost function J(w, w 0 ) is minimized, the value of W 0 is the center weight, and the cost function can be the squared distance: Where J is the cost function, W is the ownership value in the class, W 0 is the central weight, n is the number of weights in each class, and wi is the i-th weight in the class, 1 ⁇ i ⁇ n, and n Is a positive integer.
  • the input neurons are quantized, which includes the steps of:
  • the input neurons are encoded, and all input neurons of each segment are replaced with a central neuron to determine a neuron codebook.
  • FIG. 1C is a schematic diagram of a process for quantifying input neurons according to an embodiment of the present application.
  • this embodiment uses a method for quantifying ReLU activation layer neurons as an example.
  • the ReLU function is segmented into four segments.
  • the central neurons representing the four segments are represented by 0.0, 0.2, 0.5, and 0.7, respectively, and the neuron index is represented by 00, 01, 10, and 11.
  • a neuron codebook containing a neuron index and a central neuron is generated; and a neuron dictionary containing a neuron range and a neuron index, wherein the neuron range and the neuron index are correspondingly stored, and x represents an unquantized neuron.
  • the value of the neuron The quantization process of the input neuron can divide the input neuron into multiple segments according to actual needs, and obtain an index of each segment to form a neuron dictionary. According to the neuron index, the input neurons in each segment are replaced with the central neurons in the neuron codebook, which can fully exploit the similarity between the input neurons and obtain the distribution characteristics of the input neurons for low bit quantization. The number of bits representing each input neuron is reduced, thereby reducing the storage overhead and memory access overhead of the input neurons.
  • Step S2 Determine the operation codebook according to the weight codebook and the neuron codebook, and specifically include the steps:
  • FIG. 1D is a schematic diagram of a process for determining a fixed codebook according to an embodiment of the present application.
  • the multiplier codebook is taken as an example in this embodiment.
  • the codebook is further It can be an add-on code book, a pooled code book, etc., and this application is not limited.
  • the weight dictionary the weight index corresponding to the weight and the center weight corresponding to the weight index are determined; and in the neuron codebook, the corresponding neuron index and the neuron are determined according to the input neuron.
  • the center neuron corresponding to the index are used as the row index and the column index of the operation codebook, and the central neuron and the center weight are multiplied to form a matrix, and the multiplication codebook can be obtained.
  • step S3 may be further included, the weight and the input neuron are retrained, and only the weight codebook and the neuron codebook are trained during the retraining, and the contents of the weight dictionary and the neuron dictionary remain unchanged, simplifying Heavy training operations reduce the workload.
  • the retraining employs a backpropagation algorithm.
  • FIG. 1E is a schematic structural diagram of a processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 1E, the processing apparatus includes:
  • a memory 51 configured to store an operation instruction
  • the processor 52 is configured to execute an operation instruction in the memory 51, and perform an operation according to the foregoing processing method when the operation instruction is executed.
  • the operation instruction may be a binary number including an operation code and an address code, the operation code indicates an operation to be performed by the processor 52, and the address code instructs the processor 52 to read the data participating in the operation into the address in the memory 51.
  • the processor 52 performs an operation in accordance with the processing method of the foregoing data by executing an operation instruction in the memory 51, and can quantize the disordered weight and the input neuron to obtain a low-bit and normalized center. Weights and central neurons, the local similarity between the mining weights and the input neurons, the distribution characteristics of the two are obtained, and the low-bit quantization is performed according to the distribution characteristics of the two, which reduces the representation of each weight and the input neurons. The number of bits, which reduces the storage overhead and memory access overhead of both.
  • FIG. 1F is a schematic structural diagram of an arithmetic device according to an embodiment of the present application.
  • the computing device includes: an instruction control unit 1 and a lookup table unit 2;
  • the instruction control unit 1 is configured to decode the received instruction to generate search control information
  • the lookup table unit 2 is configured to search for output neurons from the operation codebook according to the search control information generated by the instruction control unit 1 and the received weight dictionary, the neuron dictionary, the operation codebook, the weight and the input neurons.
  • the weight dictionary includes a weight position (ie, a position of a weight in a neural network structure, represented by (p, q), specifically indicating a position of a p-th qth column in the weight dictionary) and a weight An index
  • the neuron dictionary includes an input neuron and a neuron index
  • the operational codebook includes a weight index, a neuron index, and an operation result of the input neuron and the weight.
  • the specific working process of the lookup table unit is: determining a weight index according to the weight value corresponding to the weight position in the weight dictionary, and determining the neuron according to the corresponding neuron range in the neuron dictionary of the input neuron
  • the index uses the weight index and the neuron index as the column index and the row index of the operation codebook, and finds the value of the column and the row (operation result) from the operation codebook, and the value is the output neuron.
  • the multiplication codebook is searched.
  • the value corresponding to the second row and the third column is 0.046, which is the output neuron.
  • pooling includes, but is not limited to, average pooling, maximum pooling, and median pooling.
  • the lookup table may include at least one of the following according to different arithmetic operations:
  • Addition lookup table used to add the central data of the index corresponding to the index by the table-searching operation add_lookup according to the input index in.
  • the in and data are vectors of length N, and N is a positive integer, that is,
  • FIG. 1G is a schematic structural diagram of another computing device according to an embodiment of the present application.
  • the computing device of the specific embodiment further includes: a preprocessing unit compared to the computing device in FIG. 4.
  • the storage unit 3, the cache unit 6, and the direct memory access unit 5 are capable of optimizing the processing of the present application to make the processing of data more orderly.
  • the pre-processing unit 4 is configured to pre-process the input information of the external input to obtain the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook, and the pre-processing includes but is not limited to segmentation, Gaussian filtering, binarization, regularization, and/or normalization.
  • the storage unit 3 is configured to store input neurons, weights, weights dictionary, neuron dictionary, operation codebook and instructions, and receive output neurons;
  • the cache unit 6 is configured to cache the instruction, the weight index, the neuron index, and the output neuron, and may include:
  • the instruction cache 61 is configured to buffer the instruction and output the cached instruction to the instruction control unit 1;
  • a weight buffer 62 configured to cache the weight, and output the cached weight to the lookup table unit 2;
  • the input neuron cache 63 is configured to buffer the input neuron and output the buffered input neuron to the lookup table unit 2;
  • the output neuron cache 64 is configured to cache the output neurons output by the lookup table unit 2, and output the buffered output neurons to the lookup table unit 2;
  • a neuron index cache 65 configured to determine a corresponding neuron index according to the input neuron, cache the neuron index, and output the cached neuron index to the lookup table unit 2;
  • the weight index cache 66 is configured to determine a corresponding weight index according to the weight, cache the weight index, and output the cached weight index to the lookup table unit 2.
  • the direct memory access unit 5 is configured to perform data or instruction reading and writing between the storage unit 3 and the cache unit 6.
  • the instruction may be a neural network specific instruction, including all instructions dedicated to completing an artificial neural network operation.
  • Neural network specific instructions include, but are not limited to, control instructions, data transfer instructions, arithmetic instructions, and logic instructions. Among them, the control command controls the execution process of the neural network.
  • Data transfer instructions complete the transfer of data between different storage media, including but not limited to matrices, vectors, and scalars.
  • the arithmetic instruction completes the arithmetic operation of the neural network, including but not limited to the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Command, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction and MAXOUT neural network operation instructions.
  • Logic instructions are used to perform logical operations on the neural network, including but not limited to vector logic operations instructions and scalar logic operation instructions.
  • the RBM neural network operation instruction is used to implement the Restricted Boltzmann Machine (RBM) neural network operation.
  • RBM Restricted Boltzmann Machine
  • the LRN neural network operation instruction is used to implement the Local Response Normalization (LRN) neural network operation.
  • LRN Local Response Normalization
  • the LSTM neural network operation instructions are used to implement Long Short-Term Memory (LSTM) neural network operations.
  • the RNN neural network operation instruction is used to implement Recurrent Neural Networks (RNN) neural network operations.
  • RNN Recurrent Neural Networks
  • the RELU neural network operation instruction is used to implement a Rectified linear unit (RELU) neural network operation.
  • RELU Rectified linear unit
  • the PRELU neural network operation instruction is used to implement Parametric Rectified Linear Unit (PRELU) neural network operations.
  • PRELU Parametric Rectified Linear Unit
  • SIGMOID neural network operation instructions are used to implement S-type growth curve (SIGMOID) neural network operations
  • the TANH neural network operation instruction is used to implement a hyperbolic tangent function (TANH) neural network operation.
  • TANH hyperbolic tangent function
  • the MAXOUT neural network operation instruction is used to implement (MAXOUT) neural network operations.
  • the neural network specific instruction includes a Cambricon instruction set, wherein the Cambricon instruction set includes at least one Cambricon instruction, and the Cambricon instruction has a length of 64 bits, and the Cambricon instruction includes an operation code and an operand.
  • the Cambricon instruction contains four types of instructions, namely Cambricon control instructions, Cambricon data transfer instructions, Cambricon operation instructions, and Cambricon logic instructions.
  • Cambricon control instructions are used to control the execution process.
  • the Cambricon control instructions include a jump jump instruction and a conditional branch conditional branch instruction.
  • Cambricon data transfer instruction is used to complete data transfer between different storage media.
  • Cambricon data transfer instructions include load instructions, store instructions, and move instructions.
  • the load instruction is used to load data from the main memory to the cache
  • the store instruction is used to store data from the cache to the main memory
  • the move instruction is used to transfer data between the cache and the cache or the cache and registers or registers and registers.
  • Data transfer instructions support three different ways of organizing data, including matrices, vectors, and scalars.
  • Cambricon operation instruction is used to perform neural network arithmetic operations.
  • Cambricon arithmetic instructions include Cambricon matrix operation instructions, Cambricon vector operation instructions, and Cambricon scalar operation instructions.
  • the Cambricon matrix operation instruction is used to complete the matrix operation in the neural network, including a matrix multiply vector, a vector multiply matrix, a matrix multiply scalar, and an outer product.
  • Product matrix add matrix
  • matrix subtract matrix matrix
  • the Cambricon vector operation instruction is used to perform vector operations in a neural network, including vector elementary arithmetics, vector transcendental functions, dot products, and vector random generation (random).
  • Vector generator and the maximum/minimum of a vector.
  • vector basic operations include vector addition, subtraction, multiplication, and division (add, subtract, multiply, divide), and vector transcendental functions are functions that do not satisfy any polynomial equation with polynomials as coefficients, including but not limited to exponential functions.
  • the Cambricon scalar instruction is used to perform scalar operations in neural networks, including scalar elementary arithmetics and scalar transcendental functions.
  • scalar basic operations include scalar addition, subtraction, multiplication, and division (add, subtract, multiply, divide), and scalar transcendental functions are functions that do not satisfy any polynomial equations with polynomials as coefficients, including but not limited to exponential functions.
  • Cambricon logic instructions are used to perform logical operations on the neural network.
  • Cambricon logic operations include Cambricon vector logic operations and Cambricon scalar logic operations.
  • the Cambricon vector logic operation instruction is used to complete vector comparison, vector logical operations, and vector greater than merge.
  • the vector comparison includes but is not limited to less than, greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to.
  • Vector logic operations include AND, OR, and NOT.
  • the Cambricon scalar logic operation is used to perform scalar comparison, scalar logical operations.
  • the scalar comparison includes but is not limited to greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to.
  • Scalar logic operations include AND, OR, and NOT.
  • FIG. 1H is a schematic flowchart of another operation method according to an embodiment of the present application. As shown in FIG. 1H, the operation method includes the following steps:
  • a receiving weight an input neuron, an instruction, a weight dictionary, a neuron dictionary, and an operation codebook; wherein the weight dictionary includes a weight position and a weight index; the neuron dictionary includes an input neuron and The neuron index; the arithmetic codebook includes a weight index, a neuron index, and an operation result of the input neuron and the weight.
  • Step S83 is similar to the specific working process of the lookup table unit, and specifically includes the following substeps:
  • FIG. 1 is a schematic flowchart of an operation method according to an embodiment of the present application.
  • the calculation method includes the following steps:
  • Step S90 Preprocessing external input input information.
  • the pre-processing the input information of the external input specifically includes: obtaining a weight corresponding to the input information, an input neuron, an instruction, a weight dictionary, a neuron dictionary, and an operation codebook; and the preprocessing includes Segmentation, Gaussian filtering, binarization, regularization, or normalization.
  • Step S91 Receive the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook.
  • Step S92 storing the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook.
  • Step S93 buffering the weight, inputting a neuron, an instruction, a weight index, and a neuron index.
  • Step S94 decoding the instruction, and determining the search control information.
  • Step S95 according to the weight, the input neuron, the weight dictionary, and the neuron dictionary, determine the neuron index in the neuron dictionary to determine the neuron index, and determine the weight position in the weight dictionary to determine Weight index.
  • Step S96 Searching the operation result in the operation codebook according to the weight index and the neuron index, and determining the output neuron.
  • FIG. 2A is a schematic structural diagram of a hierarchical storage device according to an embodiment of the present disclosure.
  • the device includes: an accurate storage unit and an inexact storage unit, where the precise storage unit is used to store data. Important bits, inaccurate memory locations are used to store non-significant bits in the data.
  • the precision memory unit uses error checking and correcting ECC memory, and the inexact memory unit uses non-ECC memory.
  • the data stored by the tiered storage device is a neural network parameter, including input neurons, weights, and output neurons, and the precise storage unit stores input neurons, output neurons, and important bits of weights, and inaccurate storage units. Stores input neurons, output neurons, and non-significant bits of weights.
  • the data stored by the tiered storage device includes floating point data and fixed point data, and the symbol bit and the exponent portion in the floating point data are designated as important bits, and the base portion is designated as a non-significant bit, and the fixed point is to be fixed.
  • the sign bit in the type data and the first x bit of the value part are designated as important bits, and the remaining bits of the value part are designated as non-significant bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a fixed point
  • the total bit of the type data Store important bits in ECC memory for accurate storage, and store non-significant bits in non-ECC memory for inexact storage.
  • the ECC memory includes a DRAM (Dynamic Random Access Memory, DRAM for short) dynamic random access memory and an SRAM (Static Random-Access Memory, SRAM) static random access memory with ECC check;
  • DRAM Dynamic Random Access Memory
  • SRAM Static Random-Access Memory
  • the SRAM with ECC check uses 6T SRAM, and in other embodiments of the present application, 4T SRAM or 3T SRAM can also be used.
  • non-ECC memory includes a non-ECC-checked DRAM and a non-ECC-checked SRAM, and the non-ECC-checked SRAM uses a 6T SRAM. In other embodiments of the present application, 4T SRAM or 3TSRAM may also be employed.
  • the unit for storing each bit in the 6T SRAM is composed of 6 MOSFETs (metal: MOS) tubes
  • the unit for storing each bit in the 4T SRAM is composed of 4 MOS tubes, and is stored in 3T SRAM.
  • Each bit unit consists of 3 MOS tubes.
  • SRAMs that store neural network weights generally use 6T SRAM, although 6T SRAM has high stability but large area and high read and write power consumption.
  • the neural network algorithm has certain fault tolerance, and the 6T SRAM cannot utilize the fault tolerance of the neural network. Therefore, in this embodiment, in order to fully exploit the fault tolerance of the neural network, 4T SRAM or 3T SRAM storage technology is used instead of 6T SRAM to increase SRAM storage. Density, reduce SRAM memory consumption, and use the fault tolerance of neural network algorithms to mask the shortcomings of 4T SRAM's weak anti-noise ability.
  • FIG. 2B is a schematic structural diagram of a 4T SRAM memory cell according to an embodiment of the present application.
  • the 4T SRAM memory cell is composed of four NMOSs, respectively M1 (first MOS transistor), M2. (Second MOS transistor), M3 (third MOS transistor), M4 (fourth MOS transistor).
  • M1 and M2 are used for gating, and M3 and M4 are used for storage.
  • the M1 gate is electrically connected to the word line WL (Word Line), the source is electrically connected to the bit line BL (Bit Line), the M2 gate is electrically connected to the word line WL, the source is electrically connected to the bit line BLB, and the M3 gate is connected.
  • WL is used to control the gate access of the memory unit, and BL is used to read and write the memory unit. When a read operation is performed, WL is pulled high and the bit is read from the BL. When a write operation is performed, the WL is pulled high, and the BL is pulled high or low. Since the driving capability of the BL is stronger than that of the memory cell, the original state is forcibly overwritten.
  • FIG. 2C is a schematic structural diagram of a 3T SRAM memory cell according to an embodiment of the present application.
  • the 3T SRAM memory cell is composed of three MOSs, respectively M1 (first MOS transistor), M2. (second MOS tube) and M3 (third MOS tube). M1 is used for gating, and M2 and M3 are used for storage.
  • the M1 gate is electrically connected to the word line WL (Word Line), the source is electrically connected to the bit line BL (Bit Line), the M2 gate is connected to the M3 source, and is connected to the operating voltage Vdd through the resistor R2, and the M2 drain is grounded.
  • the M3 gate is connected to the M2 source and the drain of M1, and is connected to the operating voltage Vdd through the resistor R1, and the drain of the M3 is grounded.
  • WL is used to control the gate access of the memory unit, and BL is used to read and write the memory unit. When a read operation is performed, WL is pulled high and the bit is read from the BL. When a write operation is performed, the WL is pulled high, and the BL is pulled high or low. Since the driving capability of the BL is stronger than that of the memory cell, the original state is forcibly overwritten.
  • the storage device of the present application adopts an approximate storage technology, which can fully exploit the fault tolerance of the neural network, and approximate the neural parameters.
  • the important bits in the parameters are accurately stored, and the unimportant bits are stored inaccurately, thereby reducing storage overhead. And access to energy consumption costs.
  • FIG. 2D is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.
  • the apparatus includes: an inaccurate arithmetic unit, an instruction control unit, and the hierarchical storage device described above.
  • the tiered storage device receives the instruction and the operation parameter, and stores the important bits and instructions in the operation parameter in the precise storage unit, and stores the non-significant bits in the operation parameter in the inexact storage unit.
  • the instruction control unit receives the instruction in the tiered storage device and decodes the instruction to generate control information to control the inexact operation unit to perform the calculation operation.
  • the inaccurate operation unit receives the operation parameters in the tiered storage device, performs operations according to the control information, and transmits the operation results to the tiered storage device for storage or output.
  • the inaccurate arithmetic unit is a neural network processor.
  • the operation parameter is a neural network parameter
  • the tiered storage device is used to store neurons, weights and instructions of the neural network, and store important bits of the neuron, important bits and instructions of the weight in the precise storage unit.
  • Non-significant bits of neurons and non-significant bits of weights are stored in inexact memory cells.
  • the inexact computing unit receives the input neurons and weights in the tiered storage device, completes the neural network operation according to the control information to obtain the output neurons, and retransmits the output neurons to the tiered storage device for storage or output.
  • the inaccurate operation unit may have two calculation modes: (1) the inexact operation unit directly receives important bits from the input neurons in the precise storage unit of the tiered storage device and important bits of the weight for calculation. (2) The inexact arithmetic unit receives the significant input bits and the non-significant bits to splicing the complete input neurons and weights, wherein the input neurons and the important bits of the weights and the non-significant bits are in the storage unit. Splicing when reading.
  • the data processing apparatus further includes a pre-processing module for pre-processing the input original data and transmitting the data to the storage device, and the pre-processing includes segmentation, Gaussian filtering, and binarization. , regularization, normalization, and so on.
  • the data processing apparatus further includes an instruction cache, an input neuron hierarchical cache, a weight hierarchical cache, and an output neuron hierarchical cache, wherein the instruction cache is disposed between the hierarchical storage device and the instruction control unit, and is configured to: Storing a dedicated instruction; the input neuron layered cache is disposed between the storage device and the imprecise computing unit for buffering the input neurons, and the input neuron hierarchical buffer includes an input neuron exact cache and an input neuron inexact cache, respectively Cache the important bits and non-important bits of the input neurons; the weighted layered cache is set between the storage device and the inexact computing unit for buffering the weighted data, and the weighted hierarchical cache includes the weighted exact cache and the weight The value is inaccurately buffered, and the important bit and the non-important bit of the weight are separately cached; the output neuron layered buffer is disposed between the storage device and the inexact computing unit for buffering the output neurons, and the output neuron is divided The layer
  • the data processing apparatus further includes a direct memory access unit (DMA) for performing in the storage device, the instruction cache, the weight layer cache, the input neuron layer buffer, and the output neuron layer buffer. Data or instruction reading and writing.
  • DMA direct memory access unit
  • instruction cache input neuron hierarchical cache, weighted hierarchical cache, and output neuron hierarchical cache all use 4T SRAM or 3T SRAM.
  • the inaccurate arithmetic unit includes but is not limited to three parts, a first partial multiplier, a second partial addition tree, and the third part is an activation function unit.
  • the first part multiplies the input data 1 (in1) and the input data 2 (in2) to obtain the multiplied output (out).
  • Add the output data (out), where in1 is a vector of length N, N is greater than 1, the process is: out in1[1]+in1[2]+...+in1[N]; or, will be input
  • the data (in1) is added by the addition tree and added to the input data (in2) to obtain the output data (out).
  • the third part is obtained by the input function (in) through an activation function (active).
  • the third part can input the data through other nonlinear functions ( In)
  • the pool is a pooling operation, and the pooling operation is performed.
  • Including but not limited to: average pooling, maximum pooling, median pooling, input data in is the data in a pooled core associated with output out.
  • the non-precise operation unit performs operations including several parts.
  • the first part is to multiply the input data 1 and the input data 2 to obtain the multiplied data;
  • the second part performs the addition tree operation for passing the input data 1 through the addition tree.
  • Level addition, or the input data 1 is added step by step through the addition tree and added to the input data 2 to obtain output data;
  • the third part performs an activation function operation, and the output data is obtained by an activation function (active) operation to obtain output data.
  • the operations of the above parts can be freely combined to realize the operation of various functions.
  • the data processing device of the present application can fully utilize the approximate storage technology, fully exploit the fault tolerance capability of the neural network, reduce the computational load of the neural network and the amount of neural network access, thereby reducing computational energy consumption and memory consumption.
  • the dedicated SIMD instruction for the multi-layer artificial neural network operation and the customized operation unit By adopting the dedicated SIMD instruction for the multi-layer artificial neural network operation and the customized operation unit, the problem that the CPU and GPU have insufficient performance and the front-end decoding overhead is solved, and the support for the multi-layer artificial neural network operation algorithm is effectively improved;
  • the on-chip cache of dedicated inaccurate storage for multi-layer artificial neural network operation algorithm the importance of input neurons and weight data is fully exploited, which avoids repeatedly reading these data into memory, reducing memory access bandwidth and avoiding The memory bandwidth becomes a problem of multi-layer artificial neural network operation and performance bottleneck of its training algorithm.
  • the data processing apparatus may include a non-neural network processor, such as a general-purpose arithmetic processor, which has corresponding general-purpose arithmetic instructions and data, for example, scalar arithmetic operations.
  • a scalar logic operation or the like such as but not limited to, includes one or more multipliers, one or more adders, and performs basic operations such as addition, multiplication, and the like.
  • FIG. 2F is a flowchart of a data storage method according to an embodiment of the present application, including the following steps. :
  • S601 Accurate storage of important bits in the data.
  • S602 Perform non-precise storage of non-significant bits in the data.
  • the data storage method includes the following steps:
  • Non-significant bits in the data are stored in non-ECC memory for inexact storage.
  • the stored data is a neural network parameter, and the bit number bits representing the neural network parameters are divided into important bits and non-significant bits.
  • a parameter of a neural network has a total of m bits, where n bits are important bits, (mn) bits are non-significant bits, where m is an integer greater than 0, and n is greater than 0 and less than or equal to m The integer.
  • the neural network parameters include input neurons, weights, and output neurons, which store the important bits of the input neurons, the important bits of the output neurons, and the important bits of the weights; the non-significant bits of the input neurons Bits, non-significant bits of output neurons, and non-significant bits of weights are stored inexactly.
  • the data includes floating-point data and fixed-point data, wherein the sign bit and the exponent portion in the definition floating-point data are important bits, the base portion is a non-significant bit; the sign bit and the value portion in the fixed-point type data are before
  • the x bit is an important bit, and the remaining bits of the value part are non-significant bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a parameter total bit.
  • the ECC memory includes an ECC-checked SRAM and an ECC-checked DRAM; the non-ECC memory includes a non-ECC-checked SRAM and a non-ECC-checked DRAM; the ECC-checked SRAM and non-ECC checksum
  • the SRAM uses 6T SRAM, and in other embodiments of the present application, 4T SRAM or 3T SRAM can also be used.
  • FIG. 2G is a flowchart of a data processing method according to an embodiment of the present application. As shown in FIG. 2G, the method includes:
  • S1 receiving instructions and parameters, and accurately storing important bits and instructions in the parameters, and inaccurately storing non-important bits in the parameters;
  • S3 Receive parameters, and perform operations according to the control information, and store the operation results.
  • the above operation is a neural network operation, and the parameters are neural network parameters, including input neurons, weights, and output neurons.
  • Step S3 further includes: receiving the input neuron and the weight, completing the neural network operation according to the control information to obtain the output neuron, and storing or outputting the output neuron.
  • the receiving input neuron and the weight, and completing the neural network operation according to the control information to obtain the output neuron include: receiving important bits of the input neuron and important bits of the weight for calculation; or receiving important bits And the non-significant bits spliced the complete input neurons and weights for calculation.
  • the method further includes the following steps: caching dedicated instructions; accurately buffering and inexact caching of input neurons; performing accurate and inexact caching of weight data; and accurately and inaccurately buffering output neurons.
  • step S1 the method further includes: pre-processing the parameters.
  • a further embodiment of the present application is directed to a storage unit, which is a 4T SRAM or a 3T SRAM, for storing neural network parameters, wherein the specific structure of the 4T SRAM is as shown in FIG. 2B, and the 3T SRAM is The specific structure refers to the structure shown in FIG. 2C, and will not be described here.
  • FIG. 3A is a schematic structural diagram of a dynamic voltage regulation and frequency modulation apparatus 100 according to an embodiment of the present application. As shown in FIG. 3A, the dynamic voltage regulation and frequency modulation apparatus 100 includes:
  • the information collecting unit 101 is configured to collect, in real time, working state information or application scenario information of the chip connected to the dynamic voltage regulating frequency modulation, where the application scenario information is obtained by using the neural network or the chip Information collected by connected sensors;
  • the voltage-adjusting and frequency-modulating unit 102 is configured to send voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip, where the voltage frequency regulation information is used to instruct the chip to adjust its working voltage or operating frequency.
  • the working state information of the chip includes an operating speed of the chip
  • the voltage frequency regulation information includes first voltage frequency regulation information
  • the voltage regulating and frequency modulation unit 102 is configured to:
  • the target speed is the running speed of the chip when the user's demand is met.
  • the information collecting unit 101 collects the running speed of the chip connected thereto in real time.
  • the running speed of the chip can be different types of speeds depending on the tasks performed by the above chips.
  • the running speed of the chip may be a frame rate of the video image processing performed by the chip; when the operation performed by the chip is voice recognition, the running speed of the chip is the above information.
  • the voltage-modulating and frequency-modulating unit 102 determines that the running speed of the chip is greater than the target speed, that is, when the operating speed of the chip reaches the running speed of the chip when the user meets the demand, the first voltage frequency control information is sent to the chip to indicate the chip. Reduce its operating voltage or operating frequency to reduce the power consumption of the chip.
  • the operation performed by the above chip is video image processing, and the above target speed is 24 frames/second.
  • the information collecting unit collects the frame rate of the video image processing by the chip in real time, and the current frame rate of the video image processing by the chip is 54 frames/second.
  • the voltage regulation and frequency modulation unit determines that the frame rate of the video image processing of the current chip is greater than the target speed, and sends the first voltage frequency regulation information to the chip to indicate that the chip reduces the operating voltage or the operating frequency to reduce the power consumption of the chip. .
  • the chip includes at least a first unit and a second unit, the output data of the first unit is input data of the second unit, and the working status information of the chip includes The operating speed of the first unit and the operating speed of the second unit, the voltage frequency regulation information includes second voltage frequency regulation information, and the frequency modulation unit 102 is further configured to:
  • the second voltage frequency regulation information is used to instruct the second unit to reduce its operating frequency or operating voltage.
  • the chip performing the task requires the cooperation of the first unit and the second unit, and the output data of the first unit is the input data of the second unit.
  • the information collecting unit 101 collects the operating speeds of the first unit and the second unit in real time.
  • the voltage regulating and frequency converting unit 102 sends the second unit to the second unit.
  • the voltage frequency regulation information is used to instruct the second unit to lower its working voltage or operating frequency, so as to reduce the power consumption of the whole chip without affecting the overall running speed of the chip.
  • the voltage frequency regulation information includes third voltage frequency regulation information
  • the frequency modulation unit 102 is further configured to:
  • the third voltage frequency regulation information is used to indicate that the first unit reduces its operating frequency or operating voltage.
  • the chip includes at least N units, and the working state information of the chip includes working state information of at least S units of the at least N units, where N is greater than 1.
  • the unit A is any one of the at least S units.
  • the voltage frequency regulation information includes fifth voltage frequency regulation information
  • the voltage regulation and frequency modulation unit 102 is further configured to:
  • the information collecting unit 101 collects the working state information of at least S units in the chip in real time.
  • the voltage regulating and frequency modulation unit 102 sends the fourth voltage frequency regulation information to the unit A to indicate that the unit A lowers its operating frequency or operating voltage.
  • the power consumption of the unit A is lowered.
  • the voltage regulating and frequency modulation unit 102 sends the fifth voltage frequency control information to the unit A to indicate the unit A. Increase its operating frequency or operating voltage so that the operating speed of the unit A meets the needs of the work.
  • the application scenario of the chip is image recognition
  • the application scenario information is the number of objects in the image to be identified
  • the voltage frequency regulation information includes sixth voltage frequency regulation information.
  • the voltage regulating and frequency modulation unit 102 is also used to:
  • the chip is applied to image recognition, and the number of objects in the image to be identified is obtained by the neural network algorithm, and the information collecting unit 101 acquires the number of objects in the image to be identified from the chip (ie, the above After applying the scenario information, when the voltage-modulating and frequency-modulating unit 102 determines that the number of objects in the image to be identified is less than the first threshold, the voltage-modulating and frequency-modulating unit 102 sends the sixth voltage-frequency control information to the chip to indicate the foregoing.
  • the chip lowers its working voltage or operating frequency; when it is determined that the number of objects in the image to be identified is greater than the first threshold, the voltage regulating and frequency modulation unit 102 sends a signal to the chip to indicate that the chip raises its working voltage or operating frequency. Voltage frequency regulation information.
  • the application scenario information is object tag information
  • the voltage frequency regulation information includes seventh voltage frequency regulation information
  • the voltage regulation and frequency modulation unit 102 is further configured to:
  • the preset object tag set includes a plurality of object tags, and the object tags may be “person”, “dog”, “tree”, and “flower”.
  • the chip determines that the current application scenario includes a dog by using a neural network algorithm, the chip transmits the object tag information including the “dog” to the information collecting unit 101, and when the frequency modulation unit 102 determines that the object tag information includes "dog", sending seventh voltage frequency regulation information to the chip to indicate that the chip raises its working voltage or operating frequency; and when determining that the object tag information does not belong to the preset object tag set, the voltage regulating frequency modulation unit 102 transmits voltage frequency regulation information for instructing the chip to reduce its operating voltage or operating frequency to the chip.
  • the chip is applied to voice recognition
  • the application scenario information is a voice input rate
  • the voltage frequency regulation information includes eighth voltage frequency regulation information
  • the voltage regulation and frequency modulation unit is further used.
  • the application scenario of the chip is voice recognition, and the input unit of the chip inputs voice to the chip at a certain rate.
  • the information collecting unit 101 collects the voice input rate in real time, and sends the voice input rate information to the voltage regulating and frequency modulation unit 102.
  • the eighth voltage frequency regulation information is sent to the chip to instruct the chip to lower its operating voltage or operating frequency.
  • the voltage regulating and frequency modulation unit 102 determines that the voice input rate is greater than the second threshold, the voltage frequency regulation information for instructing the chip to increase its operating voltage is sent to the chip.
  • the application scenario information is a keyword obtained by performing speech recognition on the chip
  • the voltage frequency regulation information includes ninth voltage frequency regulation information
  • the frequency modulation and voltage adjustment unit is further used to :
  • the ninth voltage frequency regulation information is sent to the chip, and the ninth voltage frequency regulation information is used to indicate that the chip increases its working voltage or operating frequency.
  • the FM voltage regulator unit 102 transmits the voltage regulation and frequency modulation information for instructing the chip to lower its operating voltage or operating frequency to the chip.
  • the application scenario of the above chip is speech recognition
  • the preset keyword set includes keywords such as “image beauty”, “neural network algorithm”, “image processing” and “Alipay”.
  • the FM voltage regulation unit 102 sends the ninth voltage frequency regulation information to the above to indicate that the chip increases its working voltage or operating frequency;
  • the frequency modulation unit 102 transmits the voltage regulation and frequency modulation information for instructing the chip to lower its operating voltage or operating frequency to the chip.
  • the chip is applied to machine translation
  • the application scenario information is a speed of text input or a number of characters in an image to be translated
  • the voltage frequency regulation information includes tenth voltage frequency regulation information.
  • the voltage regulating and frequency modulation unit is further configured to:
  • the chip is applied to the machine translation, and the application scene information collected by the information collection unit 101 is the speed of the text input or the number of characters in the image to be translated, and the application scenario information is transmitted to the voltage modulation and frequency modulation unit 102.
  • the voltage regulating and frequency modulation unit 102 sends the tenth voltage frequency regulation information to the chip for instructing the chip to reduce its work.
  • the voltage-modulating frequency modulation unit 102 sends a voltage frequency regulation to the chip to instruct the chip to increase its operating voltage. information.
  • the application scenario information is ambient light intensity
  • the voltage frequency regulation information includes eleventh voltage frequency regulation information
  • the voltage regulation and frequency modulation unit is further configured to:
  • the eleventh voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency .
  • the illumination intensity of the external environment is acquired by an illumination sensor connected to the chip.
  • the information collecting unit 101 transmits the light intensity to the voltage-modulating and frequency-modulating unit 102.
  • the voltage regulation and frequency modulation unit 102 transmits the eleventh voltage frequency regulation information to the chip to instruct the chip to lower its operating voltage; when determining that the illumination intensity is greater than the fifth threshold
  • the voltage-modulating and frequency-modulating unit 102 transmits voltage frequency regulation information for instructing the chip to increase its operating voltage or operating frequency to the chip.
  • the chip is applied to image beauty
  • the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency offset control information
  • the voltage regulation and frequency modulation unit further Used for:
  • the application scenario information is a face image
  • sending the twelfth voltage frequency regulation information to the chip where the twelfth voltage frequency regulation information is used to indicate that the chip increases its working voltage or operating frequency
  • the chip is applied to voice recognition, and the application scenario information is voice strength.
  • the voltage modulation and frequency modulation unit 102 sends the chip to the chip to indicate the chip.
  • the voltage frequency regulation information of the operating voltage or the operating frequency is reduced; when the voice strength is less than the sixth threshold, the voltage regulation and frequency modulation unit 102 sends a voltage frequency regulation to the chip to indicate that the chip increases its working voltage or operating frequency. information.
  • the foregoing scene information may be information of an external scene collected by the sensor, such as light intensity, voice intensity, and the like.
  • the application scenario information may also be information calculated according to the artificial intelligence algorithm. For example, in the object recognition task, the real-time calculation result information of the chip is fed back to the information collection unit, where the information includes the number of objects in the scene, the face image, Information such as object tag keywords.
  • the artificial intelligence algorithm described above includes, but is not limited to, a neural network algorithm.
  • the dynamic voltage-adjusting and frequency-modulating device in real time is connected with the chip and the working state information of each internal unit or the application scenario information of the chip, according to the working state information of the chip and its internal units. Or the application scenario information of the chip to adjust the working frequency or working voltage of the chip or its internal units to reduce the overall operating power consumption of the chip.
  • FIG. 3B is a schematic diagram of a dynamic voltage regulation and frequency modulation application scenario according to an embodiment of the present application.
  • the convolution operation device includes a dynamic voltage regulation and frequency modulation device 210, and a chip 220 connected to the dynamic voltage regulation and frequency modulation device.
  • the chip 220 includes a control unit 221, a storage unit 222, and an operation unit 223.
  • the chip 220 described above can be used for tasks such as image processing, voice processing, and the like.
  • the dynamic voltage-modulating and frequency-modulating device 210 collects the working state information of the chip 220 in real time.
  • the operational status information of the chip 220 includes the operating speed of the chip 220, the operating speed of the control unit 221, the operating speed of the storage unit 222, and the operating speed of the computing unit 223.
  • the dynamic voltage-modulating and frequency-modulating device 210 determines the running time of the storage unit 222 exceeds the operation unit 223 according to the running speed of the storage unit 222 and the operating speed of the computing unit 223.
  • the running time the dynamic voltage modulation and frequency modulation device 210 can determine that the storage unit 222 becomes a bottleneck during the execution of the task, and after the operation unit 223 performs the current operation operation, it needs to wait for the storage unit 222 to perform the reading task and
  • the read data is transmitted to the arithmetic unit 223, and the arithmetic unit 223 can perform an arithmetic operation based on the data transmitted from the storage unit 222.
  • the dynamic voltage-modulating and frequency-modulating device 210 transmits the first voltage-frequency regulation information to the operation unit 223, where the first voltage-frequency regulation information is used to instruct the operation unit 223 to lower its operating voltage or operating frequency to reduce the operating speed of the computing unit 223, so that Without affecting the completion time of the task, the overall operating power consumption of the chip 220 is reduced.
  • the dynamic voltage-modulating and frequency-modulating device 210 determines the running time of the storage unit 222 is lower than the computing unit according to the running speed of the storage unit 222 and the operating speed of the computing unit 223 At the runtime of 223, the dynamic voltage-modulating frequency modulation device 210 can determine that the arithmetic unit 223 becomes a bottleneck during the execution of the task. After the storage unit 222 completes the data reading, the operation unit 223 has not completed the current operation operation, and the storage unit 222 needs to wait for the operation unit 223 to complete the current operation operation, and then transfers the read data to the operation unit 223.
  • the dynamic voltage modulation and frequency modulation device 210 sends the second voltage frequency regulation information to the storage unit 222, where the second voltage frequency regulation information is used to instruct the storage unit 222 to lower its operating voltage or operating frequency to reduce the operating speed of the storage unit 222, so that Without affecting the completion time of the task, the overall operating power consumption of the chip 220 is reduced.
  • the dynamic voltage modulation and frequency modulation device 210 acquires the running speed of the chip 220 in real time.
  • the target operating speed is an operating speed that can meet the user's demand, and the dynamic voltage regulating and frequency modulation device 210 sends the third voltage frequency control information to the chip 220.
  • the third voltage frequency regulation information is used to instruct the chip 220 to lower its operating voltage or operating frequency to reduce the operating power consumption of the chip 220.
  • the chip 220 is used for video processing.
  • the frame rate of the video processing required by the user under normal conditions is not less than 30 frames. It is assumed that the frame rate of the actual video processing of the chip 220 is 100, and the dynamic voltage regulating and frequency modulation device is used.
  • the voltage frequency regulation information is sent to the chip 220, and the voltage frequency regulation information is used to instruct the chip 220 to lower the operating voltage or the operating frequency to reduce the frame rate of the video processing to about 30 frames.
  • the dynamic voltage modulation and frequency modulation device 210 monitors the working states of each unit (including the control unit 221, the storage unit 222, and the operation unit 223) in the chip 220 in real time.
  • the fourth voltage frequency control information is sent to the unit, where the fourth voltage frequency control information is used to indicate that the operating voltage or the operating frequency of the unit is decreased. Thereby reducing the power consumption of the chip 220.
  • the dynamic voltage regulating and frequency modulation device 210 sends the fifth voltage frequency regulation information to the unit to raise the working voltage or the operating frequency of the unit, so that the operating speed of the unit meets the working requirement. It can be seen that, in the solution of the application embodiment, the dynamic voltage regulation and frequency modulation device 210 sets the real-time acquisition chip and the running speed information of each unit therein, and reduces the chip or the internal units according to the running speed information of the chip and its internal units. The operating frequency or operating voltage is used to reduce the overall operating power consumption of the chip.
  • FIG. 3C is a schematic diagram of another dynamic voltage regulation and frequency modulation application scenario according to an embodiment of the present application.
  • the convolution operation device includes a dynamic voltage-modulating frequency modulation device 317 register unit 312, an interconnection module 313, an operation unit 314, a control unit 315, and a data access unit 316.
  • the operation unit 314 includes at least two of an addition calculator, a multiplication calculator, a comparator, and an activation operator.
  • the interconnecting module 313 is configured to control the connection relationship of the calculators in the computing unit 314 such that the at least two types of calculators form different computing topologies.
  • the register unit 312 (which may be a register unit, an instruction cache, a scratch pad memory) is configured to store the operation instruction, the address of the data block in the storage medium, and the calculation topology corresponding to the operation instruction.
  • the convolution operation device further includes a storage medium 311.
  • the storage medium 311 may be an off-chip memory. Of course, in an actual application, it may also be an on-chip memory for storing data blocks.
  • the control unit 315 is configured to extract an operation instruction from the register unit 312, an operation domain corresponding to the operation instruction, and a first calculation topology corresponding to the operation instruction, and decode the operation instruction into an execution instruction, where the execution instruction is used for control
  • the arithmetic unit 314 performs an arithmetic operation, transmits the operational domain to the data access unit 316, and transmits the computational topology to the interconnect module 313.
  • the data access unit 316 is configured to extract a data block corresponding to the operation domain from the storage medium 311, and transmit the data block to the interconnection module 313.
  • the interconnect module 313 is configured to receive the data block of the first computing topology.
  • the interconnect module 313 also repositions the data block according to the first computing topology.
  • the operation unit 314, the calculator for executing the instruction call operation unit 314 performs an operation operation on the data block to obtain an operation result, and transmits the operation result to the data access unit 316 and stores it in the storage medium 312.
  • the operation unit 314 is further configured to perform an operation operation on the re-arranged data block according to the first calculation topology and the execution instruction to obtain an operation result, and transmit the operation result to the data.
  • Access unit 316 is stored and stored in storage medium 312.
  • the interconnecting module 313 is further configured to form a first computing topology according to the connection relationship of the calculators in the control computing unit 314.
  • the dynamic voltage regulation and frequency modulation device 317 is configured to monitor the working state of the entire convolution operation device and dynamically adjust its voltage and frequency.
  • the specific calculation method of the convolution operation device is described below by using different operation instructions.
  • the operation instruction here is exemplified by a convolution calculation instruction, and the convolution calculation instruction can be applied to a neural network, so the convolution calculation instruction can also It is called a convolutional neural network.
  • the formula that it actually needs to execute can be:
  • the convolution kernel W (which may include a plurality of data) is multiplied by the input data ⁇ i , summed, and then optionally biased b, and then optionally the activation operation s(h), The final output result S is obtained.
  • the calculation topology can be obtained as a multiplier-adder-(optional) activation operator.
  • the convolution calculation instruction may include an instruction set including a convolutional neural network COMPUTE instruction having different functions, and a CONFIG instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction.
  • the COMPUTE instruction includes:
  • a convolution operation instruction according to which the convolution operation device extracts input data of a specified size and a convolution kernel from a specified address of a memory (a preferred scratch pad memory or a scalar register file), and performs the convolution operation unit Convolution operation.
  • a convolutional neural network sigmoid instruction according to which the convolution operation means respectively extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferably a scratch pad memory or a scalar register file), in a convolution operation unit Do the convolution operation, and then make the output result sigmoid activation;
  • the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, then the output is TanH activated;
  • the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, and then the output is re-activated by ReLU;
  • the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and after dividing the group, the convolution operation unit Do a convolution operation and then activate the output.
  • the CONFIG command is used to configure the various constants required for the current layer calculation before each layer of artificial neural network calculation begins.
  • the IO instruction is used to read the input data required for calculation from the external storage space and store the data back to the external space after the calculation is completed.
  • the NOP instruction is used to clear the control signals in all the control signal buffer queues of the current convolution operation device, and ensure that all the instructions before the NOP instruction are all executed.
  • the NOP instruction itself does not contain any operations;
  • the JUMP instruction is used to control the jump of the next instruction address to be read from the instruction storage unit, and is used to implement the jump of the control flow;
  • the MOVE instruction is used to carry data of an address in the internal address space of the convolution operation device to another address in the internal address space of the convolution operation device, the process is independent of the operation unit, and does not occupy the operation unit during execution. H.
  • the method for executing the convolution calculation instruction by the convolution operation device may specifically be:
  • the control unit 315 extracts a convolution calculation instruction from the register unit 312, an operation domain corresponding to the convolution calculation instruction, and a first calculation topology corresponding to the convolution calculation instruction (multiplier-adder-adder-activation operation) And the control unit transmits the operation domain to the data access unit, and transmits the first computing topology to the interconnection module.
  • the data access unit 316 extracts the convolution kernel w and the offset b corresponding to the operation domain from the storage medium 311 (when b is 0, it is not necessary to extract the offset b), and the convolution kernel w and the offset b are transmitted to the operation. Unit 314.
  • the multiplier of the operation unit 314 obtains the first result after performing the multiplication operation on the convolution kernel w and the input data Xi, and inputs the first result to the adder to perform the addition operation to obtain the second result, and the second result and the offset b
  • the addition operation is performed to obtain a third result
  • the third result is input to the activation operator to perform an activation operation to obtain an output result s
  • the output result s is transmitted to the data access unit for storage into the storage medium.
  • the output can be directly transferred to the data access storage to the storage medium without the following steps.
  • the step of performing the addition of the second result and the offset b to obtain the third result is optional, that is, when b is 0, this step is not required.
  • the order of addition and multiplication operations can be reversed.
  • the first result may include a result of a plurality of multiplication operations.
  • an embodiment of the present application provides a neural network processor including the above convolution operation device.
  • the above neural network processor is used to perform artificial neural network operations, and implements artificial intelligence applications such as speech recognition, image recognition, and translation.
  • Case 1 In the process of performing the convolution operation, the dynamic voltage-modulating and frequency-modulating device 317 acquires the running speeds of the data access unit 316 and the computing unit 314 of the neural network processor in real time. When the dynamic voltage-modulating and frequency-modulating device 317 determines that the running time of the data access unit 316 exceeds the running time of the computing unit 314 according to the operating speeds of the data access unit 316 and the computing unit 314, the dynamic voltage-modulating and frequency-modulating device 317 can determine that the convolution operation is being performed. The data access unit 316 becomes a bottleneck. After the current convolution operation operation is performed, the operation unit 314 needs to wait for the data access unit 316 to execute the read task and transfer the read data to the operation unit 314.
  • the operation unit 314 The convolution operation operation can be performed based on the data transmitted from the data access unit 316.
  • the dynamic voltage modulation and frequency modulation device 317 sends the first voltage frequency regulation information to the operation unit 314, where the first voltage frequency regulation information is used to instruct the operation unit 314 to lower the operating voltage or the operating frequency thereof to reduce the operating speed of the operation unit 314, so that the operation
  • the running speed of the unit 314 is matched with the running speed of the data access unit 316, which reduces the power consumption of the computing unit 314, prevents the computing unit 314 from being idle, and finally reduces the above without affecting the completion time of the task.
  • the overall operating power of the neural network processor is matched with the running speed of the data access unit 316, which reduces the power consumption of the computing unit 314, prevents the computing unit 314 from being idle, and finally reduces the above without affecting the completion time of the task.
  • the dynamic voltage-modulating and frequency-modulating device 317 acquires the running speeds of the data access unit 316 and the computing unit 314 of the neural network processor in real time.
  • the dynamic voltage-modulating and frequency-modulating device 317 determines that the running time of the computing unit 314 exceeds the running time of the data access unit 316 according to the operating speed of the data access unit 316 and the computing unit 314, the dynamic voltage-modulating and frequency-modulating device 317 can determine that the convolution operation process is performed.
  • the operation unit 314 becomes a bottleneck. After the data access unit 316 performs the current data read operation, the data access unit 316 needs to wait for the operation unit 314 to perform the current convolution operation.
  • the dynamic voltage-modulating frequency modulation device 317 sends the second voltage frequency regulation information to the data access unit 316, and the second voltage frequency control information is used to instruct the data access unit 316 to lower its operating voltage or operating frequency to reduce the operating speed of the data access unit 316.
  • the running speed of the data access unit 316 is matched with the running speed of the budget unit 314, the power consumption of the data access unit 316 is reduced, and the data access unit 316 is prevented from being idle, and finally, the completion time of the task is not affected. In this case, the overall operating power consumption of the above neural network processor is reduced.
  • the dynamic voltage regulation and frequency modulation device 317 collects the working parameters of the artificial neural network application by the neural network processor in real time, and adjusts the neural network processing according to the working parameter.
  • the operating voltage or operating frequency of the device is the operating voltage or operating frequency of the device.
  • the artificial intelligence application may be video image processing, object recognition, machine translation, voice recognition, image beauty, and the like.
  • the dynamic voltage regulation and frequency modulation device 317 collects the frame rate of the video image processing by the neural network processor in real time.
  • the target frame rate is a video image processing frame rate that is normally required by the user, and the dynamic voltage modulation and frequency modulation device 317 sends the third voltage frequency regulation information to the neural network processor.
  • the third voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or operating frequency, and reduces the power consumption of the neural network processor while satisfying the normal video image processing requirements of the user.
  • Case 4 When the neural network processor performs speech recognition, the dynamic voltage modulation and frequency modulation device 317 collects the speech recognition speed of the neural network processor in real time. When the voice recognition speed of the neural network processor exceeds the actual voice recognition speed of the user, the dynamic voltage regulation and frequency modulation device 317 sends fourth voltage frequency regulation information to the neural network processor, where the fourth voltage frequency regulation information is used to indicate the nerve.
  • the network processor reduces its operating voltage or operating frequency, and reduces the power consumption of the above-mentioned neural network processor while satisfying the user's normal speech recognition requirements.
  • the dynamic voltage regulation and frequency modulation device 317 monitors each unit or module in the above neural network processor in real time (including the storage medium 311, the register unit 312, the interconnection module 313, the operation unit 314, the controller unit 315, and the data access unit 316) Working status.
  • the dynamic voltage regulating and frequency modulation device 317 sends the fifth voltage frequency control information to the unit or the module to reduce the working voltage of the unit or the module or Operating frequency to further reduce the power consumption of the unit or module.
  • the dynamic voltage regulating and frequency modulation device 317 sends the sixth voltage frequency regulation information to the unit or the module to raise the working voltage or the operating frequency of the unit or the module, so that the unit or the module The speed of operation meets the needs of the work.
  • FIG. 3D is a schematic diagram of another dynamic voltage-modulation frequency modulation application scenario provided by an embodiment of the present application.
  • the convolution operation device includes a dynamic voltage-modulating frequency modulation device 7, an instruction storage unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a main operation module 5, and a plurality of slave operation modules 6.
  • the instruction storage unit 1, the controller unit 2, the data access unit 3, the interconnection module 4, the main operation module 5, and the slave operation module 6 may all pass through hardware circuits (including but not limited to FPGA, CGRA, application specific integrated circuit ASIC, analog Circuits and memristors, etc.).
  • the instruction storage unit 1 reads in an instruction through the data access unit 3 and stores the read instruction.
  • the controller unit 2 reads an instruction from the instruction storage unit 1, translates the instruction into a control signal that controls the behavior of other modules, and transmits it to other modules such as the data access unit 3, the main operation module 5, and the slave operation module 6.
  • the data access unit 3 can access the external address space, directly read and write data to and from the respective memory cells inside the convolution operation device, and complete data loading and storage.
  • the interconnect module 4 is used to connect the main operation module and the slave operation module, and can be implemented into different interconnection topologies (such as a tree structure, a ring structure, a grid structure, a hierarchical interconnection, a bus structure, etc.).
  • the dynamic voltage-modulating and frequency-modulating device 7 is configured to acquire the working state information of the data access unit 3 and the main operation unit 5 in real time, and adjust the data access unit according to the working state information of the data access unit 3 and the main operation unit 5. 3 and the operating voltage or operating frequency of the main arithmetic module 5 described above.
  • an embodiment of the present invention provides a neural network processor including the above convolution operation device.
  • the above neural network processor is used to perform artificial neural network operations, and implements artificial intelligence applications such as speech recognition, image recognition, and translation.
  • the dynamic voltage modulation and frequency modulation device 7 works as follows:
  • Case 1 The convolutional neural network processor performs the convolution operation, and the dynamic voltage-modulating and frequency-modulating device 7 acquires the running speeds of the data access unit 3 and the main operation module 5 of the neural network processor in real time.
  • the dynamic voltage-modulating and frequency-modulating device 7 determines that the running time of the data access unit 3 exceeds the running time of the main computing module 5 according to the operating speeds of the data access unit 3 and the main computing module 5, the dynamic voltage-modulating and frequency-modulating device 7 can determine that the convolution operation is performed.
  • the data access unit 3 becomes a bottleneck.
  • the main operation module 5 can perform a convolution operation operation based on the data transmitted from the data access unit 3 at this time.
  • the dynamic voltage regulating and frequency-modulating device 7 sends the first voltage frequency control information to the main operation module 5, where the first voltage frequency control information is used to instruct the main operation module 5 to lower the operating voltage or the operating frequency thereof to reduce the running speed of the main operation module 5.
  • the running speed of the main operation module 5 is matched with the running speed of the data access unit 3, the power consumption of the main operation module 5 is reduced, and the idle operation of the main operation module 5 is avoided, and finally, the completion time of the task is not affected. In this case, the overall operating power consumption of the above neural network processor is reduced.
  • Case 2 In the process of performing the convolution operation, the dynamic voltage-modulating and frequency-modulating device 7 acquires the running speeds of the data access unit 3 and the main operation module 5 of the neural network processor in real time. When the dynamic voltage-modulating and frequency-modulating device 3 determines that the running time of the main arithmetic module 5 exceeds the running time of the data access unit 3 according to the operating speed of the data access unit 3 and the main arithmetic module 5, the dynamic voltage-modulating and frequency-modulating device 7 can determine that the convolution operation is performed. In the process, the main operation module 5 becomes a bottleneck.
  • the dynamic voltage-modulating and frequency-modulating device 7 sends the second voltage frequency control information to the data access unit 3, and the second voltage frequency control information is used to instruct the data access unit 316 to lower its operating voltage or operating frequency to reduce the operating speed of the data access unit 3. So that the running speed of the data access unit 3 matches the running speed of the main operation module 5, the power consumption of the data access unit 3 is reduced, and the idle condition of the data access unit 3 is avoided, and finally the completion time of the task is not affected. In the case of the above, the operating power consumption of the above-mentioned neural network processor as a whole is reduced.
  • the dynamic voltage modulation and frequency modulation device 317 collects the working parameters of the artificial neural network application by the neural network processor in real time and adjusts the neural network processor according to the working parameter.
  • Working voltage or operating frequency is the working parameter.
  • the artificial intelligence application may be video image processing, object recognition, machine translation, voice recognition, image beauty, and the like.
  • the dynamic voltage modulation and frequency modulation device 7 collects the frame rate of the video image processing by the neural network processor in real time.
  • the target frame rate is a video image processing frame rate that is normally required by the user, and the dynamic voltage modulation and frequency modulation device 7 sends the third voltage frequency regulation information to the neural network processor.
  • the third voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or operating frequency, and reduces the power consumption of the neural network processor while satisfying the normal video image processing requirements of the user.
  • Case 4 When the neural network processor performs speech recognition, the dynamic voltage modulation and frequency modulation device 7 collects the speech recognition speed of the neural network processor in real time. When the voice recognition speed of the neural network processor exceeds the actual voice recognition speed of the user, the dynamic voltage regulation and frequency modulation device 7 sends fourth voltage frequency regulation information to the neural network processor, where the fourth voltage frequency regulation information is used to indicate the nerve.
  • the network processor reduces its operating voltage or operating frequency, and reduces the power consumption of the above-mentioned neural network processor while satisfying the user's normal speech recognition requirements.
  • the dynamic voltage regulation and frequency modulation device 7 monitors and acquires each unit or module in the above neural network processor in real time (including instruction 1, controller unit 2, data access unit 3, interconnection module 4, main operation module 5, and slave operation). Module 6) working status information. When any unit or module of the above neural network processor is in an idle state, the dynamic voltage regulating and frequency modulation device 7 sends the fifth voltage frequency control information to the unit or the module to reduce the working voltage of the unit or the module or Operating frequency to further reduce the power consumption of the unit or module.
  • the dynamic voltage regulating and frequency modulation device 7 sends the sixth voltage frequency regulation information to the unit or the module to raise the working voltage or the operating frequency of the unit or the module, so that the unit or the module The speed of operation meets the needs of the work.
  • FIG 3E schematically illustrates an embodiment of an interconnect module 4: an H-tree module.
  • the interconnection module 4 constitutes a data path between the main operation module 5 and the plurality of slave operation modules 6, and is a binary tree path composed of a plurality of nodes, and each node transmits the upstream data to the downstream two nodes in the same manner. The data returned by the two downstream nodes is merged and returned to the upstream node.
  • the neuron data in the main operation module 5 is sent to the respective slave operation modules 6 through the interconnection module 4; when the calculation process from the operation module 6 is completed, when the calculation from the operation module is completed After the process is completed, the value of each neuron output from the arithmetic module is progressively combined into a complete vector of neurons in the interconnect module 4. For example, if there are a total of N slave arithmetic modules in the device, the input data xi is sent to the N slave arithmetic modules, and each slave computing module convolves the input data xi with the convolution kernel corresponding to the slave computing module.
  • the scalar data of each slave arithmetic module is merged by the interconnect module 4 into an intermediate vector containing N elements.
  • the convolution window traverses a total of A*B (A in the X direction, B in the Y direction, and X, Y are the coordinate axes of the three-dimensional orthogonal coordinate system)
  • the data xi is input, and the above is performed for A*B xi Convolution operation, all the obtained vectors are combined in the main operation module to obtain the three-dimensional intermediate result of A*B*N.
  • FIG. 3F illustrates an example block diagram of the structure of the main operation module 5 in the apparatus for performing a convolutional neural network forward operation according to an embodiment of the present application.
  • the main operation module 5 includes a first operation unit 51, a first data dependency determination unit 52, and a first storage unit 53.
  • the first operation unit 51 includes a vector addition unit 511 and an activation unit 512.
  • the first operation unit 51 receives the control signal from the controller unit 2, and completes various operation functions of the main operation module 5, and the vector addition unit 511 is used to implement the offset operation in the forward calculation of the convolutional neural network, and the component will
  • the offset data is added to the intermediate result pair to obtain a bias result, and the activation operation unit 512 performs an activation function operation on the bias result.
  • the offset data may be read from an external address space or may be stored locally.
  • the first data dependency determining unit 52 is a port in which the first computing unit 51 reads and writes the first storage unit 53, and ensures read/write consistency of data in the first storage unit 53. At the same time, the first data dependency determining unit 52 is also responsible for transmitting the data read from the first storage unit 53 to the slave computing module through the interconnect module 4, and the output data from the computing module 6 is directly sent to the slave module 4 through the interconnect module 4.
  • the first arithmetic unit 51 The command output from the controller unit 2 is sent to the calculation unit 51 and the first data dependency determination unit 52 to control its behavior.
  • the storage unit 53 is configured to buffer the input data and the output data used by the main operation module 5 in the calculation process.
  • FIG. 3G illustrates an example block diagram of the structure of the slave arithmetic module 6 in an apparatus for performing a convolutional neural network forward operation in accordance with an embodiment of the present application.
  • each slave arithmetic module 6 includes a second arithmetic unit 61, a data dependency determining unit 62, a second storage unit 63, and a third storage unit 64.
  • the second arithmetic unit 61 receives the control signal from the controller unit 2 and performs a convolution operation.
  • the second arithmetic unit includes a vector multiplication unit 611 and an accumulating unit 612, which are respectively responsible for the vector multiplication operation and the accumulation operation in the convolution operation.
  • the second data dependency determining unit 62 is responsible for the read and write operations on the second storage unit 63 in the calculation process. Before the second data dependency determining unit 62 performs the read/write operation, it first ensures that there is no read/write consistency conflict between the data used between the instructions. For example, all control signals sent to the data dependency unit 62 are stored in an instruction queue inside the data dependency unit 62, in which the range of read data of the read command is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.
  • the second storage unit 63 buffers the input data of the slave arithmetic module 6 and outputs scalar data.
  • the third storage unit 64 buffers the convolution kernel data required by the slave arithmetic module 6 in the calculation process.
  • the dynamic voltage regulating and frequency modulation device collects the running speed of the neural network processor and its internal units and modules in real time, according to the neural network processor and its internal units and modules.
  • the running speed determines to reduce the operating frequency or working voltage of the neural network processor or its internal units, and can achieve the purpose of reducing the overall operating power consumption of the chip while meeting the needs of the user in actual work.
  • FIG. 3H is a schematic flowchart of a dynamic voltage regulation and frequency modulation method according to an embodiment of the present application. As shown in FIG. 3H, the method includes:
  • the dynamic voltage modulation and frequency modulation device collects working state information or application scenario information of the chip connected to the dynamic voltage regulation frequency modulation in real time, where the application scenario information is obtained by the chip through a neural network operation or with the chip Information collected by connected sensors.
  • the dynamic voltage regulation and frequency modulation device sends voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip, where the voltage frequency regulation information is used to instruct the chip to adjust its working voltage or operating frequency.
  • the working state information of the chip includes an operating speed of the chip
  • the voltage frequency control information includes first voltage frequency regulation information
  • the sending according to the working state information or application scenario information of the chip, is sent to the chip.
  • the voltage frequency regulation information includes:
  • the target speed is the running speed of the chip when the user's demand is met.
  • the chip includes at least a first unit and a second unit, the output data of the first unit is input data of the second unit, and the working state information of the chip includes an operating speed of the first unit And the operating speed of the second unit, the voltage frequency regulation information includes the second voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
  • the second voltage frequency regulation information is used to instruct the second unit to reduce its operating frequency or operating voltage.
  • the voltage frequency regulation information includes the third voltage frequency regulation information
  • the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
  • the third voltage frequency regulation information is used to indicate that the first unit reduces its operating frequency or operating voltage.
  • the chip includes at least N units, and the working state information of the chip includes working state information of at least S units of the N units, where N is an integer greater than 1, and the S is
  • the voltage frequency regulation information includes a fourth voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information of the chip further includes:
  • the unit A is any one of the at least S units.
  • the voltage frequency regulation information includes the fifth voltage frequency regulation information
  • the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
  • the application scenario of the chip is image recognition
  • the application scenario information is the number of objects in the image to be identified
  • the voltage frequency regulation information includes sixth voltage frequency regulation information
  • the voltage regulation and frequency modulation unit further Used for:
  • the application scenario information is object tag information
  • the voltage frequency regulation information includes seventh voltage frequency regulation information
  • the voltage regulation and frequency modulation unit is further configured to:
  • the chip is applied to the voice recognition, the application scenario information is a voice input rate, the voltage frequency regulation information includes an eighth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:
  • the application scenario information is a keyword obtained by performing voice recognition on the chip
  • the voltage frequency regulation information includes ninth voltage frequency regulation information
  • the frequency modulation and voltage adjustment unit is further configured to:
  • the ninth voltage frequency regulation information is sent to the chip, and the ninth voltage frequency regulation information is used to indicate that the chip increases its working voltage or operating frequency.
  • the chip is applied to machine translation, where the application scenario information is a speed of text input or a quantity of characters in an image to be translated, and the voltage frequency regulation information includes tenth voltage frequency regulation information, and the voltage regulation and frequency modulation
  • the application scenario information is a speed of text input or a quantity of characters in an image to be translated
  • the voltage frequency regulation information includes tenth voltage frequency regulation information
  • the voltage regulation and frequency modulation The unit is also used to:
  • the application scenario information is ambient light intensity
  • the voltage frequency regulation information includes eleventh voltage frequency regulation information
  • the voltage regulation and frequency modulation unit is further configured to:
  • the eleventh voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency .
  • the chip is applied to image beauty
  • the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency offset control information
  • the voltage regulation and frequency modulation unit is further configured to:
  • the application scenario information is a face image
  • sending the twelfth voltage frequency regulation information to the chip where the twelfth voltage frequency regulation information is used to indicate that the chip reduces its working voltage
  • FIG. 4A is a schematic structural diagram of a convolution operation device according to an embodiment of the present application.
  • the convolution operation device includes a dynamic voltage-modulating frequency modulation device 7, an instruction storage unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a main operation module 5, and N slave operation modules 6. .
  • the instruction storage unit 1, the controller unit 2, the data access unit 3, the interconnection module 4, the main operation module 5, and the N slave operation modules 6 can all pass hardware circuits (including but not limited to FPGA, CGRA, and dedicated integration). Circuit ASICs, analog circuits, and memristors are implemented.
  • the instruction storage unit 1 is configured to store an instruction read by the data access unit 3.
  • the controller unit 2 is configured to read an instruction from the instruction storage unit 1, translate the instruction into a control signal for controlling the behavior of other modules, and send the signal to other modules such as the data access unit 3, the main operation module 5, and the N slave operations. Module 6 and so on.
  • the data access unit 3 is configured to perform data or instruction read and write operations between the external address space and the convolution operation device.
  • the data access unit 3 accesses the external address space, directly reads and writes data to each storage unit inside the device, and completes loading and storing of the data.
  • N slave arithmetic modules 6 are used to implement convolution operations of input data and convolution kernels in a convolutional neural network algorithm.
  • the N slave operation modules 6 are specifically configured to calculate respective output scalars in parallel by using the same input data and respective convolution kernels.
  • the interconnection module 4 is configured to connect the main operation mode 5 block and the N slave operation module 6, and can realize different interconnection topologies (such as a tree structure, a ring structure, a grid structure, a hierarchical interconnection, and a bus structure). Wait).
  • the interconnection module 4 can implement data transmission between the main operation module 5 and the N slave operation modules 6.
  • the interconnection module 4 constitutes a data path of continuous or discretized data between the main operation module 5 and the N slave operation modules 6, and the interconnection module 4 is a tree structure, a ring structure, a grid structure, Any of a hierarchical interconnection and a bus structure.
  • the main operation module 5 is configured to splicing intermediate vectors of all input data into intermediate results, and performing subsequent operations on the intermediate results.
  • the main operation module 5 is further configured to add the intermediate result and the offset data, and then perform an activation operation.
  • the activation function active used by the main operation module is any nonlinear function of the nonlinear functions sigmoid, tanh, relu, and softmax.
  • the main operation module 5 includes:
  • the first storage unit 53 is configured to cache input data and output data used by the main operation module 5 in the calculation process
  • the first operation unit 51 is configured to complete various computing functions of the main operation module 5;
  • the first data dependency determining unit 52 is a port of the first computing unit 51 that reads and writes the first storage unit 53 for ensuring consistency of reading and writing data to the first storage unit 53, and reads from the first storage unit 53.
  • the input neuron vector is taken and sent to the N slave arithmetic modules 6 through the interconnect module 4; and the intermediate result vector from the interconnect module 4 is sent to the first arithmetic unit 51 described above.
  • the slave operation module of each of the N slave operation modules 6 includes:
  • a second operation unit 61 configured to receive a control signal sent by the controller unit 2 and perform an arithmetic logic operation
  • the second data dependency determining unit 62 is configured to perform read and write operations on the second storage unit 63 and the third storage unit 64 in the calculation process to ensure consistent reading and writing of the second storage unit 63 and the third storage unit 64. Sex
  • a second storage unit 63 configured to buffer input data and an output scalar calculated by the slave computing module
  • the third storage unit 64 is configured to buffer a convolution kernel required by the slave computing module in the calculation process.
  • first data dependency determining unit 52 and the second data dependency determining unit 62 ensure read and write consistency by:
  • the data access unit 3 reads at least one of input data, offset data, and a convolution kernel from an external address space.
  • the main operation module 5 delivers the input data to each of the N slave operation modules 6 through the interconnection module 4, and ends the calculation process at the N slave operation modules 6.
  • the interconnect module 4 progressively divides the output scalars of the N slave arithmetic modules 6 into intermediate vectors and sends them back to the main arithmetic module 5.
  • the specific calculation method of the above convolution operation device is described by different operation instructions.
  • the operation instruction here is exemplified by a convolution calculation instruction, and the convolution calculation instruction can be applied to a neural network, so the convolution calculation instruction can also It is called a convolutional neural network.
  • the formula that it actually needs to execute can be:
  • the convolution kernel W (which may include a plurality of data) is multiplied by the input data ⁇ i , summed, and then optionally biased b, and optionally an activation operation S(h), The final output result S is obtained.
  • the calculation topology can be obtained as a multiplier-adder-(optional) activation operator.
  • the convolution calculation instruction may include an instruction set including a convolutional neural network COMPUTE instruction having different functions, and a CONFIG instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction.
  • the COMPUTE instruction includes:
  • a convolution operation instruction according to which the convolution operation device extracts input data of a specified size and a convolution kernel from a specified address of a memory (a preferred scratch pad memory or a scalar register file), and performs the convolution operation unit Convolution operation.
  • a convolutional neural network sigmoid instruction according to which the convolution operation means respectively extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferably a scratch pad memory or a scalar register file), in a convolution operation unit Do the convolution operation, and then make the output result sigmoid activation;
  • the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, then the output is TanH activated;
  • the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, then the output is re-activated by ReLU;
  • the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and after dividing the group, the convolution operation unit The convolution operation is performed, preferably, and then the output is activated.
  • the CONFIG command is used to configure the various constants required for the current layer calculation before each layer of artificial neural network calculation begins.
  • the IO instruction is used to read the input data required for calculation from the external storage space and store the data back to the external space after the calculation is completed.
  • the NOP instruction is responsible for clearing the control signals in all control signal buffer queues of the current device, and ensuring that all instructions before the NOP instruction are all completed.
  • the NOP instruction itself does not contain any operations;
  • the JUMP instruction is used to control the jump of the next instruction address to be read from the instruction storage unit, and is used to implement the jump of the control flow;
  • the MOVE instruction is used to carry data of an address in the internal address space of the device to another address in the internal address space of the device.
  • the process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
  • the method for executing the convolution calculation instruction by the convolution operation device may specifically be:
  • the controller unit 2 extracts the convolution calculation instruction from the instruction storage unit 1, the operation domain corresponding to the convolution calculation instruction, and the first calculation topology corresponding to the convolution calculation instruction (multiplier-adder-adder-activate)
  • the arithmetic unit transmits the operation domain to the data access unit to transmit the first computing topology to the interconnection module 4.
  • the data access unit 3 extracts the convolution kernel w and the offset b corresponding to the operation domain from the external storage medium (when b is 0, it is not necessary to extract the offset b), and the convolution kernel w and the offset b are transmitted to the main operation. Module 5.
  • the first result may include a result of a plurality of multiplication operations.
  • the dynamic voltage regulation and frequency modulation device 7 is configured to collect operation state information of the convolution operation device, and send voltage frequency regulation information to the convolution operation device according to the operation state information of the convolution operation device, where the voltage frequency regulation The information is used to instruct the convolution operation device to adjust its operating voltage or operating frequency.
  • the dynamic voltage regulation and frequency modulation device 7 includes:
  • the information collecting unit 71 is configured to collect the working state information of the convolution operation device in real time
  • the voltage regulation and frequency adjustment unit 72 is configured to send voltage frequency regulation information to the convolution operation device 71 according to the operation state information of the convolution operation device, and the voltage frequency regulation information is used to instruct the convolution operation device 71 to adjust its working voltage or work. frequency.
  • the operating state information of the convolution operation device includes an operating speed of the convolution operation device
  • the voltage frequency regulation information includes first voltage frequency regulation information
  • the voltage regulation and frequency modulation unit 72 is configured to:
  • the target speed is an operating speed of the convolutional computing device when the user's demand is met.
  • the working state information of the convolution computing device includes an operating speed of the data access unit 3 and an operating speed of the main computing module 5, and the voltage frequency regulation information includes second electrical frequency regulation information, and frequency modulation.
  • the pressure regulating unit 72 is also used to:
  • the second voltage frequency control is sent to the main computing module 5
  • the second voltage frequency regulation information is used to indicate that the main operation module 5 reduces the operating frequency or the operating voltage thereof.
  • the voltage frequency regulation information includes third electrical frequency regulation information
  • the frequency modulation unit 72 is further configured to:
  • the third voltage is transmitted to the data access unit 3.
  • Frequency regulation information, the third voltage frequency regulation information is used to instruct the data access unit 3 to reduce its operating frequency or operating voltage.
  • the working state information of the convolution operation device includes an instruction storage unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a main operation module 5, and N slave operation modules.
  • the voltage frequency regulation information includes fourth voltage frequency regulation information
  • the voltage regulation and frequency modulation unit 72 is configured to:
  • the unit A is any one of the at least S units/modules.
  • the voltage frequency regulation information includes fifth voltage frequency regulation information
  • the voltage regulation and frequency modulation unit 72 is further configured to:
  • the fifth voltage frequency regulation information is sent to the unit A, and the fifth voltage frequency regulation information is used to indicate that the unit A raises its working voltage. Or the working frequency.
  • an embodiment of the present invention provides a neural network processor including the above convolution operation device.
  • the above neural network processor is used to perform artificial neural network operations, and implements artificial intelligence applications such as speech recognition, image recognition, and translation.
  • Case 1 The convolutional neural network processor performs the convolution operation, and the dynamic voltage-modulating and frequency-modulating device 7 acquires the running speeds of the data access unit 3 and the main arithmetic module 5 of the neural network processor in FIG. 4A in real time.
  • the dynamic voltage-modulating and frequency-modulating device 7 determines that the running time of the data access unit 3 exceeds the running time of the main computing module 5 according to the operating speed of the data access unit 3 and the main computing module 5, the dynamic voltage-modulating and frequency-modulating device 7 can determine that the convolution operation is performed.
  • the data access unit 3 becomes a bottleneck.
  • the main operation module 5 can perform a convolution operation operation based on the data transmitted by the data access unit 3.
  • the dynamic voltage regulation and frequency modulation device 7 sends the first voltage frequency regulation information to the main operation module 5, and the first voltage frequency regulation information is used to instruct the main operation module 5 to lower its working voltage or operating frequency to reduce the operation of the main operation module 5.
  • the speed is such that the running speed of the main operation module 5 matches the running speed of the data access unit 3, the power consumption of the main operation module 5 is reduced, and the idle operation of the main operation module 5 is avoided, and finally the completion time of the task is not affected. In the case of the above, the operating power consumption of the above-mentioned neural network processor as a whole is reduced.
  • Case 2 In the process of performing the convolution operation, the dynamic voltage-modulating and frequency-modulating device 7 in FIG. 4A acquires the running speeds of the data access unit 3 and the main operation module 5 of the neural network processor in real time.
  • the dynamic voltage-modulating and frequency-modulating device 7 determines that the running time of the main arithmetic module 5 exceeds the running time of the data access unit 3 according to the operating speed of the data access unit 3 and the main arithmetic module 5, the dynamic voltage-modulating and frequency-modulating device 7 can determine that the convolution operation is performed.
  • the main operation module 5 becomes a bottleneck.
  • the dynamic voltage-modulating and frequency-modulating device 7 sends the second voltage frequency control information to the data access unit 3, and the second voltage frequency control information is used to instruct the data access unit 3 to lower its operating voltage or operating frequency to reduce the operating speed of the data access unit 3. So that the running speed of the data access unit 3 matches the running speed of the main operation module 5, the power consumption of the data access unit 3 is reduced, and the idle condition of the data access unit 3 is avoided, and finally the completion time of the task is not affected. In the case of the above, the operating power consumption of the above-mentioned neural network processor as a whole is reduced.
  • the dynamic voltage-modulating frequency modulation device 7 in FIG. 4A collects the working parameters of the artificial neural network processor in the real-time application and adjusts the above according to the working parameters.
  • the operating voltage or operating frequency of the neural network processor is
  • the artificial intelligence application may be video image processing, object recognition, machine translation, voice recognition, image beauty, and the like.
  • the dynamic voltage-modulating and frequency-modulating device 7 in FIG. 4A collects the frame rate of the video image processing by the above-mentioned neural network processor in real time.
  • the target frame rate is a video image processing frame rate that is normally required by the user, and the dynamic voltage modulation and frequency modulation device 7 sends the third voltage frequency regulation information to the neural network processor.
  • the third voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or operating frequency, and reduces the power consumption of the neural network processor while satisfying the normal video image processing requirements of the user.
  • Case 4 When the neural network processor performs speech recognition, the dynamic voltage-modulating frequency modulation device 7 in FIG. 4A collects the speech recognition speed of the neural network processor in real time. When the voice recognition speed of the neural network processor exceeds the actual voice recognition speed of the user, the dynamic voltage regulation and frequency modulation device 7 sends fourth voltage frequency regulation information to the neural network processor, where the fourth voltage frequency regulation information is used to indicate the neural network.
  • the processor reduces its operating voltage or operating frequency, and reduces the power consumption of the above neural network processor while satisfying the user's normal speech recognition requirements.
  • the dynamic voltage-modulating frequency modulation device 7 in FIG. 4A monitors and acquires each unit or module in the above-mentioned neural network processor in real time (including the instruction storage unit 1, the controller unit 2, the data access unit 3, the interconnection module 4, and the main The operational status information of the arithmetic module 5 and the N slave arithmetic modules 6).
  • the dynamic voltage regulating and frequency modulation device 7 sends the fifth voltage frequency control information to the unit or the module to reduce the working voltage of the unit or the module or Operating frequency to reduce the power consumption of the unit or module.
  • the dynamic voltage regulating and frequency modulation device 7 sends the sixth voltage frequency regulation information to the unit or the module to raise the working voltage or the operating frequency of the unit or the module, so that the unit or the module The speed of operation meets the needs of the work.
  • FIG 4E schematically illustrates an embodiment of an interconnect module 4: an H-tree module.
  • the interconnection module 4 constitutes a data path between the main operation module 5 and the plurality of slave operation modules 6, and is a binary tree path composed of a plurality of nodes, and each node transmits the upstream data to the downstream two nodes in the same manner. The data returned by the two downstream nodes is merged and returned to the upstream node.
  • the neuron data in the main operation module 5 is sent to the respective slave operation modules 6 through the interconnection module 4; when the calculation process from the operation module 6 is completed, when the calculation from the operation module is completed After the process is completed, the value of each neuron output from the arithmetic module is progressively assembled into a complete vector of neurons in the interconnect module. For example, if there are a total of N slave arithmetic modules in the convolution operation device, the input data xi is sent to the N slave arithmetic modules, and each slave arithmetic module rolls the input data xi with the convolution kernel corresponding to the slave arithmetic module.
  • the product operation obtains a scalar data, and the scalar data of each slave arithmetic module is merged by the interconnect module 4 into an intermediate vector containing N elements.
  • the convolution window traverses a total of A*B (A in the X direction, B in the Y direction, and X, Y are the coordinate axes of the three-dimensional orthogonal coordinate system)
  • the data xi is input, and the above is performed for A*B xi Convolution operation, all the obtained vectors are combined in the main operation module to obtain the three-dimensional intermediate result of A*B*N.
  • FIG. 4B illustrates an example block diagram of the structure of the main operation module 5 in the apparatus for performing a convolutional neural network forward operation in accordance with an embodiment of the present application.
  • the main operation module 5 includes a first operation unit 51, a first data dependency determination unit 52, and a first storage unit 53.
  • the first operation unit 51 includes a vector addition unit 511 and an activation unit 512.
  • the first operation unit 51 receives the control signal from the controller unit 2 in FIG. 4A, and completes various operation functions of the main operation module 5, and the vector addition unit 511 is used to implement the offset in the forward calculation of the convolutional neural network. Operation, the component adds the offset data to the intermediate result bit to obtain an offset result, and the activation operation unit 512 performs an activation function operation on the bias result.
  • the offset data may be read from an external address space or may be stored locally.
  • the first data dependency determining unit 52 is a port in which the first computing unit 51 reads and writes the first storage unit 53, and ensures read/write consistency of data in the first storage unit 53. At the same time, the first data dependency determining unit 52 is also responsible for transmitting the data read from the first storage unit 53 to the slave computing module 6 through the interconnect module 4, and the output data of the slave computing module 6 is directly transmitted through the interconnect module 4.
  • the first arithmetic unit 51 is given.
  • the command output from the controller unit 2 is sent to the calculation unit 51 and the first data dependency determination unit 52 to control its behavior.
  • the storage unit 53 is configured to buffer the input data and the output data used by the main operation module 5 in the calculation process.
  • FIG. 4C illustrates an example block diagram of the structure of the slave arithmetic module 6 in an apparatus for performing a convolutional neural network forward operation in accordance with an embodiment of the present application.
  • each slave arithmetic module 6 includes a second arithmetic unit 61, a second data dependency determining unit 62, a second storage unit 63, and a third storage unit 64.
  • the second arithmetic unit 61 receives the control signal from the controller unit 2 in Fig. 4A and performs a convolution operation.
  • the second arithmetic unit includes a vector multiplication unit 611 and an accumulating unit 612, which are respectively responsible for the vector multiplication operation and the accumulation operation in the convolution operation.
  • the second data dependency determining unit 62 is responsible for the read and write operations on the second storage unit 63 in the calculation process. Before the second data dependency determining unit 62 performs the read/write operation, it first ensures that there is no read/write consistency conflict between the data used between the instructions. For example, all control signals sent to the data dependency unit 62 are stored in an instruction queue internal to the data dependency unit 62, in which the range of read data of the read command is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.
  • the second storage unit 63 buffers the input data of the slave arithmetic module 6 and outputs scalar data.
  • the third storage unit 64 buffers the convolution kernel data required by the slave arithmetic module 6 in the calculation process.
  • an embodiment of the present application provides a neural network processor including the above convolution operation device.
  • the above neural network processor is used to perform artificial neural network operations, and implements artificial intelligence applications such as speech recognition, image recognition, and translation.
  • the information collecting unit 71 of the dynamic voltage-modulating and frequency-modulating device 7 collects the working state information or the application scenario information of the neural network processor connected to the dynamic voltage-modulating and frequency-modulating device 7 in real time, and the application scenario information is the neural network processor through the neural network.
  • Information obtained by the sensor obtained by the sensor or connected to the neural network processor; the voltage-modulating frequency unit 72 of the dynamic voltage-modulating frequency modulation device 7 is directed to the nerve according to the working state information or the application scene information of the neural network processor
  • the network processor transmits voltage frequency regulation information for instructing the neural network processor to adjust its operating voltage or operating frequency.
  • the working state information of the neural network processor includes an operating speed of the neural network processor
  • the voltage frequency regulation information includes first voltage frequency regulation information
  • the voltage regulating and frequency modulation unit 72 Used for:
  • the first voltage frequency regulation information Transmitting the first voltage frequency regulation information to the neural network processor when the operating speed of the neural network processor is greater than a target speed, the first voltage frequency regulation information being used to indicate that the neural network processor is reduced Its operating frequency or operating voltage, which is the operating speed of the neural network processor when the user's needs are met.
  • the information collecting unit 71 collects the running speed of the neural network processor connected thereto in real time.
  • the speed of operation of the neural network processor can be different types of speeds depending on the tasks performed by the neural network processor described above.
  • the operating speed of the neural network processor may be a frame rate of the video image processing performed by the neural network processor; when the operation performed by the neural network processor is voice
  • the operating speed of the above-mentioned neural network processor is the speed at which the above information is voice-recognized.
  • the voltage regulating and frequency-modulating unit 72 determines that the operating speed of the neural network processor is greater than the target speed, that is, when the operating speed of the neural network processor reaches the operating speed of the neural network processor when the user demands are met, and sends the first to the neural network processor.
  • a voltage frequency regulation information to instruct the neural network processor to reduce its operating voltage or operating frequency to reduce the power consumption of the neural network processor.
  • the information collecting unit 71 collects the frame rate of the video image processing by the neural network processor in real time, and the current frame rate of the video processing by the above neural network processor is 54 frames/second.
  • the voltage-modulating and frequency-modulating unit 72 determines that when the frame rate of the video image processing by the above-mentioned neural network processor is greater than the target speed, the first voltage frequency regulation information is sent to the neural network processor to instruct the neural network processor to reduce the operating voltage thereof. Or operating frequency to reduce the power consumption of the neural network processor.
  • the neural network processor includes at least a first unit and a second unit, and output data of the first unit is input data of the second unit, the neural network processor
  • the operating state information includes the operating speed of the first unit and the operating speed of the second unit, the voltage frequency regulation information includes second voltage frequency regulation information, and the frequency modulation unit 72 is further configured to:
  • the second voltage frequency regulation information is used to instruct the second unit to reduce its operating frequency or operating voltage.
  • the foregoing neural network processor needs to cooperate with the first unit and the second unit, and the output data of the first unit is the input data of the second unit.
  • the information collecting unit 71 collects the operating speeds of the first unit and the second unit in real time.
  • the voltage regulating and frequency converting unit 72 transmits the second voltage to the second unit.
  • the frequency regulation information is used to instruct the second unit to lower its working voltage or operating frequency, so as to reduce the power consumption of the entire neural network processor without affecting the overall operating speed of the neural network processor.
  • the voltage frequency regulation information includes third voltage frequency regulation information
  • the frequency modulation unit 72 is further configured to:
  • the third voltage frequency regulation information is used to indicate that the first unit reduces its operating frequency or operating voltage.
  • the foregoing neural network processor needs to cooperate with the first unit and the second unit, and the output data of the first unit is the input data of the second unit.
  • the information collecting unit 71 collects the operating speeds of the first unit and the second unit in real time.
  • the voltage regulating frequency adjusting unit 72 transmits the third voltage to the first unit.
  • the frequency regulation information is used to indicate that the first unit reduces the operating voltage or the operating frequency thereof, so as to reduce the power consumption of the entire neural network processor without affecting the overall operating speed of the neural network processor.
  • the neural network processor includes at least N units, and the working state information of the neural network processor includes working state information of at least S units of the at least N units,
  • the N is an integer greater than 1
  • the S is an integer less than or less than N
  • the voltage frequency regulation information includes fourth voltage frequency regulation information
  • the voltage regulation and frequency modulation unit 72 is configured to:
  • the unit A is any one of the at least S units.
  • the voltage frequency regulation information includes fifth voltage frequency regulation information
  • the voltage regulation and frequency modulation unit 72 is further configured to:
  • the information collecting unit 71 collects the working state information of at least S units inside the neural network processor in real time.
  • the voltage regulating frequency modulation unit 72 transmits fourth voltage frequency regulation information to the unit A to indicate that the unit A lowers its operating frequency or operating voltage to reduce The power consumption of the unit A;
  • the voltage regulating and frequency adjusting unit 72 sends the fifth voltage frequency regulation information to the unit A to indicate that the unit A is rising.
  • the operating frequency or operating voltage is high so that the operating speed of the unit A satisfies the requirements of the work.
  • the application scenario information is the number of objects in the image to be identified, and the voltage frequency regulation information includes a sixth voltage.
  • the frequency regulation information, the voltage regulation and frequency modulation unit 72 is also used to:
  • the neural network processor is applied to image recognition, and the number of objects in the image to be identified is obtained by the neural network processor by using a neural network algorithm, and the information collecting unit 71 obtains the to-be-identified from the neural network processor.
  • the voltage-modulating frequency unit 72 determines that the number of objects in the image to be identified is less than the first threshold, the voltage-modulating frequency unit 72 sends the neural network processor to the neural network processor.
  • the sixth voltage frequency control information is used to indicate that the neural network processor reduces its working voltage or operating frequency; and when it is determined that the number of objects in the image to be identified is greater than the first threshold, the voltage regulating and frequency converting unit 72 goes to the neural network.
  • the processor transmits voltage frequency regulation information for instructing the neural network processor to increase its operating voltage or operating frequency.
  • the application scenario information is object tag information
  • the voltage frequency regulation information includes seventh voltage frequency regulation information
  • the voltage regulation and frequency modulation unit 72 is further configured to:
  • the preset object tag set includes a plurality of object tags, and the object tags may be “person”, “dog”, “tree”, and “flower”.
  • the neural network processor determines, by the neural network algorithm, that the dog is included in the current application scenario, the neural network processor transmits the object tag information including the "dog" to the information collecting unit 71, and the frequency modulation unit 72 determines the above.
  • the seventh voltage frequency regulation information is sent to the neural network processor to indicate that the neural network processor raises its working voltage or operating frequency; when it is determined that the object tag information does not belong to the preset
  • the voltage regulation frequency modulation unit 72 sends voltage frequency regulation information for instructing the neural network processor to reduce its operating voltage or operating frequency to the neural network processor.
  • the application scenario information is a voice input rate
  • the voltage frequency regulation information includes an eighth voltage frequency regulation information, and voltage regulation and frequency modulation.
  • Unit 72 is also used to:
  • the eighth voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or works frequency.
  • the application scenario of the above neural network processor is speech recognition, and the input unit of the neural network processor inputs speech to the neural network processor according to a certain rate.
  • the information collecting unit 71 collects the voice input rate in real time, and transmits the voice input rate information to the voltage regulating and frequency adjusting unit 72.
  • the eighth voltage frequency regulation information is sent to the neural network processor to instruct the neural network processor to reduce its operating voltage or operating frequency.
  • the voltage network control information for instructing the neural network processor to increase its operating voltage is sent to the neural network processor.
  • the voltage frequency regulation information includes ninth voltage frequency regulation information, and the frequency modulation is adjusted.
  • the press unit is also used to:
  • the ninth voltage frequency regulation information when the keyword belongs to a preset keyword set, where the ninth voltage frequency regulation information is used to instruct the neural network processor to raise the Working voltage or operating frequency.
  • the frequency modulation unit 72 sends the voltage modulation and frequency modulation information for instructing the neural network processor to reduce its working voltage or operating frequency to the neural network processor.
  • the preset keyword set includes keywords such as “image beauty”, “neural network algorithm”, “image processing” and “Alipay”. Assume that the application scenario information is “image beauty”, and the frequency modulation unit 72 sends the ninth voltage frequency regulation information to the foregoing to instruct the neural network processor to increase its working voltage or operating frequency; When "photographing”, the frequency modulation unit 72 transmits to the above-mentioned neural network processor voltage-regulating and frequency-modulating information for instructing the above-mentioned neural network processor to lower its operating voltage or operating frequency.
  • the application scenario information is a speed of text input or a number of characters in an image to be translated
  • the voltage frequency regulation information includes Ten voltage frequency regulation information
  • the voltage regulation frequency modulation unit 72 is also used to:
  • the neural network processor is applied to the machine translation, and the application scenario information collected by the information collection unit 71 is the speed of the text input or the number of characters in the image to be translated, and the application scenario information is transmitted to the voltage modulation and frequency modulation unit 72.
  • the voltage regulating and frequency modulation unit 72 transmits the tenth voltage frequency regulation information to the neural network processor to indicate the neural network processor.
  • the voltage regulating and frequency converting unit 72 sends the neural network processor to the neural network processor to indicate that the neural network processor is High voltage and frequency regulation information of its working voltage.
  • the voltage frequency regulation information when the application scenario information is the ambient light intensity, includes the eleventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is further configured to:
  • the eleventh voltage frequency regulation information Transmitting, by the neural network processor, the eleventh voltage frequency regulation information, when the ambient light intensity is less than a fifth threshold, the eleventh voltage frequency regulation information is used to indicate that the neural network processor is reduced Its working voltage or operating frequency.
  • the illumination intensity of the external environment is acquired by an illumination sensor connected to the neural network processor.
  • the information collection unit 71 transmits the illumination intensity to the voltage regulation and frequency modulation unit 72.
  • the voltage regulation and frequency modulation unit 72 transmits the eleventh voltage frequency regulation information to the neural network processor to instruct the neural network processor to reduce its operating voltage; when determining the light intensity
  • the voltage regulation and frequency modulation unit 72 transmits to the neural network processor voltage frequency regulation information for instructing the neural network processor to increase its operating voltage or operating frequency.
  • the neural network processor is applied to image beauty
  • the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency offset control information
  • the voltage modulation and frequency modulation unit is 72 is also used to:
  • the twelfth voltage frequency regulation information Transmitting, when the application scenario information is a face image, the twelfth voltage frequency regulation information to the neural network processor, where the twelfth voltage frequency regulation information is used to instruct the neural network processor to raise the Working voltage or working frequency;
  • the application scenario information is voice strength
  • the voltage modulation and frequency modulation unit 72 processes the neural network. Transmitting, by the neural network processor, voltage frequency regulation information for instructing the neural network processor to reduce its operating voltage or operating frequency; when the voice strength is less than the sixth threshold, the voltage regulating and frequency modulation unit 72 sends the neural network processor to indicate the nerve
  • the network processor raises the voltage frequency regulation information of its operating voltage or operating frequency.
  • the foregoing scene information may be information of an external scene collected by the sensor, such as light intensity, voice intensity, and the like.
  • the application scenario information may also be information calculated according to an artificial intelligence algorithm. For example, in the object recognition task, the real-time calculation result information of the neural network processor is fed back to the information collection unit, where the information includes the number of objects and people in the scene. Information such as face images, object tag keywords, and so on.
  • the artificial intelligence algorithm described above includes, but is not limited to, a neural network algorithm.
  • FIG. 4F is a schematic structural diagram of another convolution operation device according to an embodiment of the present application.
  • the convolution operation device includes a dynamic voltage modulation and frequency modulation device 617, a register unit 612, an interconnection module 613, an operation unit 614, a control unit 615, and a data access unit 616.
  • the operation unit 614 includes at least two of an addition calculator, a multiplication calculator, a comparator, and an activation operator.
  • the interconnecting module 613 is configured to control the connection relationship of the calculators in the computing unit 614 such that at least two types of calculators form different computing topologies.
  • the register unit 612 (which may be a register unit, an instruction cache, a scratch pad memory) is configured to store an operation instruction, an address of the data block in the storage medium, and a calculation topology corresponding to the operation instruction.
  • the convolution operation device further includes a storage medium 611.
  • the storage medium 611 may be an off-chip memory. Of course, in an actual application, it may also be an on-chip memory for storing data blocks.
  • the control unit 615 is configured to extract an operation instruction from the register unit 612, an operation domain corresponding to the operation instruction, and a first calculation topology corresponding to the operation instruction, and decode the operation instruction into an execution instruction, where the execution instruction is used to control
  • the arithmetic unit 614 performs an arithmetic operation, transfers the operational domain to the data access unit 616, and transmits the computational topology to the interconnect module 613.
  • the data access unit 616 is configured to extract a data block corresponding to the operation domain from the storage medium 611, and transmit the data block to the interconnection module 613.
  • the interconnecting module 613 is configured to receive the data block of the first computing topology.
  • the interconnect module 613 also repositions the data block according to the first computing topology.
  • the operation unit 614 is configured to execute an instruction, and the calculator in the operation unit 614 is called to perform an operation operation on the data block to obtain an operation result, and the operation result is transmitted to the data access unit 616 and stored in the storage medium 611.
  • the operation unit 614 is further configured to, according to the first computing topology and the execution instruction, invoke a calculator to perform an operation operation on the re-arranged data block, obtain an operation result, and transmit the operation result.
  • the data access unit 616 is stored in the storage medium 611.
  • the interconnecting module 613 is further configured to form a first computing topology according to the connection relationship of the calculator in the control computing unit 614.
  • the dynamic voltage regulation and frequency modulation device 617 is configured to monitor the working state of the entire convolution operation device and dynamically adjust its voltage and frequency.
  • the specific calculation method of the above convolution operation device is described by different operation instructions.
  • the operation instruction here is exemplified by a convolution calculation instruction, and the convolution calculation instruction can be applied to a neural network, so the convolution calculation instruction can also It is called a convolutional neural network.
  • the formula that it actually needs to execute can be:
  • the convolution kernel W (which may include a plurality of data) is multiplied by the input data ⁇ i , summed, and then optionally biased b, and then optionally the activation operation s(h), The final output result S is obtained.
  • the calculation topology can be obtained as a multiplier-adder-(optional) activation operator.
  • the convolution calculation instruction may include an instruction set including a convolutional neural network COMPUTE instruction having different functions, and a CONFIG instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction.
  • the COMPUTE instruction includes:
  • a convolution operation instruction according to which the convolution operation device extracts input data of a specified size and a convolution kernel from a specified address of a memory (a preferred scratch pad memory or a scalar register file), and performs the convolution operation unit Convolution operation.
  • a convolutional neural network sigmoid instruction according to which the convolution operation means respectively extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferably a scratch pad memory or a scalar register file), in a convolution operation unit Do the convolution operation, and then make the output result sigmoid activation;
  • the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, then the output is TanH activated;
  • a convolutional neural network ReLU instruction according to which the device respectively extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferred scratch pad memory), performs a convolution operation in the convolution operation unit, and then Output results for ReLU activation;
  • the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and after dividing the group, the convolution operation unit The convolution operation is performed, preferably, and then the output is activated.
  • the CONFIG command is used to configure the various constants required for the current layer calculation before each layer of artificial neural network calculation begins.
  • the IO instruction is used to read the input data required for calculation from the external storage space and store the data back to the external space after the calculation is completed.
  • the NOP instruction is responsible for clearing the control signals in all control signal buffer queues of the current device, and ensuring that all instructions before the NOP instruction are all completed.
  • the NOP instruction itself does not contain any operations;
  • the JUMP instruction is used to control the jump of the next instruction address to be read from the instruction storage unit, and is used to implement the jump of the control flow;
  • the MOVE instruction is used to carry data of an address in the internal address space of the convolution operation device to another address in the internal address space of the convolution operation device, the process is independent of the operation unit, and does not occupy the operation unit during execution. H.
  • the method for executing the convolution calculation instruction by the convolution operation device may specifically be:
  • the control unit 615 extracts the convolution calculation instruction, the operation domain corresponding to the convolution calculation instruction, and the first calculation topology corresponding to the convolution calculation instruction from the register unit 612 (multiplier-adder-adder-activation operator)
  • the control unit transmits the operational domain to the data access unit 616 to transmit the first computational topology to the interconnect module 613.
  • the data access unit 616 extracts the convolution kernel w and the offset b corresponding to the operation domain from the storage medium 611 (when b is 0, it is not necessary to extract the offset b), and the convolution kernel w and the offset b are transmitted to the operation. Unit 614.
  • the multiplier of the operation unit 614 obtains the first result after performing the multiplication operation on the convolution kernel w and the input data Xi, and inputs the first result to the adder to perform the addition operation to obtain the second result, and the second result and the offset b
  • the addition operation is performed to obtain a third result
  • the third result is input to the activation operator to perform an activation operation to obtain an output result s
  • the output result s is transmitted to the data access unit 616 for storage in the storage medium 611.
  • the direct output result can be transferred to the data access storage to the storage medium 611 without the following steps.
  • the step of performing the addition of the second result and the offset b to obtain the third result is optional, that is, when b is 0, this step is not required.
  • the order of addition and multiplication operations can be reversed.
  • the first result may include a result of a plurality of multiplication operations.
  • an embodiment of the present invention provides a neural network processor including the above convolution operation device.
  • the above neural network processor is used to perform artificial neural network operations, and implements artificial intelligence applications such as speech recognition, image recognition, and translation.
  • the dynamic voltage modulation and frequency modulation device 617 in FIG. 4F works as follows:
  • Case 1 In the process of performing the convolution operation, the dynamic voltage-modulating frequency modulation device 617 in FIG. 4F acquires the running speed of the data access unit 616 and the arithmetic unit 614 of the neural network processor in real time. When the dynamic voltage-modulating frequency modulation device 617 determines that the running time of the data access unit 616 exceeds the running time of the computing unit 614 according to the operating speed of the data access unit 616 and the computing unit 614, the dynamic voltage-modulating frequency adjusting device 617 can determine that the convolution operation is performed. During the process, the data access unit 616 becomes a bottleneck.
  • the operation unit 614 needs to wait for the data access unit 616 to execute the read task and transmit the read data to the operation unit 614.
  • the arithmetic unit 614 can perform a convolution operation operation based on the data transmitted by the data access unit 616 mentioned above.
  • the dynamic voltage modulation and frequency modulation device 617 sends the first voltage frequency regulation information to the operation unit 614, where the first voltage frequency regulation information is used to instruct the operation unit 614 to lower the operating voltage or the operating frequency thereof to reduce the operating speed of the operation unit 614.
  • the running speed of the computing unit 614 is matched with the running speed of the data access unit 616, which reduces the power consumption of the computing unit 614, prevents the computing unit 614 from being idle, and finally reduces the time without affecting the completion time of the task.
  • the overall operating power consumption of the above neural network processor is matched with the running speed of the data access unit 616, which reduces the power consumption of the computing unit 614, prevents the computing unit 614 from being idle, and finally reduces the time without affecting the completion time of the task.
  • Case 2 In the process of performing the convolution operation, the dynamic voltage modulation and frequency modulation device 617 acquires the running speed of the data access unit 616 and the operation unit 614 of the neural network processor in real time.
  • the upper dynamic voltage modulation and frequency modulation device 617 determines that the operation time of the operation unit 614 exceeds the running time of the data access unit 616 according to the operation speed of the data access unit 616 and the operation unit 614, the dynamic voltage modulation and frequency modulation device 617 can determine that the convolution operation process is performed.
  • the operation unit 614 becomes a bottleneck.
  • the data access unit 616 transfers the data read by the data access unit 616 after waiting for the operation unit 614 to perform the current convolution operation.
  • the dynamic voltage-modulating frequency modulation device 617 sends the second voltage frequency control information to the data access unit 616, where the second voltage frequency control information is used to instruct the data access unit 616 to lower its operating voltage or operating frequency to reduce the data access unit 616.
  • the running speed is such that the running speed of the data access unit 616 matches the running speed of the budget unit 614, the power consumption of the data access unit 616 is reduced, and the data access unit 616 is prevented from being idle, and finally the task is not affected. In the case of completion time, the overall operating power consumption of the above-described neural network processor is reduced.
  • the neural network processor performs an artificial neural network operation.
  • the dynamic voltage regulation and frequency modulation device 617 collects the working parameters of the artificial neural network application by the neural network processor in real time, and adjusts the neural network according to the working parameter.
  • the operating voltage or operating frequency of the processor is the operating voltage or operating frequency of the processor.
  • the artificial intelligence application may be video image processing, object recognition, machine translation, voice recognition, image beauty, and the like.
  • the dynamic voltage regulation and frequency modulation device 617 collects the frame rate of the video image processing by the neural network processor in real time.
  • the target frame rate is a video image processing frame rate that is normally required by the user, and the dynamic voltage modulation and frequency modulation device 617 sends the third voltage frequency regulation information to the neural network processor.
  • the third voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or operating frequency, and reduces the power consumption of the neural network processor while satisfying the normal video image processing requirements of the user.
  • Case 4 When the neural network processor performs speech recognition, the dynamic voltage modulation and frequency modulation device 617 collects the speech recognition speed of the neural network processor in real time. When the voice recognition speed of the neural network processor exceeds the actual voice recognition speed of the user, the dynamic voltage regulation and frequency modulation device 617 sends fourth voltage frequency regulation information to the neural network processor, where the fourth voltage frequency regulation information is used to indicate the neural network.
  • the network processor reduces its operating voltage or operating frequency, and reduces the power consumption of the above-mentioned neural network processor while satisfying the user's normal speech recognition requirements.
  • the dynamic voltage modulation and frequency modulation device 617 monitors each unit or module in the above neural network processor in real time (including the storage medium 611, the register unit 612, the interconnection module 613, the operation unit 614, the controller unit 615, and the data access unit 616). Working status.
  • the dynamic voltage regulating and frequency modulation device sends fifth voltage frequency regulation information to the unit or module to reduce the working voltage of the unit or the module or Operating frequency to further reduce the power consumption of the unit or module.
  • the dynamic voltage regulating and frequency modulation device sends the sixth voltage frequency regulation information to the unit or the module to raise the working voltage or the operating frequency of the unit or the module, so that the unit or the module The speed of operation meets the needs of the work.
  • FIG. 4G is a schematic flowchart of a method for performing a forward operation of a single-layer convolutional neural network according to an embodiment of the present application, where the method is applied to the convolution operation device. As shown in FIG. 4G, the method includes the following steps:
  • S702 The operation starts, the controller unit reads the IO instruction from the first address of the instruction storage unit, and according to the decoded control signal, the data access unit reads all corresponding convolutional neural network operation instructions from the external address space. And buffering it in the instruction storage unit;
  • the controller unit then reads the next IO instruction from the instruction storage unit, and according to the decoded control signal, the data access unit reads all data required by the main operation module from the external address space to the main a first storage unit of the computing module;
  • the controller unit then reads the next IO instruction from the instruction storage unit, and the data access unit reads the convolution kernel data required by the operation module from the external address space according to the decoded control signal;
  • the controller unit then reads the next CONFIG instruction from the instruction storage unit, and the convolution operation device configures various constants required for the calculation of the layer neural network according to the decoded control signal;
  • the controller unit then reads the next COMPUTE instruction from the instruction storage unit, and according to the decoded control signal, the main operation module first sends the input data in the convolution window to the N through the interconnection module.
  • the operation module saves to the second storage unit of the N slave operation modules, and then moves the convolution window according to the instruction;
  • the operation unit of the N slave operation modules reads the convolution kernel from the third storage unit, reads the input data from the second storage unit, and completes the input data and the convolution. a convolution operation of the core, returning the resulting output scalar through the interconnect module;
  • the output scalars returned by the N operation modules are successively formed into a complete intermediate vector.
  • the main operation module obtains an intermediate vector returned by the interconnection module, and the convolution window traverses all the input data, and the main operation module concatenates all the return vectors into an intermediate result, and the control signal is decoded according to the COMPUTE instruction from the first
  • the storage unit reads the offset data, and adds the offset result to the intermediate result by the vector addition unit, and then the activation unit activates the offset result, and writes the last output data back to the first storage unit;
  • the controller unit reads the next IO instruction from the instruction storage unit, and the data access unit stores the output data in the first storage unit to an external address space specified address according to the decoded control signal. The operation ends.
  • the method further includes:
  • the working state information of the convolution operation device includes an operating speed of the convolution operation device
  • the voltage frequency regulation information includes first voltage frequency regulation information
  • the operation according to the convolution operation device Sending the voltage frequency regulation information to the convolution operation device by the status information includes:
  • the convolution operation device Transmitting, by the convolution operation device, the first voltage frequency regulation information, when the operation speed of the convolution operation device is greater than a target speed, the first voltage frequency regulation information being used to indicate that the convolution operation device is lowered Its operating frequency or operating voltage, the target speed is the running speed of the chip when the user needs are met.
  • the working state information of the convolution operation device includes an operation speed of the data access unit and an operation speed of the main operation unit
  • the voltage frequency regulation information includes second voltage frequency regulation information
  • the second voltage frequency regulation information is used to instruct the main operation unit to lower its operating frequency or operating voltage.
  • the voltage frequency regulation information includes third voltage frequency regulation information
  • the sending the voltage frequency regulation information to the convolution operation device according to the working state information of the convolution operation device further includes:
  • the third voltage frequency regulation information is used to instruct the data access unit to reduce its operating frequency or operating voltage.
  • the working state information of the convolution operation device includes the instruction storage unit, the controller unit, the data access unit, the interconnection module, the main operation module, and at least S units/modules of the N slave operation modules.
  • Working state information, the S is an integer greater than 1 and less than or equal to N+5
  • the voltage frequency regulation information includes fourth voltage frequency regulation information, according to the working state information of the convolution operation device
  • the transmitting the voltage frequency regulation information by the convolution operation device further includes:
  • the unit A is any one of the at least S units/modules.
  • the voltage frequency regulation information includes the fifth voltage frequency regulation information
  • the sending the voltage frequency regulation information to the convolution operation device according to the working state information of the convolution operation device further includes:
  • a method for performing a forward operation of a multi-layer convolutional neural network comprising: performing a neural network forward operation method as shown in FIG. 4G for each layer, After the execution of the upper convolutional neural network, the operation instruction of this layer uses the output data address of the upper layer stored in the main operation module as the input data address of the layer, and the convolution kernel and the offset data in the instruction. The address is changed to the address corresponding to this layer.
  • an image compression method and related apparatus which can train a compressed neural network for image compression, and improve the effectiveness of image compression and the accuracy of recognition.
  • FIG. 5A provides a neural network operation process according to the present application.
  • the dotted arrow indicates the reverse operation
  • the solid arrow indicates the forward operation.
  • the forward operation when the execution of the upper artificial neural network is completed, the output neurons obtained by the previous layer are operated as the input neurons of the next layer (or some operations are performed on the output neurons.
  • the input neurons of the next layer at the same time, replace the weights with the weights of the next layer.
  • the input neuron gradient obtained by the previous layer is used as the output neuron gradient of the next layer (or the input neuron)
  • the gradient performs some operations as the output neuron gradient of the next layer, and the weight is replaced with the weight of the next layer.
  • the forward propagation phase of the neural network corresponds to the forward operation, which is the process of inputting data input to output data.
  • the back propagation phase corresponds to the inverse operation, and the error between the final result data and the expected output data is reversed.
  • the weights of each layer are corrected according to the error gradient, and the weights of each layer are adjusted. This is also the process of neural network learning and training, which can reduce the network output. error.
  • the type of compression training atlas of the compressed neural network there is no limitation on the type of compression training atlas of the compressed neural network and the number of training images included in each type of training atlas.
  • the compressed training atlas may include multiple dimensions such as images of multiple angles, images of multiple light intensities, or images acquired by multiple different types of image acquisition devices.
  • the compression neural network is trained for the compression training atlas corresponding to the above different dimensions, the effectiveness of image compression in different situations is improved, and the application range of the image compression method is expanded.
  • the compressed training map focuses on the label information included in the training image.
  • the specific content of the label information is not limited in this application, and the image portion to be trained is marked, which can be used to detect whether the compressed neural network is trained.
  • the tag information is the target license plate information
  • the driving image is input to the compressed neural network to obtain a compressed image
  • the compressed image is identified based on the recognized neural network model to obtain reference license plate information, if reference vehicle license information is used If the target license plate information is matched, the training of the compressed neural network can be determined. Otherwise, when the current training number of the compressed neural network is less than the preset threshold, the compressed neural network needs to be trained.
  • the application does not limit the type of tag information, and may be license plate information, face information, traffic sign information, object classification information, and the like.
  • the recognition neural network model involved in the present application is data obtained when the recognition neural network for image recognition is completed, and the training method for identifying the neural network is not limited, and a Batch Gradient Descent (BGD) algorithm may be adopted. Training is performed by Stochastic Gradient Descent (SGD) or mini-batch SGD. One training period is completed by single forward operation and reverse gradient propagation.
  • BGD Batch Gradient Descent
  • each training image in the identification training map set includes at least tag information that is consistent with the type of the target tag information of each training image in the compressed training image. That is to say, the recognition neural network model can identify the compressed image output by the compressed neural network (to be trained or completed training).
  • the type of the tag information of the compressed training image is a license plate
  • the type of the tag information identifying the training image includes at least the license plate, thereby ensuring that the recognized neural network model recognizes the compressed image output by the compressed neural network, and obtains the license plate information.
  • the compressed training atlas includes at least an identification training atlas.
  • the accuracy of the recognition neural network model can be improved, thereby improving the training efficiency of the compressed neural network. That is to improve the effectiveness of image compression.
  • FIG. 5B is a schematic flowchart of an image compression method according to an embodiment of the present application. As shown in FIG. 5B, the image compression method includes the following steps:
  • Step S201 Acquire an original image of the first resolution.
  • the first resolution is the input resolution of the compressed neural network
  • the second resolution is smaller than the first resolution, which is the output resolution of the compressed neural network, that is, the compression ratio of the image input to the compressed neural network (the second resolution and The ratio of the first resolution is fixed, that is, the same compression ratio is obtained by compressing different images based on the same compressed neural network model.
  • the original image is any training image in the compressed training map set of the compressed neural network, and the label information of the original image is used as the target label information.
  • the application does not limit the tag information, and may be obtained by marking the human identification, or inputting the original image into the recognition neural network, and performing recognition based on the recognition neural network model.
  • Step S202 compress the original image based on the target model to obtain a compressed image of the second resolution.
  • the target model is the current neural network model of the compressed neural network, that is, the target model is the current parameter of the compressed neural network. Compressing the original image with a resolution equal to the input resolution of the compressed neural network based on the target model yields a compressed image having a resolution equal to the output resolution of the compressed neural network.
  • the compressing the original image based on the target model to obtain the compressed image of the second resolution comprises: identifying the original image based on the target model to obtain a plurality of image information; and based on the target model And compressing the original image with the plurality of image information to obtain the compressed image.
  • the training image as described above includes multiple dimensions.
  • the original image is identified based on the target model, image information corresponding to each dimension can be determined, and the original image is compressed for each image information, thereby improving image compression in different dimensions.
  • the accuracy rate is used to improve image compression in different dimensions.
  • Step S203 Identify the compressed image based on the recognition neural network model to obtain reference label information.
  • the present application does not limit the identification method, and may include two parts: feature extraction and feature recognition, and the result of feature recognition is used as reference label information.
  • the reference label information corresponding to the driving compressed image is obtained as the license plate number after the driving image is compressed; After the image is compressed, the reference tag information corresponding to the face compressed image is obtained as the face recognition result.
  • the identifying the compressed image by using the identifying neural network model to obtain the reference label information comprises: preprocessing the compressed image to obtain an image to be identified; and determining the image to be recognized based on the identifying neural network model The identification is performed to obtain the reference tag information.
  • the preprocessing includes, but is not limited to, any one or more of the following: data format conversion processing (eg, normalization processing, integer data conversion, etc.), data deduplication processing, data exception processing, data missing padding processing, and the like.
  • data format conversion processing eg, normalization processing, integer data conversion, etc.
  • data deduplication processing eg., data exception processing, data missing padding processing, and the like.
  • the acquiring the original image of the first resolution comprises: receiving an input image; and preprocessing the input image to obtain the original image.
  • the compression efficiency of image compression can be improved by preprocessing the input image.
  • the pre-processing described above also includes size processing, since the neural network has a fixed size requirement that only images of the same basic image size as the neural network can be processed.
  • the basic image size of the compressed neural network is taken as the first basic image size, and the basic image size of the neural network is identified as the second basic image size, that is, the size of the input image of the compressed neural network is required to be equal to the first basic image size.
  • the recognition neural network requires that the size of the input image be equal to the second basic image size.
  • the compressed neural network may compress the image to be compressed that satisfies the first basic image size to obtain a compressed image; the recognition neural network may identify the image to be identified that satisfies the second basic image size to obtain reference tag information.
  • the specific manner of the size processing is not limited, and may include a method of cutting or filling pixels, a method of scaling according to a basic image size, a down sampling method for an input image, and the like.
  • the peripheral pixel is cropped to a non-critical information area around the cropped image;
  • the downsampling process is a process of reducing the sampling rate of the specific signal, for example, four adjacent pixels are averaged as corresponding positions of the processed image. The value of a pixel, thereby reducing the size of the image.
  • the pre-processing the compressed image to obtain an image to be identified includes: when the image size of the compressed image is smaller than a basic image size of the recognition neural network, performing the compressed image according to the basic image size. Filling the pixels to obtain the image to be identified.
  • the present application does not limit the pixel point, and may correspond to any color mode, for example: rgb (0, 0, 0).
  • the specific position of the pixel padding is not limited, and may be any position other than the compressed image, that is, the compressed image is not processed, but the image is expanded by filling the pixel, and the compressed image is not deformed. It is convenient to improve the recognition efficiency and accuracy of image recognition.
  • the compressed image is placed on the upper left of the image to be recognized, and the position of the image to be recognized is filled with pixels other than the compressed image.
  • the pre-processing the input image to obtain the original image comprises: when the image size of the input image is smaller than a first basic image size of the compressed neural network, according to the first basic image size
  • the input image is filled with pixel points to obtain the original image.
  • the original image to be compressed is identified by the recognition neural network by pixel dot filling to obtain reference label information, and the pixel point fills the compression ratio of the input image without changing, which is convenient for improving the efficiency and accuracy of training the compressed neural network.
  • Step S204 Acquire a loss function according to the target tag information and the reference tag information.
  • the loss function is used to describe the magnitude of the error between the target tag information and the reference tag information.
  • the tag information includes multiple dimensions, which are generally calculated using a squared difference formula:
  • c is the dimension of the tag information
  • t k is the kth dimension of the reference tag information
  • y k is the kth dimension of the target tag information
  • Step S205 Determine whether the loss function converges to the first threshold or whether the current training number of the compressed neural network is greater than or equal to the second threshold. If yes, go to step S206; if no, go to step S207.
  • the training period corresponding to each training image is completed by a single forward operation and reverse gradient propagation, and the threshold of the loss function is set to the first threshold, and the training of the compressed neural network is performed.
  • the threshold of the number of times is set to the second threshold. That is, if the loss function converges to the first threshold or the number of training times is greater than or equal to the second threshold, the training of the compressed neural network is completed, and the target model is used as the compressed neural network model corresponding to the completion of the training of the compressed neural network.
  • the present application is not limited to the reverse training method of the compressed neural network.
  • the apparatus includes an instruction cache unit 21, a controller unit 22, a direct memory access unit 23, an H-tree module 24, a main operation module 25, and a plurality of slave operation modules 26, which can be implemented by hardware circuits (for example, ASIC) implementation.
  • ASIC application specific integrated circuit
  • the instruction cache unit 21 reads the instruction through the direct memory access unit 23 and caches the read instruction; the controller unit 22 reads the instruction from the instruction cache unit 21, and translates the instruction into a micro instruction that controls the behavior of other modules.
  • Other modules such as the direct memory access unit 23, the main operation module 25 and the slave operation module 26, etc.; the direct memory access unit 23 can access the external address space, directly read and write data to each cache unit inside the device, and complete data loading and storage. .
  • FIG. 5F shows the structure of the H-tree module 24.
  • the H-tree module 24 constitutes a data path between the main operation module 25 and the plurality of slave operation modules 26, and has an H-tree type. structure.
  • the H-tree is a binary tree path composed of multiple nodes. Each node sends the upstream data to the downstream two nodes in the same way, and the data returned by the two downstream nodes are combined and returned to the upstream node. For example, in the inverse operation of the neural network, the vectors returned by the two downstream nodes are added to a vector at the current node and returned to the upstream node.
  • each slave operation module 26 outputs The sum of the output gradient vectors will be summed two-by-two in the H-tree module 24, ie, summing and summing all the output gradient vectors as the final output gradient vector.
  • FIG. 5G is a schematic structural diagram of the main operation module 25.
  • the main operation module 25 includes an operation unit 251, a data dependency determination unit 252, and a neuron buffer unit 253.
  • the neuron buffer unit 253 is configured to buffer the input data and the output data used by the main operation module 25 in the calculation process.
  • the arithmetic unit 251 performs various arithmetic functions of the main arithmetic module.
  • the data dependency judging unit 252 is a port in which the arithmetic unit 251 reads and writes the neuron cache unit 253, and at the same time, can ensure that there is no consistency conflict with the reading and writing of data in the neuron buffer unit 253.
  • the data dependency determining unit 252 determines whether there is a dependency between the microinstruction that has not been executed and the data of the microinstruction that is being executed, and if not, allows the microinstruction to be immediately transmitted, otherwise it is necessary to wait until the microinstruction
  • the microinstruction is allowed to be transmitted after all the microinstructions on which the instruction depends are executed.
  • all microinstructions sent to the data dependency unit 252 are stored in an instruction queue inside the data dependency unit 252, in which the range of read data of the read instruction is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.
  • the data dependency determination unit 252 is also responsible for reading the input gradient vector from the neuron buffer unit 253 and transmitting it to the slave operation module 26 through the H-tree module 24, and the output data from the operation module 26 is directly sent to the operation through the H-tree module 24.
  • the command output from the controller unit 22 is sent to the arithmetic unit 251 and the dependency determination unit 252 to control its behavior.
  • FIG. 5H is a schematic structural diagram of the operation module 26.
  • each slave operation module 26 includes an operation unit 261, a data dependency determination unit 262, a neuron buffer unit 263, a weight buffer unit 264, and Weight gradient buffer unit 265.
  • the arithmetic unit 261 receives the micro-instructions issued by the controller unit 22 and performs arithmetic logic operations.
  • the data dependency determination unit 262 is responsible for the read and write operations on the cache unit in the calculation process.
  • the data dependency judging unit 262 ensures that there is no consistency conflict between the reading and writing of the cache unit. Specifically, the data dependency determining unit 262 determines whether there is a dependency relationship between the microinstruction that has not been executed and the data of the microinstruction that is being executed, and if not, allows the microinstruction to be immediately transmitted, otherwise it is necessary to wait until the micro The microinstruction is allowed to be transmitted after all the microinstructions on which the instruction depends are executed.
  • all microinstructions sent to the data dependency unit 262 are stored in an instruction queue inside the data dependency unit 262, in which the range of read data of the read instruction is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.
  • the neuron buffer unit 263 buffers the input gradient vector data and the output gradient vector partial sum calculated by the arithmetic operation module 26.
  • the weight buffer unit 264 buffers the weight vector required by the slave operation module 26 in the calculation process. For each slave arithmetic module, only the columns in the weight matrix corresponding to the slave arithmetic module 26 are stored.
  • the weight gradient buffer unit 265 buffers the weight gradient data required by the corresponding slave module in updating the weights.
  • Each of the weight gradient data stored from the arithmetic module 26 corresponds to its stored weight vector.
  • Each of the arithmetic modules calculates only the product of the corresponding partial scalar element in the in_gradient and the column corresponding to the weight matrix w, and each of the obtained output vectors is a sum of the final result and the sum of the parts and the H-tree The middle and the second are added together to get the final result. So the computational process becomes a parallel computational part of the process and the subsequent accumulation process.
  • Each of the slave arithmetic modules 26 calculates a partial sum of the output gradient vectors and performs a summation operation in the H-tree module 24 to obtain the final output gradient vector.
  • Each slave arithmetic module 26 simultaneously multiplies the input gradient vector by the output value of each layer in the forward operation to calculate a weight gradient to update the weight stored in the slave arithmetic module 26.
  • Forward and reverse training are the two main processes of neural network algorithms. To train (update) the weights in the network, the neural network needs to calculate the forward output of the input vector in the network composed of the current weights. This is positive. To the process, the weight of each layer is trained (updated) layer by layer according to the difference between the output value and the label value of the input vector itself. In the forward calculation process, the output vector of each layer and the derivative value of the activation function are saved. These data are required for the reverse training process, so these data are guaranteed to exist at the beginning of the reverse training.
  • the output value of each layer in the forward operation is the data existing at the beginning of the reverse operation, and can be buffered in the main operation module by the direct memory fetch unit and sent to the slave operation module through the H-tree.
  • the main operation module 25 performs subsequent calculation based on the output gradient vector, for example, multiplying the output gradient vector by the derivative of the activation function in the forward operation to obtain the input gradient value of the next layer.
  • the derivative of the activation function in the forward operation is the data existing at the beginning of the reverse operation, and can be cached in the main operation module by the direct memory fetch unit.
  • an instruction set for performing an artificial neural network forward operation on the aforementioned apparatus includes the CONFIG instruction, the COMPUTE instruction, the IO instruction, the NOP instruction, the JUMP instruction, and the MOVE instruction, where:
  • the CONFIG command is used to configure various constants required for current layer calculation before each layer of artificial neural network calculation begins;
  • IO instruction which is used to read input data required for calculation from an external address space and store the data back to the external space after the calculation is completed;
  • the NOP instruction is responsible for clearing the microinstructions currently loaded into all internal microinstruction buffer queues, and ensuring that all instructions before the NOP instruction are all completed.
  • the NOP instruction itself does not contain any operations;
  • the JUMP instruction is used to be responsible for the jump of the next instruction address that the controller will read from the instruction cache unit, and is used to implement the jump of the control flow;
  • the MOVE instruction is used to carry data of an address in the internal address space of the device to another address in the internal address space of the device.
  • the process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
  • FIG. 5I is a block diagram of an example of the reverse training of the compressed neural network provided by the embodiment of the present application.
  • the output gradient vector input gradient of the upper layer in Fig. 5I is multiplied by the corresponding activation function derivative to obtain the input data of this layer, and then multiplied by the weight matrix to obtain the output gradient vector.
  • the operation module 26 multiplies the input gradient and the input neuron in the forward operation to calculate the weight update gradient dw, and then uses w, dw, and the weight used in the last update of the weight to update the gradient dw' according to the instruction set learning.
  • Rate update weight w Rate update weight w.
  • the input gradient ([input gradient0,..., input gradient3] in FIG. 5I) is the output gradient vector of the n+1th layer, which is first and the derivative value of the nth layer in the forward operation process. ([f'(out0),...,f'(out3)]) in Fig. 5) is multiplied to obtain an input gradient vector of the nth layer, which is completed in the main operation module 25, and sent to the H-tree module 24 From the arithmetic module 26, it is temporarily stored in the neuron buffer unit 263 of the slave arithmetic module 26. Then, the input gradient vector is multiplied by the weight matrix to obtain an output gradient vector of the nth layer.
  • the i-th slave computing module calculates the product of the i-th scalar in the input gradient vector and the column vector [w_i0,...,w_iN] in the weight matrix, and the resulting output vector is step-by-step in the H-tree module 24.
  • the two additions yield the final output gradient vector output gradient ([output gradient0,...,output gradient3] in Figure 5I).
  • the jth element of the vector, in_gradient_i is the i-th element of the inverse input n-th layer input gradient vector (ie, the product of input gradient and derivative f' in Figure 5I).
  • the input of the nth layer in the forward operation is the data existing at the beginning of the reverse training, and is sent to the slave arithmetic module 26 through the H-tree module 24 and temporarily stored in the neuron buffer unit 263.
  • the slave operation module 26 after completing the calculation of the sum of the output gradient vectors, the i-th scalar of the input gradient vector and the input vector of the n-th layer of the forward operation are multiplied to obtain a gradient vector dw of the updated weight. This update weight.
  • an IO instruction is pre-stored at the first address of the instruction cache unit; the controller unit reads the IO instruction from the first address of the instruction cache unit, and directly accesses the memory access unit according to the translated microinstruction.
  • the external address space reads all instructions related to the single layer artificial neural network reverse training and caches it in the instruction cache unit; the controller unit then reads the next IO instruction from the instruction cache unit, according to the translated micro
  • the direct memory access unit reads all data required by the main operation module from the external address space to the neuron buffer unit of the main operation module, where the data includes the input neuron and the activation function derivative value and the input gradient in the previous forward operation
  • the controller unit then reads the next IO instruction from the instruction cache unit.
  • the direct memory access unit reads the ownership value data and the weight gradient data required by the operation module from the external address space, and respectively a weight buffer unit and a weight gradient buffer unit stored in the corresponding slave arithmetic module; the controller unit then The next CONFIG instruction is read from the instruction cache unit, and the operation unit configures the value of the internal unit of the operation unit according to the parameters in the translated micro instruction, including various constants required for the calculation of the layer neural network, and the precision calculation and update of the calculation of the layer.
  • the controller unit then reads the next COMPUTE instruction from the instruction cache unit, and according to the translated microinstruction, the main operation module sends the input gradient vector and the input neuron in the forward operation through the H-tree module.
  • the input gradient vector and the input neuron in the forward operation are stored in the neuron cache unit of the slave arithmetic module; the microinstruction decoded according to the COMPUTE instruction, from the arithmetic unit of the arithmetic module from the weight buffer unit Reading the weight vector (ie, the partial column of the weight matrix stored by the slave module), completing the vector multiplication scalar operation of the weight vector and the input gradient vector, returning the output vector portion and returning through the H-tree;
  • the input gradient vector is multiplied by the input neuron to obtain a weight gradient stored in the weight gradient buffer unit; in the H-tree module, each The output gradient part returned from the arithmetic module is added step by step to obtain a complete output gradient vector; the main operation module obtains the return value of the H-tree module, and reads from the neuron cache unit according to the micro-instruction decoded by the COMPUTE instruction.
  • the value of the activation function in the forward operation is multiplied by the returned output vector to obtain the input gradient vector of the next layer of reverse training, which is written back to the neuron buffer unit; the controller unit then proceeds from the instruction cache unit.
  • Read the next COMPUTE instruction read the weight w from the weight buffer unit from the arithmetic module according to the translated micro-instruction, and read the weight gradient dw and the last update weight from the weight gradient buffer unit.
  • the weight gradient dw', the update weight w; the controller unit then reads the next IO instruction from the instruction cache unit, and according to the translated microinstruction, the direct memory access unit stores the output gradient vector in the neuron cache unit to
  • the external address space specifies the address and the operation ends.
  • the implementation process is similar to that of a single-layer neural network.
  • the next-level operation instruction will use the output gradient vector calculated in the main operation module as the next layer.
  • the trained input gradient vector performs the above calculation process, and the weight address and the weight gradient address in the instruction are also changed to the address corresponding to the layer.
  • the support for the forward operation of the multi-layer artificial neural network is effectively improved.
  • the use of dedicated on-chip buffering for multi-layer neural network reverse training fully exploits the reusability of input neurons and weight data, avoids repeatedly reading these data into memory, reduces memory access bandwidth, and avoids memory bandwidth. Multi-layer artificial neural network forward computing performance bottleneck problem.
  • Step S206 Acquire a target original image of the first resolution, and compress the target original image based on a compressed neural network model to obtain a target compressed image of the second resolution.
  • the target original image is an image (an image belonging to the same data set) that matches the type of the tag information of the training image. If the loss function converges to the first threshold or the number of training times is greater than or equal to the second threshold, the compressed neural network completes the training, and can directly input the compressed neural network to perform image compression to obtain the target compressed image, and the target compressed image can be recognized by the recognized neural network.
  • the method further includes: targeting the target based on the identifying neural network model The compressed image is identified to obtain tag information of the target original image, and tag information of the target original image is stored.
  • the compressed image can be identified based on the recognition neural network model, and the efficiency and accuracy of the manual identification of the tag information are improved.
  • Step S207 updating the target model according to the loss function to obtain an updated model, using the updated model as the target model, and using the next training image as the original image, and executing step S202.
  • the loss function is obtained by using the reference tag value obtained by the trained neural network model and the target tag value included in the original image, when the loss function satisfies the preset condition or the current training number of the compressed neural network exceeds the preset threshold.
  • the training is completed. Otherwise, the weight is repeatedly adjusted by training the compressed neural network, that is, the image content represented by each pixel in the same image is adjusted to reduce the loss of the compressed neural network.
  • the compressed neural network model obtained through training is used for image compression, which improves the effectiveness of image compression, and thus improves the accuracy of recognition.
  • FIG. 5J is a schematic structural diagram of an image compression apparatus 300 according to an embodiment of the present disclosure.
  • the image compression apparatus 300 includes a processor 301 and a memory 302.
  • the memory 302 is configured to store a first threshold, a second threshold, a current neural network model and a training number of the compressed neural network, a compressed training atlas of the compressed neural network, and each of the compressed training map sets.
  • a training image tag information, a recognition neural network model, a compressed neural network model, and a current neural network model of the compressed neural network as a target model, wherein the compressed neural network model is a target corresponding to the compression neural network training completion
  • the model, the recognition neural network model is a corresponding neural network model for identifying a neural network training completion.
  • the processor 301 is configured to acquire an original image of a first resolution, where the original image is any training image in the compressed training map set, and label information of the original image is used as target label information;
  • the original image is compressed to obtain a compressed image of a second resolution, the second resolution is smaller than the first resolution; and the compressed image is identified based on the recognized neural network model to obtain reference label information;
  • Obtaining a loss function according to the target tag information and the reference tag information acquiring the first resolution when the loss function converges to the first threshold, or the training number is greater than or equal to the second threshold a target original image of the rate, confirming that the target model is the compressed neural network model; compressing the target original image based on the compressed neural network model to obtain the target compressed image of the second resolution.
  • the processor 301 is further configured to: when the loss function does not converge to the first threshold, or when the training number is less than the second threshold, update the target model according to the loss function, An update model is obtained, the update model is used as the target model, and the next training image is used as the original image, and the step of acquiring the original image of the first resolution is performed.
  • the processor 301 is specifically configured to perform pre-processing on the compressed image to obtain an image to be identified, and to identify the to-be-identified image based on the recognized neural network model to obtain the reference tag information.
  • the preprocessing includes a size processing
  • the memory 302 is further configured to store a basic image size of the identification neural network
  • the processor 301 is specifically configured to: when an image size of the compressed image is smaller than the basic image size, The compressed image is filled with pixels according to the basic image size to obtain the image to be recognized.
  • the compressed training atlas includes at least an identification training atlas
  • the processor 301 is further configured to use the identification training atlas to train the recognized neural network to obtain the recognized neural network model, where the identification Each training image in the training map set includes at least tag information that is consistent with the type of the target tag information.
  • the processor 301 is further configured to: identify, according to the identifying a neural network model, the target compressed image, to obtain label information of the target original image;
  • the memory 302 is also used to store tag information of the target original image.
  • the compressed training atlas includes multiple dimensions
  • the processor 301 is specifically configured to: identify the original image based on the target model, obtain multiple image information, and each dimension corresponds to one image information; The original image is compressed based on the target model and the plurality of image information to obtain the compressed image.
  • the compressed image of the original image is obtained based on the target model
  • the reference tag information of the compressed image is obtained based on the recognition neural network model
  • the loss function is obtained according to the target tag information included in the original image and the reference tag information, and the loss function converges to the first threshold.
  • the training of the compressed neural network for image compression is completed, and the target model is used as the compressed neural network model, and the target of the target original image can be acquired based on the compressed neural network model. Compress the image.
  • the loss function is obtained by the reference tag value obtained by the trained neural network model and the target tag value included in the original image, and the loss function satisfies the preset condition or the current training number of the compressed neural network exceeds the preset threshold.
  • the training is completed at any time. Otherwise, the weight is repeatedly adjusted by training the compressed neural network, that is, the image content represented by each pixel in the same image is adjusted, the loss of the compressed neural network is reduced, and the effectiveness of image compression is improved, thereby facilitating Improve the accuracy of recognition.
  • an electronic device 400 is provided.
  • the electronic device 400 includes an image compression device 300.
  • the electronic device 400 includes a processor 401, a memory 402, a communication interface 403, and one or more.
  • Program 404 wherein one or more programs 404 are stored in memory 402 and configured to be executed by processor 401, which includes instructions for performing some or all of the steps described in the image compression method described above.
  • each of the above units or modules may be a circuit, including a digital circuit, an analog circuit, and the like.
  • Physical implementations of the various unit or module structures described above include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like.
  • the above chip or the above neural network processor may be any suitable hardware processor such as a CPU, GPU, FPGA, DSP, ASIC, and the like.
  • the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as Resistive Random Access Memory (RRAM), Dynamic Random Access Memory (DRAM), static. Random Random Access Memory (SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (HBM), Hybrid Memory Cube (HMC) and many more.
  • This application can be used in a variety of general purpose or special purpose computing system environments or configurations.
  • personal computers server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, top-mounted, programmable consumer electronics, personal computers (PCs), Small computer, mainframe computer, distributed computing environment including any of the above systems or devices, and the like.
  • PCs personal computers
  • mainframe computer mainframe computer
  • the present application provides a chip that includes the foregoing computing device that is capable of performing multiple operations on weights and input neurons simultaneously, thereby achieving diversification of operations.
  • a dedicated on-chip cache for multi-layer artificial neural network operation algorithms the reuse of input neurons and weight data is fully exploited, avoiding repeated reading of these data into memory, reducing memory access bandwidth and avoiding memory. Bandwidth becomes a problem of multi-layer artificial neural network operation and performance bottleneck of its training algorithm.
  • an embodiment of the present invention provides a chip package structure including the above neural network processor.
  • an embodiment of the present invention provides a board that includes the above chip package structure.
  • an embodiment of the present invention provides an electronic device including the above card.
  • the above electronic devices include, but are not limited to, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers, cameras, cameras, projectors, watches, headphones, mobile Storage, wearables, vehicles, household appliances, medical equipment.
  • the vehicle includes an airplane, a ship, and/or a vehicle;
  • the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood;
  • the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.
  • the disclosed terminal and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the above units is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, or an electrical, mechanical or other form of connection.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present invention.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the above-described integrated unit if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention contributes in essence or to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the above-described methods of various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Neurology (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

Provided are a processing method and apparatus. The method involves: respectively quantizing a weight and an input neuron to determine a weight dictionary, a weight code book, a neuron dictionary and a neuron code book; and determining a calculation code book according to the weight code book and the neuron code book. In addition, in the present application, a calculation code book is determined according to quantized data, and the two types of quantized data are combined, thereby facilitating data processing.

Description

处理方法及装置Processing method and device 技术领域Technical field
本申请涉及数据处理领域,尤其涉及一种处理方法及装置、运算方法及装置。The present application relates to the field of data processing, and in particular, to a processing method and apparatus, an operation method, and an apparatus.
背景技术Background technique
神经网络(neural network)已经获得了非常成功的应用。但是神经网络的大规模参数和大规模计算成为神经网络应用的一个巨大挑战。一方面,大规模的参数对存储容量提出了很高的要求,同时导致大量的访存能耗。另一方面,大规模计算对运算单元的设计提出了很高的要求,同时导致大量的计算能耗。因此,如何减少神经网络的参数和计算量成为一个亟待解决的问题。Neural networks have achieved very successful applications. However, large-scale parameters and large-scale computing of neural networks have become a huge challenge for neural network applications. On the one hand, large-scale parameters put high demands on storage capacity, and at the same time lead to a large amount of access energy consumption. On the other hand, large-scale calculations place high demands on the design of the arithmetic unit, and at the same time lead to a large amount of computational energy consumption. Therefore, how to reduce the parameters and calculation of neural networks has become an urgent problem to be solved.
发明内容Summary of the invention
本申请的目的在于提供一种处理方法及装置、运算方法及装置,以解决上述的至少一项技术问题。The purpose of the present application is to provide a processing method and apparatus, an arithmetic method and an apparatus to solve at least one of the above technical problems.
本申请的一方面,提供了一种处理方法,包括:In an aspect of the present application, a processing method is provided, including:
分别对权值和输入神经元进行量化,确定权值字典、权值密码本、神经元字典和神经元密码本;以及Weighting and inputting neurons are separately quantified to determine a weight dictionary, a weight codebook, a neuron dictionary, and a neuron codebook;
根据所述权值密码本和神经元密码本,确定运算密码本。The operation codebook is determined based on the weight codebook and the neuron codebook.
在本申请的一可能实施例中,对权值进行量化包括步骤:In a possible embodiment of the present application, quantifying the weight includes the steps of:
对所述权值分组,对每一组权值用聚类算法进行聚类操作,将所述每一组权值分成m类,m为正整数,每一类权值对应一个权值索引,确定所述权值字典,其中,所述权值字典包括权值位置和权值索引,所述权值位置指权值在神经网络结构中的位置;以及Grouping the weights, performing a clustering operation on each set of weights by using a clustering algorithm, dividing each set of weights into m classes, m being a positive integer, and each type of weight corresponding to a weight index, Determining the weight dictionary, wherein the weight dictionary includes a weight position and a weight index, the weight position indicating a position of the weight in the neural network structure;
将每一类的所有权值用一中心权值替换,确定所述权值密码本,其中,所述权值密码本包括权值索引和中心权值。The weighted codebook is determined by replacing the ownership value of each class with a central weight, wherein the weighted codebook includes a weighted index and a central weight.
在本申请的一可能实施例中,对输入神经元进行量化包括步骤:In a possible embodiment of the present application, the step of quantizing the input neurons comprises the steps of:
将所述输入神经元分为p段,每一段输入神经元对应一个神经元范围及一个神经元索引,确定所述神经元字典,其中,p为正整数;以及Deciphering the input neuron into p segments, each segment input neuron corresponding to a neuron range and a neuron index, determining the neuron dictionary, wherein p is a positive integer;
对所述输入神经元进行编码,将每一段的所有输入神经元用一中心神经元替换,确定所述神经元密码本。The input neurons are encoded, and all input neurons of each segment are replaced with a central neuron to determine the neuron codebook.
在本申请的一可能实施例中,所述确定运算密码本,具体包括步骤:In a possible embodiment of the present application, the determining the operation codebook includes the following steps:
根据所述权值确定所述权值密码本中的对应的权值索引,再通过权值索引确定该权值对应的中心权值;Determining, according to the weight value, a corresponding weight index in the weight credential, and determining, by using a weight index, a center weight corresponding to the weight;
根据所述输入神经元确定所述神经元密码本中对应的神经元索引,再通过神经元索引确定该输入神经元对应的中心神经元;以及Determining, according to the input neuron, a corresponding neuron index in the neuron codebook, and determining, by the neuron index, a central neuron corresponding to the input neuron; and
将该中心权值和中心神经元进行运算操作,得到运算结果,并将该运算结果组成矩阵,从而确定所述运算密码本。The center weight and the central neuron are operated to obtain an operation result, and the operation result is formed into a matrix to determine the operation codebook.
在本申请的一可能实施例中,所述运算操作包括以下的至少一种:加法、乘法和池化,其中,所述池化包括:平均值池化,最大值池化和中值池化。In a possible embodiment of the present application, the operation operation includes at least one of the following: addition, multiplication, and pooling, wherein the pooling includes: average pooling, maximum pooling, and median pooling .
在本申请的一可能实施例中,还包括步骤:对所述权值和输入神经元进行重训练,重训练时只训练 所述权值密码本和神经元密码本,所述权值字典和神经元字典中的内容保持不变,所述重训练采用反向传播算法。In a possible embodiment of the present application, the method further includes the steps of: re-training the weight and the input neuron, and training only the weight codebook and the neuron codebook during the retraining, the weight dictionary and The content in the neuron dictionary remains unchanged, and the retraining uses a backpropagation algorithm.
在本申请的一可能实施例中,所述对所述权值分组包括:In a possible embodiment of the present application, the grouping the weights includes:
分为一组,将神经网络中的所有权值归为一组;Divided into groups, grouping the ownership values in the neural network into one group;
层类型分组,将所述神经网络中所有卷积层的权值、所有全连接层的权值和所有长短时记忆网络层的权值各划分成一组;Layer type grouping, dividing weights of all convolution layers in the neural network, weights of all fully connected layers, and weights of all long and short memory network layers into a group;
层间分组,将所述神经网络中一个或者多个卷积层的权值、一个或者多个全连接层的权值和一个或者多个长短时记忆网络层的权值各划分成一组;以及Inter-layer grouping, dividing weights of one or more convolution layers in the neural network, weights of one or more fully connected layers, and weights of one or more long-term memory network layers into a group;
层内分组,将所述神经网络的一层内的权值进行切分,切分后的每一个部分划分为一组。In-layer grouping, the weights in one layer of the neural network are segmented, and each part after the segmentation is divided into a group.
在本申请的一可能实施例中,所述聚类算法包括K-means、K-medoids、Clara和/或Clarans。In a possible embodiment of the present application, the clustering algorithm comprises K-means, K-medoids, Clara and/or Clarans.
在本申请的一可能实施例中,每一类对应的中心权值的选择方法包括:确定使得代价函数J(w,w 0)最小时W 0的取值,此时W 0的取值即为该中心权值; In one possible embodiment of the present application, each class corresponding to the center of weights selection method comprising: determining that the cost function J (w, w 0) the minimum value W 0, in which case the value 0, i.e. W For the center weight;
其中,
Figure PCTCN2018095548-appb-000001
J(w,w 0)是代价函数,W是该类中所有权值,W 0是中心权值,n是该类下所有权值的数量,W i是该类中第i个权值,1≤i≤n,且i为正整数。
among them,
Figure PCTCN2018095548-appb-000001
J(w,w 0 ) is the cost function, W is the ownership value in the class, W 0 is the central weight, n is the number of ownership values under the class, and W i is the i-th weight in the class, 1≤ i ≤ n, and i is a positive integer.
本申请的另一方面,提供了一种处理装置,包括:In another aspect of the present application, a processing apparatus is provided, comprising:
存储器,用于存储操作指令;a memory for storing an operation instruction;
处理器,用于执行存储器中的操作指令,在执行该操作指令时依照前述处理方法进行操作。The processor is configured to execute an operation instruction in the memory, and operate according to the foregoing processing method when the operation instruction is executed.
在本申请的一可能实施例中,所述操作指令为二进制数,包括操作码和地址码,操作码指示处理器即将进行的操作,地址码指示处理器到存储器中的地址中读取参与操作的数据。In a possible embodiment of the present application, the operation instruction is a binary number, including an operation code and an address code, the operation code indicates an operation to be performed by the processor, and the address code indicates that the processor reads the participation operation into the address in the memory. The data.
本申请的又一方面,提供了一种运算装置,包括:In still another aspect of the present application, an arithmetic device is provided, including:
指令控制单元,用于对接收的指令进行译码,生成查找控制信息;以及An instruction control unit, configured to decode the received instruction to generate search control information;
查找表单元,用于根据所述查找控制信息,以及接收到的权值字典、神经元字典、运算密码本、权值和输入神经元,从运算密码本中查找输出神经元。The lookup table unit is configured to search for output neurons from the operation codebook according to the lookup control information, and the received weight dictionary, the neuron dictionary, the operation codebook, the weights, and the input neurons.
在本申请的一可能实施例中,所述权值字典包括权值位置和权值索引;所述神经元字典包括输入神经元和神经元索引;所述运算密码本包括权值索引、神经元索引以及输入神经元和权值的运算结果。In a possible embodiment of the present application, the weight dictionary includes a weight position and a weight index; the neuron dictionary includes an input neuron and a neuron index; the operation codebook includes a weight index, a neuron The index and the result of the operation of the input neurons and weights.
在本申请的一可能实施例中,所述运算装置还包括:In a possible embodiment of the present application, the computing device further includes:
预处理单元,用于对外部输入的输入信息进行预处理,得到所述权值、输入神经元、指令、权值字典、神经元字典、运算密码本;a pre-processing unit, configured to pre-process input information of the external input, to obtain the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation code book;
存储单元,用于存储输入神经元、权值、权值字典、神经元字典、运算密码本和指令,以及接收输出神经元;a storage unit for storing input neurons, weights, weights dictionary, neuron dictionary, operation codebook and instructions, and receiving output neurons;
缓存单元,用于缓存所述指令、输入神经元、权值、权值索引、神经元索引和输出神经元;a cache unit for buffering the instruction, input neurons, weights, weight indexes, neuron indexes, and output neurons;
直接内存存取单元,用于在所述存储单元和缓存单元之间进行数据或者指令的读写。a direct memory access unit for reading and writing data or instructions between the storage unit and the cache unit.
在本申请的一可能实施例中,所述缓存单元包括:In a possible embodiment of the present application, the cache unit includes:
指令缓存,用于缓存所述指令,并将缓存的指令输出至指令控制单元;An instruction cache for buffering the instruction and outputting the cached instruction to the instruction control unit;
权值缓存,用于缓存所述权值;a weight buffer for buffering the weights;
输入神经元缓存,用于缓存所述输入神经元;以及Input a neuron cache for caching the input neurons;
输出神经元缓存,用于缓存查找表单元输出的输出神经元。Outputs a neuron cache that is used to cache the output neurons of the lookup table cell output.
在本申请的一可能实施例中,所述缓存单元还包括:In a possible embodiment of the present application, the cache unit further includes:
权值索引缓存,用于缓存权值索引;以及a weighted index cache for caching weight indexing;
神经元索引缓存,用于缓存神经元索引。A neuron index cache that is used to cache neuron indexes.
在本申请的一可能实施例中,在对外部输入的输入信息进行的预处理时,所述预处理单元具体用于:切分、高斯滤波、二值化、正则化和/或归一化。In a possible embodiment of the present application, the pre-processing unit is specifically used for: pre-processing, Gaussian filtering, binarization, regularization, and/or normalization when pre-processing input information input externally. .
在本申请的一可能实施例中,查找表单元包括:In a possible embodiment of the present application, the lookup table unit includes:
乘法查找表,用于输入权值索引in1和神经元索引in2,通过乘法查找表经过查表操作mult_lookup,完成权值索引对应的中心权值data1和神经元索引对应的中心神经元data2的乘法操作,即用查表操作out=mult_lookup(in1,in2)完成乘法功能out=data1*data2;和/或The multiplication lookup table is used to input the weight index in1 and the neuron index in2, and the multiplication operation table mult_lookup is performed through the multiplication lookup table, and the multiplication operation of the central weight data1 corresponding to the weight index and the central neuron data2 corresponding to the neuron index is completed. , that is, using the lookup table operation out=mult_lookup(in1, in2) to complete the multiplication function out=data1*data2; and/or
加法查找表,用于根据输入索引in通过逐级加法查找表经过查表操作add_lookup完成索引对应的中心数据data的加法操作,其中,in和data是长度为N的向量,N是正整数,即用查表操作out=add_lookup(in)完成加法功能out=data[1]+data[2]+...+data[N],和/或输入权值索引in1和神经元索引in2通过加法查找表经过查表操作完成权值索引对应的中心权值data1和神经元索引对应的中心神经元data2的加法操作,即用查表操作out=add_lookup(in1,in2)完成加法功能,out=data1+data2;和/或The addition lookup table is used to perform the addition operation of the center data data corresponding to the index by the table-searching operation add_lookup according to the input index in. The in and data are vectors of length N, and N is a positive integer, that is, The table lookup operation out=add_lookup(in) completes the addition function out=data[1]+data[2]+...+data[N], and/or the input weight index in1 and the neuron index in2 through the addition lookup table After the table operation completes the addition operation of the center weight data1 corresponding to the weight index and the center neuron data2 corresponding to the neuron index, the addition function is completed by the table lookup operation out=add_lookup(in1, in2), out=data1+data2 ;and / or
池化查找表,用于输入索引对应的中心数据data的池化操作,即用查表out=pool_lookup(in)完成池化操作out=pool(data),池化操作包括平均值池化、最大值池化和中值池化。The pooled lookup table is used to input the pooling operation of the central data data corresponding to the index, that is, the pooling operation out=pool(data) is completed by using the table out=pool_lookup(in), and the pooling operation includes the average pooling and maximum Value pooling and median pooling.
在本申请的一可能实施例中,所述指令为神经网络专用指令,所述神经网络专用指令包括:In a possible embodiment of the present application, the instruction is a neural network specific instruction, and the neural network specific instruction includes:
控制指令,用于控制神经网络执行过程;Control instructions for controlling the execution of the neural network;
数据传输指令,用于完成不同存储介质之间的数据传输,数据格式包括矩阵、向量和标量;A data transfer instruction for performing data transfer between different storage media, the data format including a matrix, a vector, and a scalar;
运算指令,用于完成神经网络的算术运算,包括矩阵运算指令、向量运算指令、标量运算指令、卷积神经网络运算指令、全连接神经网络运算指令、池化神经网络运算指令、RBM神经网络运算指令、LRN神经网络运算指令、LCN神经网络运算指令、LSTM神经网络运算指令、RNN神经网络运算指令、RELU神经网络运算指令、PRELU神经网络运算指令、SIGMOID神经网络运算指令、TANH神经网络运算指令、MAXOUT神经网络运算指令;以及The operation instruction is used to complete the arithmetic operation of the neural network, including the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Command, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction, MAXOUT neural network operation instructions;
逻辑指令,用于完成神经网络的逻辑运算,包括向量逻辑运算指令和标量逻辑运算指令。Logic instructions for performing logical operations on neural networks, including vector logic operations instructions and scalar logic operation instructions.
在本申请的一可能实施例中,所述神经网络专用指令包括至少一种Cambricon指令,该Cambricon指令包括操作码和操作数,所述Cambricon指令包括:In a possible embodiment of the present application, the neural network dedicated instruction includes at least one Cambricon instruction, the Cambricon instruction includes an operation code and an operand, and the Cambricon instruction includes:
Cambricon控制指令,用于控制执行过程,且该Cambricon控制指令包括跳转指令和条件分支指令;a Cambricon control instruction for controlling an execution process, and the Cambricon control instruction includes a jump instruction and a conditional branch instruction;
Cambricon数据传输指令,用于完成不同存储介质之间的数据传输,包括加载指令、存储指令、搬运指令;其中,所述加载指令,用于将数据从主存加载到缓存;所述存储指令,用于将数据从缓存存储到主存;所述搬运指令,用于在缓存与缓存或者缓存与寄存器或者寄存器与寄存器之间搬运数据;The Cambricon data transfer instruction is configured to complete data transfer between different storage media, including a load instruction, a storage instruction, and a transfer instruction; wherein the load instruction is used to load data from the main memory to the cache; the storage instruction, Used to store data from a cache to main memory; the transfer instruction is used to transfer data between a cache and a cache or a cache and a register or a register and a register;
Cambricon运算指令,用于完成神经网络算术运算,包括Cambricon矩阵运算指令、Cambricon向量运算指令和Cambricon标量运算指令;其中,所述Cambricon矩阵运算指令,用于完成神经网络中的矩阵运算,包括矩阵乘向量运算、向量乘矩阵运算、矩阵乘标量运算、外积运算、矩阵加矩阵运算和矩阵减矩阵运算;所述Cambricon向量运算指令,用于完成神经网络中的向量运算,包括向量基本运算、向量超越函数运算、内积运算、向量随机生成运算和向量中最大/最小值运算;Cambricon标量运算指令,用于完成神经网络中的标量运算,包括标量基本运算和标量超越函数运算;以及The Cambricon operation instruction is used to complete a neural network arithmetic operation, including a Cambricon matrix operation instruction, a Cambricon vector operation instruction, and a Cambricon scalar operation instruction; wherein the Cambricon matrix operation instruction is used to complete a matrix operation in a neural network, including matrix multiplication Vector operation, vector multiplication matrix operation, matrix multiplication scalar operation, outer product operation, matrix plus matrix operation, and matrix subtraction matrix operation; the Cambricon vector operation instruction for performing vector operations in a neural network, including vector basic operations, vectors Transcendental function operations, inner product operations, vector random generation operations, and maximum/minimum operations in vectors; Cambricon scalar operations instructions for performing scalar operations in neural networks, including scalar basic operations and scalar transcendental functions;
Cambricon逻辑指令,用于神经网络的逻辑运算,该Cambricon逻辑指令包括Cambricon向量逻辑运算指令和Cambricon标量逻辑运算指令;其中,所述Cambricon向量逻辑运算指令,用于向量比较运算、向量逻辑运算和向量大于合并运算,其中,向量逻辑运算包括与、或、非;所述Cambricon标量逻辑运算指令,用于标量比较运算和标量逻辑运算运算。Cambricon logic instructions for logical operations of a neural network, the Cambricon logic instructions including Cambricon vector logic operation instructions and Cambricon scalar logic operation instructions; wherein the Cambricon vector logic operation instructions are used for vector comparison operations, vector logic operations, and vectors Greater than the merge operation, wherein the vector logic operation includes AND, OR, and NOT; the Cambricon scalar logic operation instruction is used for scalar comparison operation and scalar logic operation.
在本申请的一可能实施例中,所述Cambricon数据传输指令支持以下的一种或者多种数据组织方式:矩阵、向量和标量;所述向量基本运算包括向量加、减、乘、除;向量超越函数指不满足以多项式作系数的多项式方程的函数,包括指数函数、对数函数、三角函数、反三角函数;所述标量基本运算包括标量加、减、乘、除;标量超越函数指不满足以多项式作系数的多项式方程的函数,包括指数函数、对数函数、三角函数、反三角函数;所述向量比较包括大于、小于、等于、大于等于、小于等于和不等于;所述向量逻辑运算包括与、或、非;所述标量比较包括大于、小于、等于、大于等于、小于等于和不等于;所述标量逻辑运算包括与、或、非。In a possible embodiment of the present application, the Cambricon data transmission instruction supports one or more of the following data organization modes: a matrix, a vector, and a scalar; the vector basic operation includes a vector addition, subtraction, multiplication, and division; The transcendental function refers to a function that does not satisfy the polynomial equation with a polynomial as a coefficient, including an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function; the scalar basic operation includes scalar addition, subtraction, multiplication, and division; the scalar transcendental function refers to dissatisfaction A function of a polynomial equation in which a polynomial is a coefficient, including an exponential function, a logarithmic function, a trigonometric function, an inverse trigonometric function; the vector comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to; the vector logic operation includes And, or, and; the scalar comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to; the scalar logical operation includes AND, OR, and NOT.
本申请的又一方面,提供了另一种运算方法,包括:In yet another aspect of the present application, another method of operation is provided, including:
接收权值、输入神经元、指令、权值字典、神经元字典和运算密码本;Receiving weights, input neurons, instructions, weights dictionaries, neuron dictionaries, and arithmetic codebooks;
对所述指令进行译码,确定查找控制信息;以及Decoding the instructions to determine lookup control information;
根据所述查找控制信息、权值、权值字典、神经元字典和输入神经元,在运算密码本中查找输出神经元。According to the search control information, the weight, the weight dictionary, the neuron dictionary, and the input neurons, the output neurons are searched for in the operation codebook.
在本申请的一可能实施例中,所述权值字典包括权值位置和权值索引;所述神经元字典包括输入神经元和神经元索引;所述运算密码本包括权值索引,神经元索引以及权值和输入神经元的运算结果。In a possible embodiment of the present application, the weight dictionary includes a weight position and a weight index; the neuron dictionary includes an input neuron and a neuron index; the operation codebook includes a weight index, a neuron The index and the result of the weight and input neurons.
在本申请的一可能实施例中,根据所述查找控制信息、权值和输入神经元,在运算密码本中查找输出神经元,包括步骤:In a possible embodiment of the present application, searching for output neurons in the operation codebook according to the search control information, weights, and input neurons includes the following steps:
根据所述权值、输入神经元、权值字典和神经元字典,在神经元字典中通过确定神经元范围以确定神经元索引、以及在权值字典中通过确定权值位置以确定权值索引;Determining a neuron index in the neuron dictionary to determine a neuron index and determining a weight index by determining a weight location in the neuron dictionary according to the weight, the input neuron, the weight dictionary, and the neuron dictionary ;
根据所述权值索引和神经元索引,在运算密码本中查找该运算结果,以确定输出神经元。According to the weight index and the neuron index, the operation result is searched in the operation codebook to determine the output neuron.
在本申请的一可能实施例中,所述运算结果包括以下的至少一种运算操作的结果:加法、乘法和池化,其中,池化包括:平均值池化,最大值池化和中值池化。In a possible embodiment of the present application, the operation result includes the following results of at least one operation operation: addition, multiplication, and pooling, wherein the pooling includes: average pooling, maximum pooling, and median Pooling.
在本申请的一可能实施例中,在接收权值、输入神经元、指令、权值字典、神经元字典、运算密码本之前,还包括步骤:对外部输入的输入信息进行预处理,得到所述权值、输入神经元、指令、权值字典、神经元字典、运算密码本;以及In a possible embodiment of the present application, before receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation of the codebook, the method further includes the steps of: preprocessing the input information of the external input to obtain the Weights, input neurons, instructions, weights dictionaries, neuron dictionaries, arithmetic codebooks;
在接收权值、输入神经元、指令、权值字典、神经元字典、运算密码本之后,还包括步骤:存储权值、输入神经元、指令、权值字典、神经元字典、运算密码本、以及接收输出神经元;以及缓存所述指令、输入神经元、权值和输出神经元。After receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation of the codebook, the method further includes the steps of: storing the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, the operation codebook, And receiving output neurons; and caching the instructions, input neurons, weights, and output neurons.
在本申请的一可能实施例中,在接收权值、输入神经元、指令、权值字典、神经元字典、运算密码本之后,还包括步骤:缓存所述权值索引和神经元索引。In a possible embodiment of the present application, after receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook, the method further includes the step of: buffering the weight index and the neuron index.
在本申请的一可能实施例中,所述预处理包括切分、高斯滤波、二值化、正则化和/或归一化。In a possible embodiment of the present application, the pre-processing includes segmentation, Gaussian filtering, binarization, regularization, and/or normalization.
在本申请的一可能实施例中,所述指令为神经网络专用指令,所述神经网络专用指令包括:In a possible embodiment of the present application, the instruction is a neural network specific instruction, and the neural network specific instruction includes:
控制指令,用于控制神经网络执行过程;Control instructions for controlling the execution of the neural network;
数据传输指令,用于完成不同存储介质之间的数据传输,该数据的数据格式包括矩阵、向量和标量;a data transfer instruction for performing data transfer between different storage media, the data format of the data including a matrix, a vector, and a scalar;
运算指令,用于完成神经网络的算术运算,包括矩阵运算指令、向量运算指令、标量运算指令、卷积神经网络运算指令、全连接神经网络运算指令、池化神经网络运算指令、RBM神经网络运算指令、LRN神经网络运算指令、LCN神经网络运算指令、LSTM神经网络运算指令、RNN神经网络运算指令、RELU神经网络运算指令、PRELU神经网络运算指令、SIGMOID神经网络运算指令、TANH神经网络运算指令、MAXOUT神经网络运算指令;以及The operation instruction is used to complete the arithmetic operation of the neural network, including the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Command, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction, MAXOUT neural network operation instructions;
逻辑指令,用于完成神经网络的逻辑运算,包括向量逻辑运算指令和标量逻辑运算指令。Logic instructions for performing logical operations on neural networks, including vector logic operations instructions and scalar logic operation instructions.
在本申请的一可能实施例中,所述神经网络专用指令包括至少一种Cambricon指令,该Cambricon指令包括操作码和操作数,所述Cambricon指令包括:In a possible embodiment of the present application, the neural network dedicated instruction includes at least one Cambricon instruction, the Cambricon instruction includes an operation code and an operand, and the Cambricon instruction includes:
Cambricon控制指令,用于控制执行过程,且该Cambricon控制指令包括跳转指令和条件分支指令;a Cambricon control instruction for controlling an execution process, and the Cambricon control instruction includes a jump instruction and a conditional branch instruction;
Cambricon数据传输指令,用于完成不同存储介质之间的数据传输,包括加载指令、存储指令、搬运指令;其中,所述加载指令,用于将数据从主存加载到缓存;所述存储指令,用于将数据从缓存存储到主存;实施搬运指令,用于在缓存与缓存或者缓存与寄存器或者寄存器与寄存器之间搬运数据;The Cambricon data transfer instruction is configured to complete data transfer between different storage media, including a load instruction, a storage instruction, and a transfer instruction; wherein the load instruction is used to load data from the main memory to the cache; the storage instruction, Used to store data from the cache to main memory; implement a move instruction to transfer data between the cache and the cache or the cache and registers or registers and registers;
Cambricon运算指令,用于完成神经网络算术运算,包括Cambricon矩阵运算指令、Cambricon向量运算指令和Cambricon标量运算指令;其中,所述Cambricon矩阵运算指令,用于完成神经网络中的矩阵运算,包括矩阵乘向量运算、向量乘矩阵运算、矩阵乘标量运算、外积运算、矩阵加矩阵运算和矩阵减矩阵运算;所述Cambricon向量运算指令,用于完成神经网络中的向量运算,包括向量基本运算、向量超越函数运算、内积运算、向量随机生成运算和向量中最大/最小值运算;Cambricon标量运算指令,用于完成神经网络中的标量运算,包括标量基本运算和标量超越函数运算;以及The Cambricon operation instruction is used to complete a neural network arithmetic operation, including a Cambricon matrix operation instruction, a Cambricon vector operation instruction, and a Cambricon scalar operation instruction; wherein the Cambricon matrix operation instruction is used to complete a matrix operation in a neural network, including matrix multiplication Vector operation, vector multiplication matrix operation, matrix multiplication scalar operation, outer product operation, matrix plus matrix operation, and matrix subtraction matrix operation; the Cambricon vector operation instruction for performing vector operations in a neural network, including vector basic operations, vectors Transcendental function operations, inner product operations, vector random generation operations, and maximum/minimum operations in vectors; Cambricon scalar operations instructions for performing scalar operations in neural networks, including scalar basic operations and scalar transcendental functions;
Cambricon逻辑指令,用于神经网络的逻辑运算,该Cambricon逻辑指令包括Cambricon向量逻辑运算指令和Cambricon标量逻辑运算指令;其中,所述Cambricon向量逻辑运算指令,用于向量比较运算、向量逻辑运算和向量大于合并运算,其中,所述向量逻辑运算包括与、或、非;所述Cambricon标量逻辑运算指令,用于标量比较运算和标量逻辑运算。Cambricon logic instructions for logical operations of a neural network, the Cambricon logic instructions including Cambricon vector logic operation instructions and Cambricon scalar logic operation instructions; wherein the Cambricon vector logic operation instructions are used for vector comparison operations, vector logic operations, and vectors Greater than the merge operation, wherein the vector logic operation includes AND, OR, and NOT; the Cambricon scalar logic operation instruction is used for scalar comparison operations and scalar logic operations.
在本申请的一可能实施例中,所述Cambricon数据传输指令支持以下的一种或者多种数据组织方式:矩阵、向量和标量;所述向量基本运算包括向量加、减、乘、除;向量超越函数指不满足以多项式作系数的多项式方程的函数,包括指数函数、对数函数、三角函数、反三角函数;所述标量基本运算包括标量加、减、乘、除;标量超越函数指不满足以多项式作系数的多项式方程的函数,包括指数函数、对数函数、三角函数、反三角函数;所述向量比较包括大于、小于、等于、大于等于、小于等于和不等于;所述向量逻辑运算包括与、或、非;所述标量比较包括大于、小于、等于、大于等于、小于等于和不等于;所述标量逻辑运算包括与、或、非。In a possible embodiment of the present application, the Cambricon data transmission instruction supports one or more of the following data organization modes: a matrix, a vector, and a scalar; the vector basic operation includes a vector addition, subtraction, multiplication, and division; The transcendental function refers to a function that does not satisfy the polynomial equation with a polynomial as a coefficient, including an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function; the scalar basic operation includes scalar addition, subtraction, multiplication, and division; the scalar transcendental function refers to dissatisfaction A function of a polynomial equation in which a polynomial is a coefficient, including an exponential function, a logarithmic function, a trigonometric function, an inverse trigonometric function; the vector comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to; the vector logic operation includes And, or, and; the scalar comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to; the scalar logical operation includes AND, OR, and NOT.
本申请的又一方面,提供了另一种运算装置,所述运算装置包括:In still another aspect of the present application, another computing device is provided, the computing device comprising:
指令控制单元,用于对接收的指令进行译码,生成查找控制信息;以及An instruction control unit, configured to decode the received instruction to generate search control information;
查找表单元,用于根据所述查找控制信息,以及接收的权值字典、神经元字典、运算密码本、权值和输入神经元,从运算密码本中查找输出神经元。The lookup table unit is configured to search for output neurons from the operation codebook according to the lookup control information, and the received weight dictionary, the neuron dictionary, the operation codebook, the weights, and the input neurons.
在本申请的一可能实施例中,所述权值字典包括权值位置和权值索引;所述神经元字典包括输入神经元和神经元索引;所述运算密码本包括权值索引、神经元索引以及输入神经元和权值的运算结果。In a possible embodiment of the present application, the weight dictionary includes a weight position and a weight index; the neuron dictionary includes an input neuron and a neuron index; the operation codebook includes a weight index, a neuron The index and the result of the operation of the input neurons and weights.
在本申请的一可能实施例中,所述运算装置还包括:In a possible embodiment of the present application, the computing device further includes:
预处理单元,用于对外部输入的输入信息进行预处理,得到所述权值、输入神经元、指令、权值字典、神经元字典、运算密码本;a pre-processing unit, configured to pre-process input information of the external input, to obtain the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation code book;
存储单元,用于存储输入神经元、权值、权值字典、神经元字典、运算密码本和指令,以及接收输出神经元;a storage unit for storing input neurons, weights, weights dictionary, neuron dictionary, operation codebook and instructions, and receiving output neurons;
缓存单元,用于缓存所述指令、输入神经元、权值、权值索引、神经元索引和输出神经元;以及a cache unit for buffering the instructions, input neurons, weights, weight indexes, neuron indexes, and output neurons;
直接内存存取单元,用于在所述存储单元和缓存单元之间进行数据或者指令读写。a direct memory access unit for reading or writing data or instructions between the storage unit and the cache unit.
在本申请的一可能实施例中,所述缓存单元包括:In a possible embodiment of the present application, the cache unit includes:
指令缓存,用于缓存所述指令,并将缓存的指令输出至指令控制单元;An instruction cache for buffering the instruction and outputting the cached instruction to the instruction control unit;
权值缓存,用于缓存所述权值;a weight buffer for buffering the weights;
输入神经元缓存,用于缓存所述输入神经元;Input a neuron cache for caching the input neurons;
输出神经元缓存,用于缓存查找表单元输出的输出神经元。Outputs a neuron cache that is used to cache the output neurons of the lookup table cell output.
在本申请的一可能实施例中,所述缓存单元还包括:In a possible embodiment of the present application, the cache unit further includes:
权值索引缓存,用于缓存权值索引;Weight index cache for caching weight index;
神经元索引缓存,用于缓存神经元索引。A neuron index cache that is used to cache neuron indexes.
在本申请的一可能实施例中,所述预处理单元对外部输入的输入信息进行的预处理包括:切分、高斯滤波、二值化、正则化和/或归一化。In a possible embodiment of the present application, the preprocessing unit performs preprocessing on externally input information including: segmentation, Gaussian filtering, binarization, regularization, and/or normalization.
在本申请的一可能实施例中,所述查找表单元包括:In a possible embodiment of the present application, the lookup table unit includes:
乘法查找表:用于输入权值索引in1和神经元索引in2,通过乘法查找表经过查表操作mult_lookup,完成权值索引对应的中心权值data1和神经元索引对应的中心神经元data2的乘法操作,即用查表操作out=mult_lookup(in1,in2)完成乘法功能out=data1*data2;和/或Multiplication lookup table: used to input the weight index in1 and the neuron index in2, and through the multiplication lookup table through the table lookup operation mult_lookup, complete the multiplication operation of the center weight data1 corresponding to the weight index and the center neuron data2 corresponding to the neuron index , that is, using the lookup table operation out=mult_lookup(in1, in2) to complete the multiplication function out=data1*data2; and/or
加法查找表:用于根据输入索引in通过逐级加法查找表经过查表操作add_lookup完成索引对应的中心数据data的加法操作,其中,in和data是长度为N的向量,N是正整数,即用查表操作out=add_lookup(in)完成加法功能out=data[1]+data[2]+...+data[N],和/或输入权值索引in1和神经元索引in2通过加法查找表经过查表操作完成权值索引对应的中心权值data1和神经元索引对应的中心神经元data2的加法操作,即用查表操作out=add_lookup(in1,in2)完成加法功能,out=data1+data2;和/或Addition lookup table: used to add the central data of the index corresponding to the index by the table-searching operation add_lookup according to the input index in. The in and data are vectors of length N, and N is a positive integer, that is, The table lookup operation out=add_lookup(in) completes the addition function out=data[1]+data[2]+...+data[N], and/or the input weight index in1 and the neuron index in2 through the addition lookup table After the table operation completes the addition operation of the center weight data1 corresponding to the weight index and the center neuron data2 corresponding to the neuron index, the addition function is completed by the table lookup operation out=add_lookup(in1, in2), out=data1+data2 ;and / or
池化查找表:用于输入索引对应的中心数据data的池化操作,即用查表out=pool_lookup(in)完成池化操作out=pool(data),池化操作包括平均值池化、最大值池化和中值池化。Pooled lookup table: used to input the central data of the index corresponding to the pooling operation, that is, use the lookup table out=pool_lookup(in) to complete the pooling operation out=pool(data), and the pooling operation includes the average pooling and maximum Value pooling and median pooling.
在本申请的一可能实施例中,所述指令为神经网络专用指令,所述神经网络专用指令包括:In a possible embodiment of the present application, the instruction is a neural network specific instruction, and the neural network specific instruction includes:
控制指令,用于控制神经网络执行过程;Control instructions for controlling the execution of the neural network;
数据传输指令,用于完成不同存储介质之间的数据传输,数据格式包括矩阵、向量和标量;A data transfer instruction for performing data transfer between different storage media, the data format including a matrix, a vector, and a scalar;
运算指令,用于完成神经网络的算术运算,包括矩阵运算指令、向量运算指令、标量运算指令、卷积神经网络运算指令、全连接神经网络运算指令、池化神经网络运算指令、RBM神经网络运算指令、LRN神经网络运算指令、LCN神经网络运算指令、LSTM神经网络运算指令、RNN神经网络运算指令、RELU神经网络运算指令、PRELU神经网络运算指令、SIGMOID神经网络运算指令、TANH神经网络运算指令、MAXOUT神经网络运算指令;以及The operation instruction is used to complete the arithmetic operation of the neural network, including the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Command, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction, MAXOUT neural network operation instructions;
逻辑指令,用于完成神经网络的逻辑运算,包括向量逻辑运算指令和标量逻辑运算指令。Logic instructions for performing logical operations on neural networks, including vector logic operations instructions and scalar logic operation instructions.
在本申请的一可能实施例中,所述神经网络专用指令包括至少一种Cambricon指令,该Cambricon 指令包括操作码和操作数,所述Cambricon指令包括:In a possible embodiment of the present application, the neural network dedicated instruction includes at least one Cambricon instruction including an operation code and an operand, and the Cambricon instruction includes:
Cambricon控制指令,用于控制执行过程,且该Cambricon控制指令包括跳转指令和条件分支指令;a Cambricon control instruction for controlling an execution process, and the Cambricon control instruction includes a jump instruction and a conditional branch instruction;
Cambricon数据传输指令,用于完成不同存储介质之间的数据传输,包括加载指令、存储指令、搬运指令;其中,所述加载指令,用于将数据从主存加载到缓存;所述存储指令,用于将数据从缓存存储到主存;实施搬运指令,用于在缓存与缓存或者缓存与寄存器或者寄存器与寄存器之间搬运数据;The Cambricon data transfer instruction is configured to complete data transfer between different storage media, including a load instruction, a storage instruction, and a transfer instruction; wherein the load instruction is used to load data from the main memory to the cache; the storage instruction, Used to store data from the cache to main memory; implement a move instruction to transfer data between the cache and the cache or the cache and registers or registers and registers;
Cambricon运算指令,用于完成神经网络算术运算,包括Cambricon矩阵运算指令、Cambricon向量运算指令和Cambricon标量运算指令;其中,所述Cambricon矩阵运算指令,用于完成神经网络中的矩阵运算,包括矩阵乘向量运算、向量乘矩阵运算、矩阵乘标量运算、外积运算、矩阵加矩阵运算和矩阵减矩阵运算;所述Cambricon向量运算指令,用于完成神经网络中的向量运算,包括向量基本运算、向量超越函数运算、内积运算、向量随机生成运算和向量中最大/最小值运算;Cambricon标量运算指令,用于完成神经网络中的标量运算,包括标量基本运算和标量超越函数运算;以及The Cambricon operation instruction is used to complete a neural network arithmetic operation, including a Cambricon matrix operation instruction, a Cambricon vector operation instruction, and a Cambricon scalar operation instruction; wherein the Cambricon matrix operation instruction is used to complete a matrix operation in a neural network, including matrix multiplication Vector operation, vector multiplication matrix operation, matrix multiplication scalar operation, outer product operation, matrix plus matrix operation, and matrix subtraction matrix operation; the Cambricon vector operation instruction for performing vector operations in a neural network, including vector basic operations, vectors Transcendental function operations, inner product operations, vector random generation operations, and maximum/minimum operations in vectors; Cambricon scalar operations instructions for performing scalar operations in neural networks, including scalar basic operations and scalar transcendental functions;
Cambricon逻辑指令,用于神经网络的逻辑运算,该Cambricon逻辑指令包括Cambricon向量逻辑运算指令和Cambricon标量逻辑运算指令;其中,所述Cambricon向量逻辑运算指令,用于向量比较运算、向量逻辑运算和向量大于合并运算,其中,所述向量逻辑运算包括与、或、非;所述Cambricon标量逻辑运算指令,用于标量比较运算和标量逻辑运算。Cambricon logic instructions for logical operations of a neural network, the Cambricon logic instructions including Cambricon vector logic operation instructions and Cambricon scalar logic operation instructions; wherein the Cambricon vector logic operation instructions are used for vector comparison operations, vector logic operations, and vectors Greater than the merge operation, wherein the vector logic operation includes AND, OR, and NOT; the Cambricon scalar logic operation instruction is used for scalar comparison operations and scalar logic operations.
在本申请的一可能实施例中,所述Cambricon数据传输指令支持以下的一种或者多种数据组织方式:矩阵、向量和标量;所述向量基本运算包括向量加、减、乘、除;向量超越函数指不满足以多项式作系数的多项式方程的函数,包括指数函数、对数函数、三角函数、反三角函数;所述标量基本运算包括标量加、减、乘、除;标量超越函数指不满足以多项式作系数的多项式方程的函数,包括指数函数、对数函数、三角函数、反三角函数;所述向量比较包括大于、小于、等于、大于等于、小于等于和不等于;所述向量逻辑运算包括与、或、非;所述标量比较包括大于、小于、等于、大于等于、小于等于和不等于;所述标量逻辑运算包括与、或、非。In a possible embodiment of the present application, the Cambricon data transmission instruction supports one or more of the following data organization modes: a matrix, a vector, and a scalar; the vector basic operation includes a vector addition, subtraction, multiplication, and division; The transcendental function refers to a function that does not satisfy the polynomial equation with a polynomial as a coefficient, including an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function; the scalar basic operation includes scalar addition, subtraction, multiplication, and division; the scalar transcendental function refers to dissatisfaction A function of a polynomial equation in which a polynomial is a coefficient, including an exponential function, a logarithmic function, a trigonometric function, an inverse trigonometric function; the vector comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to; the vector logic operation includes And, or, and; the scalar comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to; the scalar logical operation includes AND, OR, and NOT.
本申请的又一方面,提供了又一种处理方法,包括:In yet another aspect of the present application, a further processing method is provided, including:
接收权值、输入神经元、指令、权值字典、神经元字典和运算密码本;Receiving weights, input neurons, instructions, weights dictionaries, neuron dictionaries, and arithmetic codebooks;
对所述指令进行译码,确定查找控制信息;Decoding the instruction to determine search control information;
根据所述查找控制信息、权值、权值字典、神经元字典和输入神经元,在运算密码本中查找输出神经元。According to the search control information, the weight, the weight dictionary, the neuron dictionary, and the input neurons, the output neurons are searched for in the operation codebook.
在本申请的一可能实施例中,所述权值字典包括权值位置和权值索引;所述神经元字典包括输入神经元和神经元索引;所述运算密码本包括权值索引,神经元索引以及权值和输入神经元的运算结果。In a possible embodiment of the present application, the weight dictionary includes a weight position and a weight index; the neuron dictionary includes an input neuron and a neuron index; the operation codebook includes a weight index, a neuron The index and the result of the weight and input neurons.
在本申请的一可能实施例中,根据所述查找控制信息、权值和输入神经元,在运算密码本中查找输出神经元,包括步骤:In a possible embodiment of the present application, searching for output neurons in the operation codebook according to the search control information, weights, and input neurons includes the following steps:
根据所述权值、输入神经元、权值字典和神经元字典,在神经元字典中通过确定神经元范围以确定神经元索引、以及在权值字典中通过确定权值位置以确定权值索引;Determining a neuron index in the neuron dictionary to determine a neuron index and determining a weight index by determining a weight location in the neuron dictionary according to the weight, the input neuron, the weight dictionary, and the neuron dictionary ;
根据所述权值索引和神经元索引,在运算密码本中查找该运算结果,以确定输出神经元。According to the weight index and the neuron index, the operation result is searched in the operation codebook to determine the output neuron.
在本申请的一可能实施例中,所述运算结果包括以下的至少一种运算操作的结果:加法、乘法和池化,其中,池化包括:平均值池化,最大值池化和中值池化。In a possible embodiment of the present application, the operation result includes the following results of at least one operation operation: addition, multiplication, and pooling, wherein the pooling includes: average pooling, maximum pooling, and median Pooling.
在本申请的一可能实施例中,在接收权值、输入神经元、指令、权值字典、神经元字典、运算密码 本之前,还包括步骤:对外部输入的输入信息进行预处理,得到所述权值、输入神经元、指令、权值字典、神经元字典、运算密码本;以及In a possible embodiment of the present application, before receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation of the codebook, the method further includes the steps of: preprocessing the input information of the external input to obtain the Weights, input neurons, instructions, weights dictionaries, neuron dictionaries, arithmetic codebooks;
在接收权值、输入神经元、指令、权值字典、神经元字典、运算密码本之后,还包括步骤:存储权值、输入神经元、指令、权值字典、神经元字典、运算密码本、以及接收输出神经元;以及缓存所述指令、输入神经元、权值和输出神经元。After receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation of the codebook, the method further includes the steps of: storing the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, the operation codebook, And receiving output neurons; and caching the instructions, input neurons, weights, and output neurons.
在本申请的一可能实施例中,在接收权值、输入神经元、指令、权值字典、神经元字典、运算密码本之后,还包括步骤:缓存所述权值索引和神经元索引。In a possible embodiment of the present application, after receiving the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook, the method further includes the step of: buffering the weight index and the neuron index.
在本申请的一可能实施例中,所述预处理包括切分、高斯滤波、二值化、正则化和/或归一化。In a possible embodiment of the present application, the pre-processing includes segmentation, Gaussian filtering, binarization, regularization, and/or normalization.
在本申请的一可能实施例中,所述指令为神经网络专用指令,所述神经网络专用指令包括:In a possible embodiment of the present application, the instruction is a neural network specific instruction, and the neural network specific instruction includes:
控制指令,用于控制神经网络执行过程;Control instructions for controlling the execution of the neural network;
数据传输指令,用于完成不同存储介质之间的数据传输,数据格式包括矩阵、向量和标量;A data transfer instruction for performing data transfer between different storage media, the data format including a matrix, a vector, and a scalar;
运算指令,用于完成神经网络的算术运算,包括矩阵运算指令、向量运算指令、标量运算指令、卷积神经网络运算指令、全连接神经网络运算指令、池化神经网络运算指令、RBM神经网络运算指令、LRN神经网络运算指令、LCN神经网络运算指令、LSTM神经网络运算指令、RNN神经网络运算指令、RELU神经网络运算指令、PRELU神经网络运算指令、SIGMOID神经网络运算指令、TANH神经网络运算指令、MAXOUT神经网络运算指令;以及The operation instruction is used to complete the arithmetic operation of the neural network, including the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Command, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction, MAXOUT neural network operation instructions;
逻辑指令,用于完成神经网络的逻辑运算,包括向量逻辑运算指令和标量逻辑运算指令。Logic instructions for performing logical operations on neural networks, including vector logic operations instructions and scalar logic operation instructions.
在本申请的一可能实施例中,所述神经网络专用指令包括至少一种Cambricon指令,该Cambricon指令包括操作码和操作数,所述Cambricon指令包括:In a possible embodiment of the present application, the neural network dedicated instruction includes at least one Cambricon instruction, the Cambricon instruction includes an operation code and an operand, and the Cambricon instruction includes:
Cambricon控制指令,用于控制执行过程,且该Cambricon控制指令包括跳转指令和条件分支指令;a Cambricon control instruction for controlling an execution process, and the Cambricon control instruction includes a jump instruction and a conditional branch instruction;
Cambricon数据传输指令,用于完成不同存储介质之间的数据传输,包括加载指令、存储指令、搬运指令;其中,所述加载指令,用于将数据从主存加载到缓存;所述存储指令,用于将数据从缓存存储到主存;实施搬运指令,用于在缓存与缓存或者缓存与寄存器或者寄存器与寄存器之间搬运数据;The Cambricon data transfer instruction is configured to complete data transfer between different storage media, including a load instruction, a storage instruction, and a transfer instruction; wherein the load instruction is used to load data from the main memory to the cache; the storage instruction, Used to store data from the cache to main memory; implement a move instruction to transfer data between the cache and the cache or the cache and registers or registers and registers;
Cambricon运算指令,用于完成神经网络算术运算,包括Cambricon矩阵运算指令、Cambricon向量运算指令和Cambricon标量运算指令;其中,所述Cambricon矩阵运算指令,用于完成神经网络中的矩阵运算,包括矩阵乘向量运算、向量乘矩阵运算、矩阵乘标量运算、外积运算、矩阵加矩阵运算和矩阵减矩阵运算;所述Cambricon向量运算指令,用于完成神经网络中的向量运算,包括向量基本运算、向量超越函数运算、内积运算、向量随机生成运算和向量中最大/最小值运算;Cambricon标量运算指令,用于完成神经网络中的标量运算,包括标量基本运算和标量超越函数运算;以及The Cambricon operation instruction is used to complete a neural network arithmetic operation, including a Cambricon matrix operation instruction, a Cambricon vector operation instruction, and a Cambricon scalar operation instruction; wherein the Cambricon matrix operation instruction is used to complete a matrix operation in a neural network, including matrix multiplication Vector operation, vector multiplication matrix operation, matrix multiplication scalar operation, outer product operation, matrix plus matrix operation, and matrix subtraction matrix operation; the Cambricon vector operation instruction for performing vector operations in a neural network, including vector basic operations, vectors Transcendental function operations, inner product operations, vector random generation operations, and maximum/minimum operations in vectors; Cambricon scalar operations instructions for performing scalar operations in neural networks, including scalar basic operations and scalar transcendental functions;
Cambricon逻辑指令,用于神经网络的逻辑运算,该Cambricon逻辑指令包括Cambricon向量逻辑运算指令和Cambricon标量逻辑运算指令;其中,所述Cambricon向量逻辑运算指令,用于向量比较运算、向量逻辑运算和向量大于合并运算,其中,所述向量逻辑运算包括与、或、非;所述Cambricon标量逻辑运算指令,用于标量比较运算和标量逻辑运算。Cambricon logic instructions for logical operations of a neural network, the Cambricon logic instructions including Cambricon vector logic operation instructions and Cambricon scalar logic operation instructions; wherein the Cambricon vector logic operation instructions are used for vector comparison operations, vector logic operations, and vectors Greater than the merge operation, wherein the vector logic operation includes AND, OR, and NOT; the Cambricon scalar logic operation instruction is used for scalar comparison operations and scalar logic operations.
在本申请的一可能实施例中,所述Cambricon数据传输指令支持以下的一种或者多种数据组织方式:矩阵、向量和标量;所述向量基本运算包括向量加、减、乘、除;向量超越函数指不满足以多项式作系数的多项式方程的函数,包括指数函数、对数函数、三角函数、反三角函数;所述标量基本运算包括标量加、减、乘、除;标量超越函数指不满足以多项式作系数的多项式方程的函数,包括指数函数、对数 函数、三角函数、反三角函数;所述向量比较包括大于、小于、等于、大于等于、小于等于和不等于;所述向量逻辑运算包括与、或、非;所述标量比较包括大于、小于、等于、大于等于、小于等于和不等于;所述标量逻辑运算包括与、或、非。In a possible embodiment of the present application, the Cambricon data transmission instruction supports one or more of the following data organization modes: a matrix, a vector, and a scalar; the vector basic operation includes a vector addition, subtraction, multiplication, and division; The transcendental function refers to a function that does not satisfy the polynomial equation with a polynomial as a coefficient, including an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function; the scalar basic operation includes scalar addition, subtraction, multiplication, and division; the scalar transcendental function refers to dissatisfaction A function of a polynomial equation in which a polynomial is a coefficient, including an exponential function, a logarithmic function, a trigonometric function, an inverse trigonometric function; the vector comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to; the vector logic operation includes And, or, and; the scalar comparison includes greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to; the scalar logical operation includes AND, OR, and NOT.
神经网络(neural network)已经获得了非常成功的应用,但是大规模的神经网络参数对存储提出了很高的要求。一方面,大量神经网络参数需要巨大的存储容量。另一个方面,访问大量神经网络数据会带来巨大的访存能耗。Neural networks have achieved very successful applications, but large-scale neural network parameters place high demands on storage. On the one hand, a large number of neural network parameters require huge storage capacity. On the other hand, accessing a large amount of neural network data will bring huge energy consumption for access.
现在存储神经网络参数的内存是错误检查和纠正ECC(Error Correcting Code:简称ECC)内存,ECC内存虽然能够纠正读取数据时发生的错误,但是ECC内存会带来额外的存储容量开销和访存功耗开销。神经网络算法有一定的容错能力,而将神经网络所有参数使用ECC内存存储忽略了神经网络的容错,带来额外存储开销,计算开销和访存开销,因此如何结合神经网络容错能力选用适合神经网络处理的内存是一个亟待解决的问题。The memory that stores the neural network parameters is now error checking and correcting ECC (Error Correcting Code (ECC) memory. Although ECC memory can correct errors when reading data, ECC memory will bring additional storage capacity overhead and memory access. Power consumption overhead. The neural network algorithm has certain fault tolerance. The use of ECC memory storage for all parameters of the neural network ignores the fault tolerance of the neural network, and brings additional storage overhead, computational overhead and memory access overhead. Therefore, how to combine the neural network fault tolerance with the appropriate neural network. Handling memory is a problem that needs to be solved.
本申请的又一方面,提供了一种存储装置,包括:In still another aspect of the present application, a storage device is provided, including:
精确存储单元,用于存储数据中的重要比特位;An accurate storage unit for storing important bits in the data;
非精确存储单元,用于存储数据中的非重要比特位。An inexact memory location for storing non-significant bits in the data.
在本申请的一可能实施例中,所述精确存储单元采用ECC内存,所述非精确存储单元采用非ECC内存。In a possible embodiment of the present application, the precise storage unit uses ECC memory, and the inexact storage unit uses non-ECC memory.
在本申请的一可能实施例中,所述数据为神经网络参数,包括输入神经元、权值和输出神经元;所述精确存储单元用于存储输入神经元的重要比特位、输出神经元的重要比特位和权值的重要比特位;所述非精确存储单元用于存储输入神经元的非重要比特位、输出神经元的非重要比特位和权值的非重要比特位。In a possible embodiment of the present application, the data is a neural network parameter, including input neurons, weights, and output neurons; the precise storage unit is configured to store important bits of the input neurons, and output neurons. Important bits of important bits and weights; the imprecise storage unit is used to store non-significant bits of the input neurons, non-significant bits of the output neurons, and non-significant bits of the weights.
在本申请的一可能实施例中,所述数据包括浮点型数据和定点型数据;所述浮点型数据中的符号位和指数部分为重要比特位,底数部分为非重要比特位;所述定点型数据中的符号位和数值部分的前x比特为重要比特位,数值部分的剩余比特为非重要比特位,其中,x为大于等于0且小于m的正整数,m为数据总比特位。In a possible embodiment of the present application, the data includes floating point data and fixed point data; the symbol bit and the exponent part in the floating point data are important bits, and the bottom part is a non-significant bit; The first x bits of the sign bit and the value part in the point type data are important bits, and the remaining bits of the value part are non-important bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a total bit of data Bit.
在本申请的一可能实施例中,所述ECC内存包括有ECC校验的DRAM和有ECC校验的SRAM;所述有ECC校验的SRAM采用6T SRAM,或者采用4T SRAM或3T SRAM。In a possible embodiment of the present application, the ECC memory includes an ECC-checked DRAM and an ECC-checked SRAM; the ECC-checked SRAM uses a 6T SRAM, or a 4T SRAM or a 3T SRAM.
在本申请的一可能实施例中,所述非ECC内存包括非ECC校验的DRAM和非ECC校验的SRAM;所述非ECC校验的SRAM采用6T SRAM,或者采用4T SRAM或3T SRAM。In a possible embodiment of the present application, the non-ECC memory includes a non-ECC check DRAM and a non-ECC check SRAM; the non-ECC check SRAM uses 6T SRAM, or 4T SRAM or 3T SRAM.
在本申请的一可能实施例中,所述6T SRAM中存放每一个比特的存储单元包括6个MOS管;所述4T SRAM中存放每一个比特的存储单元包括4个MOS管;所述3T SRAM中存放每一个比特的存储单元包括3个MOS管。In a possible embodiment of the present application, the storage unit storing each bit in the 6T SRAM includes 6 MOS tubes; the storage unit storing each bit in the 4T SRAM includes 4 MOS tubes; the 3T SRAM The memory cell in which each bit is stored includes three MOS transistors.
在本申请的一可能实施例中,所述4个MOS管包括:第一MOS管、第二MOS管、第三MOS管和第四MOS管,第一MOS管和第二MOS管用于门控,第三MOS管和第四MOS管用于存储,其中,第一MOS管栅极与字线WL电连接,源极与位线BL电连接;第二MOS管栅极与字线WL电连接,源极与位线BLB电连接;第三MOS管栅极与第四MOS管源极和第二MOS管漏极连接,并通过电阻R2与工作电压连接,第三MOS管漏极接地;第四MOS管栅极与第三MOS管源极和第一MOS管漏极连 接,并通过电阻R1与工作电压连接,第四MOS管漏极接地;WL用于控制存储单元的门控访问,BL用于进行存储单元的读写。In a possible embodiment of the present application, the four MOS transistors include: a first MOS transistor, a second MOS transistor, a third MOS transistor, and a fourth MOS transistor, and the first MOS transistor and the second MOS transistor are used for gating The third MOS transistor and the fourth MOS transistor are used for storage, wherein the first MOS transistor gate is electrically connected to the word line WL, the source is electrically connected to the bit line BL, and the second MOS transistor gate is electrically connected to the word line WL. The source is electrically connected to the bit line BLB; the gate of the third MOS transistor is connected to the source of the fourth MOS transistor and the drain of the second MOS transistor, and is connected to the working voltage through the resistor R2, and the drain of the third MOS transistor is grounded; The gate of the MOS transistor is connected to the source of the third MOS transistor and the drain of the first MOS transistor, and is connected to the working voltage through the resistor R1, and the drain of the fourth MOS transistor is grounded; WL is used for controlling the gate access of the memory unit, for BL For reading and writing of the storage unit.
在本申请的一可能实施例中,所述3个MOS管包括:第一MOS管,第二MOS管和第三MOS管,第一MOS管用于门控,第二MOS管和第三MOS管用于存储,其中,第一MOS管栅极与字线WL电连接,源极与位线BL电连接;第二MOS管栅极与第三MOS管源极连接,并通过电阻R2与工作电压连接,第二MOS管漏极接地;第三MOS管栅极与第二MOS管源极和第一MOS管漏极连接,并通过电阻R1与工作电压连接,第三MOS管漏极接地;WL用于控制存储单元的门控访问,BL用于进行存储单元的读写。In a possible embodiment of the present application, the three MOS transistors include: a first MOS transistor, a second MOS transistor, and a third MOS transistor, the first MOS transistor is used for gating, and the second MOS transistor and the third MOS transistor are used. In the storage, the first MOS transistor gate is electrically connected to the word line WL, the source is electrically connected to the bit line BL, the second MOS transistor gate is connected to the third MOS transistor source, and is connected to the working voltage through the resistor R2. a second MOS transistor drain is grounded; a third MOS transistor gate is connected to the second MOS transistor source and the first MOS transistor drain, and is connected to the working voltage through the resistor R1, and the third MOS transistor drain is grounded; For controlling the gate access of the memory unit, the BL is used for reading and writing the memory unit.
本申请的又一方面,提供了一种数据处理装置,包括:In still another aspect of the present application, a data processing apparatus is provided, including:
运算单元、指令控制单元和上述的存储装置;所述存储装置用于接收输入的指令和运算参数,并将运算参数中的重要比特位和指令存储于精确存储单元,将运算参数中的非重要比特位存储于非精确存储单元;所述指令控制单元用于接收存储装置中的指令,并译码生成控制信息;所述运算单元用于接收存储装置中的运算参数,依据控制信息进行运算,并将运算结果传输至存储装置。An arithmetic unit, an instruction control unit, and the storage device; the storage device is configured to receive the input instruction and the operation parameter, and store the important bits and instructions in the operation parameter in the precise storage unit, and the non-important in the operation parameter The bit is stored in the inaccurate storage unit; the instruction control unit is configured to receive an instruction in the storage device, and decode the generated control information; the operation unit is configured to receive the operation parameter in the storage device, and perform an operation according to the control information, And transfer the result of the operation to the storage device.
在本申请的一可能实施例中,所述运算单元为神经网络处理器。In a possible embodiment of the present application, the computing unit is a neural network processor.
在本申请的一可能实施例中,所述运算参数为神经网络参数,所述运算单元用于接收存储装置中的输入神经元和权值,依据控制信息完成神经网络运算得到输出神经元,并将输出神经元传输至存储装置。In a possible embodiment of the present application, the operation parameter is a neural network parameter, and the operation unit is configured to receive an input neuron and a weight in the storage device, complete the neural network operation according to the control information, and obtain an output neuron, and The output neurons are transmitted to the storage device.
在本申请的一可能实施例中,所述运算单元用于接收存储装置中的输入神经元的重要比特位和权值的重要比特位进行计算;或者,所述运算单元用于接收重要比特位和非重要比特位拼接完整的输入神经元和权值进行计算。In a possible embodiment of the present application, the operation unit is configured to receive important bits of an input neuron in the storage device and important bits of the weight for calculation; or the operation unit is configured to receive important bits. And the non-significant bits spliced the complete input neurons and weights for calculation.
在本申请的一可能实施例中,还包括:指令缓存,设置在存储装置和指令控制单元之间,用于存储专用指令;输入神经元分层缓存,设置在存储装置和运算单元之间,用于缓存输入神经元,所述输入神经元分层缓存包括输入神经元精确缓存和输入神经元非精确缓存;权值分层缓存,设置在存储装置和运算单元之间,用于缓存权值数据,所述权值分层缓存包括权值精确缓存和权值非精确缓存;输出神经元分层缓存,设置在存储装置和运算单元之间,用于缓存输出神经元,所述输出神经元分层缓存包括输出神经元精确缓存和输出神经元非精确缓存。In a possible embodiment of the present application, the method further includes: an instruction cache, disposed between the storage device and the instruction control unit, for storing the dedicated instruction; and inputting a layered buffer of the neuron, disposed between the storage device and the operation unit, For buffering input neurons, the input neuron hierarchical cache includes an input neuron exact cache and an input neuron inexact cache; a weighted hierarchical cache, disposed between the storage device and the arithmetic unit for buffering weights Data, the weighted layered cache includes a weighted precision cache and a weighted inexact cache; an output neuron layered cache, disposed between the storage device and the arithmetic unit for buffering output neurons, the output neurons Hierarchical caching includes output neuron exact caching and output neuron inexact caching.
在本申请的一可能实施例中,还包括直接数据存取单元DMA,用于在所述存储装置、指令缓存、权值分层缓存、输入神经元分层缓存和输出神经元分层缓存中进行数据或者指令读写。In a possible embodiment of the present application, a direct data access unit DMA is further included for use in the storage device, the instruction cache, the weight layer cache, the input neuron hierarchical cache, and the output neuron hierarchical cache. Perform data or instruction reading and writing.
在本申请的一可能实施例中,所述指令缓存、输入神经元分层缓存、权值分层缓存和输出神经元分层缓存采用4T SRAM或3T SRAM。In a possible embodiment of the present application, the instruction cache, the input neuron hierarchical cache, the weight layer cache, and the output neuron hierarchical cache use 4T SRAM or 3T SRAM.
在本申请的一可能实施例中,还包括预处理模块,用于对输入数据进行预处理并传输至存储装置;所述预处理包括切分、高斯滤波、二值化、正则化和归一化。In a possible embodiment of the present application, a preprocessing module is further included for preprocessing and transmitting the input data to the storage device; the preprocessing includes segmentation, Gaussian filtering, binarization, regularization, and normalization. Chemical.
在本申请的一可能实施例中,所述运算单元为通用运算处理器。In a possible embodiment of the present application, the operation unit is a general purpose operation processor.
本申请的又一方面,提供一种电子装置,包括上述的数据处理装置。In still another aspect of the present application, an electronic device is provided, including the data processing device described above.
本申请的又一方面,提供一种存储方法,包括:将数据中的重要比特位进行精确存储;将数据中的非重要比特位进行非精确存储。In still another aspect of the present application, a storage method is provided, comprising: accurately storing important bits in data; and performing inexact storage of non-significant bits in the data.
在本申请的一可能实施例中,所述将数据中的重要比特位进行精确存储具体包括:提取数据的重要比特位,将所述数据中的重要比特位存储在ECC内存中进行精确存储。In a possible embodiment of the present application, the accurately storing the important bits in the data specifically includes: extracting important bits of the data, and storing the important bits in the data in the ECC memory for accurate storage.
在本申请的一可能实施例中,所述将数据中的非重要比特位进行非精确存储具体包括:提取数据的非重要比特位,将所述数据中的非重要比特位存储在非ECC内存中进行非精确存储。In a possible embodiment of the present application, the performing inaccurate storage of non-significant bits in the data specifically includes: extracting non-significant bits of the data, and storing non-important bits in the data in the non-ECC memory. Inexact storage in the middle.
在本申请的一可能实施例中,所述数据为神经网络参数,包括输入神经元、权值和输出神经元;将输入神经元的重要比特位、输出神经元的重要比特位和权值的重要比特位进行精确存储;将输入神经元的非重要比特位、输出神经元的非重要比特位和权值的非重要比特位进行非精确存储。In a possible embodiment of the present application, the data is a neural network parameter, including an input neuron, a weight, and an output neuron; an important bit of the input neuron, an important bit of the output neuron, and a weight The important bits are accurately stored; the non-significant bits of the input neurons, the non-significant bits of the output neurons, and the non-significant bits of the weights are stored inexactly.
在本申请的一可能实施例中,所述数据包括浮点型数据和定点型数据;所述浮点型数据中的符号位和指数部分为重要比特位,底数部分为非重要比特位;所述定点型数据中的符号位和数值部分的前x比特为重要比特位,数值部分的剩余比特为非重要比特位,其中,x为大于等于0且小于m的正整数,m为参数总比特位。In a possible embodiment of the present application, the data includes floating point data and fixed point data; the symbol bit and the exponent part in the floating point data are important bits, and the bottom part is a non-significant bit; The first x bits of the sign bit and the value part in the point type data are important bits, and the remaining bits of the value part are non-significant bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a parameter total bit Bit.
在本申请的一可能实施例中,所述ECC内存包括有ECC校验的DRAM和有ECC校验的SRAM;所述有ECC校验的SRAM采用6T SRAM、4T SRAM或3T SRAM。In a possible embodiment of the present application, the ECC memory includes an ECC-checked DRAM and an ECC-checked SRAM; and the ECC-checked SRAM uses a 6T SRAM, a 4T SRAM, or a 3T SRAM.
在本申请的一可能实施例中,所述非ECC内存包括非ECC校验的DRAM和非ECC校验的SRAM;所述非ECC校验的SRAM采用6T SRAM、4T SRAM或3T SRAM。In a possible embodiment of the present application, the non-ECC memory includes a non-ECC check DRAM and a non-ECC check SRAM; the non-ECC check SRAM uses 6T SRAM, 4T SRAM or 3T SRAM.
本申请的又一方面,提供一种数据处理方法,包括:In still another aspect of the present application, a data processing method is provided, including:
接收指令和参数,并将该参数中的重要比特位和指令进行精确存储,将参数中的非重要比特位进行非精确存储;接收指令,并将指令译码生成控制信息;接收参数,并依据控制信息进行运算,将运算结果存储。Receiving instructions and parameters, and accurately storing important bits and instructions in the parameter, inaccurately storing non-important bits in the parameter; receiving the instruction, and decoding the instruction to generate control information; receiving the parameter, and according to The control information is calculated and the operation result is stored.
在本申请的一可能实施例中,所述运算为神经网络运算,所述参数为神经网络参数。In a possible embodiment of the present application, the operation is a neural network operation, and the parameter is a neural network parameter.
在本申请的一可能实施例中,所述接收参数,并依据控制信息进行运算,将运算结果存储包括:接收输入神经元和权值,依据控制信息完成神经网络运算得到输出神经元,并将输出神经元存储或输出。In a possible embodiment of the present application, the receiving parameter is performed according to the control information, and storing the operation result includes: receiving the input neuron and the weight, completing the neural network operation according to the control information, and obtaining the output neuron, and Output neuron storage or output.
在本申请的一可能实施例中,所述接收输入神经元和权值,依据控制信息完成神经网络运算得到输出神经元包括:接收输入神经元的重要比特位和权值的重要比特位进行计算;或者,接收将重要比特位和非重要比特位拼接完整的输入神经元和权值进行计算。In a possible embodiment of the present application, the receiving the input neuron and the weight, and completing the neural network operation according to the control information to obtain the output neuron include: receiving important bits of the input neuron and important bits of the weight for calculation Or, it receives the input neurons and weights that spliced the important bits and the non-significant bits into a complete calculation.
在本申请的一可能实施例中,所述数据处理方法还包括:缓存专用指令;对输入神经元进行精确缓存和非精确缓存;对权值数据进行精确缓存和非精确缓存;对输出神经元进行精确缓存和非精确缓存。In a possible embodiment of the present application, the data processing method further includes: a cache dedicated instruction; an accurate cache and an inexact cache on the input neuron; an accurate cache and an inexact cache on the weight data; and an output neuron Perform precise and inexact caching.
在本申请的一可能实施例中,所述运算为通用运算。In a possible embodiment of the present application, the operation is a general operation.
在本申请的一可能实施例中,在所述接收指令和参数,并将参数中的重要比特位和指令存储于进行精确存储,将参数中的非重要比特位进行非精确存储之前还包括:对输入数据进行预处理并存储;所述预处理包括切分、高斯滤波、二值化、正则化和归一化。In a possible embodiment of the present application, before the receiving the instruction and the parameter, and storing the important bits and instructions in the parameter for accurate storage, and performing non-precise storage of the non-important bits in the parameter, the method further includes: The input data is pre-processed and stored; the pre-processing includes segmentation, Gaussian filtering, binarization, regularization, and normalization.
本申请的又一方面,提供一种存储单元,所述存储单元为4T SRAM或3T SRAM,用于存储神经网络参数。In yet another aspect of the present application, a memory unit is provided, the memory unit being a 4T SRAM or a 3T SRAM for storing neural network parameters.
在本申请的一可能实施例中,所述4T SRAM中存放每一个比特的存储单元包括4个MOS管;所述3T SRAM中存放每一个比特的存储单元包括3个MOS管。In a possible embodiment of the present application, the storage unit storing each bit in the 4T SRAM includes 4 MOS tubes; and the storage unit storing each bit in the 3T SRAM includes 3 MOS tubes.
在本申请的一可能实施例中,所述4个MOS管包括:第一MOS管、第二MOS管、第三MOS管和第四MOS管,第一MOS管和第二MOS管用于门控,第三MOS管和第四MOS管用于存储,其中,第一MOS管栅极与字线WL电连接,源极与位线BL电连接;第二MOS管栅极与字线WL电连接,源极与位线BLB电连接;第三MOS管栅极与第四MOS管源极和第二MOS管漏极连接,并通过电阻R2 与工作电压连接,第三MOS管漏极接地;第四MOS管栅极与第三MOS管源极和第一MOS管漏极连接,并通过电阻R1与工作电压连接,第四MOS管漏极接地;WL用于控制存储单元的门控访问,BL用于进行存储单元的读写。In a possible embodiment of the present application, the four MOS transistors include: a first MOS transistor, a second MOS transistor, a third MOS transistor, and a fourth MOS transistor, and the first MOS transistor and the second MOS transistor are used for gating The third MOS transistor and the fourth MOS transistor are used for storage, wherein the first MOS transistor gate is electrically connected to the word line WL, the source is electrically connected to the bit line BL, and the second MOS transistor gate is electrically connected to the word line WL. The source is electrically connected to the bit line BLB; the gate of the third MOS transistor is connected to the source of the fourth MOS transistor and the drain of the second MOS transistor, and is connected to the working voltage through the resistor R2, and the drain of the third MOS transistor is grounded; The gate of the MOS transistor is connected to the source of the third MOS transistor and the drain of the first MOS transistor, and is connected to the working voltage through the resistor R1, and the drain of the fourth MOS transistor is grounded; WL is used for controlling the gate access of the memory unit, for BL For reading and writing of the storage unit.
在本申请的一可能实施例中,所述3个MOS管包括:第一MOS管,第二MOS管和第三MOS管,第一MOS管用于门控,第二MOS管和第三MOS管用于存储,其中,第一MOS管栅极与字线WL电连接,源极与位线BL电连接;第二MOS管栅极与第三MOS管源极连接,并通过电阻R2与工作电压连接,第二MOS管漏极接地;第三MOS管栅极与第二MOS管源极和第一MOS管漏极连接,并通过电阻R1与工作电压连接,第三MOS管漏极接地;WL用于控制存储单元的门控访问,BL用于进行存储单元的读写。In a possible embodiment of the present application, the three MOS transistors include: a first MOS transistor, a second MOS transistor, and a third MOS transistor, the first MOS transistor is used for gating, and the second MOS transistor and the third MOS transistor are used. In the storage, the first MOS transistor gate is electrically connected to the word line WL, the source is electrically connected to the bit line BL, the second MOS transistor gate is connected to the third MOS transistor source, and is connected to the working voltage through the resistor R2. a second MOS transistor drain is grounded; a third MOS transistor gate is connected to the second MOS transistor source and the first MOS transistor drain, and is connected to the working voltage through the resistor R1, and the third MOS transistor drain is grounded; For controlling the gate access of the memory unit, the BL is used for reading and writing the memory unit.
在本申请的一可能实施例中,所述神经网络参数包括输入神经元、权值和输出神经元。In a possible embodiment of the present application, the neural network parameters include input neurons, weights, and output neurons.
随着工作频率的提高和半导体工艺的不断发展,芯片的功耗问题已成为深亚纳米集成电路中的一个重要的考虑因素,动态电压频率调节(Dynamic Voltage Frequency scaling,简称:DVFS)为目前在半导体领域被广泛采用的一种动态电压频率调节技术,DVFS技术具体是在动态调节芯片的运行频率和电压(对于同一芯片,频率越高,需要的电压也越高),从而达到节能的目的。但是现有技术中,缺乏应用到智能芯片的动态调压调频方法和相应的装置的设计,以及无法应用场景信息完成对芯片的电压频率的提前调整。With the improvement of operating frequency and the continuous development of semiconductor technology, the power consumption of chips has become an important consideration in deep sub-nanometer integrated circuits. Dynamic Voltage Frequency Scaling (DVFS) is currently in progress. A dynamic voltage frequency adjustment technology widely used in the semiconductor field. The DVFS technology specifically adjusts the operating frequency and voltage of the chip (for the same chip, the higher the frequency, the higher the voltage required), thereby achieving energy saving. However, in the prior art, there is a lack of dynamic voltage modulation and frequency modulation method applied to the smart chip and the design of the corresponding device, and the scene information cannot be applied to complete the advance adjustment of the voltage frequency of the chip.
本申请的又一方面,提供一种动态调压调频装置,包括:In another aspect of the present application, a dynamic voltage regulation and frequency modulation apparatus is provided, including:
信息采集单元,用于实时采集与所述动态调压调频相连接的芯片的工作状态信息或应用场景信息,所述应用场景信息为所述芯片通过神经网络运算得到的或者与所述芯片相连接的传感器采集的信息;An information collecting unit, configured to collect, in real time, working state information or application scenario information of a chip connected to the dynamic voltage regulating frequency modulation, where the application scenario information is obtained by using the neural network or connected to the chip Information collected by the sensor;
调压调频单元,用于根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息,所述电压频率调控信息用于指示所述芯片调整其工作电压或者工作频率。The voltage-adjusting and frequency-modulating unit is configured to send voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip, where the voltage frequency regulation information is used to instruct the chip to adjust its working voltage or operating frequency.
在本申请的一可能实施例中,所述芯片的工作状态信息包括所述芯片的运行速度,所述电压频率调控信息包括第一电压频率调控信息,所述调压调频单元用于:In a possible embodiment of the present application, the working state information of the chip includes an operating speed of the chip, the voltage frequency regulation information includes first voltage frequency regulation information, and the voltage regulating frequency modulation unit is configured to:
当所述芯片的运行速度大于目标速度时,向所述芯片发送所述第一电压频率调控信息,所述第一电压频率调控信息用于指示所述芯片降低其工作频率或者工作电压,所述目标速度为满足用户需求时所述芯片的运行速度。Transmitting the first voltage frequency regulation information to the chip when the running speed of the chip is greater than a target speed, where the first voltage frequency regulation information is used to indicate that the chip reduces its operating frequency or operating voltage, The target speed is the running speed of the chip when the user's demand is met.
在本申请的一可能实施例中,所述芯片至少包括第一单元和第二单元,所述第一单元的输出数据为所述第二单元的输入数据,所述芯片的工作状态信息包括所述第一单元的运行速度和第二单元的运行速度,所述电压频率调控信息包括第二电压频率调控信息,所述调频调压单元还用于:In a possible embodiment of the present application, the chip includes at least a first unit and a second unit, the output data of the first unit is input data of the second unit, and the working status information of the chip includes The operating speed of the first unit and the operating speed of the second unit, the voltage frequency regulation information includes second voltage frequency regulation information, and the frequency modulation unit is further configured to:
当根据所述第一单元的运行速度和所述第二单元的运行速度确定所述第一单元的运行时间超过所述第二单元的运行时间时,向所述第二单元发送所述第二电压频率调控信息,所述第二电压频率调控信息用于指示所述第二单元降低其工作频率或者工作电压。And when the running time of the first unit exceeds the running time of the second unit according to the running speed of the first unit and the running speed of the second unit, sending the second unit to the second unit Voltage frequency regulation information, the second voltage frequency regulation information is used to instruct the second unit to reduce its operating frequency or operating voltage.
在本申请的一可能实施例中,所述电压频率调控信息包括第三电压频率调控信息,所述调频调压单元还用于:In a possible embodiment of the present application, the voltage frequency regulation information includes third voltage frequency regulation information, and the frequency modulation unit is further configured to:
当根据所述第一单元的运行速度和所述第二单元的运行速度确定所述第二单元的运行时间超过所述第一单元的运行时间时,向所述第一单元发送所述第三电压频率调控信息,所述第三电压频率调控信 息用于指示所述第一单元降低其工作频率或者工作电压。Transmitting the third unit to the first unit when it is determined that an operating time of the second unit exceeds a running time of the first unit according to an operating speed of the first unit and an operating speed of the second unit Voltage frequency regulation information, the third voltage frequency regulation information is used to indicate that the first unit reduces its operating frequency or operating voltage.
在本申请的一可能实施例中,所述芯片包括至少N个单元,所述芯片的工作状态信息包括所述至少N个单元中的至少S个单元的工作状态信息,所述N为大于1的整数,所述S为小于或者小于N的整数,所述电压频率调控信息包括第四电压频率调控信息,所述调压调频单元用于:In a possible embodiment of the present application, the chip includes at least N units, and the working state information of the chip includes working state information of at least S units of the at least N units, where N is greater than 1. An integer that is less than or less than an integer of N, the voltage frequency regulation information includes fourth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is configured to:
根据所述单元A的工作状态信息确定所述单元A处于空闲状态时,向所述单元A发送所述第四电压频率调控信息,所述第四电压频率调控信息用于指示所述单元A降低其工作频率或者工作电压,And determining, according to the working state information of the unit A, that the unit A is in an idle state, sending the fourth voltage frequency regulation information to the unit A, where the fourth voltage frequency regulation information is used to indicate that the unit A is lowered. Its working frequency or working voltage,
其中,所述单元A为所述至少S个单元中的任意一个。The unit A is any one of the at least S units.
在本申请的一可能实施例中,所述电压频率调控信息包括第五电压频率调控信息,所述调压调频单元还用于:In a possible embodiment of the present application, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:
根据所述单元A的工作状态信息确定所述单元A重新处于工作状态时,向所述单元A发送所述第五电压频率调控信息,所述第五电压频率调控信息用于指示所述单元A升高其工作电压或者工作频率。And determining, according to the working state information of the unit A, that the unit A is in an active state, sending the fifth voltage frequency regulation information to the unit A, where the fifth voltage frequency regulation information is used to indicate the unit A Increase its working voltage or operating frequency.
在本申请的一可能实施例中,所述芯片的应用场景为图像识别,所述应用场景信息为待识别图像中物体的个数,所述电压频率调控信息包括第六电压频率调控信息,所述调压调频单元还用于:In a possible embodiment of the present application, the application scenario of the chip is image recognition, the application scenario information is the number of objects in the image to be identified, and the voltage frequency regulation information includes sixth voltage frequency regulation information. The voltage regulating FM unit is also used to:
当确定所述待识别图像中物体的个数小于第一阈值时,向所述芯片发送所述第六电压频率调控信息,所述第六电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。When it is determined that the number of objects in the image to be identified is less than a first threshold, sending the sixth voltage frequency regulation information to the chip, where the sixth voltage frequency regulation information is used to indicate that the chip reduces its working voltage Or the working frequency.
在本申请的一可能实施例中,所述应用场景信息为物体标签信息,所述电压频率调控信息包括第七电压频率调控信息,所述调压调频单元还用于:In a possible embodiment of the present application, the application scenario information is object tag information, the voltage frequency regulation information includes seventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:
当确定所述物体标签信息属于预设物体标签集时,向所述芯片发送所述第七电压频率调控信息,所述第七电压频率调控信息用于指示所述芯片升高其工作电压或者工作频率。When it is determined that the object tag information belongs to the preset object tag set, sending the seventh voltage frequency regulation information to the chip, where the seventh voltage frequency regulation information is used to indicate that the chip raises its working voltage or works frequency.
在本申请的一可能实施例中,所述芯片应用于语音识别,所述应用场景信息为语音输入速率,所述电压频率调控信息包括第八电压频率调控信息,所述调压调频单元还用于:In a possible embodiment of the present application, the chip is applied to voice recognition, the application scenario information is a voice input rate, the voltage frequency regulation information includes eighth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further used. to:
当所述语音输入速率小于第二阈值时,向所述芯片发送第八电压频率调控信息,所述第八电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。And when the voice input rate is less than the second threshold, sending, to the chip, eighth voltage frequency regulation information, where the eighth voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency.
在本申请的一可能实施例中,所述应用场景信息为所述芯片进行语音识别得到的关键词,所述电压频率调控信息包括第九电压频率调控信息,所述调频调压单元还用于:In a possible embodiment of the present application, the application scenario information is a keyword obtained by performing speech recognition on the chip, the voltage frequency regulation information includes ninth voltage frequency regulation information, and the frequency modulation and voltage adjustment unit is further used to :
当所述关键词属于预设关键词集时,向所述芯片发送所述第九电压频率调控信息,所述第九电压频率调控信息用于指示所述芯片升高其工作电压或者工作频率。When the keyword belongs to the preset keyword set, the ninth voltage frequency regulation information is sent to the chip, and the ninth voltage frequency regulation information is used to indicate that the chip increases its working voltage or operating frequency.
在本申请的一可能实施例中,所述芯片应用于机器翻译,所述应用场景信息为文字输入的速度或者待翻译图像中文字的数量,所述电压频率调控信息包括第十电压频率调控信息,所述调压调频单元还用于:In a possible embodiment of the present application, the chip is applied to machine translation, and the application scenario information is a speed of text input or a number of characters in an image to be translated, and the voltage frequency regulation information includes tenth voltage frequency regulation information. The voltage regulating and frequency modulation unit is further configured to:
当所述文字输入速度小于第三阈值或者待翻译图像中文字的数量小于第四阈值时,向所述芯片发送所述第十电压频率调控信息,所述第十电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。And when the text input speed is less than a third threshold or the number of characters in the image to be translated is less than a fourth threshold, sending the tenth voltage frequency regulation information to the chip, where the tenth voltage frequency regulation information is used to indicate The chip reduces its operating voltage or operating frequency.
在本申请的一可能实施例中,所述应用场景信息为外界的光照强度,所述电压频率调控信息包括第十一电压频率调控信息,所述调压调频单元还用于:In a possible embodiment of the present application, the application scenario information is ambient light intensity, the voltage frequency regulation information includes eleventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:
当所述外界的光照强度小于第五阈值时,向所述芯片发送所述第十一电压频率调控信息,所述第十一电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。Transmitting the eleventh voltage frequency regulation information to the chip when the ambient light intensity is less than a fifth threshold, the eleventh voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency .
在本申请的一可能实施例中,所述芯片应用于图像美颜,所述电压频率调控信息包括第十二电压频率调控信息和第十三电压频率调控信息,所述调压调频单元还用于:In a possible embodiment of the present application, the chip is applied to image beauty, and the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further used. to:
当所述应用场景信息为人脸图像时,向所述芯片发送所述第十二电压频率调控信息,所述第十二电压频率调控信息用于指示所述芯片降低其工作电压;When the application scenario information is a face image, sending the twelfth voltage frequency regulation information to the chip, where the twelfth voltage frequency regulation information is used to indicate that the chip reduces its working voltage;
当所述应用场景信息不为人脸图像时,向所述芯片发送所述第十三电压频率调控信息,所述第十三电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。And when the application scenario information is not a face image, sending the thirteenth voltage frequency regulation information to the chip, where the thirteenth voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency.
本申请的又一方面,提供一种动态调压调频方法,包括:In another aspect of the present application, a dynamic voltage regulation and frequency modulation method is provided, including:
实时采集与所述动态调压调频相连接的芯片的工作状态信息或应用场景信息,所述应用场景信息为所述芯片通过神经网络运算得到的或者与所述芯片相连接的传感器采集的信息;Collecting working state information or application scenario information of the chip connected to the dynamic voltage-modulating frequency modulation in real time, where the application scenario information is information collected by a sensor obtained by the chip through a neural network or connected to the chip;
根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息,所述电压频率调控信息用于指示所述芯片调整其工作电压或者工作频率。And transmitting voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip, where the voltage frequency regulation information is used to instruct the chip to adjust its working voltage or operating frequency.
在本申请的一可能实施例中,所述芯片的工作状态信息包括所述芯片的运行速度,所述电压频率调控信息包括第一电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息包括:In a possible embodiment of the present application, the working state information of the chip includes an operating speed of the chip, and the voltage frequency regulation information includes first voltage frequency regulation information, according to the working state information of the chip or Transmitting the voltage frequency regulation information to the chip by using the scenario information includes:
当所述芯片的运行速度大于目标速度时,向所述芯片发送所述第一电压频率调控信息,所述第一电压频率调控信息用于指示所述芯片降低其工作频率或者工作电压,所述目标速度为满足用户需求时所述芯片的运行速度。Transmitting the first voltage frequency regulation information to the chip when the running speed of the chip is greater than a target speed, where the first voltage frequency regulation information is used to indicate that the chip reduces its operating frequency or operating voltage, The target speed is the running speed of the chip when the user's demand is met.
在本申请的一可能实施例中,所述芯片至少包括第一单元和第二单元,所述第一单元的输出数据为所述第二单元的输入数据,所述芯片的工作状态信息包括所述第一单元的运行速度和第二单元的运行速度,所述电压频率调控信息包括第二电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:In a possible embodiment of the present application, the chip includes at least a first unit and a second unit, the output data of the first unit is input data of the second unit, and the working status information of the chip includes The operating speed of the first unit and the operating speed of the second unit, the voltage frequency regulation information includes second voltage frequency regulation information, and the voltage frequency is sent to the chip according to the working state information or the application scenario information of the chip. Regulatory information also includes:
当根据所述第一单元的运行速度和所述第二单元的运行速度确定所述第一单元的运行时间超过所述第二单元的运行时间时,向所述第二单元发送所述第二电压频率调控信息,所述第二电压频率调控信息用于指示所述第二单元降低其工作频率或者工作电压。And when the running time of the first unit exceeds the running time of the second unit according to the running speed of the first unit and the running speed of the second unit, sending the second unit to the second unit Voltage frequency regulation information, the second voltage frequency regulation information is used to instruct the second unit to reduce its operating frequency or operating voltage.
在本申请的一可能实施例中,所述电压频率调控信息包括第二电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:In a possible embodiment of the present application, the voltage frequency regulation information includes the second voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
当根据所述第一单元的运行速度和所述第二单元的运行速度确定所述第二单元的运行时间超过所述第一单元的运行时间时,向所述第一单元发送所述第三电压频率调控信息,所述第三电压频率调控信息用于指示所述第一单元降低其工作频率或者工作电压。Transmitting the third unit to the first unit when it is determined that an operating time of the second unit exceeds a running time of the first unit according to an operating speed of the first unit and an operating speed of the second unit Voltage frequency regulation information, the third voltage frequency regulation information is used to indicate that the first unit reduces its operating frequency or operating voltage.
在本申请的一可能实施例中,所述芯片包括至少N个单元,所述芯片的工作状态信息包括所述至少N个单元中的至少S个单元的工作状态信息,所述N为大于1的整数,所述S为小于或者小于N的整数,所述电压频率调控信息包括第二电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:In a possible embodiment of the present application, the chip includes at least N units, and the working state information of the chip includes working state information of at least S units of the at least N units, where N is greater than 1. The integer value, the S is an integer less than or less than N, the voltage frequency regulation information includes second voltage frequency regulation information, and the voltage frequency regulation is sent to the chip according to the working state information or the application scenario information of the chip. The information also includes:
根据所述单元A的工作状态信息确定所述单元A处于空闲状态时,向所述单元A发送所述第四电压频率调控信息,所述第四电压频率调控信息用于指示所述单元A降低其工作频率或者工作电压,And determining, according to the working state information of the unit A, that the unit A is in an idle state, sending the fourth voltage frequency regulation information to the unit A, where the fourth voltage frequency regulation information is used to indicate that the unit A is lowered. Its working frequency or working voltage,
其中,所述单元A为所述至少S个单元中的任意一个。The unit A is any one of the at least S units.
在本申请的一可能实施例中,所述电压频率调控信息包括第五电压频率调控信息,所述根据所述芯 片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:In a possible embodiment of the present application, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
根据所述单元A的工作状态信息确定所述单元A重新处于工作状态时,向所述单元A发送所述第五电压频率调控信息,所述第五电压频率调控信息用于指示所述单元A升高其工作电压或者工作频率。And determining, according to the working state information of the unit A, that the unit A is in an active state, sending the fifth voltage frequency regulation information to the unit A, where the fifth voltage frequency regulation information is used to indicate the unit A Increase its working voltage or operating frequency.
在本申请的一可能实施例中,所述芯片的应用场景为图像识别,所述应用场景信息为待识别图像中物体的个数,所述电压频率调控信息包括第六电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:In a possible embodiment of the present application, the application scenario of the chip is image recognition, the application scenario information is the number of objects in the image to be identified, and the voltage frequency regulation information includes sixth voltage frequency regulation information. The sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
当确定所述待识别图像中物体的个数小于第一阈值时,向所述芯片发送所述第六电压频率调控信息,所述第六电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。When it is determined that the number of objects in the image to be identified is less than a first threshold, sending the sixth voltage frequency regulation information to the chip, where the sixth voltage frequency regulation information is used to indicate that the chip reduces its working voltage Or the working frequency.
在本申请的一可能实施例中,所述应用场景信息为物体标签信息,所述电压频率调控信息包括第七电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:In a possible embodiment of the present application, the application scenario information is object tag information, and the voltage frequency regulation information includes seventh voltage frequency regulation information, where the device is based on the working state information or the application scenario information of the chip. The chip transmitting voltage frequency regulation information further includes:
当确定所述物体标签信息属于预设物体标签集时,向所述芯片发送所述第七电压频率调控信息,所述第七电压频率调控信息用于指示所述芯片升高其工作电压或者工作频率。When it is determined that the object tag information belongs to the preset object tag set, sending the seventh voltage frequency regulation information to the chip, where the seventh voltage frequency regulation information is used to indicate that the chip raises its working voltage or works frequency.
在本申请的一可能实施例中,所述芯片应用于语音识别,所述应用场景信息为语音输入速率,所述电压频率调控信息包括第八电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:In a possible embodiment of the present application, the chip is applied to voice recognition, the application scenario information is a voice input rate, and the voltage frequency regulation information includes eighth voltage frequency regulation information, where the operation is performed according to the chip. The sending the voltage frequency regulation information to the chip by the status information or the application scenario information further includes:
当所述语音输入速率小于第二阈值时,向所述芯片发送所述第八电压频率调控信息,所述第八电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。And when the voice input rate is less than the second threshold, sending the eighth voltage frequency regulation information to the chip, where the eighth voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency.
在本申请的一可能实施例中,所述应用场景信息为所述芯片进行语音识别得到的关键词,所述电压频率调控信息包括第九电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:In a possible embodiment of the present application, the application scenario information is a keyword obtained by performing speech recognition on the chip, and the voltage frequency regulation information includes ninth voltage frequency regulation information, according to the working state of the chip. Sending the voltage frequency regulation information to the chip by the information or the application scenario information further includes:
当所述关键词属于预设关键词集时,向所述芯片发送所述第九电压频率调控信息,所述第九电压频率调控信息用于指示所述芯片升高其工作电压或者工作频率。When the keyword belongs to the preset keyword set, the ninth voltage frequency regulation information is sent to the chip, and the ninth voltage frequency regulation information is used to indicate that the chip increases its working voltage or operating frequency.
在本申请的一可能实施例中,所述芯片应用于机器翻译,所述应用场景信息为文字输入的速度或者待翻译图像中文字的数量,所述电压频率调控信息包括第十电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:In a possible embodiment of the present application, the chip is applied to machine translation, and the application scenario information is a speed of text input or a number of characters in an image to be translated, and the voltage frequency regulation information includes tenth voltage frequency regulation information. The sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
当所述文字输入速度小于第三阈值或者待翻译图像中文字的数量小于第四阈值时,向所述芯片发送所述第十电压频率调控信息,所述第十电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。And when the text input speed is less than a third threshold or the number of characters in the image to be translated is less than a fourth threshold, sending the tenth voltage frequency regulation information to the chip, where the tenth voltage frequency regulation information is used to indicate The chip reduces its operating voltage or operating frequency.
在本申请的一可能实施例中,所述应用场景信息为外界的光照强度,所述电压频率调控信息包括第十一电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:In a possible embodiment of the present application, the application scenario information is an ambient light intensity, and the voltage frequency regulation information includes eleventh voltage frequency regulation information, and the working state information or application scenario information according to the chip Sending voltage frequency regulation information to the chip further includes:
当所述外界的光照强度小于第五阈值时,向所述芯片发送所述第十一电压频率调控信息,所述第十一电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。Transmitting the eleventh voltage frequency regulation information to the chip when the ambient light intensity is less than a fifth threshold, the eleventh voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency .
在本申请的一可能实施例中,所述芯片应用于图像美颜,所述电压频率调控信息包括第十二电压频率调控信息和第十三电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:In a possible embodiment of the present application, the chip is applied to image beauty, and the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency regulation information, and the operation according to the chip The sending the voltage frequency regulation information to the chip by the status information or the application scenario information further includes:
当所述应用场景信息为人脸图像时,向所述芯片发送所述第十二电压频率调控信息,所述第十二电压频率调控信息用于指示所述芯片降低其工作电压;When the application scenario information is a face image, sending the twelfth voltage frequency regulation information to the chip, where the twelfth voltage frequency regulation information is used to indicate that the chip reduces its working voltage;
当所述应用场景信息不为人脸图像时,向所述芯片发送所述第十三电压频率调控信息,所述第十三电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。And when the application scenario information is not a face image, sending the thirteenth voltage frequency regulation information to the chip, where the thirteenth voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency.
随着工作频率的提高和半导体工艺的不断发展,芯片的功耗问题已成为深亚纳米集成电路中的一个重要的考虑因素,动态电压频率调节(Dynamic Voltage Frequency scaling,简称DVFS)为目前在半导体领域被广泛采用的一种动态电压频率调节技术,DVFS技术具体是在动态调节芯片的运行频率和电压(对于同一芯片,频率越高,需要的电压也越高),从而达到节能的目的。但是现有技术中,缺乏应用到智能芯片比如卷积运算装置的动态调压调频方法和相应的装置的设计。With the increase of operating frequency and the continuous development of semiconductor technology, the power consumption of chips has become an important consideration in deep sub-nanometer integrated circuits. Dynamic Voltage Frequency Scaling (DVFS) is currently in semiconductors. A dynamic voltage frequency adjustment technology widely used in the field, the DVFS technology specifically adjusts the operating frequency and voltage of the chip (for the same chip, the higher the frequency, the higher the voltage required), thereby achieving the purpose of energy saving. However, in the prior art, there is a lack of a dynamic voltage modulation and frequency modulation method applied to a smart chip such as a convolution operation device and a corresponding device design.
本申请的又一方面,提供一种卷积运算装置,包括动态调压调频装置、指令存储单元、控制器单元、数据访问单元、互连模块、主运算模块以及N个从运算模块,所述N为大于1的整数,其中:In still another aspect of the present application, a convolution operation device includes: a dynamic voltage modulation and frequency modulation device, an instruction storage unit, a controller unit, a data access unit, an interconnection module, a main operation module, and N slave operation modules, N is an integer greater than 1, where:
所述指令存储单元,用于存储所述数据访问单元读入的指令;The instruction storage unit is configured to store an instruction read by the data access unit;
所述控制器单元,用于从所述指令存储单元中读取指令,将该指令译成控制其他模块行为的控制信号,所述其他模块包括所述数据访问单元、所述主运算模块和所述N个从运算模块;The controller unit is configured to read an instruction from the instruction storage unit, and translate the instruction into a control signal for controlling behavior of other modules, where the other module includes the data access unit, the main operation module, and the Said N slave arithmetic modules;
所述数据访问单元,用于执行外部地址空间与所述卷积运算装置之间的数据或指令读写操作;The data access unit is configured to perform data or instruction read and write operations between the external address space and the convolution operation device;
所述N个从运算模块,用于实现卷积神经网络算法中的输入数据和卷积核的卷积运算;The N slave operation modules are configured to implement a convolution operation of the input data and the convolution kernel in the convolutional neural network algorithm;
所述互连模块,用于所述主运算模块和所述从运算模块之间的数据传输;The interconnection module is configured to perform data transmission between the main operation module and the slave operation module;
所述主运算模块,用于将所有输入数据的中间向量拼接成中间结果,并对所述中间结果执行后续运算;The main operation module is configured to splicing intermediate vectors of all input data into intermediate results, and performing subsequent operations on the intermediate results;
所述动态调压调频装置,用于采集所述卷积运算装置的工作状态信息;根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息,所述电压频率调控信息用于指示所述卷积运算装置调整其工作电压或者工作频率。The dynamic voltage regulation and frequency modulation device is configured to collect operation state information of the convolution operation device; and send voltage frequency regulation information to the convolution operation device according to the operation state information of the convolution operation device, the voltage frequency The regulation information is used to instruct the convolution operation device to adjust its operating voltage or operating frequency.
在本申请的一可能实施例中,所述主运算模块还用于将中间结果与偏置数据相加,执行激活操作。In a possible embodiment of the present application, the main operation module is further configured to add an intermediate result to the offset data to perform an activation operation.
在本申请的一可能实施例中,所述N个从运算模块具体用于:利用相同的输入数据和各自的卷积核,并行地计算出各自的输出标量。In a possible embodiment of the present application, the N slave operation modules are specifically configured to calculate respective output scalars in parallel by using the same input data and respective convolution kernels.
在本申请的一可能实施例中,所述主运算模块使用的激活函数active是非线性函数sigmoid,tanh,relu,softmax中的任一个非线性函数。In a possible embodiment of the present application, the activation function active used by the main operation module is any nonlinear function of the nonlinear functions sigmoid, tanh, relu, and softmax.
在本申请的一可能实施例中,所述互连模块构成所述主运算模块和所述N个从运算模块之间的连续或离散化数据的数据通路,所述互连模块为树状结构、环状结构、网格状结构、分级互连结构和总线结构中的任一种结构。In a possible embodiment of the present application, the interconnection module constitutes a data path of continuous or discretized data between the main operation module and the N slave operation modules, and the interconnection module is a tree structure. Any one of a ring structure, a mesh structure, a hierarchical interconnection structure, and a bus structure.
在本申请的一可能实施例中,所述主运算模块包括:In a possible embodiment of the present application, the main operation module includes:
第一存储单元,用于缓存所述主运算模块在计算过程中用到的输入数据和输出数据;a first storage unit, configured to buffer input data and output data used by the main operation module in the calculation process;
第一运算单元,用于完成所述主运算模块的各种运算功能;a first operation unit, configured to complete various computing functions of the main operation module;
第一数据依赖关系判定单元,是第一运算单元读写第一存储单元的端口,用于保证对所述第一存储单元的数据读写的一致性,并且从所述第一存储单元读取输入的神经元向量,并通过所述互连模块发送给所述N个从运算模块;以及将来自所述互连模块的中间结果向量发送到第一运算单元。a first data dependency determining unit, configured to read and write a port of the first storage unit by the first computing unit, to ensure consistency of reading and writing data to the first storage unit, and to read from the first storage unit An input neuron vector and sent to the N slave arithmetic modules by the interconnect module; and an intermediate result vector from the interconnect module is sent to the first arithmetic unit.
在本申请的一可能实施例中,所述N个从运算模块中的每个从运算模块包括:In a possible embodiment of the present application, each of the N slave computing modules includes:
第二运算单元,用于接收所述控制器单元发出的控制信号并进行算数逻辑运算;a second operation unit, configured to receive a control signal sent by the controller unit and perform an arithmetic logic operation;
第二数据依赖关系判定单元,用于在计算过程中对第二存储单元和第三存储单元的读写操作,以保证对第二存储单元和第三存储单元的读写一致性;a second data dependency determining unit, configured to perform read and write operations on the second storage unit and the third storage unit during the calculating process to ensure read and write consistency to the second storage unit and the third storage unit;
第二存储单元,用于缓存输入数据以及从该运算模块计算得到的输出标量;a second storage unit, configured to buffer input data and an output scalar calculated from the operation module;
第三存储单元,用于缓存该从运算模块在计算过程中需要的卷积核。And a third storage unit, configured to cache a convolution kernel required by the slave computing module in the calculation process.
在本申请的一可能实施例中,所述第一数据依赖关系判定单元和所述第二数据依赖关系判定单元通过以下方式保证读写一致性:In a possible embodiment of the present application, the first data dependency determining unit and the second data dependency determining unit ensure read and write consistency by:
判断尚未执行的控制信号与正在执行过程中的控制信号的数据之间是否存在依赖关系,如果不存在,允许该条控制信号立即发射,否则需要等到该条控制信号所依赖的所有控制信号全部执行完成后,该条控制信号才允许被发射。Determining whether there is a dependency between the control signal that has not been executed and the data of the control signal being executed, if not, allowing the control signal to be immediately transmitted, otherwise it is necessary to wait until all control signals on which the control signal depends Once completed, the control signal is allowed to be transmitted.
在本申请的一可能实施例中,所述数据访问单元从外部地址空间读入输入数据、偏置数据和卷积核中的至少一个。In a possible embodiment of the present application, the data access unit reads at least one of input data, offset data, and a convolution kernel from an external address space.
在本申请的一可能实施例中,所述动态调压调频装置包括:In a possible embodiment of the present application, the dynamic voltage regulation and frequency modulation apparatus includes:
信息采集单元,用于实时采集所述卷积运算装置的工作状态信息;An information collecting unit, configured to collect working state information of the convolution operation device in real time;
调压调频单元,用于根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息,所述电压频率调控信息用于指示所述卷积运算装置调整其工作电压或者工作频率。a voltage regulating unit for transmitting voltage frequency regulation information to the convolution operation device according to the working state information of the convolution operation device, wherein the voltage frequency adjustment information is used to instruct the convolution operation device to adjust its working voltage Or the working frequency.
在本申请的一可能实施例中,所述卷积运算装置的工作状态信息包括所述卷积运算装置的运行速度,所述电压频率调控信息包括第一电压频率调控信息,所述调压调频单元用于:In a possible embodiment of the present application, the working state information of the convolution operation device includes an operating speed of the convolution operation device, the voltage frequency regulation information includes first voltage frequency regulation information, and the voltage regulation and frequency modulation Unit is used to:
当所述卷积运算装置的运行速度大于目标速度时,向所述卷积运算装置发送所述第一电压频率调控信息,所述第一电压频率调控信息用于指示所述卷积运算装置降低其工作频率或者工作电压,所述目标速度为满足用户需求时所述卷积运算装置的运行速度。Transmitting, by the convolution operation device, the first voltage frequency regulation information, when the operation speed of the convolution operation device is greater than a target speed, the first voltage frequency regulation information being used to indicate that the convolution operation device is lowered Its operating frequency or operating voltage, which is the operating speed of the convolutional computing device when the user's needs are met.
在本申请的一可能实施例中,所述卷积运算装置的工作状态信息包括所述数据访问单元的运行速度和主运算模块的运行速度,所述电压频率调控信息包括第二电压频率调控信息,所述调频调压单元还用于:In a possible embodiment of the present application, the working state information of the convolution operation device includes an operating speed of the data access unit and an operating speed of the main computing module, and the voltage frequency control information includes second voltage frequency control information. The FM voltage regulator unit is further configured to:
当根据所述数据访问单元的运行速度和所述主运算模块的运行速度确定所述数据访问单元的运行时间超过所述主运算模块的运行时间时,向所述主运算模块发送所述第二电压频率调控信息,所述第二电压频率调控信息用于指示所述主运算模块降低其工作频率或者工作电压。And when the running time of the data access unit exceeds the running time of the main computing module according to the running speed of the data access unit and the running speed of the main computing module, sending the second to the main computing module Voltage frequency regulation information, the second voltage frequency regulation information is used to instruct the main operation module to reduce its operating frequency or operating voltage.
在本申请的一可能实施例中,所述电压频率调控信息包括第三电压频率调控信息,所述调频调压单元还用于:In a possible embodiment of the present application, the voltage frequency regulation information includes third voltage frequency regulation information, and the frequency modulation unit is further configured to:
当根据所述数据访问单元的运行速度和所述主运算模块的运行速度确定所述主运算模块的运行时间超过所述数据访问单元的运行时间时,向所述数据访问单元发送所述第三电压频率调控信息,所述第三电压频率调控信息用于指示所述数据访问单元降低其工作频率或者工作电压。And when the running time of the main operation module exceeds the running time of the data access unit according to the running speed of the data access unit and the running speed of the main operation module, sending the third to the data access unit Voltage frequency regulation information, the third voltage frequency regulation information is used to instruct the data access unit to reduce its operating frequency or operating voltage.
在本申请的一可能实施例中,所述卷积运算装置的工作状态信息包括指令存储单元、控制器单元、数据访问单元、互连模块、主运算模块及N个从运算模块中的至少S个单元/模块的工作状态信息,所述S为大于1且小于或等于N+5的整数,所述电压频率调控信息包括第四电压频率调控信息,所述调压调频单元用于:In a possible embodiment of the present application, the working state information of the convolution operation device includes an instruction storage unit, a controller unit, a data access unit, an interconnection module, a main operation module, and at least S of the N slave operation modules. Working state information of the unit/module, the S is an integer greater than 1 and less than or equal to N+5, the voltage frequency regulation information includes fourth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is configured to:
根据所述单元A的工作状态信息确定所述单元A处于空闲状态时,向所述单元A发送所述第四电压频率调控信息,所述第四电压频率调控信息用于指示所述单元A降低其工作频率或者工作电压,And determining, according to the working state information of the unit A, that the unit A is in an idle state, sending the fourth voltage frequency regulation information to the unit A, where the fourth voltage frequency regulation information is used to indicate that the unit A is lowered. Its working frequency or working voltage,
其中,所述单元A为所述至少S个单元/模块中的任意一个。The unit A is any one of the at least S units/modules.
在本申请的一可能实施例中,所述电压频率调控信息包括第五电压频率调控信息,所述调压调频单元还用于:In a possible embodiment of the present application, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:
根据所述单元A的工作状态信息确定所述单元A重新处于工作状态时,向所述单元A发送所述第五电压频率调控信息,所述第五电压频率调控信息用于指示所述单元A升高其工作电压或者工作频率。And determining, according to the working state information of the unit A, that the unit A is in an active state, sending the fifth voltage frequency regulation information to the unit A, where the fifth voltage frequency regulation information is used to indicate the unit A Increase its working voltage or operating frequency.
本申请的又一方面,提供了一种神经网络处理器,该神经网络处理器包括如上述的卷积运算装置。In yet another aspect of the present application, a neural network processor is provided, the neural network processor comprising a convolution operation device as described above.
本申请的又一方面,提供了一种电子装置,该电子装置包括如上述的神经网络处理器。In yet another aspect of the present application, an electronic device is provided, the electronic device comprising a neural network processor as described above.
本申请的又一方面,提供了一种用于执行单层卷积神经网络正向运算的方法,应用于上述的卷积运算装置中,包括:In still another aspect of the present application, a method for performing a forward operation of a single-layer convolutional neural network is provided, which is applied to the above-described convolution operation device, and includes:
在指令存储单元的首地址处预先存入一条输入输出IO指令;Pre-storing an input/output IO instruction at the first address of the instruction storage unit;
运算开始,控制器单元从所述指令存储单元的首地址读取所述IO指令,根据译出的控制信号,数据访问单元从外部地址空间读取相应的所有卷积神经网络运算指令,并将其缓存在所述指令存储单元中;The operation begins, the controller unit reads the IO instruction from the first address of the instruction storage unit, and according to the decoded control signal, the data access unit reads all corresponding convolutional neural network operation instructions from the external address space, and Caching in the instruction storage unit;
所述控制器单元接着从所述指令存储单元读入下一条IO指令,根据译出的控制信号,所述数据访问单元从外部地址空间读取主运算模块需要的所有数据至所述主运算模块的第一存储单元;The controller unit then reads in the next IO instruction from the instruction storage unit, and according to the decoded control signal, the data access unit reads all data required by the main operation module from the external address space to the main operation module. First storage unit;
所述控制器单元接着从所述指令存储单元读入下一条IO指令,根据译出的控制信号,所述数据访问单元从外部地址空间读取从运算模块需要的卷积核数据;The controller unit then reads the next IO instruction from the instruction storage unit, and the data access unit reads the convolution kernel data required by the operation module from the external address space according to the decoded control signal;
所述控制器单元接着从所述指令存储单元读入下一条CONFIG指令,根据译出的控制信号,所述卷积运算装置配置该层神经网络计算需要的各种常数;The controller unit then reads the next CONFIG instruction from the instruction storage unit, and the convolution operation device configures various constants required for the calculation of the layer neural network according to the decoded control signal;
所述控制器单元接着从所述指令存储单元读入下一条COMPUTE指令,根据译出的控制信号,所述主运算模块首先通过互连模块将卷积窗口内的输入数据发给N个从运算模块,保存至所述N个从运算模块的第二存储单元,之后,在依据指令移动卷积窗口;The controller unit then reads the next COMPUTE instruction from the instruction storage unit, and according to the translated control signal, the main operation module first sends the input data in the convolution window to the N slave operations through the interconnect module. a module, saved to the second storage unit of the N slave computing modules, and then moving the convolution window according to the instruction;
根据COMPUTE指令译出的控制信号,所述N个从运算模块的运算单元从第三存储单元读取卷积核,从所述第二存储单元读取输入数据,完成输入数据和卷积核的卷积运算,将得到的输出标量通过所述互连模块返回;According to the control signal decoded by the COMPUTE instruction, the operation unit of the N slave operation modules reads the convolution kernel from the third storage unit, reads the input data from the second storage unit, and completes the input data and the convolution kernel. Convolution operation, returning the obtained output scalar through the interconnect module;
在所述互连模块中,所述N个从运算模块返回的输出标量被逐级拼成完整的中间向量;In the interconnection module, the output scalars returned by the N operation modules are successively formed into a complete intermediate vector;
所述主运算模块得到互连模块返回的中间向量,卷积窗口遍历所有输入数据,所述主运算模块将所有返回向量拼接成中间结果,根据COMPUTE指令译出的控制信号,从第一存储单元读取偏置数据,与中间结果通过向量加单元相加得到偏置结果,然后激活单元对偏置结果做激活,并将最后的输出数据写回至所述第一存储单元中;The main operation module obtains an intermediate vector returned by the interconnection module, and the convolution window traverses all the input data, and the main operation module concatenates all the return vectors into an intermediate result, and the control signal decoded according to the COMPUTE instruction is from the first storage unit. Reading the offset data, adding the offset result to the intermediate result by the vector addition unit, then the activation unit activates the offset result, and writes the last output data back to the first storage unit;
所述控制器单元接着从指令存储单元读入下一条IO指令,根据译出的控制信号,所述数据访问单元将所述第一存储单元中的输出数据存至外部地址空间指定地址,运算结束。The controller unit then reads the next IO instruction from the instruction storage unit, and the data access unit stores the output data in the first storage unit to the specified address in the external address space according to the translated control signal, and the operation ends. .
在本申请的一可能实施例中,所述方法还包括:In a possible embodiment of the present application, the method further includes:
实时采集所述卷积运算装置的工作状态信息;Collecting working state information of the convolution operation device in real time;
根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息,所述电压频率 调控信息用于指示所述卷积运算装置调整其工作电压或工作频率。The voltage frequency adjustment information is transmitted to the convolution operation device according to the operation state information of the convolution operation device, and the voltage frequency adjustment information is used to instruct the convolution operation device to adjust its operating voltage or operating frequency.
在本申请的一可能实施例中所述卷积运算装置的工作状态信息包括所述卷积运算装置的运行速度,所述电压频率调控信息包括第一电压频率调控信息,所述根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息包括:In a possible embodiment of the present application, the working state information of the convolution operation device includes an operating speed of the convolution operation device, and the voltage frequency regulation information includes first voltage frequency regulation information, according to the volume The transmitting the operating state information of the product computing device to the convolution computing device includes:
当所述卷积运算装置的运行速度大于目标速度时,向所述卷积运算装置发送所述第一电压频率调控信息,所述第一电压频率调控信息用于指示所述卷积运算装置降低其工作频率或者工作电压,所述目标速度为满足用户需求时所述芯片的运行速度。Transmitting, by the convolution operation device, the first voltage frequency regulation information, when the operation speed of the convolution operation device is greater than a target speed, the first voltage frequency regulation information being used to indicate that the convolution operation device is lowered Its operating frequency or operating voltage, the target speed is the running speed of the chip when the user needs are met.
在本申请的一可能实施例中,所述卷积运算装置的工作状态信息包括所述数据访问单元的运行速度和主运算模块的运行速度,所述电压频率调控信息包括第二电压频率调控信息,所述根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息还包括:In a possible embodiment of the present application, the working state information of the convolution operation device includes an operating speed of the data access unit and an operating speed of the main computing module, and the voltage frequency control information includes second voltage frequency control information. The transmitting the voltage frequency regulation information to the convolution operation device according to the working state information of the convolution operation device further includes:
当根据所述数据访问单元的运行速度和所述主运算模块的运行速度确定所述数据访问单元的运行时间超过所述主运算模块的运行时间时,向所述主运算模块发送所述第二电压频率调控信息,所述第二电压频率调控信息用于指示所述主运算模块降低其工作频率或者工作电压。And when the running time of the data access unit exceeds the running time of the main computing module according to the running speed of the data access unit and the running speed of the main computing module, sending the second to the main computing module Voltage frequency regulation information, the second voltage frequency regulation information is used to instruct the main operation module to reduce its operating frequency or operating voltage.
在本申请的一可能实施例中,所述电压频率调控信息包括第三电压频率调控信息,所述根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息还包括:In a possible embodiment of the present application, the voltage frequency regulation information includes third voltage frequency regulation information, and the voltage frequency regulation information is sent to the convolution operation device according to the operation state information of the convolution operation device. include:
当根据所述数据访问单元的运行速度和所述主运算模块的运行速度确定所述主运算模块的运行时间超过所述数据访问单元的运行时间时,向所述数据访问单元发送所述第三电压频率调控信息,所述第三电压频率调控信息用于指示所述数据访问单元降低其工作频率或者工作电压。And when the running time of the main operation module exceeds the running time of the data access unit according to the running speed of the data access unit and the running speed of the main operation module, sending the third to the data access unit Voltage frequency regulation information, the third voltage frequency regulation information is used to instruct the data access unit to reduce its operating frequency or operating voltage.
在本申请的一可能实施例中,所述卷积运算装置的工作状态信息包括所述指令存储单元、控制器单元、数据访问单元、互连模块、主运算模块及N个从运算模块中至少S个单元/模块的工作状态信息,所述S为大于1且小于或等于N+5的整数,所述电压频率调控信息包括第四电压频率调控信息,所述根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息还包括:In a possible embodiment of the present application, the working state information of the convolution operation device includes at least the instruction storage unit, the controller unit, the data access unit, the interconnect module, the main operation module, and the N slave operation modules. Working state information of S units/modules, the S is an integer greater than 1 and less than or equal to N+5, and the voltage frequency regulation information includes fourth voltage frequency regulation information, according to the convolution operation device The sending the voltage frequency regulation information to the convolution operation device by the working state information further includes:
根据所述单元A的工作状态信息确定所述单元A处于空闲状态时,向所述单元A发送所述第四电压频率调控信息,所述第四电压频率调控信息用于指示所述单元A降低其工作频率或者工作电压,And determining, according to the working state information of the unit A, that the unit A is in an idle state, sending the fourth voltage frequency regulation information to the unit A, where the fourth voltage frequency regulation information is used to indicate that the unit A is lowered. Its working frequency or working voltage,
其中,所述单元A为所述至少S个单元/模块中的任意一个。The unit A is any one of the at least S units/modules.
在本申请的一可能实施例中,所述电压频率调控信息包括第五电压频率调控信息,所述根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息还包括:In a possible embodiment of the present application, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the voltage frequency regulation information is sent to the convolution operation device according to the operation state information of the convolution operation device. include:
根据所述单元A的工作状态信息确定所述单元A重新处于工作状态时,向所述单元A发送所述第五电压频率调控信息,所述第五电压频率调控信息用于指示所述单元A升高其工作电压或者工作频率。And determining, according to the working state information of the unit A, that the unit A is in an active state, sending the fifth voltage frequency regulation information to the unit A, where the fifth voltage frequency regulation information is used to indicate the unit A Increase its working voltage or operating frequency.
本申请的另一方面,提供了一种用于执行多层卷积神经网络正向运算的方法,包括:In another aspect of the present application, a method for performing a forward operation of a multi-layer convolutional neural network is provided, comprising:
对每一层执行如上述所述的单层卷积神经网络正向运算的方法,当上一层卷积神经网络执行完毕后,本层的运算指令将主运算模块中存储的上一层的输出数据地址作为本层的输入数据地址,并且指令中的卷积核和偏置数据地址变更至本层对应的地址。Performing a forward operation of the single-layer convolutional neural network as described above for each layer. When the upper-layer convolutional neural network is executed, the operation instruction of the layer will be the upper layer stored in the main operation module. The output data address is used as the input data address of this layer, and the convolution kernel and the offset data address in the instruction are changed to the address corresponding to this layer.
随着大数据时代的到来,数据以爆炸性的速度增长着,巨量的数据携带着信息在人们之间传递着,而图像作为人类感知世界的视觉基础,是人类获取信息、表达信息和传递信息的重要手段。With the advent of the era of big data, data is growing at an explosive rate. A huge amount of data carries information between people. Image is the visual basis of human perception of the world. It is human access to information, expression and transmission of information. An important means.
现有技术中,通过图像压缩有效地减少了数据量,提高图像的传输速率。然而,对图像进行压缩之 后,难以保留原始图像的全部信息,因此,如何进行图像压缩仍然为本领域技术人员待解决的技术问题。In the prior art, the amount of data is effectively reduced by image compression, and the transmission rate of an image is improved. However, after the image is compressed, it is difficult to retain all the information of the original image, and therefore, how to perform image compression is still a technical problem to be solved by those skilled in the art.
本申请的又一方面,提供了一种图像压缩方法,包括:In still another aspect of the present application, an image compression method is provided, including:
获取第一分辨率的原始图像,所述原始图像为压缩神经网络的压缩训练图集中的任一训练图像,所述原始图像的标签信息作为目标标签信息;Obtaining an original image of a first resolution, where the original image is any training image in a compressed training map set of a compressed neural network, and label information of the original image is used as target label information;
基于目标模型对所述原始图像进行压缩,得到第二分辨率的压缩图像,所述第二分辨率小于所述第一分辨率,所述目标模型为所述压缩神经网络当前的神经网络模型;Compressing the original image based on a target model to obtain a compressed image of a second resolution, wherein the second resolution is smaller than the first resolution, and the target model is a current neural network model of the compressed neural network;
基于识别神经网络模型对所述压缩图像进行识别,得到参考标签信息,所述识别神经网络模型为识别神经网络训练完成时对应的神经网络模型;Identifying the compressed image based on the recognition neural network model, and obtaining reference tag information, wherein the identifying the neural network model is to identify a corresponding neural network model when the neural network training is completed;
根据所述目标标签信息与所述参考标签信息获取损失函数;Obtaining a loss function according to the target tag information and the reference tag information;
在所述损失函数收敛于第一阈值,或所述压缩神经网络当前的训练次数大于或等于第二阈值时,获取所述第一分辨率的目标原始图像,将所述目标模型作为所述压缩神经网络训练完成时对应的压缩神经网络模型;Obtaining the target original image of the first resolution, and using the target model as the compression, when the loss function converges to a first threshold, or the current training number of the compressed neural network is greater than or equal to a second threshold Corresponding compressed neural network model when neural network training is completed;
基于所述压缩神经网络模型对所述目标原始图像进行压缩,得到所述第二分辨率的目标压缩图像。The target original image is compressed based on the compressed neural network model to obtain a target compressed image of the second resolution.
在本申请的一可能实施例中,所述图像压缩方法还包括:In a possible embodiment of the present application, the image compression method further includes:
在所述损失函数未收敛于所述第一阈值,或所述压缩神经网络当前的训练次数小于所述第二阈值时,根据所述损失函数对所述目标模型进行更新,得到更新模型,将所述更新模型作为所述目标模型,将下一个训练图像作为所述原始图像,执行所述获取第一分辨率的原始图像的步骤。And when the loss function does not converge to the first threshold, or the current training number of the compressed neural network is less than the second threshold, updating the target model according to the loss function to obtain an updated model, The update model is used as the target model, and the next training image is used as the original image, and the step of acquiring the original image of the first resolution is performed.
在本申请的一可能实施例中,所述基于识别神经网络模型对所述压缩图像进行识别,得到参考标签信息具体包括:In a possible embodiment of the present application, the identifying the compressed image based on the identification neural network model, and obtaining the reference label information specifically includes:
对所述压缩图像进行预处理,得到待识别图像;Pre-processing the compressed image to obtain an image to be identified;
基于所述识别神经网络模型对所述待识别图像进行识别,得到所述参考标签信息。And identifying the to-be-identified image based on the identifying neural network model to obtain the reference tag information.
在本申请的一可能实施例中,所述预处理包括尺寸处理,所述对所述压缩图像进行预处理,得到待识别图像具体包括:In a possible embodiment of the present application, the pre-processing includes a size processing, and the pre-processing the compressed image to obtain the image to be identified specifically includes:
在所述压缩图像的图像大小小于所述识别神经网络的基本图像大小时,按照所述基本图像大小对所述压缩图像进行填充像素点,得到所述待识别图像。When the image size of the compressed image is smaller than the basic image size of the recognition neural network, the compressed image is filled with pixels according to the basic image size to obtain the image to be recognized.
在本申请的一可能实施例中,所述压缩训练图集至少包括识别训练图集,所述方法还包括:In a possible embodiment of the present application, the compressed training atlas includes at least an identification training atlas, and the method further includes:
采用所述识别训练图集对所述识别神经网络进行训练,得到所述识别神经网络模型,所述识别训练图集中每一训练图像至少包括与所述目标标签信息的类型一致的标签信息。The identification neural network is trained by using the identification training atlas to obtain the identification neural network model, and each training image in the identification training map set at least includes label information that is consistent with the type of the target label information.
在本申请的一可能实施例中,在所述基于所述压缩神经网络模型对所述目标原始图像进行压缩,得到所述第二分辨率的目标压缩图像之后,所述方法还包括:In a possible embodiment of the present application, after the target original image is compressed based on the compressed neural network model to obtain the target compressed image of the second resolution, the method further includes:
基于所述识别神经网络模型对所述目标压缩图像进行压缩,得到所述目标原始图像的标签信息,并存储所述目标原始图像的标签信息。And compressing the target compressed image based on the recognized neural network model to obtain tag information of the target original image, and storing tag information of the target original image.
在本申请的一可能实施例中,所述压缩训练图集包括多个维度,所述基于目标模型对所述原始图像进行压缩,得到第二分辨率的压缩图像包括:In a possible embodiment of the present application, the compressed training atlas includes a plurality of dimensions, and the compressed image is compressed by the target model to obtain a compressed image of the second resolution, including:
基于所述目标模型对所述原始图像进行识别,得到多个图像信息,每一维度对应一个图像信息;Identifying the original image based on the target model to obtain a plurality of image information, each dimension corresponding to one image information;
基于所述目标模型和所述多个图像信息对所述原始图像进行压缩,得到所述压缩图像。The original image is compressed based on the target model and the plurality of image information to obtain the compressed image.
本申请的又一方面,提供了一种图像压缩装置,包括处理器、与所述处理器连接的存储器,其中:In still another aspect of the present application, an image compression apparatus includes a processor, a memory coupled to the processor, wherein:
所述存储器,用于存储第一阈值、第二阈值、压缩神经网络当前的神经网络模型和训练次数、所述压缩神经网络的压缩训练图集和所述压缩训练图集中每一训练图像的标签信息、识别神经网络模型、压缩神经网络模型,将所述压缩神经网络当前的神经网络模型作为目标模型,所述压缩神经网络模型为所述压缩神经网络训练完成时对应的目标模型,所述识别神经网络模型为识别神经网络训练完成时对应的神经网络模型;The memory is configured to store a first threshold, a second threshold, a current neural network model and a training number of the compressed neural network, a compressed training atlas of the compressed neural network, and a label of each training image in the compressed training map set Information, a recognition neural network model, a compressed neural network model, and a current neural network model of the compressed neural network as a target model, the compressed neural network model being a corresponding target model when the compressed neural network training is completed, the identification The neural network model is to identify a corresponding neural network model when the neural network training is completed;
所述处理器,用于获取第一分辨率的原始图像,所述原始图像为所述压缩训练图集中的任一训练图像,将所述原始图像的标签信息作为目标标签信息;基于所述目标模型对所述原始图像进行压缩,得到第二分辨率的压缩图像,所述第二分辨率小于所述第一分辨率;基于所述识别神经网络模型对所述压缩图像进行识别,得到参考标签信息;根据所述目标标签信息与所述参考标签信息获取损失函数;在所述损失函数收敛于所述第一阈值,或所述训练次数大于或等于所述第二阈值时,获取所述第一分辨率的目标原始图像,确认所述目标模型为所述压缩神经网络模型;基于所述压缩神经网络模型对所述目标原始图像进行压缩,得到所述第二分辨率的目标压缩图像。The processor is configured to acquire an original image of a first resolution, where the original image is any training image in the compressed training map set, and label information of the original image is used as target label information; The model compresses the original image to obtain a compressed image of a second resolution, the second resolution is smaller than the first resolution; and the compressed image is identified based on the recognized neural network model to obtain a reference label Obtaining a loss function according to the target tag information and the reference tag information; acquiring the first when the loss function converges to the first threshold, or the training number is greater than or equal to the second threshold a target original image of a resolution, confirming that the target model is the compressed neural network model; compressing the target original image based on the compressed neural network model to obtain a target compressed image of the second resolution.
在本申请的一可能实施例中,所述处理器,还用于在所述损失函数未收敛于所述第一阈值,或所述训练次数小于所述第二阈值时,根据所述损失函数对所述目标模型进行更新,得到更新模型,将所述更新模型作为所述目标模型,将下一个训练图像作为所述原始图像,执行所述获取第一分辨率的原始图像的步骤。In a possible embodiment of the present application, the processor is further configured to: according to the loss function, when the loss function does not converge to the first threshold, or the training number is less than the second threshold Updating the target model, obtaining an updated model, using the updated model as the target model, and using the next training image as the original image, performing the step of acquiring the original image of the first resolution.
在本申请的一可能实施例中,所述处理器具体用于对所述压缩图像进行预处理,得到待识别图像;基于所述识别神经网络模型对所述待识别图像进行识别,得到所述参考标签信息。In a possible embodiment of the present application, the processor is specifically configured to preprocess the compressed image to obtain an image to be identified, and identify the image to be identified based on the recognized neural network model, to obtain the Refer to the label information.
在本申请的一可能实施例中,所述预处理包括尺寸处理,所述存储器,还用于存储所述识别神经网络的基本图像大小;所述处理器具体用于在所述压缩图像的图像大小小于所述基本图像大小时,按照所述基本图像大小对所述压缩图像进行填充像素点,得到所述待识别图像。In a possible embodiment of the present application, the pre-processing includes size processing, the memory is further configured to store a basic image size of the recognition neural network; and the processor is specifically configured to be used in an image of the compressed image When the size is smaller than the basic image size, the compressed image is filled with pixels according to the basic image size, and the image to be recognized is obtained.
在本申请的一可能实施例中,所述压缩训练图集至少包括识别训练图集,所述处理器还用于采用所述识别训练图集对所述识别神经网络进行训练,得到所述识别神经网络模型,所述识别训练图集中每一训练图像至少包括与所述目标标签信息的类型一致的标签信息。In a possible embodiment of the present application, the compressed training atlas includes at least identifying a training atlas, and the processor is further configured to use the identification training atlas to train the recognized neural network to obtain the identification. The neural network model, each training image in the identification training map set includes at least tag information consistent with the type of the target tag information.
在本申请的一可能实施例中,所述处理器,还用于基于所述识别神经网络模型对所述目标压缩图像进行识别,得到所述目标原始图像的标签信息;所述存储器,还用于存储所述目标原始图像的标签信息。In a possible embodiment of the present application, the processor is further configured to: identify the target compressed image based on the recognized neural network model, and obtain label information of the target original image; And storing label information of the target original image.
在本申请的一可能实施例中,所述压缩训练图集包括多个维度,所述处理器具体用于基于所述目标模型对所述原始图像进行识别,得到多个图像信息,每一维度对应一个图像信息;基于所述目标模型和所述多个图像信息对所述原始图像进行压缩,得到所述压缩图像。In a possible embodiment of the present application, the compressed training atlas includes multiple dimensions, and the processor is specifically configured to identify the original image based on the target model to obtain multiple image information, each dimension. Corresponding to one image information; compressing the original image based on the target model and the plurality of image information to obtain the compressed image.
本申请的另一方面,提供另一种电子设备,包括处理器、存储器、通信接口以及一个或多个程序,其中,上述一个或多个程序被存储在上述存储器中,并且被配置由上述处理器执行,所述程序包括用于如上述所述的图像压缩方法中所描述的部分或全部步骤的指令。In another aspect of the present application, there is provided another electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be processed by the above Executed, the program includes instructions for some or all of the steps described in the image compression method as described above.
本申请的另一方面,提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述所述的图像压缩方法。In another aspect of the present application, a computer readable storage medium is provided, the computer storage medium storing a computer program, the computer program comprising program instructions, the program instructions causing the processor when executed by a processor The image compression method described above is performed.
本申请提供的处理方法及装置、运算方法及装置,相较于现有技术,至少具有以下优点:The processing method and device, the computing method and the device provided by the present application have at least the following advantages compared with the prior art:
1、采用量化的方法将神经网络的神经元和权值进行量化,用权值字典和权值密码本表示量化后的权值,用神经元字典和神经元密码本表示量化后的神经元,然后将神经网络中的运算转化成为查表操作,从而减少神经网络参数存储量,减少访存能耗和计算能耗。神经网络处理器中集成基于查找表的计算方法,优化了查表操作,简化了结构,减少神经网络访存能耗和计算能耗,同时还能实现运算的多元化。1. Quantize the neurons and weights of the neural network by using the quantified method, use the weight dictionary and the weight codebook to represent the quantized weights, and use the neuron dictionary and the neuron codebook to represent the quantized neurons. Then the operation in the neural network is transformed into a table lookup operation, thereby reducing the amount of neural network parameter storage, reducing the memory consumption and computing energy consumption. The neural network processor integrates a lookup table-based calculation method, optimizes the look-up table operation, simplifies the structure, reduces the neural network access energy consumption and calculates the energy consumption, and at the same time realizes the diversification of the operation.
2、可以对神经网络进行重训练,且重训练时只需训练密码本,不需要训练权值字典,简化了重训练操作。2, the neural network can be retrained, and only need to train the codebook during retraining, and does not need to train the weight dictionary, which simplifies the retraining operation.
3、采用针对局部量化的多层人工神经网络运算的神经网络专用指令和灵活的运算单元,解决了中央处理器CPU和图形处理器GPU运算性能不足,以及前端译码开销大的问题,有效提高了对多层人工神经网络运算算法的支持。3. The neural network dedicated instruction and flexible arithmetic unit for multi-layer artificial neural network operation for local quantization solve the problem that the central processor CPU and graphics processor GPU have insufficient performance and the front-end decoding overhead is large, effectively improving Support for multi-layer artificial neural network algorithms.
4、通过采用针对多层人工神经网络运算算法的专用片上缓存,充分挖掘了输入神经元和权值数据的重用性,避免了反复在内存中读取这些数据,降低了内存访问带宽,避免了内存带宽为多层人工神经网络运算及其训练算法带来的性能瓶颈的问题。4. By using a dedicated on-chip buffer for multi-layer artificial neural network operation algorithm, the reuse of input neurons and weight data is fully exploited, avoiding repeated reading of these data in memory, reducing memory access bandwidth and avoiding Memory bandwidth is a performance bottleneck caused by multi-layer artificial neural network operations and its training algorithms.
附图说明DRAWINGS
图1A为本申请实施例提供的一种处理方法的流程示意图。FIG. 1A is a schematic flowchart of a processing method according to an embodiment of the present application.
图1B为本申请实施例提供的一种对权值进行量化的过程示意图。FIG. 1B is a schematic diagram of a process for quantifying weights according to an embodiment of the present application.
图1C为本申请实施例提供的一种对输入神经元进行量化的过程示意图。FIG. 1C is a schematic diagram of a process for quantifying input neurons according to an embodiment of the present application.
图1D为本申请实施例提供的一种确定运算密码本的过程示意图。FIG. 1D is a schematic diagram of a process for determining a computing codebook according to an embodiment of the present application.
图1E为本申请实施例提供的一种处理装置的结构示意图。FIG. 1 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application.
图1F为本申请实施例提供的一种运算装置的结构示意图。FIG. 1 is a schematic structural diagram of an arithmetic device according to an embodiment of the present application.
图1G为本申请一具体实施例提供的一种运算装置的结构示意图。FIG. 1G is a schematic structural diagram of an arithmetic device according to an embodiment of the present application.
图1H为本申请实施例提供的一种运算方法的流程示意图。FIG. 1H is a schematic flowchart diagram of an operation method according to an embodiment of the present application.
图1I为本申请实施例提供的具体实施例的另一种运算方法的流程示意图。FIG. 1 is a schematic flowchart diagram of another computing method according to a specific embodiment of the present disclosure.
图2A为本申请实施例提供的一种分层存储装置结构示意图。2A is a schematic structural diagram of a layered storage device according to an embodiment of the present application.
图2B为本申请实施例提供的一种4T SRAM存储单元的结构示意图。2B is a schematic structural diagram of a 4T SRAM memory unit according to an embodiment of the present application.
图2C为本申请实施例提供的一种3T SRAM存储单元的结构示意图。2C is a schematic structural diagram of a 3T SRAM memory unit according to an embodiment of the present application.
图2D为本申请实施例提供的一种数据处理装置结构示意图。2D is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.
图2E为本申请实施例提供的另一种数据处理装置结构示意图。2E is a schematic structural diagram of another data processing apparatus according to an embodiment of the present application.
图2F为本申请实施例提供的一种数据存储方法流程图。2F is a flowchart of a data storage method according to an embodiment of the present application.
图2G为本申请实施例提供的一种数据处理方法流程图。2G is a flowchart of a data processing method according to an embodiment of the present application.
图3A为本申请实施例提供的一种动态调压调频装置的结构示意图。FIG. 3A is a schematic structural diagram of a dynamic voltage regulation and frequency modulation apparatus according to an embodiment of the present application.
图3B为本申请实施例提供的一种动态调压调频应用场景示意图。FIG. 3B is a schematic diagram of a dynamic voltage regulation and frequency modulation application scenario according to an embodiment of the present disclosure.
图3C为本申请实施例提供的另一种动态调压调频应用场景示意图。FIG. 3C is a schematic diagram of another dynamic voltage regulation and frequency modulation application scenario according to an embodiment of the present disclosure.
图3D为本申请实施例提供的另一种动态调压调频应用场景示意图。FIG. 3D is a schematic diagram of another dynamic voltage regulation and frequency modulation application scenario provided by an embodiment of the present application.
图3E为本申请实施例提供的一种互连模块4的一种实施方式的示意图。FIG. 3E is a schematic diagram of an implementation manner of an interconnection module 4 according to an embodiment of the present application.
图3F为本申请实施例提供的一种用于执行卷积神经网络正向运算的装置中主运算模块5的结构的示例框图。FIG. 3F is a block diagram showing an example of a structure of a main operation module 5 in an apparatus for performing a forward operation of a convolutional neural network according to an embodiment of the present application.
图3G为本申请实施例提供的一种用于执行卷积神经网络正向运算的装置中从运算模块6的结构的示例框图。FIG. 3G is a block diagram showing an example of a structure of a slave operation module 6 in an apparatus for performing a forward operation of a convolutional neural network according to an embodiment of the present application.
图3H为本申请实施例提供的一种动态调压调频方法的流程示意图。FIG. 3 is a schematic flowchart of a dynamic voltage regulation and frequency modulation method according to an embodiment of the present application.
图4A为本申请实施例提供的一种卷积运算装置的结构示意图。4A is a schematic structural diagram of a convolution operation device according to an embodiment of the present application.
图4B为本申请实施例提供的一种卷积运算装置中的主运算模块的结构的示例框图。FIG. 4B is a block diagram showing an example of a structure of a main operation module in a convolution operation device according to an embodiment of the present application.
图4C为本申请实施例提供的一种卷积运算装置中的从运算模块的结构的示例框图。4C is a block diagram showing an example of a structure of a slave arithmetic module in a convolution operation device according to an embodiment of the present application.
图4D为本申请实施例提供的一种卷积运算装置中的动态调压调频装置的结构的示例框图。4D is a block diagram showing an example of a structure of a dynamic voltage regulation and frequency modulation apparatus in a convolution operation device according to an embodiment of the present application.
图4E为本申请实施例提供的互连模块4的一种实施方式的示意图。FIG. 4E is a schematic diagram of an implementation manner of the interconnect module 4 according to an embodiment of the present application.
图4F为本申请实施例提供的另一种卷积运算装置的结构示意图。FIG. 4F is a schematic structural diagram of another convolution operation device according to an embodiment of the present application.
图4G为本申请实施例提供的用于执行单层卷积神经网络正向运方法的流程示意图。FIG. 4G is a schematic flowchart of a method for performing a forward operation of a single-layer convolutional neural network according to an embodiment of the present application.
图5A为本申请实施例提供的一种神经网络的运算示意图。FIG. 5A is a schematic diagram of operations of a neural network according to an embodiment of the present application.
图5B为本申请实施例提供的一种图像压缩方法的流程示意图。FIG. 5B is a schematic flowchart of an image compression method according to an embodiment of the present application.
图5C为本申请实施例提供的一种尺寸处理方法的场景示意图。FIG. 5C is a schematic diagram of a scenario of a size processing method according to an embodiment of the present application.
图5D为本申请实施例提供的一种单层神经网络运算方法的流程示意图。FIG. 5 is a schematic flowchart of a single-layer neural network operation method according to an embodiment of the present application.
图5E为本申请实施例提供的一种用于执行压缩神经网络反向训练装置的结构示意图。FIG. 5E is a schematic structural diagram of a reverse training device for performing a compressed neural network according to an embodiment of the present application.
图5F为本申请实施例提供的一种H树模块的结构示意图。FIG. 5 is a schematic structural diagram of an H-tree module according to an embodiment of the present application.
图5G为本申请实施例提供的一种主运算模块的结构示意图。FIG. 5G is a schematic structural diagram of a main operation module according to an embodiment of the present application.
图5H为本申请实施例提供的一种运算模块的结构示意图。FIG. 5 is a schematic structural diagram of an operation module according to an embodiment of the present application.
图5I为本申请实施例提供的一种压缩神经网络反向训练的示例框图。FIG. 5I is a block diagram of an example of reverse training of a compressed neural network according to an embodiment of the present application.
图5J为本申请实施例提供的一种图像压缩方法的流程示意图。FIG. 5 is a schematic flowchart of an image compression method according to an embodiment of the present application.
图5K为本申请实施例提供的一种电子装置的结构示意图。FIG. 5K is a schematic structural diagram of an electronic device according to an embodiment of the present application.
具体实施方式Detailed ways
基于现有技术中,神经网络的数据处理时超大的计算量使得神经网络的应用受到阻碍的技术缺陷,本申请提供了一种处理方法及装置、运算方法及装置。其中,处理方法及装置通过量化输入神经元和权值这两种数据,分别挖掘层间、段间数据之间的相似性以及层内、段内数据局部相似性,以挖掘这两种数据的分布特性从而进行低比特量化,减小用于表示每一个数据的比特数,从而降低了数据存储开销和访存开销。处理方法及装置将量化后的神经元和权值,通过查表操作实现了二者的运算操作,减少了神经网络访存能耗和计算能耗。Based on the prior art, the excessive computational amount of data processing of the neural network makes the application of the neural network impaired. The present application provides a processing method, apparatus, and method and apparatus. The processing method and device quantify the input data between the neurons and the weights, respectively, and respectively mine the similarity between the layers, the data between the segments, and the local similarity of the data in the intra- and intra-segment to excavate the two kinds of data. The distribution characteristics thus perform low bit quantization, reducing the number of bits used to represent each data, thereby reducing data storage overhead and memory access overhead. The processing method and device realize the operation operations of the quantized neurons and weights through the table look-up operation, thereby reducing the energy consumption of the neural network to access the storage and calculating the energy consumption.
本申请中提到的输入神经元和输出神经元并非是指整个神经网络的输入层中神经元和输出层中神经元,而是对于网络中任意相邻的两层,处于网络前馈运算下层中的神经元即为输入神经元,处于网络前馈运算上层中的神经元即为输出神经元。以卷积神经网络为例,设一个卷积神经网络有L层,K=1,2,...,L-1,对于第K层和第K+1层来说,将第K层称为输入层,其中的神经元为所述输入神经元,第K+1层称为输出层,其中的神经元为所述输出神经元。即除最顶层外,每一层都可以作为输入层,其下一层为对应的输出层。The input neurons and output neurons mentioned in this application do not refer to the neurons in the input layer of the entire neural network and the neurons in the output layer, but to any two adjacent layers in the network, which are under the network feedforward operation. The neurons in the middle are the input neurons, and the neurons in the upper layer of the network feedforward operation are the output neurons. Taking a convolutional neural network as an example, let a convolutional neural network have an L layer, K=1, 2,..., L-1. For the Kth layer and the K+1th layer, the Kth layer is called As an input layer, the neuron is the input neuron, and the K+1th layer is called an output layer, wherein the neuron is the output neuron. That is, except for the top layer, each layer can be used as an input layer, and the next layer is the corresponding output layer.
为使本申请的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参阅附图,对本申请进一步详细说明。In order to make the objects, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the accompanying drawings.
参阅图1A,图1A为本申请实施例提供的一种处理方法的流程示意图,如图1A所示,处理方法包括:Referring to FIG. 1A, FIG. 1A is a schematic flowchart of a processing method according to an embodiment of the present disclosure. As shown in FIG. 1A, the processing method includes:
步骤S1、分别对权值和输入神经元进行量化,确定权值字典、权值密码本、神经元字典和神经元密码本;Step S1, respectively quantizing the weight and the input neuron, and determining the weight dictionary, the weight codebook, the neuron dictionary, and the neuron codebook;
其中,对权值进行量化的过程具体包括步骤:The process of quantifying the weight includes the following steps:
对权值进行分组,对每一组权值用聚类算法进行聚类操作,将一组权值分成m类,m为正整数,每一类权值对应一个权值索引,确定权值字典,其中,权值字典包括权值位置和权值索引,权值位置指权值在神经网络结构中的位置;The weights are grouped, and each group of weights is clustered by a clustering algorithm, and a set of weights is divided into m classes, m is a positive integer, each type of weight corresponds to a weight index, and a weight dictionary is determined. Wherein the weight dictionary includes a weight position and a weight index, and the weight position refers to a position of the weight in the neural network structure;
将每一类的所有权值用一中心权值替换,确定权值密码本,该权值密码本包括权值索引和中心权值。The ownership value of each class is replaced with a central weight, and the weighted password book is determined. The weighted codebook includes a weight index and a center weight.
参阅图1B,图1B为本申请实施例提供的一种对权值进行量化的过程示意图,如图1B所示,按照预设的分组策略对权值进行分组,得到有序排列的权值矩阵。再对分组后的权值矩阵进行组内采样以及聚类操作,在将数值相近的权值划为同一类别,在根据损失函数计算出4个类别下的中心权值为1.50、-0.13、-1.3和0.23,分别对应四个类别的权值。已知的权值密码本中,中心权值为-1.3的类别的权值索引为00,中心权值为-0.13的类别的权值索引为01,中心权值为0.23的类别的权值索引为10,中心权值为1.50的类别的权值索引为11。另外,还分别用4个权值对应的权值索引(00、01、10和11)分别表示对应类别中的权值,从而得到权值字典。需要注意的是,权值字典还包括权值位置,即权值在神经网络结构中的位置,在权值字典中,权值位置指其中第p行第q列的坐标即(p,q),在本实施例中,1≤p≤4,1≤q≤4。Referring to FIG. 1B, FIG. 1B is a schematic diagram of a process for quantifying weights according to an embodiment of the present application. As shown in FIG. 1B, weights are grouped according to a preset grouping strategy to obtain an ordered matrix of weights. . Then, the intra-group sampling and clustering operations are performed on the grouped weight matrix, and the weights with similar values are classified into the same category, and the central weights under the four categories are calculated according to the loss function are 1.50, -0.13, - 1.3 and 0.23, respectively, correspond to the weights of the four categories. In the known weight codebook, the weight index of the category with a center weight of -1.3 is 00, the weight index of the category with a center weight of -0.13 is 01, and the weight index of the category with a center weight of 0.23. For a value of 10, a category with a center weight of 1.50 has an index of 11. In addition, weight values corresponding to the weights (00, 01, 10, and 11) of the four weights are respectively used to represent the weights in the corresponding categories, thereby obtaining a weight dictionary. It should be noted that the weight dictionary also includes the weight position, that is, the position of the weight in the neural network structure. In the weight dictionary, the weight position refers to the coordinate of the qth column of the pth row, ie (p, q) In the present embodiment, 1 ≤ p ≤ 4, and 1 ≤ q ≤ 4.
可见,该量化过程充分挖掘了神经网络层间权值的相似性以及层内权值局部相似性,得到神经网络的权值分布特性从而进行低比特量化,减小了用于表示每一个权值的比特数,从而降低了权值存储开销和访存开销。It can be seen that the quantization process fully exploits the similarity of the weights between the neural network layers and the local similarity of the intra-layer weights, and obtains the weight distribution characteristics of the neural network to perform low-bit quantization, which is used to represent each weight. The number of bits, which reduces the weight storage overhead and fetch overhead.
可选的,该预设的分组策略包括但不限于以下几种:分为一组,将神经网络的所有权值归为一组;层类型分组,将神经网络中所有卷积层的权值、所有全连接层的权值和所有长短时记忆网络层的权值各划分成一组;层间分组,将神经网络中一个或者多个卷积层的权值、一个或者多个全连接层的权值和一个或者多个长短时记忆网络层的权值各划分成一组;以及层内分组,将神经网络的一层内的权值进行切分,切分后的每一个部分划分为一组。Optionally, the preset grouping policy includes, but is not limited to, the following: grouping the groups, and grouping the ownership values of the neural network into groups; layer type grouping, weighting all convolution layers in the neural network, The weights of all fully connected layers and the weights of all long and short memory network layers are grouped into one group; the inter-layer grouping, the weight of one or more convolution layers in the neural network, the weight of one or more fully connected layers The value and the weight of one or more long-term memory network layers are grouped into one group; and the intra-layer grouping divides the weights in the layer of the neural network, and each part after the segmentation is divided into a group.
聚类算法包括K-means、K-medoids、Clara和/或Clarans。每一类的中心权值的选择方法为使得代价函数J(w,w 0)最小时,W 0的取值即为该中心权值,代价函数可以为平方距离:
Figure PCTCN2018095548-appb-000002
其中,J是代价函数,W是该类中所有权值,W 0是中心权值,n是每一类中权值数量,wi是类中第i个权值,1≤i≤n,且n为正整数。
Clustering algorithms include K-means, K-medoids, Clara, and/or Clarans. The central weight of each class is chosen such that when the cost function J(w, w 0 ) is minimized, the value of W 0 is the center weight, and the cost function can be the squared distance:
Figure PCTCN2018095548-appb-000002
Where J is the cost function, W is the ownership value in the class, W 0 is the central weight, n is the number of weights in each class, and wi is the i-th weight in the class, 1 ≤ i ≤ n, and n Is a positive integer.
进一步地,对输入神经元进行量化进行说明,其包括步骤:Further, the input neurons are quantized, which includes the steps of:
将输入神经元分为p段,每一段输入神经元对应一个神经元范围及一个神经元索引,确定神经元字典,其中,p为正整数;以及Dividing the input neuron into p segments, each segment input neuron corresponding to a neuron range and a neuron index, determining a neuron dictionary, where p is a positive integer;
对所述输入神经元进行编码,将每一段的所有输入神经元用一中心神经元替换,确定神经元密码本。The input neurons are encoded, and all input neurons of each segment are replaced with a central neuron to determine a neuron codebook.
参阅图1C,图1C为本申请实施例提供的一种对输入神经元进行量化的过程示意图,如图1C所示,本实施例以对ReLU激活层神经元进行量化为例做具体说明。首先将ReLU函数进行分段,共分为四段,分别用0.0、0.2、0.5和0.7分别表示表示四段的中心神经元,以00、01、10和11来表示神经元索引。最后生成包含有神经元索引和中心神经元的神经元密码本;以及包含有神经元范围和神经元索引的神经元字典,其中,神经元范围和神经元索引对应存储,x表示未量化神经元时神经元的值。该输入神经元的量化过程能够按照实际需求将输入神经元分成多段,以及得到每一段的索引,组成神经元字典。再根据神经元索引,将每一段中的输入神经元替换为神经元密码本中的中心神经元,能够充分挖掘输入神经元之间的相似性,得到输入神经元的分布特性从而进行低比特量化,减小了表示每一个输入神经元的比特数,从而降低了输入神经元的存储开销和访存开销。Referring to FIG. 1C, FIG. 1C is a schematic diagram of a process for quantifying input neurons according to an embodiment of the present application. As shown in FIG. 1C, this embodiment uses a method for quantifying ReLU activation layer neurons as an example. First, the ReLU function is segmented into four segments. The central neurons representing the four segments are represented by 0.0, 0.2, 0.5, and 0.7, respectively, and the neuron index is represented by 00, 01, 10, and 11. Finally, a neuron codebook containing a neuron index and a central neuron is generated; and a neuron dictionary containing a neuron range and a neuron index, wherein the neuron range and the neuron index are correspondingly stored, and x represents an unquantized neuron. The value of the neuron. The quantization process of the input neuron can divide the input neuron into multiple segments according to actual needs, and obtain an index of each segment to form a neuron dictionary. According to the neuron index, the input neurons in each segment are replaced with the central neurons in the neuron codebook, which can fully exploit the similarity between the input neurons and obtain the distribution characteristics of the input neurons for low bit quantization. The number of bits representing each input neuron is reduced, thereby reducing the storage overhead and memory access overhead of the input neurons.
步骤S2、根据所述权值密码本和神经元密码本,确定运算密码本,具体包括步骤:Step S2: Determine the operation codebook according to the weight codebook and the neuron codebook, and specifically include the steps:
S21、根据所述权值确定权值密码本中的对应的权值索引,再通过权值索引确定该权值对应的中心权值;S21: Determine, according to the weight value, a corresponding weight index in the weight code book, and then determine, by using the weight index, a center weight corresponding to the weight value;
S22、根据所述输入神经元确定神经元密码本中对应的神经元索引,再通过神经元索引确定该输入神经元对应的中心神经元;以及S22. Determine, according to the input neuron, a corresponding neuron index in the neuron codebook, and then determine, by the neuron index, a central neuron corresponding to the input neuron; and
S23、将该中心权值和中心神经元进行运算操作,得到运算结果,并将该运算结果组成矩阵,从而确定所述运算密码本。S23. Perform an operation operation on the center weight and the central neuron to obtain an operation result, and form the operation result into a matrix to determine the operation codebook.
参阅图1D,图1D为本申请实施例提供的确一种定运算密码本的过程示意图,如图1D所示,本实施例以乘法密码本为例,在其他实施例中,该运算密码本还可以为加法密码本、池化密码本等,本申请不做唯一限定。先在权值字典中,确定权值对应的权值索引、以及该权值索引对应的中心权值;再在神经元密码本中,根据输入神经元确定对应的神经元索引、以及该神经元索引对应的中心神经元。最后将该神经元索引和权值索引作为运算密码本的行索引和列索引,中心神经元和中心权值进行乘法运算,组成矩阵,即可得到乘法密码本。Referring to FIG. 1D, FIG. 1D is a schematic diagram of a process for determining a fixed codebook according to an embodiment of the present application. As shown in FIG. 1D, the multiplier codebook is taken as an example in this embodiment. In other embodiments, the codebook is further It can be an add-on code book, a pooled code book, etc., and this application is not limited. First, in the weight dictionary, the weight index corresponding to the weight and the center weight corresponding to the weight index are determined; and in the neuron codebook, the corresponding neuron index and the neuron are determined according to the input neuron. The center neuron corresponding to the index. Finally, the neuron index and the weight index are used as the row index and the column index of the operation codebook, and the central neuron and the center weight are multiplied to form a matrix, and the multiplication codebook can be obtained.
在步骤S2之后还可以包括步骤S3、对权值和输入神经元进行重训练,重训练时只训练权值密码本和神经元密码本,权值字典和神经元字典的内容保持不变,简化重训练操作,减小了工作量。优选地,所述重训练采用反向传播算法。After step S2, step S3 may be further included, the weight and the input neuron are retrained, and only the weight codebook and the neuron codebook are trained during the retraining, and the contents of the weight dictionary and the neuron dictionary remain unchanged, simplifying Heavy training operations reduce the workload. Preferably, the retraining employs a backpropagation algorithm.
参阅图1E,图1E为本申请实施例提供的一种处理装置的结构示意图,如图1E所示,该处理装置包括:Referring to FIG. 1E, FIG. 1E is a schematic structural diagram of a processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 1E, the processing apparatus includes:
存储器51,用于存储操作指令;a memory 51, configured to store an operation instruction;
处理器52,用于执行存储器51中的操作指令,在执行该操作指令时依照前述的处理方法执行操作。其中,操作指令可以为二进制数,包括操作码和地址码,操作码指示处理器52即将执行的操作,地址码指示处理器52到存储器51中的地址中读取参与该操作的数据。The processor 52 is configured to execute an operation instruction in the memory 51, and perform an operation according to the foregoing processing method when the operation instruction is executed. The operation instruction may be a binary number including an operation code and an address code, the operation code indicates an operation to be performed by the processor 52, and the address code instructs the processor 52 to read the data participating in the operation into the address in the memory 51.
本申请的数据的处理装置,处理器52通过执行存储器51中的操作指令,依照前述数据的处理方法执行操作,能够对无序的权值和输入神经元进行量化,得到低比特且规范化的中心权值和中心神经元,挖掘权值和输入神经元之间的局部相似性,得到二者分布特性,根据二者分布特性进行低比特量化,减小了表示每一个权值和输入神经元的比特数,从而降低了二者的存储开销和访存开销。In the processing device of the data of the present application, the processor 52 performs an operation in accordance with the processing method of the foregoing data by executing an operation instruction in the memory 51, and can quantize the disordered weight and the input neuron to obtain a low-bit and normalized center. Weights and central neurons, the local similarity between the mining weights and the input neurons, the distribution characteristics of the two are obtained, and the low-bit quantization is performed according to the distribution characteristics of the two, which reduces the representation of each weight and the input neurons. The number of bits, which reduces the storage overhead and memory access overhead of both.
参阅图1F,图1F为本申请实施例提供的一种运算装置的结构示意图,如图1F所示,该运算装置包括:指令控制单元1和查找表单元2;1F, FIG. 1F is a schematic structural diagram of an arithmetic device according to an embodiment of the present application. As shown in FIG. 1F, the computing device includes: an instruction control unit 1 and a lookup table unit 2;
指令控制单元1,用于对接收的指令进行译码,生成查找控制信息;The instruction control unit 1 is configured to decode the received instruction to generate search control information;
查找表单元2,用于根据指令控制单元1生成的查找控制信息,以及接收到的权值字典、神经元字典、运算密码本、权值和输入神经元,从运算密码本中查找输出神经元。其中,所述权值字典包括权值位置(即权值在神经网络结构中的位置,用(p,q)表示,具体表示在权值字典中第p行第q列的位置)和权值索引;所述神经元字典包括输入神经元和神经元索引;所述运算密码本包括权值索引、神经元索引以及输入神经元和权值的运算结果。The lookup table unit 2 is configured to search for output neurons from the operation codebook according to the search control information generated by the instruction control unit 1 and the received weight dictionary, the neuron dictionary, the operation codebook, the weight and the input neurons. . Wherein, the weight dictionary includes a weight position (ie, a position of a weight in a neural network structure, represented by (p, q), specifically indicating a position of a p-th qth column in the weight dictionary) and a weight An index; the neuron dictionary includes an input neuron and a neuron index; the operational codebook includes a weight index, a neuron index, and an operation result of the input neuron and the weight.
其中,该查找表单元的具体工作过程为:根据权值确定权值在权值字典中对应的权值位置确定权值索引,根据输入神经元在神经元字典中对应的神经元范围确定神经元索引,将权值索引和神经元索引作为运算密码本的列索引和行索引,从运算密码本中查找出该列和该行的数值(运算结果),该数值即为输出神经元。The specific working process of the lookup table unit is: determining a weight index according to the weight value corresponding to the weight position in the weight dictionary, and determining the neuron according to the corresponding neuron range in the neuron dictionary of the input neuron The index uses the weight index and the neuron index as the column index and the row index of the operation codebook, and finds the value of the column and the row (operation result) from the operation codebook, and the value is the output neuron.
参阅图1B至图1D,在进行查找时,假定某神经元的神经元索引为01,某权值的权值索引为10时,则将该神经元和权值进行运算时,查找乘法密码本中第2行第3列对应的数值0.046,即为输出神经元。类似地,加法和池化操作与乘法操作类似,此处不再赘述。可以理解的是,池化包括但不限于平均值池化、最大值池化和中值池化。Referring to FIG. 1B to FIG. 1D, when performing a search, assuming that the neuron index of a certain neuron is 01 and the weight index of a certain weight is 10, when the neuron and the weight are operated, the multiplication codebook is searched. The value corresponding to the second row and the third column is 0.046, which is the output neuron. Similarly, the addition and pooling operations are similar to the multiplication operations and will not be described here. It can be understood that pooling includes, but is not limited to, average pooling, maximum pooling, and median pooling.
更具体地,根据不同的运算操作,该查找表可以包括以下的至少一种:More specifically, the lookup table may include at least one of the following according to different arithmetic operations:
乘法查找表:用于输入权值索引in1和神经元索引in2,通过乘法查找表经过查表操作mult_lookup,完成权值索引对应的中心权值data1和神经元索引对应的中心神经元data2的乘法操作,即用查表操作out=mult_lookup(in1,in2)完成乘法功能out=data1*data2;和/或Multiplication lookup table: used to input the weight index in1 and the neuron index in2, and through the multiplication lookup table through the table lookup operation mult_lookup, complete the multiplication operation of the center weight data1 corresponding to the weight index and the center neuron data2 corresponding to the neuron index , that is, using the lookup table operation out=mult_lookup(in1, in2) to complete the multiplication function out=data1*data2; and/or
加法查找表:用于根据输入索引in通过逐级加法查找表经过查表操作add_lookup完成索引对应的中心数据data的加法操作,其中,in和data是长度为N的向量,N是正整数,即用查表操作out=add_lookup(in)完成加法功能out=data[1]+data[2]+...+data[N],和/或,输入权值索引in1和神经元索引in2通过加法查找表经过查表操作完成权值索引对应的中心权值data1和神经元索引对应的中心神经元data2的加法操作,即用查表操作out=add_lookup(in1,in2)完成加法功能,out=data1+data2;和/或Addition lookup table: used to add the central data of the index corresponding to the index by the table-searching operation add_lookup according to the input index in. The in and data are vectors of length N, and N is a positive integer, that is, The table lookup operation out=add_lookup(in) completes the addition function out=data[1]+data[2]+...+data[N], and/or, the input weight index in1 and the neuron index in2 are searched by addition. The table performs the addition operation of the center weight data1 corresponding to the weight index and the center neuron data2 corresponding to the neuron index through the table lookup operation, that is, the addition function is completed by the table lookup operation out=add_lookup(in1, in2), out=data1+ Data2; and / or
池化查找表:用于输入索引对应的中心数据data的池化操作,即用查表out=pool_lookup(in)完成池化操作out=pool(data),池化操作包括平均值池化、最大值池化和中值池化。Pooled lookup table: used to input the central data of the index corresponding to the pooling operation, that is, use the lookup table out=pool_lookup(in) to complete the pooling operation out=pool(data), and the pooling operation includes the average pooling and maximum Value pooling and median pooling.
参阅图1G,图1G为本申请实施例提供的另一种运算装置的结构示意图,如图1G所示,该具体实施例的运算装置相较于图1F中的运算装置还包括:预处理单元4、存储单元3、缓存单元6和直接内存存取单元5,能够优化本申请的处理过程,使得数据的处理更有序。1G, FIG. 1G is a schematic structural diagram of another computing device according to an embodiment of the present application. As shown in FIG. 1G, the computing device of the specific embodiment further includes: a preprocessing unit compared to the computing device in FIG. 4. The storage unit 3, the cache unit 6, and the direct memory access unit 5 are capable of optimizing the processing of the present application to make the processing of data more orderly.
预处理单元4,用于对外部输入的输入信息进行预处理,得到所述权值、输入神经元、指令、权值字典、神经元字典和运算密码本,预处理包括但不限于切分、高斯滤波、二值化、正则化和/或归一化。The pre-processing unit 4 is configured to pre-process the input information of the external input to obtain the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook, and the pre-processing includes but is not limited to segmentation, Gaussian filtering, binarization, regularization, and/or normalization.
存储单元3,用于存储输入神经元、权值、权值字典、神经元字典、运算密码本和指令,以及接收输出神经元;The storage unit 3 is configured to store input neurons, weights, weights dictionary, neuron dictionary, operation codebook and instructions, and receive output neurons;
缓存单元6,用于缓存所述指令、权值索引、神经元索引和输出神经元,可以包括:The cache unit 6 is configured to cache the instruction, the weight index, the neuron index, and the output neuron, and may include:
指令缓存61,用于缓存所述指令,并将缓存的指令输出至指令控制单元1;The instruction cache 61 is configured to buffer the instruction and output the cached instruction to the instruction control unit 1;
权值缓存62,用于缓存所述权值,并将缓存的权值输出至查找表单元2;a weight buffer 62, configured to cache the weight, and output the cached weight to the lookup table unit 2;
输入神经元缓存63,用于缓存所述输入神经元,并将缓存的输入神经元输出至查找表单元2;The input neuron cache 63 is configured to buffer the input neuron and output the buffered input neuron to the lookup table unit 2;
输出神经元缓存64,用于缓存查找表单元2输出的输出神经元,并将缓存的输出神经元输出至查找表单元2;The output neuron cache 64 is configured to cache the output neurons output by the lookup table unit 2, and output the buffered output neurons to the lookup table unit 2;
神经元索引缓存65,用于根据输入神经元确定对应的神经元索引,缓存该神经元索引,并将缓存的神经元索引输出至查找表单元2;a neuron index cache 65, configured to determine a corresponding neuron index according to the input neuron, cache the neuron index, and output the cached neuron index to the lookup table unit 2;
权值索引缓存66,用于根据权值确定对应的权值索引,缓存该权值索引,并将缓存的权值索引输出至查找表单元2。The weight index cache 66 is configured to determine a corresponding weight index according to the weight, cache the weight index, and output the cached weight index to the lookup table unit 2.
直接内存存取单元5,用于在所述存储单元3和缓存单元6之间进行数据或者指令读写。The direct memory access unit 5 is configured to perform data or instruction reading and writing between the storage unit 3 and the cache unit 6.
可选的,关于指令,该指令可以为神经网络专用指令,包括所有专用于完成人工神经网络运算的指令。神经网络专用指令包括但不仅限于控制指令、数据传输指令、运算指令和逻辑指令。其中,控制指令控制神经网络执行过程。数据传输指令完成不同存储介质之间的数据传输,数据格式包括但不仅限于矩阵,向量和标量。运算指令完成神经网络的算术运算,包括但不仅限于矩阵运算指令、向量运算指令、标量运算指令、卷积神经网络运算指令、全连接神经网络运算指令、池化神经网络运算指令、RBM神经网络运算指令、LRN神经网络运算指令、LCN神经网络运算指令、LSTM神经网络运算指令、RNN神经网络运算指令、RELU神经网络运算指令、PRELU神经网络运算指令、SIGMOID神经网络运算指令、TANH神经网络运算指令和MAXOUT神经网络运算指令。逻辑指令用于完成神经网络的逻辑运算,包括但不仅限于向量逻辑运算指令和标量逻辑运算指令。Optionally, regarding the instruction, the instruction may be a neural network specific instruction, including all instructions dedicated to completing an artificial neural network operation. Neural network specific instructions include, but are not limited to, control instructions, data transfer instructions, arithmetic instructions, and logic instructions. Among them, the control command controls the execution process of the neural network. Data transfer instructions complete the transfer of data between different storage media, including but not limited to matrices, vectors, and scalars. The arithmetic instruction completes the arithmetic operation of the neural network, including but not limited to the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Command, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction and MAXOUT neural network operation instructions. Logic instructions are used to perform logical operations on the neural network, including but not limited to vector logic operations instructions and scalar logic operation instructions.
其中,RBM神经网络运算指令用于实现Restricted Boltzmann Machine(RBM)神经网络运算。Among them, the RBM neural network operation instruction is used to implement the Restricted Boltzmann Machine (RBM) neural network operation.
LRN神经网络运算指令用于实现Local Response Normalization(LRN)神经网络运算。The LRN neural network operation instruction is used to implement the Local Response Normalization (LRN) neural network operation.
LSTM神经网络运算指令用于实现Long Short-Term Memory(LSTM)神经网络运算。The LSTM neural network operation instructions are used to implement Long Short-Term Memory (LSTM) neural network operations.
RNN神经网络运算指令用于实现Recurrent Neural Networks(RNN)神经网络运算。The RNN neural network operation instruction is used to implement Recurrent Neural Networks (RNN) neural network operations.
RELU神经网络运算指令用于实现Rectified linear unit(RELU)神经网络运算。The RELU neural network operation instruction is used to implement a Rectified linear unit (RELU) neural network operation.
PRELU神经网络运算指令用于实现Parametric Rectified Linear Unit(PRELU)神经网络运算。The PRELU neural network operation instruction is used to implement Parametric Rectified Linear Unit (PRELU) neural network operations.
SIGMOID神经网络运算指令用于实现S型生长曲线(SIGMOID)神经网络运算SIGMOID neural network operation instructions are used to implement S-type growth curve (SIGMOID) neural network operations
TANH神经网络运算指令用于实现双曲正切函数(TANH)神经网络运算。The TANH neural network operation instruction is used to implement a hyperbolic tangent function (TANH) neural network operation.
MAXOUT神经网络运算指令用于实现(MAXOUT)神经网络运算。The MAXOUT neural network operation instruction is used to implement (MAXOUT) neural network operations.
更进一步地,该神经网络专用指令包括Cambricon指令集,其中,所述Cambricon指令集包括至少一种Cambricon指令,且Cambricon指令的长度为64bit,该Cambricon指令包括操作码和操作数。Cambricon指令包含四种类型的指令,分别是Cambricon控制指令(control instructions)、Cambricon数据传输指令(data transfer instructions)、Cambricon运算指令(computational instructions)和Cambricon逻辑指令(logical instructions)。Further, the neural network specific instruction includes a Cambricon instruction set, wherein the Cambricon instruction set includes at least one Cambricon instruction, and the Cambricon instruction has a length of 64 bits, and the Cambricon instruction includes an operation code and an operand. The Cambricon instruction contains four types of instructions, namely Cambricon control instructions, Cambricon data transfer instructions, Cambricon operation instructions, and Cambricon logic instructions.
可选的,Cambricon控制指令用于控制执行过程。Cambricon控制指令包括跳转jump指令和条件分支conditional branch指令。Optionally, Cambricon control instructions are used to control the execution process. The Cambricon control instructions include a jump jump instruction and a conditional branch conditional branch instruction.
可选的,Cambricon数据传输指令用于完成不同存储介质之间的数据传输。Cambricon数据传输指 令包括加载(load)指令、存储(store)指令和搬运(move)指令。load指令用于将数据从主存加载到缓存,store指令用于将数据从缓存存储到主存,move指令用于在缓存与缓存或者缓存与寄存器或者寄存器与寄存器之间搬运数据。数据传输指令支持三种不同的数据组织方式,包括矩阵,向量和标量。Optionally, the Cambricon data transfer instruction is used to complete data transfer between different storage media. Cambricon data transfer instructions include load instructions, store instructions, and move instructions. The load instruction is used to load data from the main memory to the cache, the store instruction is used to store data from the cache to the main memory, and the move instruction is used to transfer data between the cache and the cache or the cache and registers or registers and registers. Data transfer instructions support three different ways of organizing data, including matrices, vectors, and scalars.
可选的,Cambricon运算指令用于完成神经网络算术运算。Cambricon运算指令包括Cambricon矩阵运算指令、Cambricon向量运算指令和Cambricon标量运算指令。Optionally, the Cambricon operation instruction is used to perform neural network arithmetic operations. Cambricon arithmetic instructions include Cambricon matrix operation instructions, Cambricon vector operation instructions, and Cambricon scalar operation instructions.
可选的,Cambricon矩阵运算指令用于完成神经网络中的矩阵运算,包括矩阵乘向量(matrix multiply vector)、向量乘矩阵(vector multiply matrix)、矩阵乘标量(matrix multiply scalar)、外积(outer product)、矩阵加矩阵(matrix add matrix)和矩阵减矩阵(matrix subtract matrix)。Optionally, the Cambricon matrix operation instruction is used to complete the matrix operation in the neural network, including a matrix multiply vector, a vector multiply matrix, a matrix multiply scalar, and an outer product. Product), matrix add matrix, and matrix subtract matrix.
可选的,Cambricon向量运算指令用于完成神经网络中的向量运算,包括向量基本运算(vector elementary arithmetics)、向量超越函数运算(vector transcendental functions)、内积(dot product)、向量随机生成(random vector generator)和向量中最大/最小值(maximum/minimum of a vector)。其中,向量基本运算包括向量加、减、乘、除(add、subtract、multiply、divide),向量超越函数是指那些不满足任何以多项式作系数的多项式方程的函数,包括但不仅限于指数函数、对数函数、三角函数和反三角函数。Optionally, the Cambricon vector operation instruction is used to perform vector operations in a neural network, including vector elementary arithmetics, vector transcendental functions, dot products, and vector random generation (random). Vector generator) and the maximum/minimum of a vector. Among them, vector basic operations include vector addition, subtraction, multiplication, and division (add, subtract, multiply, divide), and vector transcendental functions are functions that do not satisfy any polynomial equation with polynomials as coefficients, including but not limited to exponential functions. Logarithmic functions, trigonometric functions, and inverse trigonometric functions.
可选的,Cambricon标量运算指令用于完成神经网络中的标量运算,包括标量基本运算(scalar elementary arithmetics)和标量超越函数运算(scalar transcendental functions)。其中,标量基本运算包括标量加、减、乘、除(add、subtract、multiply、divide),标量超越函数是指那些不满足任何以多项式作系数的多项式方程的函数,包括但不仅限于指数函数,对数函数,三角函数,反三角函数。Optionally, the Cambricon scalar instruction is used to perform scalar operations in neural networks, including scalar elementary arithmetics and scalar transcendental functions. Among them, scalar basic operations include scalar addition, subtraction, multiplication, and division (add, subtract, multiply, divide), and scalar transcendental functions are functions that do not satisfy any polynomial equations with polynomials as coefficients, including but not limited to exponential functions. Logarithmic function, trigonometric function, inverse trigonometric function.
可选的,Cambricon逻辑指令用于完成神经网络的逻辑运算。Cambricon逻辑运算包括Cambricon向量逻辑运算指令和Cambricon标量逻辑运算指令。其中,Cambricon向量逻辑运算指令用于完成向量比较(vector compare)、向量逻辑运算(vector logical operations)和向量大于合并(vector greater than merge)。其中,向量比较包括但不限于小于大于、小于、等于、大于等于、小于等于和不等于。向量逻辑运算包括与、或、非。Optionally, Cambricon logic instructions are used to perform logical operations on the neural network. Cambricon logic operations include Cambricon vector logic operations and Cambricon scalar logic operations. Among them, the Cambricon vector logic operation instruction is used to complete vector comparison, vector logical operations, and vector greater than merge. Wherein, the vector comparison includes but is not limited to less than, greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. Vector logic operations include AND, OR, and NOT.
可选的,Cambricon标量逻辑运算用于完成标量比较(scalar compare),标量逻辑运算(scalar logical operations)。其中,标量比较包括但不限于大于、小于、等于、大于等于、小于等于和不等于。标量逻辑运算包括与、或、非。Alternatively, the Cambricon scalar logic operation is used to perform scalar comparison, scalar logical operations. Wherein, the scalar comparison includes but is not limited to greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. Scalar logic operations include AND, OR, and NOT.
参阅图1H,图1H为本申请实施例提供的另一种运算方法的流程示意图,如图1H所示,该运算方法包括以下步骤:Referring to FIG. 1H, FIG. 1H is a schematic flowchart of another operation method according to an embodiment of the present application. As shown in FIG. 1H, the operation method includes the following steps:
S81、接收权值、输入神经元、指令、权值字典、神经元字典和运算密码本;其中,所述权值字典包括权值位置和权值索引;所述神经元字典包括输入神经元和神经元索引;所述运算密码本包括权值索引、神经元索引以及输入神经元和权值的运算结果。S81, a receiving weight, an input neuron, an instruction, a weight dictionary, a neuron dictionary, and an operation codebook; wherein the weight dictionary includes a weight position and a weight index; the neuron dictionary includes an input neuron and The neuron index; the arithmetic codebook includes a weight index, a neuron index, and an operation result of the input neuron and the weight.
S82、对所述指令进行译码,确定查找控制信息;S82. Decode the instruction to determine search control information.
S83、根据所述查找控制信息、权值、权值字典、神经元字典和输入神经元,在运算密码本中查找输出神经元。S83. Search for an output neuron in the operation codebook according to the search control information, the weight value, the weight dictionary, the neuron dictionary, and the input neuron.
其中,步骤S83与查找表单元的具体工作过程类似,具体包括以下子步骤:Step S83 is similar to the specific working process of the lookup table unit, and specifically includes the following substeps:
S831、根据所述权值、输入神经元、权值字典和神经元字典,在神经元字典中通过确定神经元范围以确定神经元索引、以及在权值字典中通过确定权值位置以确定权值索引;以及S831. Determine, according to the weight, the input neuron, the weight dictionary, and the neuron dictionary, the neuron index in the neuron dictionary to determine the neuron index, and determine the weight position in the weight dictionary to determine the weight Value index;
S832、根据所述权值索引和神经元索引,在运算密码本中查找该运算结果,确定输出神经元。S832. Search for the operation result in the operation codebook according to the weight index and the neuron index, and determine the output neuron.
为了优化本申请的运算方法,使得处理更方便、有序,本申请的实施例提供了又一种运算方法,图1I为本申请实施例提供的一具体实施例的运算方法的流程示意图,该运算方法包括如下步骤:In order to optimize the operation method of the present application, the processing is more convenient and orderly, and the embodiment of the present application provides another operation method. FIG. 1 is a schematic flowchart of an operation method according to an embodiment of the present application. The calculation method includes the following steps:
步骤S90、对外部输入的输入信息进行预处理。Step S90: Preprocessing external input input information.
可选的,该对外部输入的输入信息进行预处理具体包括:得到所述输入信息对应的权值、输入神经元、指令、权值字典、神经元字典和运算密码本;所述预处理包括切分、高斯滤波、二值化、正则化和、或归一化。Optionally, the pre-processing the input information of the external input specifically includes: obtaining a weight corresponding to the input information, an input neuron, an instruction, a weight dictionary, a neuron dictionary, and an operation codebook; and the preprocessing includes Segmentation, Gaussian filtering, binarization, regularization, or normalization.
步骤S91、接收所述权值、输入神经元、指令、权值字典、神经元字典和运算密码本。Step S91: Receive the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook.
步骤S92、存储所述权值、输入神经元、指令、权值字典、神经元字典、运算密码本。Step S92, storing the weight, the input neuron, the instruction, the weight dictionary, the neuron dictionary, and the operation codebook.
步骤S93、缓存所述权值、输入神经元、指令、权值索引、神经元索引。Step S93, buffering the weight, inputting a neuron, an instruction, a weight index, and a neuron index.
步骤S94、对所述指令进行译码,确定查找控制信息。Step S94, decoding the instruction, and determining the search control information.
步骤S95、根据所述权值、输入神经元、权值字典和神经元字典,在神经元字典中通过确定神经元范围以确定神经元索引、以及在权值字典中通过确定权值位置以确定权值索引。Step S95, according to the weight, the input neuron, the weight dictionary, and the neuron dictionary, determine the neuron index in the neuron dictionary to determine the neuron index, and determine the weight position in the weight dictionary to determine Weight index.
步骤S96、根据所述权值索引和神经元索引,在运算密码本中查找该运算结果,确定输出神经元。Step S96: Searching the operation result in the operation codebook according to the weight index and the neuron index, and determining the output neuron.
参阅图2A,图2A为本申请实施例提供的一种分层存储装置结构示意图,如图2A所示,该装置包括:精确存储单元和非精确存储单元,精确存储单元用于存储数据中的重要比特位,非精确存储单元用于存储数据中的非重要比特位。Referring to FIG. 2A, FIG. 2A is a schematic structural diagram of a hierarchical storage device according to an embodiment of the present disclosure. As shown in FIG. 2A, the device includes: an accurate storage unit and an inexact storage unit, where the precise storage unit is used to store data. Important bits, inaccurate memory locations are used to store non-significant bits in the data.
精确存储单元采用错误检查和纠正ECC内存,非精确存储单元采用非ECC内存。The precision memory unit uses error checking and correcting ECC memory, and the inexact memory unit uses non-ECC memory.
进一步地,分层存储装置存储的数据为神经网络参数,包括输入神经元、权值和输出神经元,精确存储单元存储输入神经元、输出神经元以及权值的重要比特位,非精确存储单元存储输入神经元、输出神经元以及权值的非重要比特位。Further, the data stored by the tiered storage device is a neural network parameter, including input neurons, weights, and output neurons, and the precise storage unit stores input neurons, output neurons, and important bits of weights, and inaccurate storage units. Stores input neurons, output neurons, and non-significant bits of weights.
进一步地,分层存储装置存储的数据包括浮点型数据和定点型数据,将浮点型数据中的符号位和指数部分指定为重要比特位,将底数部分指定为非重要比特位,将定点型数据中的符号位和数值部分的前x比特位指定为重要比特位,将数值部分的剩余比特指定为非重要比特位,其中,x为大于等于0且小于m的正整数,m为定点型数据的总比特位。将重要比特位存放在ECC内存进行精确存储,将非重要比特位存放在非ECC内存都进行非精确存储。Further, the data stored by the tiered storage device includes floating point data and fixed point data, and the symbol bit and the exponent portion in the floating point data are designated as important bits, and the base portion is designated as a non-significant bit, and the fixed point is to be fixed. The sign bit in the type data and the first x bit of the value part are designated as important bits, and the remaining bits of the value part are designated as non-significant bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a fixed point The total bit of the type data. Store important bits in ECC memory for accurate storage, and store non-significant bits in non-ECC memory for inexact storage.
进一步地,ECC内存包括有ECC校验的DRAM(Dynamic Random Access Memory,简称:DRAM)动态随机存取存储器和有ECC校验的SRAM(Static Random-Access Memory,简称SRAM)静态随机存取存储器;其中,有ECC校验的SRAM采用6T SRAM,在本申请的其他实施例中,也可采用4T SRAM或3T SRAM。Further, the ECC memory includes a DRAM (Dynamic Random Access Memory, DRAM for short) dynamic random access memory and an SRAM (Static Random-Access Memory, SRAM) static random access memory with ECC check; Among them, the SRAM with ECC check uses 6T SRAM, and in other embodiments of the present application, 4T SRAM or 3T SRAM can also be used.
进一步地,非ECC内存包括非ECC校验的DRAM和非ECC校验的SRAM,非ECC校验的SRAM采用6T SRAM,在本申请其他实施例中,也可采用4T SRAM或3TSRAM。Further, the non-ECC memory includes a non-ECC-checked DRAM and a non-ECC-checked SRAM, and the non-ECC-checked SRAM uses a 6T SRAM. In other embodiments of the present application, 4T SRAM or 3TSRAM may also be employed.
其中,6T SRAM中存放每一个比特的单元由6个场效应管MOS(metal oxide semiconductor,简称MOS:)管组成,4T SRAM中存放每一个比特的单元由4个MOS管组成,3T SRAM中存放个每一个比特的单元由3个MOS管组成。Among them, the unit for storing each bit in the 6T SRAM is composed of 6 MOSFETs (metal: MOS) tubes, and the unit for storing each bit in the 4T SRAM is composed of 4 MOS tubes, and is stored in 3T SRAM. Each bit unit consists of 3 MOS tubes.
存储神经网络权值的SRAM一般采用6T SRAM,虽然6T SRAM稳定性高但是占用的面积大,读写功耗高。神经网络算法有一定的容错能力,而6T SRAM无法利用神经网络的容错特性,因此,在本实施例为充分挖掘神经网络的容错能力,采用4T SRAM或3T SRAM存储技术代替6T SRAM,增加SRAM存储密度,减少SRAM访存功耗,利用神经网络算法的容错性掩盖4T SRAM的抗噪能力弱的缺点。SRAMs that store neural network weights generally use 6T SRAM, although 6T SRAM has high stability but large area and high read and write power consumption. The neural network algorithm has certain fault tolerance, and the 6T SRAM cannot utilize the fault tolerance of the neural network. Therefore, in this embodiment, in order to fully exploit the fault tolerance of the neural network, 4T SRAM or 3T SRAM storage technology is used instead of 6T SRAM to increase SRAM storage. Density, reduce SRAM memory consumption, and use the fault tolerance of neural network algorithms to mask the shortcomings of 4T SRAM's weak anti-noise ability.
参阅图2B,图2B为本申请实施例提供的一种4T SRAM存储单元的结构示意图,如图2B所示,4T SRAM存储单元由4个NMOS组成,分别是M1(第一MOS管),M2(第二MOS管),M3(第三MOS管),M4(第四MOS管)。M1和M2是用于门控,M3和M4用于存储。Referring to FIG. 2B, FIG. 2B is a schematic structural diagram of a 4T SRAM memory cell according to an embodiment of the present application. As shown in FIG. 2B, the 4T SRAM memory cell is composed of four NMOSs, respectively M1 (first MOS transistor), M2. (Second MOS transistor), M3 (third MOS transistor), M4 (fourth MOS transistor). M1 and M2 are used for gating, and M3 and M4 are used for storage.
M1栅极与字线WL(Word Line)电连接,源极与位线BL(Bit Line)电连接;M2栅极与字线WL电连接,源极与位线BLB电连接;M3栅极与M4源极、M2漏极连接,并通过电阻R2与工作电压Vdd连接,M3漏极接地;M4栅极与M3源极、M1漏极连接,并通过电阻R1与工作电压Vdd连接,M4漏极接地。WL用来控制存储单元的门控访问,BL来进行存储单元的读写。当进行读操作时,拉高WL,从BL中读出位即可。当进行写操作时,拉高WL,拉高或者拉低BL,由于BL的驱动能力比存储单元强,会强制覆盖原来的状态。The M1 gate is electrically connected to the word line WL (Word Line), the source is electrically connected to the bit line BL (Bit Line), the M2 gate is electrically connected to the word line WL, the source is electrically connected to the bit line BLB, and the M3 gate is connected. M4 source, M2 drain connection, and connected to the working voltage Vdd through the resistor R2, M3 drain is grounded; M4 gate is connected to the M3 source, M1 drain, and connected to the working voltage Vdd through the resistor R1, M4 drain Ground. WL is used to control the gate access of the memory unit, and BL is used to read and write the memory unit. When a read operation is performed, WL is pulled high and the bit is read from the BL. When a write operation is performed, the WL is pulled high, and the BL is pulled high or low. Since the driving capability of the BL is stronger than that of the memory cell, the original state is forcibly overwritten.
参阅图2C,图2C为本申请实施例提供的一种3T SRAM存储单元的结构示意图,如图2C所示,3T SRAM存储单元由3个MOS组成,分别是M1(第一MOS管),M2(第二MOS管)和M3(第三MOS管)。M1用于门控,M2和M3用于存储。Referring to FIG. 2C, FIG. 2C is a schematic structural diagram of a 3T SRAM memory cell according to an embodiment of the present application. As shown in FIG. 2C, the 3T SRAM memory cell is composed of three MOSs, respectively M1 (first MOS transistor), M2. (second MOS tube) and M3 (third MOS tube). M1 is used for gating, and M2 and M3 are used for storage.
M1栅极与字线WL(Word Line)电连接,源极与位线BL(Bit Line)电连接;M2栅极与M3源极连接,并通过电阻R2与工作电压Vdd连接,M2漏极接地;M3栅极与M2源极、M1漏极连接,并通过电阻R1与工作电压Vdd连接,M3漏极接地。WL用来控制存储单元的门控访问,BL来进行存储单元的读写。当进行读操作时,拉高WL,从BL中读出位即可。当进行写操作时,拉高WL,拉高或者拉低BL,由于BL的驱动能力比存储单元强,会强制覆盖原来的状态。The M1 gate is electrically connected to the word line WL (Word Line), the source is electrically connected to the bit line BL (Bit Line), the M2 gate is connected to the M3 source, and is connected to the operating voltage Vdd through the resistor R2, and the M2 drain is grounded. The M3 gate is connected to the M2 source and the drain of M1, and is connected to the operating voltage Vdd through the resistor R1, and the drain of the M3 is grounded. WL is used to control the gate access of the memory unit, and BL is used to read and write the memory unit. When a read operation is performed, WL is pulled high and the bit is read from the BL. When a write operation is performed, the WL is pulled high, and the BL is pulled high or low. Since the driving capability of the BL is stronger than that of the memory cell, the original state is forcibly overwritten.
本申请的存储装置采用近似存储技术,能够充分挖掘神经网络的容错能力,将神经参数进行近似存储,参数中重要的比特位采用精确存储,不重要的比特位采用非精确存储,从而减少存储开销和访存能耗开销。The storage device of the present application adopts an approximate storage technology, which can fully exploit the fault tolerance of the neural network, and approximate the neural parameters. The important bits in the parameters are accurately stored, and the unimportant bits are stored inaccurately, thereby reducing storage overhead. And access to energy consumption costs.
本申请的实施例提供了一种数据处理装置,该装置与近似存储技术相对应的加速装置,参阅图2D,图2D为本申请实施例提供的一种数据处理装置的结构示意图,该数据处理装置包括:非精确运算单元、指令控制单元和上述的分层存储装置。An embodiment of the present application provides a data processing apparatus, which is an acceleration device corresponding to an approximate storage technology. Referring to FIG. 2D, FIG. 2D is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The apparatus includes: an inaccurate arithmetic unit, an instruction control unit, and the hierarchical storage device described above.
分层存储装置接收指令和运算参数,并将运算参数中的重要比特位和指令存储于精确存储单元,将运算参数中的非重要比特位存储于非精确存储单元。The tiered storage device receives the instruction and the operation parameter, and stores the important bits and instructions in the operation parameter in the precise storage unit, and stores the non-significant bits in the operation parameter in the inexact storage unit.
指令控制单元接收分层存储装置中的指令,并将指令进行译码生成控制信息控制非精确运算单元进行计算操作。The instruction control unit receives the instruction in the tiered storage device and decodes the instruction to generate control information to control the inexact operation unit to perform the calculation operation.
非精确运算单元接收分层存储装置中的运算参数,依据控制信息进行运算,并将运算结果传输至分层存储装置进行存储或输出。The inaccurate operation unit receives the operation parameters in the tiered storage device, performs operations according to the control information, and transmits the operation results to the tiered storage device for storage or output.
进一步地,非精确运算单元为神经网络处理器。进一步地,上述运算参数为神经网络参数,分层存储装置用来存储神经网络的神经元,权值和指令,将神经元的重要比特位、权值的重要比特位和指令存储在精确存储单元,神经元的非重要比特位和权值的非重要比特位存储在非精确存储单元。非精确运算单元接收分层存储装置中的输入神经元和权值,依据控制信息完成神经网络运算得到输出神经元,并将输出神经元重新传输至分层存储装置进行存储或输出。Further, the inaccurate arithmetic unit is a neural network processor. Further, the operation parameter is a neural network parameter, and the tiered storage device is used to store neurons, weights and instructions of the neural network, and store important bits of the neuron, important bits and instructions of the weight in the precise storage unit. Non-significant bits of neurons and non-significant bits of weights are stored in inexact memory cells. The inexact computing unit receives the input neurons and weights in the tiered storage device, completes the neural network operation according to the control information to obtain the output neurons, and retransmits the output neurons to the tiered storage device for storage or output.
进一步地,非精确运算单元可以有两种计算模式:(1)非精确运算单元直接接收来自分层存储装置的精确存储单元中的输入神经元的重要比特位和权值的重要比特位进行计算;(2)非精确运算单元接收重要比特位和非重要比特位拼接完整的输入神经元和权值进行计算,其中,输入神经元和权值的重要比特位和非重要比特位在存储单元中读取时进行拼接。Further, the inaccurate operation unit may have two calculation modes: (1) the inexact operation unit directly receives important bits from the input neurons in the precise storage unit of the tiered storage device and important bits of the weight for calculation. (2) The inexact arithmetic unit receives the significant input bits and the non-significant bits to splicing the complete input neurons and weights, wherein the input neurons and the important bits of the weights and the non-significant bits are in the storage unit. Splicing when reading.
进一步地,参阅图2E,如图2E所示,数据处理装置还包括预处理模块,用于对输入的原始数据进行预处理并传输至存储装置,预处理包括切分、高斯滤波、二值化、正则化、归一化等等。Further, referring to FIG. 2E, as shown in FIG. 2E, the data processing apparatus further includes a pre-processing module for pre-processing the input original data and transmitting the data to the storage device, and the pre-processing includes segmentation, Gaussian filtering, and binarization. , regularization, normalization, and so on.
进一步地,数据处理装置还包括指令缓存、输入神经元分层缓存、权值分层缓存和输出神经元分层缓存,其中,指令缓存设置在分层存储装置和指令控制单元之间,用于存储专用指令;输入神经元分层缓存设置在存储装置和非精确运算单元之间,用于缓存输入神经元,输入神经元分层缓存包括输入神经元精确缓存和输入神经元非精确缓存,分别缓存输入神经元的重要比特位和非重要比特位;权值分层缓存设置在存储装置和非精确运算单元之间,用于缓存权值数据,权值分层缓存包括权值精确缓存和权值非精确缓存,分别缓存权值的重要比特位和非重要比特位;输出神经元分层缓存设置在存储装置和非精确运算单元之间,用于缓存输出神经元,所述输出神经元分层缓存包括输出神经元精确缓存和输出神经元非精确缓存,分别缓存输出神经元的重要比特位和非重要比特位。Further, the data processing apparatus further includes an instruction cache, an input neuron hierarchical cache, a weight hierarchical cache, and an output neuron hierarchical cache, wherein the instruction cache is disposed between the hierarchical storage device and the instruction control unit, and is configured to: Storing a dedicated instruction; the input neuron layered cache is disposed between the storage device and the imprecise computing unit for buffering the input neurons, and the input neuron hierarchical buffer includes an input neuron exact cache and an input neuron inexact cache, respectively Cache the important bits and non-important bits of the input neurons; the weighted layered cache is set between the storage device and the inexact computing unit for buffering the weighted data, and the weighted hierarchical cache includes the weighted exact cache and the weight The value is inaccurately buffered, and the important bit and the non-important bit of the weight are separately cached; the output neuron layered buffer is disposed between the storage device and the inexact computing unit for buffering the output neurons, and the output neuron is divided The layer buffer includes an output neuron exact cache and an output neuron inexact cache, which respectively buffer the important ratio of the output neurons. Bits and unimportant bits.
进一步地,数据处理装置还包括直接数据存取单元DMA(direct memory access),用于在存储装置、指令缓存、权值分层缓存、输入神经元分层缓存和输出神经元分层缓存中进行数据或者指令读写。Further, the data processing apparatus further includes a direct memory access unit (DMA) for performing in the storage device, the instruction cache, the weight layer cache, the input neuron layer buffer, and the output neuron layer buffer. Data or instruction reading and writing.
进一步地,上述指令缓存、输入神经元分层缓存、权值分层缓存和输出神经元分层缓存均采用4T SRAM或3T SRAM。Further, the above instruction cache, input neuron hierarchical cache, weighted hierarchical cache, and output neuron hierarchical cache all use 4T SRAM or 3T SRAM.
进一步地,非精确运算单元包括但不限于三个部分,第一部分乘法器,第二部分加法树,第三部分为激活函数单元。第一部分将输入数据1(in1)和输入数据2(in2)相乘得到相乘之后的输出(out),过程为:out=in1*in2;第二部分将输入数据in1通过加法树逐级相加得到输出数据(out),其中in1是一个长度为N的向量,N大于1,过程为:out=in1[1]+in1[2]+...+in1[N];或者,将输入数据(in1)通过加法树累加之后和输入数据(in2)相加得到输出数据(out),过程为:out=in1[1]+in1[2]+...+in1[N]+in2;或者,将输入数据(in1)和输入数据(in2)相加得到输出数据(out),过称为:out=in1+in2;第三部分将输入数据(in)通过激活函数(active)运算得到激活输出数据(out),过程为:out=active(in),激活函数active可以是sigmoid、tanh、relu、softmax等,除了做激活操作,第三部分可通过其他的非线性函数将输入数据(in)通过运算(f)得到输出数据(out),过程为:out=f(in)。Further, the inaccurate arithmetic unit includes but is not limited to three parts, a first partial multiplier, a second partial addition tree, and the third part is an activation function unit. The first part multiplies the input data 1 (in1) and the input data 2 (in2) to obtain the multiplied output (out). The process is: out=in1*in2; the second part passes the input data in1 through the addition tree step by step. Add the output data (out), where in1 is a vector of length N, N is greater than 1, the process is: out=in1[1]+in1[2]+...+in1[N]; or, will be input The data (in1) is added by the addition tree and added to the input data (in2) to obtain the output data (out). The process is: out=in1[1]+in1[2]+...+in1[N]+in2; Alternatively, the input data (in1) and the input data (in2) are added to obtain output data (out), which is called: out=in1+in2; the third part is obtained by the input function (in) through an activation function (active). Activate the output data (out), the process is: out=active(in), the active function active can be sigmoid, tanh, relu, softmax, etc. In addition to the activation operation, the third part can input the data through other nonlinear functions ( In) The output data (out) is obtained by the operation (f), and the process is: out=f(in).
非精确运算单元还可以包括池化单元,池化单元将输入数据(in)通过池化运算得到输出数据(out),过程为out=pool(in),其中pool为池化运算,池化运算包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out相关的一个池化核中的数据。The inexact computing unit may further include a pooling unit, and the pooling unit obtains output data (out) through the pooling operation, and the process is out=pool(in), wherein the pool is a pooling operation, and the pooling operation is performed. Including but not limited to: average pooling, maximum pooling, median pooling, input data in is the data in a pooled core associated with output out.
非精确运算单元执行运算包括几个部分,第一部分是将输入数据1和输入数据2相乘,得到相乘之后的数据;第二部分执行加法树运算,用于将输入数据1通过加法树逐级相加,或者将所述输入数据1 通过加法树逐级相加后和输入数据2相加得到输出数据;第三部分执行激活函数运算,对输入数据通过激活函数(active)运算得到输出数据。以上几个部分的运算可以自由组合,从而实现各种不同功能的运算。The non-precise operation unit performs operations including several parts. The first part is to multiply the input data 1 and the input data 2 to obtain the multiplied data; the second part performs the addition tree operation for passing the input data 1 through the addition tree. Level addition, or the input data 1 is added step by step through the addition tree and added to the input data 2 to obtain output data; the third part performs an activation function operation, and the output data is obtained by an activation function (active) operation to obtain output data. . The operations of the above parts can be freely combined to realize the operation of various functions.
本申请的数据处理装置能够充分利用近似存储技术,并充分挖掘神经网络的容错能力,减少神经网络的计算量和神经网络访存量,从而减少计算能耗和访存能耗。通过采用针对多层人工神经网络运算的专用SIMD指令和定制的运算单元,解决了CPU和GPU运算性能不足,前端译码开销大的问题,有效提高了对多层人工神经网络运算算法的支持;通过采用针对多层人工神经网络运算算法的专用非精确存储的片上缓存,充分挖掘了输入神经元和权值数据的重要性,避免了反复向内存读取这些数据,降低了内存访问带宽,避免了内存带宽成为多层人工神经网络运算及其训练算法性能瓶颈的问题。The data processing device of the present application can fully utilize the approximate storage technology, fully exploit the fault tolerance capability of the neural network, reduce the computational load of the neural network and the amount of neural network access, thereby reducing computational energy consumption and memory consumption. By adopting the dedicated SIMD instruction for the multi-layer artificial neural network operation and the customized operation unit, the problem that the CPU and GPU have insufficient performance and the front-end decoding overhead is solved, and the support for the multi-layer artificial neural network operation algorithm is effectively improved; By using the on-chip cache of dedicated inaccurate storage for multi-layer artificial neural network operation algorithm, the importance of input neurons and weight data is fully exploited, which avoids repeatedly reading these data into memory, reducing memory access bandwidth and avoiding The memory bandwidth becomes a problem of multi-layer artificial neural network operation and performance bottleneck of its training algorithm.
以上仅是示例性的说明,但本申请并不限于此,数据处理装置可以包括非神经网络处理器,例如,通用运算处理器,通用运算具有相应的通用运算指令和数据,例如,标量算数运算、标量逻辑运算等,通用运算处理器例如但不限于包括一个或多个乘法器、一个或多个加法器,执行例如加法、乘法等基本运算。The above is merely an illustrative description, but the application is not limited thereto, and the data processing apparatus may include a non-neural network processor, such as a general-purpose arithmetic processor, which has corresponding general-purpose arithmetic instructions and data, for example, scalar arithmetic operations. A scalar logic operation or the like, such as but not limited to, includes one or more multipliers, one or more adders, and performs basic operations such as addition, multiplication, and the like.
本申请又一实施例提供一种数据存储方法,采用近似存储的方式,将数据进行分层存储,参阅图2F,图2F为本申请实施例提供的一种数据存储方法流程图,包括以下步骤:A further embodiment of the present application provides a data storage method, which uses a storage method to store data in a hierarchical manner. Referring to FIG. 2F, FIG. 2F is a flowchart of a data storage method according to an embodiment of the present application, including the following steps. :
S601:将数据中的重要比特位进行精确存储。S601: Accurate storage of important bits in the data.
S602:将数据中的非重要比特位进行非精确存储。S602: Perform non-precise storage of non-significant bits in the data.
具体来讲,该数据存储方法包括以下步骤:Specifically, the data storage method includes the following steps:
提取数据的重要比特位和非重要比特位;Extracting important bits and non-significant bits of the data;
将数据中的重要比特位存储在ECC内存中进行精确存储;Store important bits in the data in ECC memory for accurate storage;
将数据中的非重要比特位存储在非ECC内存中进行非精确存储。Non-significant bits in the data are stored in non-ECC memory for inexact storage.
本实施例中,存储的数据为神经网络参数,将表示神经网络参数的比特数比特位分为重要比特位和非重要比特位。举例来说,神经网络一个参数共有m个比特位,其中n个比特位是重要比特位,(m-n)个比特位是非重要比特位,其中m是大于0的整数,n是大于0小于等于m的整数。In this embodiment, the stored data is a neural network parameter, and the bit number bits representing the neural network parameters are divided into important bits and non-significant bits. For example, a parameter of a neural network has a total of m bits, where n bits are important bits, (mn) bits are non-significant bits, where m is an integer greater than 0, and n is greater than 0 and less than or equal to m The integer.
神经网络参数包括输入神经元、权值和输出神经元,将输入神经元的重要比特位、输出神经元的重要比特位和权值的重要比特位进行精确存储;将输入神经元的非重要比特位、输出神经元的非重要比特位和权值的非重要比特位进行非精确存储。The neural network parameters include input neurons, weights, and output neurons, which store the important bits of the input neurons, the important bits of the output neurons, and the important bits of the weights; the non-significant bits of the input neurons Bits, non-significant bits of output neurons, and non-significant bits of weights are stored inexactly.
数据包括浮点型数据和定点型数据,其中,定义浮点型数据中的符号位和指数部分为重要比特位,底数部分为非重要比特位;定点型数据中的符号位和数值部分的前x比特为重要比特位,数值部分的剩余比特为非重要比特位,其中,x为大于等于0且小于m的正整数,m为参数总比特位。The data includes floating-point data and fixed-point data, wherein the sign bit and the exponent portion in the definition floating-point data are important bits, the base portion is a non-significant bit; the sign bit and the value portion in the fixed-point type data are before The x bit is an important bit, and the remaining bits of the value part are non-significant bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a parameter total bit.
ECC内存包括有ECC校验的SRAM和有ECC校验的DRAM;所述非ECC内存包括非ECC校验的SRAM和非ECC校验的DRAM;所述有ECC校验的SRAM和非ECC校验的SRAM采用6T SRAM,在本申请其他实施例中,也可以采用4T SRAM或3T SRAM。The ECC memory includes an ECC-checked SRAM and an ECC-checked DRAM; the non-ECC memory includes a non-ECC-checked SRAM and a non-ECC-checked DRAM; the ECC-checked SRAM and non-ECC checksum The SRAM uses 6T SRAM, and in other embodiments of the present application, 4T SRAM or 3T SRAM can also be used.
本申请又一实施例提供一种数据处理方法,图2G是本申请实施例提供的一种数据处理方法流程图,如图2G所示,包括:A further embodiment of the present application provides a data processing method, and FIG. 2G is a flowchart of a data processing method according to an embodiment of the present application. As shown in FIG. 2G, the method includes:
S1:接收指令和参数,并将参数中的重要比特位和指令进行精确存储,将参数中的非重要比特位进行非精确存储;S1: receiving instructions and parameters, and accurately storing important bits and instructions in the parameters, and inaccurately storing non-important bits in the parameters;
S2:接收指令,并将指令译码生成控制信息;S2: receiving an instruction, and decoding the instruction to generate control information;
S3:接收参数,并依据控制信息进行运算,将运算结果存储。S3: Receive parameters, and perform operations according to the control information, and store the operation results.
其中,上述运算为神经网络运算,参数为神经网络参数,包括输入神经元、权值和输出神经元。The above operation is a neural network operation, and the parameters are neural network parameters, including input neurons, weights, and output neurons.
步骤S3进一步包括:接收输入神经元和权值,依据控制信息完成神经网络运算得到输出神经元,并将输出神经元存储或输出。Step S3 further includes: receiving the input neuron and the weight, completing the neural network operation according to the control information to obtain the output neuron, and storing or outputting the output neuron.
进一步地,该接收输入神经元和权值,依据控制信息完成神经网络运算得到输出神经元包括:接收输入神经元的重要比特位和权值的重要比特位进行计算;或者,接收将重要比特位和非重要比特位拼接完整的输入神经元和权值进行计算。Further, the receiving input neuron and the weight, and completing the neural network operation according to the control information to obtain the output neuron include: receiving important bits of the input neuron and important bits of the weight for calculation; or receiving important bits And the non-significant bits spliced the complete input neurons and weights for calculation.
进一步地,还包括以下步骤:缓存专用指令;对输入神经元进行精确缓存和非精确缓存;对权值数据进行精确缓存和非精确缓存;对输出神经元进行精确缓存和非精确缓存。Further, the method further includes the following steps: caching dedicated instructions; accurately buffering and inexact caching of input neurons; performing accurate and inexact caching of weight data; and accurately and inaccurately buffering output neurons.
进一步地,在步骤S1之前还包括:对参数进行预处理。Further, before step S1, the method further includes: pre-processing the parameters.
本申请又一实施例申请一种存储单元,该存储单元为4T SRAM或3T SRAM,用于存储神经网络参数,其中,该4T SRAM的具体结构参阅如图2B所示的结构,该3T SRAM的具体结构参阅图2C所示的结构,在此不再叙述。A further embodiment of the present application is directed to a storage unit, which is a 4T SRAM or a 3T SRAM, for storing neural network parameters, wherein the specific structure of the 4T SRAM is as shown in FIG. 2B, and the 3T SRAM is The specific structure refers to the structure shown in FIG. 2C, and will not be described here.
参阅图3A,图3A为本申请实施例提供的一种动态调压调频装置100的结构示意图。如图3A所示,动态调压调频装置100包括:Referring to FIG. 3A, FIG. 3A is a schematic structural diagram of a dynamic voltage regulation and frequency modulation apparatus 100 according to an embodiment of the present application. As shown in FIG. 3A, the dynamic voltage regulation and frequency modulation apparatus 100 includes:
信息采集单元101,用于实时采集与所述动态调压调频相连接的芯片的工作状态信息或应用场景信息,所述应用场景信息为所述芯片通过神经网络运算得到的或者与所述芯片相连接的传感器采集的信息;The information collecting unit 101 is configured to collect, in real time, working state information or application scenario information of the chip connected to the dynamic voltage regulating frequency modulation, where the application scenario information is obtained by using the neural network or the chip Information collected by connected sensors;
调压调频单元102,用于根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息,所述电压频率调控信息用于指示所述芯片调整其工作电压或者工作频率。The voltage-adjusting and frequency-modulating unit 102 is configured to send voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip, where the voltage frequency regulation information is used to instruct the chip to adjust its working voltage or operating frequency.
在本申请的一可能实施例中,所述芯片的工作状态信息包括所述芯片的运行速度,所述电压频率调控信息包括第一电压频率调控信息,所述调压调频单元102用于:In a possible embodiment of the present application, the working state information of the chip includes an operating speed of the chip, the voltage frequency regulation information includes first voltage frequency regulation information, and the voltage regulating and frequency modulation unit 102 is configured to:
当所述芯片的运行速度大于目标速度时,向所述芯片发送所述第一电压频率调控信息,所述第一电压频率调控信息用于指示所述芯片降低其工作频率或者工作电压,所述目标速度为满足用户需求时所述芯片的运行速度。Transmitting the first voltage frequency regulation information to the chip when the running speed of the chip is greater than a target speed, where the first voltage frequency regulation information is used to indicate that the chip reduces its operating frequency or operating voltage, The target speed is the running speed of the chip when the user's demand is met.
具体地,上述信息采集单元101实时采集与其连接的芯片的运行速度。该芯片的运行速度根据上述芯片执行任务的不同可为不同类型的速度。当该芯片进行的操作为视频图像处理时,则上述芯片的运行速度可为上述芯片进行视频图像处理的帧率;当上述芯片进行的操作为语音识别时,则上述芯片的运行速度为上述信息进行语音识别的速度。上述调压调频单元102确定上述芯片的运行速度大于上述目标速度,即上述芯片的运行速度达到满足用户需求时该芯片的运行速度时,向该芯片发送第一电压频率调控信息,以指示该芯片降低其工作电压或者工作频率,以降低芯片的功耗。Specifically, the information collecting unit 101 collects the running speed of the chip connected thereto in real time. The running speed of the chip can be different types of speeds depending on the tasks performed by the above chips. When the operation performed by the chip is video image processing, the running speed of the chip may be a frame rate of the video image processing performed by the chip; when the operation performed by the chip is voice recognition, the running speed of the chip is the above information. The speed at which speech recognition is performed. The voltage-modulating and frequency-modulating unit 102 determines that the running speed of the chip is greater than the target speed, that is, when the operating speed of the chip reaches the running speed of the chip when the user meets the demand, the first voltage frequency control information is sent to the chip to indicate the chip. Reduce its operating voltage or operating frequency to reduce the power consumption of the chip.
举例说明,假设上述芯片进行的操作为视频图像处理,且上述目标速度为24帧/秒。上述信息采集单元实时采集上述芯片进行视频图像处理的帧率,且当前上述芯片进行视频图像处理的帧率为54帧/秒。 上述调压调频单元确定当前上述芯片进行视频图像处理的帧率大于上述目标速度时,向芯片发送第一电压频率调控信息,以指示该芯片降低其工作电压或者工作频率,以降低芯片的功耗。For example, it is assumed that the operation performed by the above chip is video image processing, and the above target speed is 24 frames/second. The information collecting unit collects the frame rate of the video image processing by the chip in real time, and the current frame rate of the video image processing by the chip is 54 frames/second. The voltage regulation and frequency modulation unit determines that the frame rate of the video image processing of the current chip is greater than the target speed, and sends the first voltage frequency regulation information to the chip to indicate that the chip reduces the operating voltage or the operating frequency to reduce the power consumption of the chip. .
在本申请的一可能实施例中,所述芯片至少包括第一单元和第二单元,所述第一单元的输出数据为所述第二单元的输入数据,所述芯片的工作状态信息包括所述第一单元的运行速度和第二单元的运行速度,所述电压频率调控信息包括第二电压频率调控信息,所述调频调压单元102还用于:In a possible embodiment of the present application, the chip includes at least a first unit and a second unit, the output data of the first unit is input data of the second unit, and the working status information of the chip includes The operating speed of the first unit and the operating speed of the second unit, the voltage frequency regulation information includes second voltage frequency regulation information, and the frequency modulation unit 102 is further configured to:
当根据所述第一单元的运行速度和所述第二单元的运行速度确定所述第一单元的运行时间超过所述第二单元的运行时间时,向所述第二单元发送所述第二电压频率调控信息,所述第二电压频率调控信息用于指示所述第二单元降低其工作频率或者工作电压。And when the running time of the first unit exceeds the running time of the second unit according to the running speed of the first unit and the running speed of the second unit, sending the second unit to the second unit Voltage frequency regulation information, the second voltage frequency regulation information is used to instruct the second unit to reduce its operating frequency or operating voltage.
具体地,上述芯片执行任务需要上述第一单元和上述第二单元的配合,并且上述第一单元的输出数据为上述第二单元的输入数据。上述信息采集单元101实时采集上述第一单元和上述第二单元的运行速度。当确定上述第一单元的运行速度小于上述第二单元的运行速度即上述第一单元的运行时间超过上述第二单元的运行时间时,上述调压调频单元102向上述第二单元发送上述第二电压频率调控信息,以指示上述第二单元降低其工作电压或者工作频率,达到在不影响芯片整体的运行速度的前提下,达到降低芯片整体的功耗。Specifically, the chip performing the task requires the cooperation of the first unit and the second unit, and the output data of the first unit is the input data of the second unit. The information collecting unit 101 collects the operating speeds of the first unit and the second unit in real time. When it is determined that the running speed of the first unit is less than the running speed of the second unit, that is, the running time of the first unit exceeds the running time of the second unit, the voltage regulating and frequency converting unit 102 sends the second unit to the second unit. The voltage frequency regulation information is used to instruct the second unit to lower its working voltage or operating frequency, so as to reduce the power consumption of the whole chip without affecting the overall running speed of the chip.
在本申请的一可能实施例中,所述电压频率调控信息包括第三电压频率调控信息,所述调频调压单元102还用于:In a possible embodiment of the present application, the voltage frequency regulation information includes third voltage frequency regulation information, and the frequency modulation unit 102 is further configured to:
当根据所述第一单元的运行速度和所述第二单元的运行速度确定所述第二单元的运行时间超过所述第一单元的运行时间时,向所述第一单元发送所述第三电压频率调控信息,所述第三电压频率调控信息用于指示所述第一单元降低其工作频率或者工作电压。Transmitting the third unit to the first unit when it is determined that an operating time of the second unit exceeds a running time of the first unit according to an operating speed of the first unit and an operating speed of the second unit Voltage frequency regulation information, the third voltage frequency regulation information is used to indicate that the first unit reduces its operating frequency or operating voltage.
在本申请的一可能实施例中,所述芯片包括至少N个单元,所述芯片的工作状态信息包括所述至少N个单元中的至少S个单元的工作状态信息,所述N为大于1的整数,所述S为小于或者小于N的整数,所述电压频率调控信息包括第四电压频率调控信息,所述调压调频单元102用于:In a possible embodiment of the present application, the chip includes at least N units, and the working state information of the chip includes working state information of at least S units of the at least N units, where N is greater than 1. The integer frequency, the S is an integer less than or less than N, the voltage frequency regulation information includes fourth voltage frequency regulation information, and the voltage regulation and frequency modulation unit 102 is configured to:
根据所述单元A的工作状态信息确定所述单元A处于空闲状态时,向所述单元A发送所述第四电压频率调控信息,所述第四电压频率调控信息用于指示所述单元A降低其工作频率或者工作电压,And determining, according to the working state information of the unit A, that the unit A is in an idle state, sending the fourth voltage frequency regulation information to the unit A, where the fourth voltage frequency regulation information is used to indicate that the unit A is lowered. Its working frequency or working voltage,
其中,所述单元A为所述至少S个单元中的任意一个。The unit A is any one of the at least S units.
在本申请的一可能实施例中,所述电压频率调控信息包括第五电压频率调控信息,所述调压调频单元102还用于:In a possible embodiment of the present application, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the voltage regulation and frequency modulation unit 102 is further configured to:
根据所述单元A的工作状态信息确定所述单元A重新处于工作状态时,向所述单元A发送所述第五电压频率调控信息,所述第五电压频率调控信息用于指示所述单元A升高其工作电压或者工作频率。And determining, according to the working state information of the unit A, that the unit A is in an active state, sending the fifth voltage frequency regulation information to the unit A, where the fifth voltage frequency regulation information is used to indicate the unit A Increase its working voltage or operating frequency.
具体地,在上述芯片工作过程中,上述信息采集单元101实时采集上述芯片内部至少S个单元的工作状态信息。当根据上述单元A的工作状态信息确定该单元A处于空闲状态时,上述调压调频单元102向上述单元A发送第四电压频率调控信息,以指示该单元A降低其工作频率或者工作电压,以降低该单元A的功耗;当根据上述单元A的工作状态信息确定该单元A重新处于工作状态时,上述调压调频单元102向上述单元A发送第五电压频率调控信息,以指示该单元A升高其工作频率或者工作电压,以使该单元A的运行速度满足工作的需求。Specifically, in the working process of the chip, the information collecting unit 101 collects the working state information of at least S units in the chip in real time. When it is determined that the unit A is in an idle state according to the working state information of the unit A, the voltage regulating and frequency modulation unit 102 sends the fourth voltage frequency regulation information to the unit A to indicate that the unit A lowers its operating frequency or operating voltage. The power consumption of the unit A is lowered. When the unit A is determined to be in the working state again according to the working state information of the unit A, the voltage regulating and frequency modulation unit 102 sends the fifth voltage frequency control information to the unit A to indicate the unit A. Increase its operating frequency or operating voltage so that the operating speed of the unit A meets the needs of the work.
在本申请的一可能实施例中,所述芯片的应用场景为图像识别,所述应用场景信息为待识别图像中物体的个数,所述电压频率调控信息包括第六电压频率调控信息,所述调压调频单元102还用于:In a possible embodiment of the present application, the application scenario of the chip is image recognition, the application scenario information is the number of objects in the image to be identified, and the voltage frequency regulation information includes sixth voltage frequency regulation information. The voltage regulating and frequency modulation unit 102 is also used to:
当确定所述待识别图像中物体的个数小于第一阈值时,向所述芯片发送所述第六电压频率调控信息,所述第六电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。When it is determined that the number of objects in the image to be identified is less than a first threshold, sending the sixth voltage frequency regulation information to the chip, where the sixth voltage frequency regulation information is used to indicate that the chip reduces its working voltage Or the working frequency.
具体地,上述芯片应用于图像识别,上述待识别图像中物体的个数为上述芯片通过神经网络算法得到,上述信息采集单元101从上述芯片中获取上述待识别图像中物体的个数(即上述应用场景信息)后,当上述调压调频单元102确定上述待识别图像中物体的个数小于第一阈值时,该调压调频单元102向上述芯片发送上述第六电压频率调控信息,以指示上述芯片降低其工作电压或者工作频率;当确定上述待识别图像中物体的个数大于第一阈值时,该调压调频单元102向上述芯片发送用于指示上述芯片升高其工作电压或者工作频率的电压频率调控信息。Specifically, the chip is applied to image recognition, and the number of objects in the image to be identified is obtained by the neural network algorithm, and the information collecting unit 101 acquires the number of objects in the image to be identified from the chip (ie, the above After applying the scenario information, when the voltage-modulating and frequency-modulating unit 102 determines that the number of objects in the image to be identified is less than the first threshold, the voltage-modulating and frequency-modulating unit 102 sends the sixth voltage-frequency control information to the chip to indicate the foregoing. The chip lowers its working voltage or operating frequency; when it is determined that the number of objects in the image to be identified is greater than the first threshold, the voltage regulating and frequency modulation unit 102 sends a signal to the chip to indicate that the chip raises its working voltage or operating frequency. Voltage frequency regulation information.
在本申请的一可能实施例中,所述应用场景信息为物体标签信息,所述电压频率调控信息包括第七电压频率调控信息,所述调压调频单元102还用于:In a possible embodiment of the present application, the application scenario information is object tag information, the voltage frequency regulation information includes seventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit 102 is further configured to:
当确定所述物体标签信息属于预设物体标签集时,向所述芯片发送所述第七电压频率调控信息,所述第七电压频率调控信息用于指示所述芯片升高其工作电压或者工作频率。When it is determined that the object tag information belongs to the preset object tag set, sending the seventh voltage frequency regulation information to the chip, where the seventh voltage frequency regulation information is used to indicate that the chip raises its working voltage or works frequency.
举例说明,上述预设物体标签集包括多个物体标签,该物体标签可为“人”、“狗”、“树”和“花”。当上述芯片通过神经网络算法确定当前应用场景中包括狗时,该芯片将该包括“狗”这个物体标签信息传输至上述信息采集单元101后,当上述调频调压单元102确定上述物体标签信息包括“狗”时,向上述芯片发送第七电压频率调控信息,以指示上述芯片升高其工作电压或者工作频率;当确定上述物体标签信息不属于上述预设物体标签集时,该调压调频单元102向上述芯片发送用于指示上述芯片降低其工作电压或者工作频率的电压频率调控信息。For example, the preset object tag set includes a plurality of object tags, and the object tags may be “person”, “dog”, “tree”, and “flower”. When the chip determines that the current application scenario includes a dog by using a neural network algorithm, the chip transmits the object tag information including the “dog” to the information collecting unit 101, and when the frequency modulation unit 102 determines that the object tag information includes "dog", sending seventh voltage frequency regulation information to the chip to indicate that the chip raises its working voltage or operating frequency; and when determining that the object tag information does not belong to the preset object tag set, the voltage regulating frequency modulation unit 102 transmits voltage frequency regulation information for instructing the chip to reduce its operating voltage or operating frequency to the chip.
在本申请的一可能实施例中,所述芯片应用于语音识别,所述应用场景信息为语音输入速率,所述电压频率调控信息包括第八电压频率调控信息,所述调压调频单元还用于:In a possible embodiment of the present application, the chip is applied to voice recognition, the application scenario information is a voice input rate, the voltage frequency regulation information includes eighth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further used. to:
当所述语音输入速率小于第二阈值时,向所述芯片发送第八电压频率调控信息,所述第八电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。And when the voice input rate is less than the second threshold, sending, to the chip, eighth voltage frequency regulation information, where the eighth voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency.
具体地,上述芯片的应用场景为语音识别,该芯片的输入单元按照一定的速率向芯片输入语音。上述信息采集单元101实时采集语音输入速率,并将该语音输入速率信息发送至上述调压调频单元102。当该调压调频单元102确定上述语音输入速率小于第二阈值时,向上述芯片发送第八电压频率调控信息,以指示上述芯片降低其工作电压或者工作频率。当该调压调频单元102确定上述语音输入速率大于第二阈值时,向上述芯片发送用于指示上述芯片升高其工作电压的电压频率调控信息。Specifically, the application scenario of the chip is voice recognition, and the input unit of the chip inputs voice to the chip at a certain rate. The information collecting unit 101 collects the voice input rate in real time, and sends the voice input rate information to the voltage regulating and frequency modulation unit 102. When the voltage regulation and frequency modulation unit 102 determines that the voice input rate is less than the second threshold, the eighth voltage frequency regulation information is sent to the chip to instruct the chip to lower its operating voltage or operating frequency. When the voltage regulating and frequency modulation unit 102 determines that the voice input rate is greater than the second threshold, the voltage frequency regulation information for instructing the chip to increase its operating voltage is sent to the chip.
在本申请的一可能实施例中,所述应用场景信息为所述芯片进行语音识别得到的关键词,所述电压频率调控信息包括第九电压频率调控信息,所述调频调压单元还用于:In a possible embodiment of the present application, the application scenario information is a keyword obtained by performing speech recognition on the chip, the voltage frequency regulation information includes ninth voltage frequency regulation information, and the frequency modulation and voltage adjustment unit is further used to :
当所述关键词属于预设关键词集时,向所述芯片发送所述第九电压频率调控信息,所述第九电压频率调控信息用于指示所述芯片升高其工作电压或者工作频率。When the keyword belongs to the preset keyword set, the ninth voltage frequency regulation information is sent to the chip, and the ninth voltage frequency regulation information is used to indicate that the chip increases its working voltage or operating frequency.
进一步地,当上述关键词不属于上述关键词集时,上述调频调压单元102向上述芯片发送用于指示上述芯片降低其工作电压或者工作频率的调压调频信息。Further, when the keyword does not belong to the keyword set, the FM voltage regulator unit 102 transmits the voltage regulation and frequency modulation information for instructing the chip to lower its operating voltage or operating frequency to the chip.
举例说明,上述芯片的应用场景为语音识别,上述预设关键词集包括“图像美颜”、“神经网络算法”、“图像处理”和“支付宝”等等关键词。假设上述应用场景信息为“图像美颜”,上述调频调压单元102向上述发送上述第九电压频率调控信息,以指示上述芯片升高其工作电压或者工作频率;假设上述应用场景信息为“拍照”时,上述调频调压单元102向上述芯片发送用于指示上述芯片降低其工作电压或者工作频 率的调压调频信息。For example, the application scenario of the above chip is speech recognition, and the preset keyword set includes keywords such as “image beauty”, “neural network algorithm”, “image processing” and “Alipay”. Assume that the application scenario information is “image beauty”, the FM voltage regulation unit 102 sends the ninth voltage frequency regulation information to the above to indicate that the chip increases its working voltage or operating frequency; When the frequency modulation unit 102 transmits the voltage regulation and frequency modulation information for instructing the chip to lower its operating voltage or operating frequency to the chip.
在本申请的一可能实施例中,所述芯片应用于机器翻译,所述应用场景信息为文字输入的速度或者待翻译图像中文字的数量,所述电压频率调控信息包括第十电压频率调控信息,所述调压调频单元还用于:In a possible embodiment of the present application, the chip is applied to machine translation, and the application scenario information is a speed of text input or a number of characters in an image to be translated, and the voltage frequency regulation information includes tenth voltage frequency regulation information. The voltage regulating and frequency modulation unit is further configured to:
当所述文字输入速度小于第三阈值或者待翻译图像中文字的数量小于第四阈值时,向所述芯片发送所述第十电压频率调控信息,所述第十电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。And when the text input speed is less than a third threshold or the number of characters in the image to be translated is less than a fourth threshold, sending the tenth voltage frequency regulation information to the chip, where the tenth voltage frequency regulation information is used to indicate The chip reduces its operating voltage or operating frequency.
具体地,上述芯片应用于机器翻译,上述信息采集单元101采集的应用场景信息为文字输入的速度或者待翻译图像中文字的数量,并将该应用场景信息传输至上述调压调频单元102。当确定上述文字输入速度小于第三阈值或者待翻译图像中文字的数量小于第四阈值时,上述调压调频单元102向上述芯片发送第十电压频率调控信息,以用于指示上述芯片降低其工作电压;当确定上述文字输入速度大于第三阈值或者待翻译图像中文字的数量大于第四阈值时,上述调压调频单元102向上述芯片发送用于指示上述芯片升高其工作电压的电压频率调控信息。Specifically, the chip is applied to the machine translation, and the application scene information collected by the information collection unit 101 is the speed of the text input or the number of characters in the image to be translated, and the application scenario information is transmitted to the voltage modulation and frequency modulation unit 102. When it is determined that the text input speed is less than the third threshold or the number of characters in the image to be translated is less than the fourth threshold, the voltage regulating and frequency modulation unit 102 sends the tenth voltage frequency regulation information to the chip for instructing the chip to reduce its work. a voltage; when it is determined that the text input speed is greater than a third threshold or the number of characters in the image to be translated is greater than a fourth threshold, the voltage-modulating frequency modulation unit 102 sends a voltage frequency regulation to the chip to instruct the chip to increase its operating voltage. information.
在本申请的一可能实施例中,所述应用场景信息为外界的光照强度,所述电压频率调控信息包括第十一电压频率调控信息,所述调压调频单元还用于:In a possible embodiment of the present application, the application scenario information is ambient light intensity, the voltage frequency regulation information includes eleventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:
当所述外界的光照强度小于第五阈值时,向所述芯片发送所述第十一电压频率调控信息,所述第十一电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。Transmitting the eleventh voltage frequency regulation information to the chip when the ambient light intensity is less than a fifth threshold, the eleventh voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency .
具体地,上述外界的光照强度为与上述芯片连接的光照传感器采集获取的。上述信息采集单元101获取上述光照强度后,将该光照强度传输至上述调压调频单元102。当确定上述光照强度小于第五阈值时,上述调压调频单元102向上述芯片发送上述第十一电压频率调控信息,以指示所述芯片降低其工作电压;当确定上述光照强度大于第五阈值时,上述调压调频单元102向上述芯片发送用于指示所述芯片升高其工作电压或者工作频率的电压频率调控信息。Specifically, the illumination intensity of the external environment is acquired by an illumination sensor connected to the chip. After acquiring the above-mentioned light intensity, the information collecting unit 101 transmits the light intensity to the voltage-modulating and frequency-modulating unit 102. When it is determined that the illumination intensity is less than the fifth threshold, the voltage regulation and frequency modulation unit 102 transmits the eleventh voltage frequency regulation information to the chip to instruct the chip to lower its operating voltage; when determining that the illumination intensity is greater than the fifth threshold The voltage-modulating and frequency-modulating unit 102 transmits voltage frequency regulation information for instructing the chip to increase its operating voltage or operating frequency to the chip.
在本申请的一可能实施例中,所述芯片应用于图像美颜,所述电压频率调控信息包括第十二电压频率调控信息和第十三电压频偏调控信息,所述调压调频单元还用于:In a possible embodiment of the present application, the chip is applied to image beauty, and the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency offset control information, and the voltage regulation and frequency modulation unit further Used for:
当所述应用场景信息为人脸图像时,向所述芯片发送所述第十二电压频率调控信息,所述第十二电压频率调控信息用于指示所述芯片升高其工作电压或者工作频率;When the application scenario information is a face image, sending the twelfth voltage frequency regulation information to the chip, where the twelfth voltage frequency regulation information is used to indicate that the chip increases its working voltage or operating frequency;
当所述应用场景信息不为人脸图像时,向所述芯片发送第十三电压频率调控信息,所述第十三电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。And when the application scenario information is not a face image, sending thirteenth voltage frequency regulation information to the chip, where the thirteenth voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency.
在本申请的一可能实施例中,上述芯片应用于语音识别,上述应用场景信息为语音强度,当上述语音强度大于第六阈值时,上述调压调频单元102向上述芯片发送用于指示上述芯片降低其工作电压或者工作频率的电压频率调控信息;当上述语音强度小于第六阈值时,上述调压调频单元102向上述芯片发送用于指示上述芯片升高其工作电压或者工作频率的电压频率调控信息。In a possible embodiment of the present application, the chip is applied to voice recognition, and the application scenario information is voice strength. When the voice strength is greater than a sixth threshold, the voltage modulation and frequency modulation unit 102 sends the chip to the chip to indicate the chip. The voltage frequency regulation information of the operating voltage or the operating frequency is reduced; when the voice strength is less than the sixth threshold, the voltage regulation and frequency modulation unit 102 sends a voltage frequency regulation to the chip to indicate that the chip increases its working voltage or operating frequency. information.
需要说明的是,上述场景信息可以是传感器采集到的外部场景的信息如光照强度,语音强度等。上述应用场景信息也可以是根据人工智能算法计算出的信息,例如在物体识别任务中,将芯片的实时计算结果信息反馈给信息采集单元,所述信息包括场景中物体个数、人脸图像、物体标签关键词等信息。It should be noted that the foregoing scene information may be information of an external scene collected by the sensor, such as light intensity, voice intensity, and the like. The application scenario information may also be information calculated according to the artificial intelligence algorithm. For example, in the object recognition task, the real-time calculation result information of the chip is fed back to the information collection unit, where the information includes the number of objects in the scene, the face image, Information such as object tag keywords.
可选地,上述人工智能算法包括但不限于神经网络算法。Optionally, the artificial intelligence algorithm described above includes, but is not limited to, a neural network algorithm.
可以看出,在本发明实施例的方案中,动态调压调频装置实时与其连接的芯片及其内部各单元的工 作状态信息或芯片的应用场景信息,根据芯片及其内部各单元的工作状态信息或芯片的应用场景信息来调整芯片或者其内部各单元的工作频率或者工作电压,以达到降低芯片的整体运行功耗。It can be seen that, in the solution of the embodiment of the present invention, the dynamic voltage-adjusting and frequency-modulating device in real time is connected with the chip and the working state information of each internal unit or the application scenario information of the chip, according to the working state information of the chip and its internal units. Or the application scenario information of the chip to adjust the working frequency or working voltage of the chip or its internal units to reduce the overall operating power consumption of the chip.
参阅图3B,图3B为本申请实施例提供的一种动态调压调频应用场景示意图。如图3B所示,该卷积运算装置包括动态调压调频装置210、和与该动态调压调频装置相连接的芯片220。Referring to FIG. 3B, FIG. 3B is a schematic diagram of a dynamic voltage regulation and frequency modulation application scenario according to an embodiment of the present application. As shown in FIG. 3B, the convolution operation device includes a dynamic voltage regulation and frequency modulation device 210, and a chip 220 connected to the dynamic voltage regulation and frequency modulation device.
其中,上述芯片220包括控制单元221、存储单元222和运算单元223。上述芯片220可用于进行图像处理、处理语音等任务。The chip 220 includes a control unit 221, a storage unit 222, and an operation unit 223. The chip 220 described above can be used for tasks such as image processing, voice processing, and the like.
其中,上述动态调压调频装置210实时采集上述芯片220的工作状态信息。该芯片220的工作状态信息包括该芯片220的运行速度、控制单元221的运行速度、存储单元222的运行速度和运算单元223的运行速度。The dynamic voltage-modulating and frequency-modulating device 210 collects the working state information of the chip 220 in real time. The operational status information of the chip 220 includes the operating speed of the chip 220, the operating speed of the control unit 221, the operating speed of the storage unit 222, and the operating speed of the computing unit 223.
在本申请的一可能实施例中,芯片220在执行一次任务时,当动态调压调频装置210根据存储单元222的运行速度和运算单元223的运行速度确定存储单元222的运行时间超过运算单元223的运行时间,动态调压调频装置210可确定在执行此次任务过程中,存储单元222成为了瓶颈,运算单元223执行完当前的运算操作后,需要等待存储单元222执行完读取任务并将其读取的数据传输至运算单元223,运算单元223才能根据此次存储单元222传输过来的数据进行运算操作。动态调压调频装置210向运算单元223发送第一电压频率调控信息,该第一电压频率调控信息用于指示运算单元223降低其工作电压或者工作频率,以降低运算单元223的运行速度,使得在不影响任务的完成时间的情况下,降低了芯片220整体的运行功耗。In a possible embodiment of the present application, when the chip 220 performs a task, when the dynamic voltage-modulating and frequency-modulating device 210 determines the running time of the storage unit 222 exceeds the operation unit 223 according to the running speed of the storage unit 222 and the operating speed of the computing unit 223. The running time, the dynamic voltage modulation and frequency modulation device 210 can determine that the storage unit 222 becomes a bottleneck during the execution of the task, and after the operation unit 223 performs the current operation operation, it needs to wait for the storage unit 222 to perform the reading task and The read data is transmitted to the arithmetic unit 223, and the arithmetic unit 223 can perform an arithmetic operation based on the data transmitted from the storage unit 222. The dynamic voltage-modulating and frequency-modulating device 210 transmits the first voltage-frequency regulation information to the operation unit 223, where the first voltage-frequency regulation information is used to instruct the operation unit 223 to lower its operating voltage or operating frequency to reduce the operating speed of the computing unit 223, so that Without affecting the completion time of the task, the overall operating power consumption of the chip 220 is reduced.
在本申请的一可能实施例中,芯片220在执行一次任务时,当动态调压调频装置210根据存储单元222的运行速度和运算单元223的运行速度确定存储单元222的运行时间低于运算单元223的运行时间时,动态调压调频装置210可确定在执行此次任务过程中,运算单元223成为了瓶颈。在存储单元222在完成数据读取后,运算单元223还未完成当前的运算操作,存储单元222需要等待运算单元223完成当前的运算操作后,才将读取的数据传输至运算单元223。动态调压调频装置210向存储单元222发送第二电压频率调控信息,该第二电压频率调控信息用于指示存储单元222降低其工作电压或者工作频率,以降低存储单元222的运行速度,使得在不影响任务的完成时间的情况下,降低了芯片220整体的运行功耗。In a possible embodiment of the present application, when the chip 220 performs a task, when the dynamic voltage-modulating and frequency-modulating device 210 determines the running time of the storage unit 222 is lower than the computing unit according to the running speed of the storage unit 222 and the operating speed of the computing unit 223 At the runtime of 223, the dynamic voltage-modulating frequency modulation device 210 can determine that the arithmetic unit 223 becomes a bottleneck during the execution of the task. After the storage unit 222 completes the data reading, the operation unit 223 has not completed the current operation operation, and the storage unit 222 needs to wait for the operation unit 223 to complete the current operation operation, and then transfers the read data to the operation unit 223. The dynamic voltage modulation and frequency modulation device 210 sends the second voltage frequency regulation information to the storage unit 222, where the second voltage frequency regulation information is used to instruct the storage unit 222 to lower its operating voltage or operating frequency to reduce the operating speed of the storage unit 222, so that Without affecting the completion time of the task, the overall operating power consumption of the chip 220 is reduced.
在本申请的一可能实施例中,动态调压调频装置210实时获取芯片220的运行速度。当动态调压调频装置210确定芯片220的运行速度大于目标运行速度时,该目标运行速度为能够满足用户需求的运行速度,动态调压调频装置210向芯片220发送第三电压频率调控信息,该第三电压频率调控信息用于指示芯片220降低其工作电压或者工作频率,以降低芯片220的运行功耗。In a possible embodiment of the present application, the dynamic voltage modulation and frequency modulation device 210 acquires the running speed of the chip 220 in real time. When the dynamic voltage modulation and frequency modulation device 210 determines that the operating speed of the chip 220 is greater than the target operating speed, the target operating speed is an operating speed that can meet the user's demand, and the dynamic voltage regulating and frequency modulation device 210 sends the third voltage frequency control information to the chip 220. The third voltage frequency regulation information is used to instruct the chip 220 to lower its operating voltage or operating frequency to reduce the operating power consumption of the chip 220.
举例说明,芯片220用于进行视频处理,比如在正常情况下的用户要求视频处理的帧率不低于30帧,假设此时芯片220实际进行视频处理的帧率是100,动态调压调频装置向芯片220发送电压频率调控信息,该电压频率调控信息用于指示芯片220降低工作电压或者工作频率,以使视频处理的帧率降低到30帧左右。For example, the chip 220 is used for video processing. For example, the frame rate of the video processing required by the user under normal conditions is not less than 30 frames. It is assumed that the frame rate of the actual video processing of the chip 220 is 100, and the dynamic voltage regulating and frequency modulation device is used. The voltage frequency regulation information is sent to the chip 220, and the voltage frequency regulation information is used to instruct the chip 220 to lower the operating voltage or the operating frequency to reduce the frame rate of the video processing to about 30 frames.
在本申请的一可能实施例中,动态调压调频装置210实时监控芯片220中各单元(包括控制单元221、存储单元222和运算单元223)的工作状态。当面动态调压调频装置220确定各单元中任一单元处于空闲状态时,向该单元发送第四电压频率调控信息,该第四电压频率调控信息用于指示该降低该单元工作 电压或者工作频率,从而降低芯片220的功耗。当该单元重新处于工作状态时,动态调压调频装置210向该单元发送第五电压频率调控信息,以升高该单元的工作电压或者工作频率,以使该单元的运行速度满足工作需求。可以看出,在申请实施例的方案中,动态调压调频装210置实时采集芯片及其内部各单元的运行速度信息,根据芯片及其内部各单元的运行速度信息降低芯片或者其内部各单元的工作频率或者工作电压,以达到降低芯片的整体运行功耗。In a possible embodiment of the present application, the dynamic voltage modulation and frequency modulation device 210 monitors the working states of each unit (including the control unit 221, the storage unit 222, and the operation unit 223) in the chip 220 in real time. When the surface dynamic voltage modulation and frequency modulation device 220 determines that any one of the units is in an idle state, the fourth voltage frequency control information is sent to the unit, where the fourth voltage frequency control information is used to indicate that the operating voltage or the operating frequency of the unit is decreased. Thereby reducing the power consumption of the chip 220. When the unit is in the working state again, the dynamic voltage regulating and frequency modulation device 210 sends the fifth voltage frequency regulation information to the unit to raise the working voltage or the operating frequency of the unit, so that the operating speed of the unit meets the working requirement. It can be seen that, in the solution of the application embodiment, the dynamic voltage regulation and frequency modulation device 210 sets the real-time acquisition chip and the running speed information of each unit therein, and reduces the chip or the internal units according to the running speed information of the chip and its internal units. The operating frequency or operating voltage is used to reduce the overall operating power consumption of the chip.
参阅图3C,图3C为本申请实施例提供的另一种动态调压调频应用场景示意图。如图3C所示,该卷积运算装置包括动态调压调频装置317寄存器单元312、互连模块313、运算单元314、控制单元315和数据访问单元316。Referring to FIG. 3C, FIG. 3C is a schematic diagram of another dynamic voltage regulation and frequency modulation application scenario according to an embodiment of the present application. As shown in FIG. 3C, the convolution operation device includes a dynamic voltage-modulating frequency modulation device 317 register unit 312, an interconnection module 313, an operation unit 314, a control unit 315, and a data access unit 316.
其中,运算单元314包括加法计算器、乘法计算器、比较器和激活运算器中的至少二种。The operation unit 314 includes at least two of an addition calculator, a multiplication calculator, a comparator, and an activation operator.
互连模块313,用于控制运算单元314中计算器的连接关系使得该至少二种计算器组成不同的计算拓扑结构。The interconnecting module 313 is configured to control the connection relationship of the calculators in the computing unit 314 such that the at least two types of calculators form different computing topologies.
寄存器单元312(可以是寄存器单元,指令缓存,高速暂存存储器),用于存储该运算指令、数据块的在存储介质的地址、运算指令对应的计算拓扑结构。The register unit 312 (which may be a register unit, an instruction cache, a scratch pad memory) is configured to store the operation instruction, the address of the data block in the storage medium, and the calculation topology corresponding to the operation instruction.
可选地,该卷积运算装置还包括存储介质311。Optionally, the convolution operation device further includes a storage medium 311.
存储介质311可以为片外存储器,当然在实际应用中,也可以为片内存储器,用于存储数据块,该数据块具体可以为n维数据,n为大于等于1的整数,例如,n=1时,为1维数据,即向量,如n=2时,为2维数据,即矩阵,如n=3或3以上时,为多维数据。The storage medium 311 may be an off-chip memory. Of course, in an actual application, it may also be an on-chip memory for storing data blocks. The data block may be n-dimensional data, and n is an integer greater than or equal to 1, for example, n= At 1 o'clock, it is 1D data, that is, a vector, such as n=2, which is 2D data, that is, a matrix, such as n=3 or more, which is multidimensional data.
控制单元315,用于从寄存器单元312内提取运算指令、该运算指令对应的操作域以及该运算指令对应的第一计算拓扑结构,将该运算指令译码成执行指令,该执行指令用于控制运算单元314执行运算操作,将该操作域传输至数据访问单元316,将该计算拓扑结构传输至互连模块313。The control unit 315 is configured to extract an operation instruction from the register unit 312, an operation domain corresponding to the operation instruction, and a first calculation topology corresponding to the operation instruction, and decode the operation instruction into an execution instruction, where the execution instruction is used for control The arithmetic unit 314 performs an arithmetic operation, transmits the operational domain to the data access unit 316, and transmits the computational topology to the interconnect module 313.
数据访问单元316,用于从存储介质311中提取该操作域对应的数据块,并将该数据块传输至互连模块313。The data access unit 316 is configured to extract a data block corresponding to the operation domain from the storage medium 311, and transmit the data block to the interconnection module 313.
互连模块313、用于接收第一计算拓扑结构的数据块。The interconnect module 313 is configured to receive the data block of the first computing topology.
在本申请的一可能实施例中,互连模块313还根据第一计算拓扑结构对数据块重新摆放。In a possible embodiment of the present application, the interconnect module 313 also repositions the data block according to the first computing topology.
运算单元314,用于该执行指令调用运算单元314的计算器对该数据块执行运算操作得到运算结果,将该运算结果传输至数据访问单元316并存储在存储介质312内。The operation unit 314, the calculator for executing the instruction call operation unit 314 performs an operation operation on the data block to obtain an operation result, and transmits the operation result to the data access unit 316 and stores it in the storage medium 312.
在本申请的一可能实施例中,运算单元314还用于按第一计算拓扑结构以及该执行指令调用计算器对重新摆放的数据块执行运算操作得到运算结果,将该运算结果传输至数据访问单元316并存储在存储介质312内。In a possible embodiment of the present application, the operation unit 314 is further configured to perform an operation operation on the re-arranged data block according to the first calculation topology and the execution instruction to obtain an operation result, and transmit the operation result to the data. Access unit 316 is stored and stored in storage medium 312.
在一可行的实施例中,互连模块313还用于依据控制运算单元314中计算器的连接关系形成第一计算拓扑结构。In a possible embodiment, the interconnecting module 313 is further configured to form a first computing topology according to the connection relationship of the calculators in the control computing unit 314.
动态调压调频装置317,用于监控整个卷积运算装置的工作状态并对其电压和频率进行动态调控。The dynamic voltage regulation and frequency modulation device 317 is configured to monitor the working state of the entire convolution operation device and dynamically adjust its voltage and frequency.
下面通过不同的运算指令来说明该卷积运算装置的具体计算方法,这里的运算指令以卷积计算指令为例,该卷积计算指令可以应用在神经网络中,所以该卷积计算指令也可以称为卷积神经网络。对于卷积计算指令来说,其实际需要执行的公式可以为:The specific calculation method of the convolution operation device is described below by using different operation instructions. The operation instruction here is exemplified by a convolution calculation instruction, and the convolution calculation instruction can be applied to a neural network, so the convolution calculation instruction can also It is called a convolutional neural network. For a convolutional calculation instruction, the formula that it actually needs to execute can be:
s=s(∑wx i+b) s=s(∑wx i +b)
其中,即将卷积核W(可包括多个数据)乘以输入数据χ i,进行求和,然后可选地可加上偏置b,然后可选地还可做激活运算s(h),得到最终的输出结果S。依据该公式即可以得到该计算拓扑结构为,乘法运算器-加法运算器-(可选的)激活运算器。上述卷积计算指令可以包括指令集,该指令集包含有不同功能的卷积神经网络COMPUTE指令以及CONFIG指令、IO指令、NOP指令、JUMP指令和MOVE指令。 Wherein, the convolution kernel W (which may include a plurality of data) is multiplied by the input data χ i , summed, and then optionally biased b, and then optionally the activation operation s(h), The final output result S is obtained. According to the formula, the calculation topology can be obtained as a multiplier-adder-(optional) activation operator. The convolution calculation instruction may include an instruction set including a convolutional neural network COMPUTE instruction having different functions, and a CONFIG instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction.
在一种实施例中,COMPUTE指令包括:In one embodiment, the COMPUTE instruction includes:
卷积运算指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作。a convolution operation instruction, according to which the convolution operation device extracts input data of a specified size and a convolution kernel from a specified address of a memory (a preferred scratch pad memory or a scalar register file), and performs the convolution operation unit Convolution operation.
卷积神经网络sigmoid指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作,然后将输出结果做sigmoid激活;a convolutional neural network sigmoid instruction according to which the convolution operation means respectively extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferably a scratch pad memory or a scalar register file), in a convolution operation unit Do the convolution operation, and then make the output result sigmoid activation;
卷积神经网络TanH指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作,然后将输出结果做TanH激活;The convolutional neural network TanH instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, then the output is TanH activated;
卷积神经网络ReLU指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作,然后将输出结果做ReLU激活;The convolutional neural network ReLU instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, and then the output is re-activated by ReLU;
卷积神经网络group指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器)的指定地址取出指定大小的输入数据和卷积核,划分group之后,在卷积运算部件中做卷积操作,然后将输出结果做激活。The convolutional neural network group instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and after dividing the group, the convolution operation unit Do a convolution operation and then activate the output.
CONFIG指令,用于在每层人工神经网络计算开始前配置当前层计算需要的各种常数。The CONFIG command is used to configure the various constants required for the current layer calculation before each layer of artificial neural network calculation begins.
IO指令,用于实现从外部存储空间读入计算需要的输入数据以及在计算完成后将数据存回至外部空间。The IO instruction is used to read the input data required for calculation from the external storage space and store the data back to the external space after the calculation is completed.
NOP指令,用于负责清空当前该卷积运算装置内部所有控制信号缓存队列中的控制信号,保证NOP指令之前的所有指令全部执行完毕。NOP指令本身不包含任何操作;The NOP instruction is used to clear the control signals in all the control signal buffer queues of the current convolution operation device, and ensure that all the instructions before the NOP instruction are all executed. The NOP instruction itself does not contain any operations;
JUMP指令,用于负责控制将要从指令存储单元读取的下一条指令地址的跳转,用来实现控制流的跳转;The JUMP instruction is used to control the jump of the next instruction address to be read from the instruction storage unit, and is used to implement the jump of the control flow;
MOVE指令,用于负责将该卷积运算装置内部地址空间某一地址的数据搬运至该卷积运算装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。The MOVE instruction is used to carry data of an address in the internal address space of the convolution operation device to another address in the internal address space of the convolution operation device, the process is independent of the operation unit, and does not occupy the operation unit during execution. H.
上述卷积运算装置执行卷积计算指令的方法具体可以为:The method for executing the convolution calculation instruction by the convolution operation device may specifically be:
控制单元315从寄存器单元312内提取卷积计算指令、该卷积计算指令对应的操作域以及卷积计算指令对应的第一计算拓扑结构(乘法运算器-加法运算器-加法运算器-激活运算器),控制单元将该操作域传输至数据访问单元,将该第一计算拓扑结构传输至互联模块。The control unit 315 extracts a convolution calculation instruction from the register unit 312, an operation domain corresponding to the convolution calculation instruction, and a first calculation topology corresponding to the convolution calculation instruction (multiplier-adder-adder-activation operation) And the control unit transmits the operation domain to the data access unit, and transmits the first computing topology to the interconnection module.
数据访问单元316从存储介质311内提取该操作域对应的卷积核w和偏置b(当b为0时,不需要提取偏置b),将卷积核w和偏置b传输至运算单元314。The data access unit 316 extracts the convolution kernel w and the offset b corresponding to the operation domain from the storage medium 311 (when b is 0, it is not necessary to extract the offset b), and the convolution kernel w and the offset b are transmitted to the operation. Unit 314.
运算单元314的乘法运算器将卷积核w与输入数据Xi执行乘法运算以后得到第一结果,将第一结果输入到加法运算器执行加法运算得到第二结果,将第二结果和偏置b执行加法运算得到第三结果,将 第三结果输到激活运算器执行激活运算得到输出结果s,将输出结果s传输至数据访问单元存储至存储介质内。其中,每个步骤后都可以直接将输出结果传输到数据访问存储至存储介质内,无需下面的步骤。另外,将第二结果和偏置b执行加法运算得到第三结果这一步骤可选,即当b为0时,不需要这个步骤。另外,加法运算和乘法运算的顺序可以调换。The multiplier of the operation unit 314 obtains the first result after performing the multiplication operation on the convolution kernel w and the input data Xi, and inputs the first result to the adder to perform the addition operation to obtain the second result, and the second result and the offset b The addition operation is performed to obtain a third result, and the third result is input to the activation operator to perform an activation operation to obtain an output result s, and the output result s is transmitted to the data access unit for storage into the storage medium. Among them, after each step, the output can be directly transferred to the data access storage to the storage medium without the following steps. In addition, the step of performing the addition of the second result and the offset b to obtain the third result is optional, that is, when b is 0, this step is not required. In addition, the order of addition and multiplication operations can be reversed.
可选地,上述第一结果可包括多个乘法运算的结果。Optionally, the first result may include a result of a plurality of multiplication operations.
在本申请的一可能实施例中,本申请实施例提供了一种神经网络处理器,其包括了上述卷积运算装置。In a possible embodiment of the present application, an embodiment of the present application provides a neural network processor including the above convolution operation device.
上述神经网络处理器用于执行人工神经网络运算,实现语音识别,图像识别,翻译等人工智能的应用。The above neural network processor is used to perform artificial neural network operations, and implements artificial intelligence applications such as speech recognition, image recognition, and translation.
在这个卷积计算任务中,上述动态调压调频装置317的工作过程如下:In this convolution calculation task, the above dynamic voltage modulation and frequency modulation device 317 works as follows:
情形一、上述神经网络处理器在执行卷积运算过程中,动态调压调频装置317实时获取上述神经网络处理器的数据访问单元316和运算单元314的运行速度。当动态调压调频装置317根据数据访问单元316和运算单元314的运行速度确定数据访问单元316的运行时间超过运算单元314的运行时间,动态调压调频装置317可确定在进行卷积运算过程中,数据访问单元316成为了瓶颈,运算单元314执行完当前的卷积运算操作后,需要等待数据访问单元316执行完读取任务并将其读取的数据传输至运算单元314,该运算单元314才能根据此次数据访问单元316传输过来的数据进行卷积运算操作。动态调压调频装置317向运算单元314发送第一电压频率调控信息,该第一电压频率调控信息用于指示运算单元314降低其工作电压或者工作频率,以降低运算单元314的运行速度,使得运算单元314的运行速度与数据访问单元316的运行速度相匹配,降低了运算单元314的功耗,避免运算单元314空闲的情况发生,最终使得在不影响任务的完成时间的情况下,降低了上述神经网络处理器整体的运行功耗。Case 1: In the process of performing the convolution operation, the dynamic voltage-modulating and frequency-modulating device 317 acquires the running speeds of the data access unit 316 and the computing unit 314 of the neural network processor in real time. When the dynamic voltage-modulating and frequency-modulating device 317 determines that the running time of the data access unit 316 exceeds the running time of the computing unit 314 according to the operating speeds of the data access unit 316 and the computing unit 314, the dynamic voltage-modulating and frequency-modulating device 317 can determine that the convolution operation is being performed. The data access unit 316 becomes a bottleneck. After the current convolution operation operation is performed, the operation unit 314 needs to wait for the data access unit 316 to execute the read task and transfer the read data to the operation unit 314. The operation unit 314 The convolution operation operation can be performed based on the data transmitted from the data access unit 316. The dynamic voltage modulation and frequency modulation device 317 sends the first voltage frequency regulation information to the operation unit 314, where the first voltage frequency regulation information is used to instruct the operation unit 314 to lower the operating voltage or the operating frequency thereof to reduce the operating speed of the operation unit 314, so that the operation The running speed of the unit 314 is matched with the running speed of the data access unit 316, which reduces the power consumption of the computing unit 314, prevents the computing unit 314 from being idle, and finally reduces the above without affecting the completion time of the task. The overall operating power of the neural network processor.
情形二、上述神经网络处理器在执行卷积运算过程中,动态调压调频装置317实时获取上述神经网络处理器的数据访问单元316和运算单元314的运行速度。当动态调压调频装置317根据数据访问单元316和运算单元314的运行速度确定运算单元314的运行时间超过上述数据访问单元316的运行时间,动态调压调频装置317可确定在进行卷积运算过程中,运算单元314成为了瓶颈,数据访问单元316执行完当前的数据读取操作后,需要等待运算单元314执行当前的卷积运算操作后,数据访问单元316才将其读取的数据传输至上述运算单元314。动态调压调频装置317向数据访问单元316发送第二电压频率调控信息,该第二电压频率调控信息用于指示数据访问单元316降低其工作电压或者工作频率,以降低数据访问单元316的运行速度,使得数据访问单元316的运行速度与预算单元314的运行速度相匹配,降低了数据访问单元316的功耗,并避免数据访问单元316空闲的情况发生,最终使得在不影响任务的完成时间的情况下,降低了上述神经网络处理器整体的运行功耗。Case 2: In the process of performing the convolution operation, the dynamic voltage-modulating and frequency-modulating device 317 acquires the running speeds of the data access unit 316 and the computing unit 314 of the neural network processor in real time. When the dynamic voltage-modulating and frequency-modulating device 317 determines that the running time of the computing unit 314 exceeds the running time of the data access unit 316 according to the operating speed of the data access unit 316 and the computing unit 314, the dynamic voltage-modulating and frequency-modulating device 317 can determine that the convolution operation process is performed. The operation unit 314 becomes a bottleneck. After the data access unit 316 performs the current data read operation, the data access unit 316 needs to wait for the operation unit 314 to perform the current convolution operation. The above arithmetic unit 314. The dynamic voltage-modulating frequency modulation device 317 sends the second voltage frequency regulation information to the data access unit 316, and the second voltage frequency control information is used to instruct the data access unit 316 to lower its operating voltage or operating frequency to reduce the operating speed of the data access unit 316. The running speed of the data access unit 316 is matched with the running speed of the budget unit 314, the power consumption of the data access unit 316 is reduced, and the data access unit 316 is prevented from being idle, and finally, the completion time of the task is not affected. In this case, the overall operating power consumption of the above neural network processor is reduced.
上述神经网络处理器执行人工神经网络运算,进行人工智能应用的时候,动态调压调频装置317实时采集上述神经网络处理器进行人工智能应用的的工作参数,并根据该工作参数调整上述神经网络处理器的工作电压或工作频率。When the neural network processor performs the artificial neural network operation and performs the artificial intelligence application, the dynamic voltage regulation and frequency modulation device 317 collects the working parameters of the artificial neural network application by the neural network processor in real time, and adjusts the neural network processing according to the working parameter. The operating voltage or operating frequency of the device.
具体地,上述人工智能应用可以是视频图像处理,物体识别、机器翻译、语音识别和图像美颜等等。Specifically, the artificial intelligence application may be video image processing, object recognition, machine translation, voice recognition, image beauty, and the like.
情形三、上述神经网络处理器进行视频图像处理时,动态调压调频装置317实时采集上述神经网络处理器进行视频图像处理的帧率。当该视频图像处理的帧率超过目标帧率时,该目标帧率为用户正常需求的视频图像处理帧率,动态调压调频装置317向上述神经网络处理器发送第三电压频率调控信息,该 第三电压频率调控信息用于指示上述神经网络处理器降低其工作电压或者工作频率,在满足了用户正常视频图像处理需求的同时也降低了上述神经网络处理器的功耗。Case 3: When the above neural network processor performs video image processing, the dynamic voltage regulation and frequency modulation device 317 collects the frame rate of the video image processing by the neural network processor in real time. When the frame rate of the video image processing exceeds the target frame rate, the target frame rate is a video image processing frame rate that is normally required by the user, and the dynamic voltage modulation and frequency modulation device 317 sends the third voltage frequency regulation information to the neural network processor. The third voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or operating frequency, and reduces the power consumption of the neural network processor while satisfying the normal video image processing requirements of the user.
情形四、上述神经网络处理器进行语音识别时,动态调压调频装置317实时采集上述神经网络处理器的语音识别速度。当上述神经网络处理器的语音识别速度超过用户实际语音识别速度时,动态调压调频装置317向上述神经网络处理器发送第四电压频率调控信息,该第四电压频率调控信息用于指示上述神经网络处理器降低其工作电压或者工作频率,在满足了用户正常语音识别需求的同时也降低了上述神经网络处理器的功耗。Case 4: When the neural network processor performs speech recognition, the dynamic voltage modulation and frequency modulation device 317 collects the speech recognition speed of the neural network processor in real time. When the voice recognition speed of the neural network processor exceeds the actual voice recognition speed of the user, the dynamic voltage regulation and frequency modulation device 317 sends fourth voltage frequency regulation information to the neural network processor, where the fourth voltage frequency regulation information is used to indicate the nerve The network processor reduces its operating voltage or operating frequency, and reduces the power consumption of the above-mentioned neural network processor while satisfying the user's normal speech recognition requirements.
情形五、动态调压调频装置317实时监控上述神经网络处理器中各单元或者模块(包括存储介质311、寄存器单元312、互连模块313、运算单元314、控制器单元315、数据访问单元316)的工作状态。当上述神经网络处理器中各单元或者模块任一单元或者模块处于空闲状态时,动态调压调频装置317向该单元或者模块发送第五电压频率调控信息,以降低该单元或者模块的工作电压或者工作频率,以进而降低该单元或者模块的功耗。当该单元或者模块重新处于工作状态时,动态调压调频装置317向该单元或者模块发送第六电压频率调控信息,以升高该单元或者模块的工作电压或者工作频率,以使该单元或者模块的运行速度满足工作需求。 Case 5, the dynamic voltage regulation and frequency modulation device 317 monitors each unit or module in the above neural network processor in real time (including the storage medium 311, the register unit 312, the interconnection module 313, the operation unit 314, the controller unit 315, and the data access unit 316) Working status. When any unit or module of the above neural network processor is in an idle state, the dynamic voltage regulating and frequency modulation device 317 sends the fifth voltage frequency control information to the unit or the module to reduce the working voltage of the unit or the module or Operating frequency to further reduce the power consumption of the unit or module. When the unit or module is in the working state again, the dynamic voltage regulating and frequency modulation device 317 sends the sixth voltage frequency regulation information to the unit or the module to raise the working voltage or the operating frequency of the unit or the module, so that the unit or the module The speed of operation meets the needs of the work.
参阅图3D,图3D是本申请实施例提供的又一种动态调压调频应用场景示意图。如图3D所示,该卷积运算装置包括动态调压调频装置7、指令存储单元1、控制器单元2、数据访问单元3、互连模块4、主运算模块5和多个从运算模块6。指令存储单元1、控制器单元2、数据访问单元3、互连模块4、主运算模块5和从运算模块6均可以通过硬件电路(例如包括但不限于FPGA、CGRA、专用集成电路ASIC、模拟电路和忆阻器等)实现。Referring to FIG. 3D, FIG. 3D is a schematic diagram of another dynamic voltage-modulation frequency modulation application scenario provided by an embodiment of the present application. As shown in FIG. 3D, the convolution operation device includes a dynamic voltage-modulating frequency modulation device 7, an instruction storage unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a main operation module 5, and a plurality of slave operation modules 6. . The instruction storage unit 1, the controller unit 2, the data access unit 3, the interconnection module 4, the main operation module 5, and the slave operation module 6 may all pass through hardware circuits (including but not limited to FPGA, CGRA, application specific integrated circuit ASIC, analog Circuits and memristors, etc.).
指令存储单元1通过数据访问单元3读入指令并存储读入的指令。The instruction storage unit 1 reads in an instruction through the data access unit 3 and stores the read instruction.
控制器单元2从指令存储单元1中读取指令,将指令译成控制其他模块行为的控制信号并发送给其他模块如数据访问单元3、主运算模块5和从运算模块6等。The controller unit 2 reads an instruction from the instruction storage unit 1, translates the instruction into a control signal that controls the behavior of other modules, and transmits it to other modules such as the data access unit 3, the main operation module 5, and the slave operation module 6.
数据访问单元3能够访问外部地址空间,直接向该卷积运算装置内部的各个存储单元读写数据,完成数据的加载和存储。The data access unit 3 can access the external address space, directly read and write data to and from the respective memory cells inside the convolution operation device, and complete data loading and storage.
互连模块4用于连接主运算模块和从运算模块,可以实现成不同的互连拓扑(如树状结构、环状结构、网格状结构、分级互连,总线结构等)。The interconnect module 4 is used to connect the main operation module and the slave operation module, and can be implemented into different interconnection topologies (such as a tree structure, a ring structure, a grid structure, a hierarchical interconnection, a bus structure, etc.).
动态调压调频装置7,用于实时获取上述数据访问单元3和上述主运算单元5的工作状态信息,并根据上述数据访问单元3和上述主运算单元5的工作状态信息来调整上述数据访问单元3和上述主运算模块5的工作电压或者工作频率。The dynamic voltage-modulating and frequency-modulating device 7 is configured to acquire the working state information of the data access unit 3 and the main operation unit 5 in real time, and adjust the data access unit according to the working state information of the data access unit 3 and the main operation unit 5. 3 and the operating voltage or operating frequency of the main arithmetic module 5 described above.
在本申请的一可能实施例中,本发明实施例提供了一种神经网络处理器,其包括了上述卷积运算装置。In a possible embodiment of the present application, an embodiment of the present invention provides a neural network processor including the above convolution operation device.
上述神经网络处理器用于执行人工神经网络运算,实现语音识别,图像识别,翻译等人工智能的应用。The above neural network processor is used to perform artificial neural network operations, and implements artificial intelligence applications such as speech recognition, image recognition, and translation.
在这个卷积计算任务中,动态调压调频装置7的工作过程如下:In this convolution calculation task, the dynamic voltage modulation and frequency modulation device 7 works as follows:
情形一、上述卷积神经网络处理器在执行卷积运算过程中,动态调压调频装置7实时获取上述神经网络处理器的数据访问单元3和主运算模块5的运行速度。当动态调压调频装置7根据数据访问单元3 和主运算模块5的运行速度确定数据访问单元3的运行时间超过主运算模块5的运行时间,动态调压调频装置7可确定在进行卷积运算过程中,数据访问单元3成为了瓶颈,主运算模块5执行完当前的卷积运算操作后,需要等待上述数据访问单元3执行完读取任务并将其读取的数据传输至主运算模块5,主运算模块5才能根据此次数据访问单元3传输过来的数据进行卷积运算操作。动态调压调频装置7向主运算模块5发送第一电压频率调控信息,该第一电压频率调控信息用于指示主运算模块5降低其工作电压或者工作频率,以降低主运算模块5的运行速度,使得主运算模块5的运行速度与数据访问单元3的运行速度相匹配,降低了主运算模块5的功耗,避免主运算模块5空闲的情况发生,最终使得在不影响任务的完成时间的情况下,降低了上述神经网络处理器整体的运行功耗。Case 1: The convolutional neural network processor performs the convolution operation, and the dynamic voltage-modulating and frequency-modulating device 7 acquires the running speeds of the data access unit 3 and the main operation module 5 of the neural network processor in real time. When the dynamic voltage-modulating and frequency-modulating device 7 determines that the running time of the data access unit 3 exceeds the running time of the main computing module 5 according to the operating speeds of the data access unit 3 and the main computing module 5, the dynamic voltage-modulating and frequency-modulating device 7 can determine that the convolution operation is performed. During the process, the data access unit 3 becomes a bottleneck. After the main arithmetic module 5 performs the current convolution operation, it needs to wait for the data access unit 3 to execute the read task and transfer the read data to the main operation module 5. The main operation module 5 can perform a convolution operation operation based on the data transmitted from the data access unit 3 at this time. The dynamic voltage regulating and frequency-modulating device 7 sends the first voltage frequency control information to the main operation module 5, where the first voltage frequency control information is used to instruct the main operation module 5 to lower the operating voltage or the operating frequency thereof to reduce the running speed of the main operation module 5. The running speed of the main operation module 5 is matched with the running speed of the data access unit 3, the power consumption of the main operation module 5 is reduced, and the idle operation of the main operation module 5 is avoided, and finally, the completion time of the task is not affected. In this case, the overall operating power consumption of the above neural network processor is reduced.
情形二、上述神经网络处理器在执行卷积运算过程中,动态调压调频装置7实时获取上述神经网络处理器的数据访问单元3和主运算模块5的运行速度。当动态调压调频装置3根据数据访问单元3和主运算模块5的运行速度确定主运算模块5的运行时间超过数据访问单元3的运行时间,动态调压调频装置7可确定在进行卷积运算过程中,主运算模块5成为了瓶颈,数据访问单元3执行完当前的数据读取操作后,需要等待主运算模块5执行当前的卷积运算操作后,数据访问单元3才将其读取的数据传输至主运算模块5。动态调压调频装置7向数据访问单元3发送第二电压频率调控信息,该第二电压频率调控信息用于指示数据访问单元316降低其工作电压或者工作频率,以降低数据访问单元3的运行速度,使得数据访问单元3的运行速度与主运算模块5的运行速度相匹配,降低了数据访问单元3的功耗,并避免数据访问单元3空闲的情况发生,最终使得在不影响任务的完成时间的情况下,降低了上述神经网络处理器整体的运行功耗。Case 2: In the process of performing the convolution operation, the dynamic voltage-modulating and frequency-modulating device 7 acquires the running speeds of the data access unit 3 and the main operation module 5 of the neural network processor in real time. When the dynamic voltage-modulating and frequency-modulating device 3 determines that the running time of the main arithmetic module 5 exceeds the running time of the data access unit 3 according to the operating speed of the data access unit 3 and the main arithmetic module 5, the dynamic voltage-modulating and frequency-modulating device 7 can determine that the convolution operation is performed. In the process, the main operation module 5 becomes a bottleneck. After the data access unit 3 performs the current data read operation, it needs to wait for the main operation module 5 to perform the current convolution operation operation, and then the data access unit 3 reads the data. The data is transferred to the main operation module 5. The dynamic voltage-modulating and frequency-modulating device 7 sends the second voltage frequency control information to the data access unit 3, and the second voltage frequency control information is used to instruct the data access unit 316 to lower its operating voltage or operating frequency to reduce the operating speed of the data access unit 3. So that the running speed of the data access unit 3 matches the running speed of the main operation module 5, the power consumption of the data access unit 3 is reduced, and the idle condition of the data access unit 3 is avoided, and finally the completion time of the task is not affected. In the case of the above, the operating power consumption of the above-mentioned neural network processor as a whole is reduced.
上述神经网络处理器执行人工神经网络运算,进行人工智能应用的时候,动态调压调频装置317实时采集上述神经网络处理器进行人工智能应用的的工作参数并根据该工作参数调整上述神经网络处理器的工作电压或工作频率。When the neural network processor performs the artificial neural network operation and performs the artificial intelligence application, the dynamic voltage modulation and frequency modulation device 317 collects the working parameters of the artificial neural network application by the neural network processor in real time and adjusts the neural network processor according to the working parameter. Working voltage or operating frequency.
具体地,上述人工智能应用可以是视频图像处理,物体识别、机器翻译、语音识别和图像美颜等等。Specifically, the artificial intelligence application may be video image processing, object recognition, machine translation, voice recognition, image beauty, and the like.
情形三、上述神经网络处理器进行视频图像处理时,动态调压调频装置7实时采集上述神经网络处理器进行视频图像处理的帧率。当该视频图像处理的帧率超过目标帧率时,该目标帧率为用户正常需求的视频图像处理帧率,动态调压调频装置7向上述神经网络处理器发送第三电压频率调控信息,该第三电压频率调控信息用于指示上述神经网络处理器降低其工作电压或者工作频率,在满足了用户正常视频图像处理需求的同时也降低了上述神经网络处理器的功耗。Case 3: When the above neural network processor performs video image processing, the dynamic voltage modulation and frequency modulation device 7 collects the frame rate of the video image processing by the neural network processor in real time. When the frame rate of the video image processing exceeds the target frame rate, the target frame rate is a video image processing frame rate that is normally required by the user, and the dynamic voltage modulation and frequency modulation device 7 sends the third voltage frequency regulation information to the neural network processor. The third voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or operating frequency, and reduces the power consumption of the neural network processor while satisfying the normal video image processing requirements of the user.
情形四、上述神经网络处理器进行语音识别时,动态调压调频装置7实时采集上述神经网络处理器的语音识别速度。当上述神经网络处理器的语音识别速度超过用户实际语音识别速度时,动态调压调频装置7向上述神经网络处理器发送第四电压频率调控信息,该第四电压频率调控信息用于指示上述神经网络处理器降低其工作电压或者工作频率,在满足了用户正常语音识别需求的同时也降低了上述神经网络处理器的功耗。Case 4: When the neural network processor performs speech recognition, the dynamic voltage modulation and frequency modulation device 7 collects the speech recognition speed of the neural network processor in real time. When the voice recognition speed of the neural network processor exceeds the actual voice recognition speed of the user, the dynamic voltage regulation and frequency modulation device 7 sends fourth voltage frequency regulation information to the neural network processor, where the fourth voltage frequency regulation information is used to indicate the nerve The network processor reduces its operating voltage or operating frequency, and reduces the power consumption of the above-mentioned neural network processor while satisfying the user's normal speech recognition requirements.
情形五、动态调压调频装置7实时监控并获取上述神经网络处理器中各单元或者模块(包括指令1、控制器单元2、数据访问单元3、互连模块4、主运算模块5和从运算模块6)的工作状态信息。当上述神经网络处理器中各单元或者模块任一单元或者模块处于空闲状态时,动态调压调频装置7向该单元或者模块发送第五电压频率调控信息,以降低该单元或者模块的工作电压或者工作频率,以进而降低该单元或者模块的功耗。当该单元或者模块重新处于工作状态时,动态调压调频装置7向该单元或者模块发 送第六电压频率调控信息,以升高该单元或者模块的工作电压或者工作频率,以使该单元或者模块的运行速度满足工作需求。 Case 5, the dynamic voltage regulation and frequency modulation device 7 monitors and acquires each unit or module in the above neural network processor in real time (including instruction 1, controller unit 2, data access unit 3, interconnection module 4, main operation module 5, and slave operation). Module 6) working status information. When any unit or module of the above neural network processor is in an idle state, the dynamic voltage regulating and frequency modulation device 7 sends the fifth voltage frequency control information to the unit or the module to reduce the working voltage of the unit or the module or Operating frequency to further reduce the power consumption of the unit or module. When the unit or module is in the working state again, the dynamic voltage regulating and frequency modulation device 7 sends the sixth voltage frequency regulation information to the unit or the module to raise the working voltage or the operating frequency of the unit or the module, so that the unit or the module The speed of operation meets the needs of the work.
参阅图3E,图3E示意性示出了互连模块4的一种实施方式:H树模块。互连模块4构成主运算模块5和多个从运算模块6之间的数据通路,是由多个节点构成的二叉树通路,每个节点将上游的数据同样地发给下游的两个节点,将下游的两个节点返回的数据进行合并,并返回给上游的节点。例如,在卷积神经网络开始计算阶段,主运算模块5内的神经元数据通过互连模块4发送给各个从运算模块6;当从运算模块6的计算过程完成后,当从运算模块的计算过程完成后,每个从运算模块输出的神经元的值会在互连模块4中逐级拼成一个完整的由神经元组成的向量。举例说明,假设装置中共有N个从运算模块,则输入数据xi被分别发送到N个从运算模块,每个从运算模块将输入数据xi与该从运算模块相应的卷积核做卷积运算,得到一标量数据,各从运算模块的标量数据被互连模块4合并成一个含有N个元素的中间向量。假设卷积窗口总共遍历得到A*B个(X方向为A个,Y方向为B个,X、Y为三维正交坐标系的坐标轴)输入数据xi,则对A*B个xi执行上述卷积操作,得到的所有向量在主运算模块中合并得到A*B*N的三维中间结果。Referring to Figure 3E, Figure 3E schematically illustrates an embodiment of an interconnect module 4: an H-tree module. The interconnection module 4 constitutes a data path between the main operation module 5 and the plurality of slave operation modules 6, and is a binary tree path composed of a plurality of nodes, and each node transmits the upstream data to the downstream two nodes in the same manner. The data returned by the two downstream nodes is merged and returned to the upstream node. For example, in the beginning of the calculation phase of the convolutional neural network, the neuron data in the main operation module 5 is sent to the respective slave operation modules 6 through the interconnection module 4; when the calculation process from the operation module 6 is completed, when the calculation from the operation module is completed After the process is completed, the value of each neuron output from the arithmetic module is progressively combined into a complete vector of neurons in the interconnect module 4. For example, if there are a total of N slave arithmetic modules in the device, the input data xi is sent to the N slave arithmetic modules, and each slave computing module convolves the input data xi with the convolution kernel corresponding to the slave computing module. Obtaining a scalar data, the scalar data of each slave arithmetic module is merged by the interconnect module 4 into an intermediate vector containing N elements. Assuming that the convolution window traverses a total of A*B (A in the X direction, B in the Y direction, and X, Y are the coordinate axes of the three-dimensional orthogonal coordinate system), the data xi is input, and the above is performed for A*B xi Convolution operation, all the obtained vectors are combined in the main operation module to obtain the three-dimensional intermediate result of A*B*N.
参阅图3F,图3F示出了根据本申请实施例的用于执行卷积神经网络正向运算的装置中主运算模块5的结构的示例框图。如图3F所示,主运算模块5包括第一运算单元51、第一数据依赖关系判定单元52和第一存储单元53。Referring to FIG. 3F, FIG. 3F illustrates an example block diagram of the structure of the main operation module 5 in the apparatus for performing a convolutional neural network forward operation according to an embodiment of the present application. As shown in FIG. 3F, the main operation module 5 includes a first operation unit 51, a first data dependency determination unit 52, and a first storage unit 53.
其中,第一运算单元51包括向量加法单元511以及激活单元512。第一运算单元51接收来自控制器单元2的控制信号,完成主运算模块5的各种运算功能,向量加法单元511用于实现卷积神经网络正向计算中的加偏置操作,该部件将偏置数据与所述中间结果对位相加得到偏置结果,激活运算单元512对偏置结果执行激活函数操作。所述偏置数据可以是从外部地址空间读入的,也可以是存储在本地的。The first operation unit 51 includes a vector addition unit 511 and an activation unit 512. The first operation unit 51 receives the control signal from the controller unit 2, and completes various operation functions of the main operation module 5, and the vector addition unit 511 is used to implement the offset operation in the forward calculation of the convolutional neural network, and the component will The offset data is added to the intermediate result pair to obtain a bias result, and the activation operation unit 512 performs an activation function operation on the bias result. The offset data may be read from an external address space or may be stored locally.
第一数据依赖关系判定单元52是第一运算单元51读写第一存储单元53的端口,保证第一存储单元53中数据的读写一致性。同时,第一数据依赖关系判定单元52也负责将从第一存储单元53读取的数据通过互连模块4发送给从运算模块,而从运算模块6的输出数据通过互连模块4直接发送给第一运算单元51。控制器单元2输出的指令发送给计算单元51和第一数据依赖关系判定单元52,来控制其行为。The first data dependency determining unit 52 is a port in which the first computing unit 51 reads and writes the first storage unit 53, and ensures read/write consistency of data in the first storage unit 53. At the same time, the first data dependency determining unit 52 is also responsible for transmitting the data read from the first storage unit 53 to the slave computing module through the interconnect module 4, and the output data from the computing module 6 is directly sent to the slave module 4 through the interconnect module 4. The first arithmetic unit 51. The command output from the controller unit 2 is sent to the calculation unit 51 and the first data dependency determination unit 52 to control its behavior.
存储单元53用于缓存主运算模块5在计算过程中用到的输入数据和输出数据。The storage unit 53 is configured to buffer the input data and the output data used by the main operation module 5 in the calculation process.
参阅图3G,图3G示出了根据本申请实施例的用于执行卷积神经网络正向运算的装置中从运算模块6的结构的示例框图。如图3G所示,每个从运算模块6包括第二运算单元61、数据依赖关系判定单元62、第二存储单元63和第三存储单元64。Referring to FIG. 3G, FIG. 3G illustrates an example block diagram of the structure of the slave arithmetic module 6 in an apparatus for performing a convolutional neural network forward operation in accordance with an embodiment of the present application. As shown in FIG. 3G, each slave arithmetic module 6 includes a second arithmetic unit 61, a data dependency determining unit 62, a second storage unit 63, and a third storage unit 64.
第二运算单元61接收控制器单元2发出的控制信号并进行卷积运算。第二运算单元包括向量乘单元611和累加单元612,分别负责卷积运算中的向量乘运算和累加运算。The second arithmetic unit 61 receives the control signal from the controller unit 2 and performs a convolution operation. The second arithmetic unit includes a vector multiplication unit 611 and an accumulating unit 612, which are respectively responsible for the vector multiplication operation and the accumulation operation in the convolution operation.
第二数据依赖关系判定单元62负责计算过程中对第二存储单元63的读写操作。第二数据依赖关系判定单元62执行读写操作之前会首先保证指令之间所用的数据不存在读写一致性冲突。例如,所有发往数据依赖关系单元62的控制信号都会被存入数据依赖关系单元62内部的指令队列里,在该队列中, 读指令的读取数据的范围如果与队列位置靠前的写指令写数据的范围发生冲突,则该指令必须等到所依赖的写指令被执行后才能够执行。The second data dependency determining unit 62 is responsible for the read and write operations on the second storage unit 63 in the calculation process. Before the second data dependency determining unit 62 performs the read/write operation, it first ensures that there is no read/write consistency conflict between the data used between the instructions. For example, all control signals sent to the data dependency unit 62 are stored in an instruction queue inside the data dependency unit 62, in which the range of read data of the read command is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.
第二存储单元63缓存该从运算模块6的输入数据和输出标量数据。The second storage unit 63 buffers the input data of the slave arithmetic module 6 and outputs scalar data.
第三存储单元64缓存该从运算模块6在计算过程中需要的卷积核数据。The third storage unit 64 buffers the convolution kernel data required by the slave arithmetic module 6 in the calculation process.
可以看出,在本发明实施例的方案中,上述动态调压调频装置实时采集上述神经网络处理器及其内部各单元和模块的运行速度,当根据神经网络处理器及其内部各单元和模块的运行速度确定降低神经网络处理器或者其内部各单元的工作频率或者工作电压,在满足实际工作中用户的需求的同时也可以达到降低芯片的整体运行功耗的目的。It can be seen that, in the solution of the embodiment of the present invention, the dynamic voltage regulating and frequency modulation device collects the running speed of the neural network processor and its internal units and modules in real time, according to the neural network processor and its internal units and modules. The running speed determines to reduce the operating frequency or working voltage of the neural network processor or its internal units, and can achieve the purpose of reducing the overall operating power consumption of the chip while meeting the needs of the user in actual work.
参阅图3H,图3H为本申请实施例提供的一种动态调压调频方法的流程示意图。如图3H所示,该方法包括:Referring to FIG. 3H, FIG. 3H is a schematic flowchart of a dynamic voltage regulation and frequency modulation method according to an embodiment of the present application. As shown in FIG. 3H, the method includes:
S801、动态调压调频装置实时采集与所述动态调压调频相连接的芯片的工作状态信息或应用场景信息,所述应用场景信息为所述芯片通过神经网络运算得到的或者与所述芯片相连接的传感器采集的信息。S801, the dynamic voltage modulation and frequency modulation device collects working state information or application scenario information of the chip connected to the dynamic voltage regulation frequency modulation in real time, where the application scenario information is obtained by the chip through a neural network operation or with the chip Information collected by connected sensors.
S802、动态调压调频装置根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息,所述电压频率调控信息用于指示所述芯片调整其工作电压或者工作频率。S802: The dynamic voltage regulation and frequency modulation device sends voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip, where the voltage frequency regulation information is used to instruct the chip to adjust its working voltage or operating frequency.
其中,所述芯片的工作状态信息包括所述芯片的运行速度,所述电压频率调控信息包括第一电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息包括:The working state information of the chip includes an operating speed of the chip, and the voltage frequency control information includes first voltage frequency regulation information, and the sending, according to the working state information or application scenario information of the chip, is sent to the chip. The voltage frequency regulation information includes:
当所述芯片的运行速度大于目标速度时,向所述芯片发送所述第一电压频率调控信息,所述第一电压频率调控信息用于指示所述芯片降低其工作频率或者工作电压,所述目标速度为满足用户需求时所述芯片的运行速度。Transmitting the first voltage frequency regulation information to the chip when the running speed of the chip is greater than a target speed, where the first voltage frequency regulation information is used to indicate that the chip reduces its operating frequency or operating voltage, The target speed is the running speed of the chip when the user's demand is met.
进一步地,所述芯片至少包括第一单元和第二单元,所述第一单元的输出数据为所述第二单元的输入数据,所述芯片的工作状态信息包括所述第一单元的运行速度和第二单元的运行速度,所述电压频率调控信息包括第二电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:Further, the chip includes at least a first unit and a second unit, the output data of the first unit is input data of the second unit, and the working state information of the chip includes an operating speed of the first unit And the operating speed of the second unit, the voltage frequency regulation information includes the second voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
当根据所述第一单元的运行速度和所述第二单元的运行速度确定所述第一单元的运行时间超过所述第二单元的运行时间时,向所述第二单元发送所述第二电压频率调控信息,所述第二电压频率调控信息用于指示所述第二单元降低其工作频率或者工作电压。And when the running time of the first unit exceeds the running time of the second unit according to the running speed of the first unit and the running speed of the second unit, sending the second unit to the second unit Voltage frequency regulation information, the second voltage frequency regulation information is used to instruct the second unit to reduce its operating frequency or operating voltage.
进一步地,所述电压频率调控信息包括第三电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:Further, the voltage frequency regulation information includes the third voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
当根据所述第一单元的运行速度和所述第二单元的运行速度确定所述第二单元的运行时间超过所述第一单元的运行时间时,向所述第一单元发送所述第三电压频率调控信息,所述第三电压频率调控信息用于指示所述第一单元降低其工作频率或者工作电压。Transmitting the third unit to the first unit when it is determined that an operating time of the second unit exceeds a running time of the first unit according to an operating speed of the first unit and an operating speed of the second unit Voltage frequency regulation information, the third voltage frequency regulation information is used to indicate that the first unit reduces its operating frequency or operating voltage.
可选地,所述芯片包括至少N个单元,所述芯片的工作状态信息包括所述N个单元中的至少S个单元的工作状态信息,所述N为大于1的整数,所述S为小于或者小于N的整数,所述电压频率调控信息包括第四电压频率调控信息,所述根据所述芯片的工作状态信息向所述芯片发送电压频率调控信息 还包括:Optionally, the chip includes at least N units, and the working state information of the chip includes working state information of at least S units of the N units, where N is an integer greater than 1, and the S is The voltage frequency regulation information includes a fourth voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information of the chip further includes:
根据所述单元A的工作状态信息确定所述单元A处于空闲状态时,向所述单元A发送所述第四电压频率调控信息,所述第四电压频率调控信息用于指示所述单元A降低其工作频率或者工作电压,And determining, according to the working state information of the unit A, that the unit A is in an idle state, sending the fourth voltage frequency regulation information to the unit A, where the fourth voltage frequency regulation information is used to indicate that the unit A is lowered. Its working frequency or working voltage,
其中,所述单元A为所述至少S个单元中的任意一个。The unit A is any one of the at least S units.
可选地,所述电压频率调控信息包括第五电压频率调控信息,所述根据所述芯片的工作状态信息或应用场景信息向所述芯片发送电压频率调控信息还包括:Optionally, the voltage frequency regulation information includes the fifth voltage frequency regulation information, and the sending the voltage frequency regulation information to the chip according to the working state information or the application scenario information of the chip further includes:
根据所述单元A的工作状态信息确定所述单元A重新处于工作状态时,向所述单元A发送第五电压频率调控信息,所述第五电压频率调控信息用于指示所述单元A升高其工作电压或者工作频率。Determining, according to the working state information of the unit A, that the unit A is in the working state again, sending the fifth voltage frequency regulation information to the unit A, where the fifth voltage frequency regulation information is used to indicate that the unit A is raised. Its working voltage or operating frequency.
可选地,所述芯片的应用场景为图像识别,所述应用场景信息为待识别图像中物体的个数,所述电压频率调控信息包括第六电压频率调控信息,所述调压调频单元还用于:Optionally, the application scenario of the chip is image recognition, the application scenario information is the number of objects in the image to be identified, the voltage frequency regulation information includes sixth voltage frequency regulation information, and the voltage regulation and frequency modulation unit further Used for:
当确定所述待识别图像中物体的个数小于第一阈值时,向所述芯片发送所述第六电压频率调控信息,所述第六电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。When it is determined that the number of objects in the image to be identified is less than a first threshold, sending the sixth voltage frequency regulation information to the chip, where the sixth voltage frequency regulation information is used to indicate that the chip reduces its working voltage Or the working frequency.
可选地,所述应用场景信息为物体标签信息,所述电压频率调控信息包括第七电压频率调控信息,所述调压调频单元还用于:Optionally, the application scenario information is object tag information, the voltage frequency regulation information includes seventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:
当确定所述物体标签信息属于预设物体标签集时,向所述芯片发送所述第七电压频率调控信息,所述第七电压频率调控信息用于指示所述芯片升高其工作电压或者工作频率。When it is determined that the object tag information belongs to the preset object tag set, sending the seventh voltage frequency regulation information to the chip, where the seventh voltage frequency regulation information is used to indicate that the chip raises its working voltage or works frequency.
可选地,所述芯片应用于语音识别,所述应用场景信息为语音输入速率,所述电压频率调控信息包括第八电压频率调控信息,所述调压调频单元还用于:Optionally, the chip is applied to the voice recognition, the application scenario information is a voice input rate, the voltage frequency regulation information includes an eighth voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:
当所述语音输入速率小于第二阈值时,向所述芯片发送第八电压频率调控信息,所述第八电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。And when the voice input rate is less than the second threshold, sending, to the chip, eighth voltage frequency regulation information, where the eighth voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency.
可选地,所述应用场景信息为所述芯片进行语音识别得到的关键词,所述电压频率调控信息包括第九电压频率调控信息,所述调频调压单元还用于:Optionally, the application scenario information is a keyword obtained by performing voice recognition on the chip, the voltage frequency regulation information includes ninth voltage frequency regulation information, and the frequency modulation and voltage adjustment unit is further configured to:
当所述关键词属于预设关键词集时,向所述芯片发送所述第九电压频率调控信息,所述第九电压频率调控信息用于指示所述芯片升高其工作电压或者工作频率。When the keyword belongs to the preset keyword set, the ninth voltage frequency regulation information is sent to the chip, and the ninth voltage frequency regulation information is used to indicate that the chip increases its working voltage or operating frequency.
可选地,所述芯片应用于机器翻译,所述应用场景信息为文字输入的速度或者待翻译图像中文字的数量,所述电压频率调控信息包括第十电压频率调控信息,所述调压调频单元还用于:Optionally, the chip is applied to machine translation, where the application scenario information is a speed of text input or a quantity of characters in an image to be translated, and the voltage frequency regulation information includes tenth voltage frequency regulation information, and the voltage regulation and frequency modulation The unit is also used to:
当所述文字输入速度小于第三阈值或者待翻译图像中文字的数量小于第四阈值时,向所述芯片发送所述第十电压频率调控信息,所述第十电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。And when the text input speed is less than a third threshold or the number of characters in the image to be translated is less than a fourth threshold, sending the tenth voltage frequency regulation information to the chip, where the tenth voltage frequency regulation information is used to indicate The chip reduces its operating voltage or operating frequency.
可选地,所述应用场景信息为外界的光照强度,所述电压频率调控信息包括第十一电压频率调控信息,所述调压调频单元还用于:Optionally, the application scenario information is ambient light intensity, the voltage frequency regulation information includes eleventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit is further configured to:
当所述外界的光照强度小于第五阈值时,向所述芯片发送所述第十一电压频率调控信息,所述第十一电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。Transmitting the eleventh voltage frequency regulation information to the chip when the ambient light intensity is less than a fifth threshold, the eleventh voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency .
可选地,所述芯片应用于图像美颜,所述电压频率调控信息包括第十二电压频率调控信息和第十三电压频偏调控信息,所述调压调频单元还用于:Optionally, the chip is applied to image beauty, and the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency offset control information, and the voltage regulation and frequency modulation unit is further configured to:
当所述应用场景信息为人脸图像时,向所述芯片发送所述第十二电压频率调控信息,所述第十二电压频率调控信息用于指示所述芯片降低其工作电压;When the application scenario information is a face image, sending the twelfth voltage frequency regulation information to the chip, where the twelfth voltage frequency regulation information is used to indicate that the chip reduces its working voltage;
当所述应用场景信息不为人脸图像时,向所述芯片发送第十三电压频率调控信息,所述第十三电压频率调控信息用于指示所述芯片降低其工作电压或者工作频率。And when the application scenario information is not a face image, sending thirteenth voltage frequency regulation information to the chip, where the thirteenth voltage frequency regulation information is used to indicate that the chip reduces its working voltage or operating frequency.
需要说明的是,上述方法实施例的具体实现过程可参阅图3A所示实施例的相关描述,在此不再叙述。It should be noted that the specific implementation process of the foregoing method embodiment may refer to the related description of the embodiment shown in FIG. 3A, and is not described herein.
参阅图4A,图4A是本申请实施例提供的卷积运算装置的结构示意图。如图4A所示,该卷积运算装置包括动态调压调频装置7、指令存储单元1、控制器单元2、数据访问单元3、互连模块4、主运算模块5和N个从运算模块6。Referring to FIG. 4A, FIG. 4A is a schematic structural diagram of a convolution operation device according to an embodiment of the present application. As shown in FIG. 4A, the convolution operation device includes a dynamic voltage-modulating frequency modulation device 7, an instruction storage unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a main operation module 5, and N slave operation modules 6. .
其中,指令存储单元1、控制器单元2、数据访问单元3、互连模块4、主运算模块5和N个从运算模块6均可以通过硬件电路(例如包括但不限于FPGA、CGRA、专用集成电路ASIC、模拟电路和忆阻器等)实现。The instruction storage unit 1, the controller unit 2, the data access unit 3, the interconnection module 4, the main operation module 5, and the N slave operation modules 6 can all pass hardware circuits (including but not limited to FPGA, CGRA, and dedicated integration). Circuit ASICs, analog circuits, and memristors are implemented.
其中,指令存储单元1用于存储数据访问单元3读入的指令。The instruction storage unit 1 is configured to store an instruction read by the data access unit 3.
控制器单元2,用于从指令存储单元1中读取指令,将该指令译成控制其他模块行为的控制信号,并发送给其他模块如数据访问单元3、主运算模块5和N个从运算模块6等。The controller unit 2 is configured to read an instruction from the instruction storage unit 1, translate the instruction into a control signal for controlling the behavior of other modules, and send the signal to other modules such as the data access unit 3, the main operation module 5, and the N slave operations. Module 6 and so on.
数据访问单元3,用于执行外部地址空间与卷积运算装置之间的数据或指令读写操作。The data access unit 3 is configured to perform data or instruction read and write operations between the external address space and the convolution operation device.
具体地,数据访问单元3访问外部地址空间,直接向装置内部的各个存储单元读写数据,完成数据的加载和存储。Specifically, the data access unit 3 accesses the external address space, directly reads and writes data to each storage unit inside the device, and completes loading and storing of the data.
N个从运算模块6,用于实现卷积神经网络算法中的输入数据和卷积核的卷积运算。N slave arithmetic modules 6 are used to implement convolution operations of input data and convolution kernels in a convolutional neural network algorithm.
其中,N个从运算模块6具体用于:利用相同的输入数据和各自的卷积核,并行地计算出各自的输出标量。The N slave operation modules 6 are specifically configured to calculate respective output scalars in parallel by using the same input data and respective convolution kernels.
互连模块4,用于连接主运算模5块和N个从运算模块6,可以实现成不同的互连拓扑(如树状结构、环状结构、网格状结构、分级互连,总线结构等)。互连模块4可实现主运算模块5和N个从运算模块6之间的数据传输。The interconnection module 4 is configured to connect the main operation mode 5 block and the N slave operation module 6, and can realize different interconnection topologies (such as a tree structure, a ring structure, a grid structure, a hierarchical interconnection, and a bus structure). Wait). The interconnection module 4 can implement data transmission between the main operation module 5 and the N slave operation modules 6.
换句话说,互连模块4构成主运算模块5和N个从运算模块6之间的连续或离散化数据的数据通路,互连模块4为树状结构、环状结构、网格状结构、分级互连和总线结构中的任一种结构。In other words, the interconnection module 4 constitutes a data path of continuous or discretized data between the main operation module 5 and the N slave operation modules 6, and the interconnection module 4 is a tree structure, a ring structure, a grid structure, Any of a hierarchical interconnection and a bus structure.
主运算模块5,用于将所有输入数据的中间向量拼接成中间结果,并对所述中间结果执行后续运算。The main operation module 5 is configured to splicing intermediate vectors of all input data into intermediate results, and performing subsequent operations on the intermediate results.
其中,主运算模块5还用于将中间结果与偏置数据相加,然后执行激活操作。主运算模块使用的激活函数active是非线性函数sigmoid,tanh,relu,softmax中的任一个非线性函数。The main operation module 5 is further configured to add the intermediate result and the offset data, and then perform an activation operation. The activation function active used by the main operation module is any nonlinear function of the nonlinear functions sigmoid, tanh, relu, and softmax.
其中,主运算模块5包括:The main operation module 5 includes:
第一存储单元53,用于缓存主运算模块5在计算过程中用到的输入数据和输出数据;The first storage unit 53 is configured to cache input data and output data used by the main operation module 5 in the calculation process;
第一运算单元51,用于完成主运算模块5的各种运算功能;The first operation unit 51 is configured to complete various computing functions of the main operation module 5;
第一数据依赖关系判定单元52,是第一运算单元51读写第一存储单元53的端口,用于保证对第一存储单元53的数据读写的一致性,并且从第一存储单元53读取输入的神经元向量,并通过互连模块4发送给N个从运算模块6;以及将来自互连模块4的中间结果向量被发送到上述第一运算单元51。The first data dependency determining unit 52 is a port of the first computing unit 51 that reads and writes the first storage unit 53 for ensuring consistency of reading and writing data to the first storage unit 53, and reads from the first storage unit 53. The input neuron vector is taken and sent to the N slave arithmetic modules 6 through the interconnect module 4; and the intermediate result vector from the interconnect module 4 is sent to the first arithmetic unit 51 described above.
其中,N个从运算模块6中的每个从运算模块包括:The slave operation module of each of the N slave operation modules 6 includes:
第二运算单元61,用于接收所述控制器单元2发出的控制信号并进行算数逻辑运算;a second operation unit 61, configured to receive a control signal sent by the controller unit 2 and perform an arithmetic logic operation;
第二数据依赖关系判定单元62,用于在计算过程中对第二存储单元63和第三存储单元64的读写操 作,以保证对第二存储单元63和第三存储单元64的读写一致性;The second data dependency determining unit 62 is configured to perform read and write operations on the second storage unit 63 and the third storage unit 64 in the calculation process to ensure consistent reading and writing of the second storage unit 63 and the third storage unit 64. Sex
第二存储单元63,用于缓存输入数据以及该从运算模块计算得到的输出标量;a second storage unit 63, configured to buffer input data and an output scalar calculated by the slave computing module;
第三存储单元64,用于缓存该从运算模块在计算过程中需要的卷积核。The third storage unit 64 is configured to buffer a convolution kernel required by the slave computing module in the calculation process.
进一步地,第一数据依赖关系判定单元52和第二数据依赖关系判定单元62通过以下方式保证读写一致性:Further, the first data dependency determining unit 52 and the second data dependency determining unit 62 ensure read and write consistency by:
判断尚未执行的控制信号与正在执行过程中的控制信号的数据之间是否存在依赖关系,如果不存在,允许该条控制信号立即发射,否则需要等到该条控制信号所依赖的所有控制信号全部执行完成后该条控制信号才允许被发射。Determining whether there is a dependency between the control signal that has not been executed and the data of the control signal being executed, if not, allowing the control signal to be immediately transmitted, otherwise it is necessary to wait until all control signals on which the control signal depends This control signal is allowed to be transmitted after completion.
可选地,数据访问单元3从外部地址空间读入输入数据、偏置数据和卷积核中的至少一个。Optionally, the data access unit 3 reads at least one of input data, offset data, and a convolution kernel from an external address space.
在神经网络全连接层正向运算开始之前,主运算模块5通过互连模块4将输入数据输送到N个从运算模块6的每一个从运算模块,在N个从运算模块6的计算过程结束后,互连模块4逐级将N个从运算模块6的输出标量拼成中间向量,输送回主运算模块5。Before the start of the neural network full connection layer forward operation, the main operation module 5 delivers the input data to each of the N slave operation modules 6 through the interconnection module 4, and ends the calculation process at the N slave operation modules 6. Afterwards, the interconnect module 4 progressively divides the output scalars of the N slave arithmetic modules 6 into intermediate vectors and sends them back to the main arithmetic module 5.
下面通过不同的运算指令来说明上述卷积运算装置的具体计算方法,这里的运算指令以卷积计算指令为例,该卷积计算指令可以应用在神经网络中,所以该卷积计算指令也可以称为卷积神经网络。对于卷积计算指令来说,其实际需要执行的公式可以为:The specific calculation method of the above convolution operation device is described by different operation instructions. The operation instruction here is exemplified by a convolution calculation instruction, and the convolution calculation instruction can be applied to a neural network, so the convolution calculation instruction can also It is called a convolutional neural network. For a convolutional calculation instruction, the formula that it actually needs to execute can be:
s=s(∑wx i+b) s=s(∑wx i +b)
其中,即将卷积核W(可包括多个数据)乘以输入数据χ i,进行求和,然后可选地可加上偏置b,然后可选地还可做激活运算S(h),得到最终的输出结果S。依据该公式即可以得到该计算拓扑结构为乘法运算器-加法运算器-(可选的)激活运算器。上述卷积计算指令可以包括指令集,该指令集包含有不同功能的卷积神经网络COMPUTE指令以及CONFIG指令、IO指令、NOP指令、JUMP指令和MOVE指令。 Wherein, the convolution kernel W (which may include a plurality of data) is multiplied by the input data χ i , summed, and then optionally biased b, and optionally an activation operation S(h), The final output result S is obtained. According to the formula, the calculation topology can be obtained as a multiplier-adder-(optional) activation operator. The convolution calculation instruction may include an instruction set including a convolutional neural network COMPUTE instruction having different functions, and a CONFIG instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction.
在一种实施例中,COMPUTE指令包括:In one embodiment, the COMPUTE instruction includes:
卷积运算指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作。a convolution operation instruction, according to which the convolution operation device extracts input data of a specified size and a convolution kernel from a specified address of a memory (a preferred scratch pad memory or a scalar register file), and performs the convolution operation unit Convolution operation.
卷积神经网络sigmoid指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作,然后将输出结果做sigmoid激活;a convolutional neural network sigmoid instruction according to which the convolution operation means respectively extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferably a scratch pad memory or a scalar register file), in a convolution operation unit Do the convolution operation, and then make the output result sigmoid activation;
卷积神经网络TanH指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作,然后将输出结果做TanH激活;The convolutional neural network TanH instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, then the output is TanH activated;
卷积神经网络ReLU指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作,然后将输出结果做ReLU激活;以及The convolutional neural network ReLU instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, then the output is re-activated by ReLU;
卷积神经网络group指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器)的指定地址取出指定大小的输入数据和卷积核,划分group之后,在卷积运算部件中做卷积操作,优选的,然后将输出结果做激活。The convolutional neural network group instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and after dividing the group, the convolution operation unit The convolution operation is performed, preferably, and then the output is activated.
CONFIG指令,用于在每层人工神经网络计算开始前配置当前层计算需要的各种常数。The CONFIG command is used to configure the various constants required for the current layer calculation before each layer of artificial neural network calculation begins.
IO指令,用于实现从外部存储空间读入计算需要的输入数据以及在计算完成后将数据存回至外部空间。The IO instruction is used to read the input data required for calculation from the external storage space and store the data back to the external space after the calculation is completed.
NOP指令,用于负责清空当前装置内部所有控制信号缓存队列中的控制信号,保证NOP指令之前的所有指令全部指令完毕。NOP指令本身不包含任何操作;The NOP instruction is responsible for clearing the control signals in all control signal buffer queues of the current device, and ensuring that all instructions before the NOP instruction are all completed. The NOP instruction itself does not contain any operations;
JUMP指令,用于负责控制将要从指令存储单元读取的下一条指令地址的跳转,用来实现控制流的跳转;The JUMP instruction is used to control the jump of the next instruction address to be read from the instruction storage unit, and is used to implement the jump of the control flow;
MOVE指令,用于负责将装置内部地址空间某一地址的数据搬运至装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。The MOVE instruction is used to carry data of an address in the internal address space of the device to another address in the internal address space of the device. The process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
上述卷积运算装置执行卷积计算指令的方法具体可以为:The method for executing the convolution calculation instruction by the convolution operation device may specifically be:
控制器单元2从指令存储单元1内提取卷积计算指令、卷积计算指令对应的操作域以及卷积计算指令对应的第一计算拓扑结构(乘法运算器-加法运算器-加法运算器-激活运算器),控制单元将该操作域传输至数据访问单元,将该第一计算拓扑结构传输至互联模块4。The controller unit 2 extracts the convolution calculation instruction from the instruction storage unit 1, the operation domain corresponding to the convolution calculation instruction, and the first calculation topology corresponding to the convolution calculation instruction (multiplier-adder-adder-activate) The arithmetic unit transmits the operation domain to the data access unit to transmit the first computing topology to the interconnection module 4.
数据访问单元3从外部存储介质提取该操作域对应的卷积核w和偏置b(当b为0时,不需要提取偏置b),将卷积核w和偏置b传输至主运算模块5。The data access unit 3 extracts the convolution kernel w and the offset b corresponding to the operation domain from the external storage medium (when b is 0, it is not necessary to extract the offset b), and the convolution kernel w and the offset b are transmitted to the main operation. Module 5.
可选地,上述第一结果可包括多个乘法运算的结果。Optionally, the first result may include a result of a plurality of multiplication operations.
动态调压调频装置7,用于采集所述卷积运算装置的工作状态信息;根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息,所述电压频率调控信息用于指示所述卷积运算装置调整其工作电压或者工作频率。The dynamic voltage regulation and frequency modulation device 7 is configured to collect operation state information of the convolution operation device, and send voltage frequency regulation information to the convolution operation device according to the operation state information of the convolution operation device, where the voltage frequency regulation The information is used to instruct the convolution operation device to adjust its operating voltage or operating frequency.
具体地,动态调压调频装置7包括:Specifically, the dynamic voltage regulation and frequency modulation device 7 includes:
信息采集单元71,用于实时采集上述卷积运算装置的工作状态信息;The information collecting unit 71 is configured to collect the working state information of the convolution operation device in real time;
调压调频单元72,用于根据上述卷积运算装置的工作状态信息向卷积运算装置71发送电压频率调控信息,所述电压频率调控信息用于指示卷积运算装置71调整其工作电压或者工作频率。The voltage regulation and frequency adjustment unit 72 is configured to send voltage frequency regulation information to the convolution operation device 71 according to the operation state information of the convolution operation device, and the voltage frequency regulation information is used to instruct the convolution operation device 71 to adjust its working voltage or work. frequency.
在本申请的一可能实施例中,上述卷积运算装置的工作状态信息包括该卷积运算装置的运行速度,上述电压频率调控信息包括第一电压频率调控信息,调压调频单元72用于:In a possible embodiment of the present application, the operating state information of the convolution operation device includes an operating speed of the convolution operation device, the voltage frequency regulation information includes first voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is configured to:
当上述卷积运算装置的运行速度大于目标速度时,向该卷积运算装置发送上述第一电压频率调控信息,该第一电压频率调控信息用于指示上述卷积运算装置降低其工作频率或者工作电压,所述目标速度为满足用户需求时上述卷积运算装置的运行速度。And when the operating speed of the convolution operation device is greater than the target speed, transmitting the first voltage frequency regulation information to the convolution operation device, where the first voltage frequency regulation information is used to instruct the convolution operation device to lower its operating frequency or work The voltage, the target speed is an operating speed of the convolutional computing device when the user's demand is met.
在本申请的一可能实施例中,上述卷积运算装置的工作状态信息包括数据访问单元3的运行速度和主运算模块5的运行速度,上述电压频率调控信息包括第二电频率调控信息,调频调压单元72还用于:In a possible embodiment of the present application, the working state information of the convolution computing device includes an operating speed of the data access unit 3 and an operating speed of the main computing module 5, and the voltage frequency regulation information includes second electrical frequency regulation information, and frequency modulation. The pressure regulating unit 72 is also used to:
当根据数据访问单元3的运行速度和主运算模块5的运行速度确定上述数据访问单元3的运行时间超过上述主运算模块5的运行时间时,向该主运算模块5发送上述第二电压频率调控信息,该第二电压频率调控信息用于指示上述主运算模块5降低其工作频率或者工作电压。When it is determined that the running time of the data access unit 3 exceeds the running time of the main computing module 5 according to the running speed of the data access unit 3 and the running speed of the main computing module 5, the second voltage frequency control is sent to the main computing module 5 The second voltage frequency regulation information is used to indicate that the main operation module 5 reduces the operating frequency or the operating voltage thereof.
进一步地,上述电压频率调控信息包括第三电频率调控信息,调频调压单元72还用于:Further, the voltage frequency regulation information includes third electrical frequency regulation information, and the frequency modulation unit 72 is further configured to:
当根据上述数据访问单元3的运行速度和上述主运算模块5的运行速度确定上述主运算模块5的运行时间超过上述数据访问单元3的运行时间时,向该数据访问单元3发送上述第三电压频率调控信息,该第三电压频率调控信息用于指示上述数据访问单元3降低其工作频率或者工作电压。When it is determined that the running time of the main operation module 5 exceeds the running time of the data access unit 3 according to the operating speed of the data access unit 3 and the operating speed of the main operation module 5, the third voltage is transmitted to the data access unit 3. Frequency regulation information, the third voltage frequency regulation information is used to instruct the data access unit 3 to reduce its operating frequency or operating voltage.
在本申请的一可能实施例中,上述卷积运算装置的工作状态信息包括指令存储单元1、控制器单元 2、数据访问单元3、互连模块4、主运算模块5及N个从运算模块6中至少S个单元/模块的工作状态信息,该S为大于1且小于或等于N+5的整数,上述电压频率调控信息包括第四电压频率调控信息,调压调频单元72用于:In a possible embodiment of the present application, the working state information of the convolution operation device includes an instruction storage unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a main operation module 5, and N slave operation modules. The working state information of at least S units/modules in the case, the S is an integer greater than 1 and less than or equal to N+5, the voltage frequency regulation information includes fourth voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is configured to:
根据上述单元A的工作状态信息确定该单元A处于空闲状态时,向该单元A发送上述第四电压频率调控信息,该第四电压频率调控信息用于指示上述单元A降低其工作频率或者工作电压,And determining, according to the working state information of the unit A, that the unit A is in an idle state, sending the fourth voltage frequency regulation information to the unit A, where the fourth voltage frequency regulation information is used to indicate that the unit A reduces its operating frequency or operating voltage. ,
其中,上述单元A为上述至少S个单元/模块中的任意一个。The unit A is any one of the at least S units/modules.
进一步地,上述电压频率调控信息包括第五电压频率调控信息,调压调频单元72还用于:Further, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is further configured to:
根据上述单元A的工作状态信息确定该单元A重新处于工作状态时,向该单元A发送所述第五电压频率调控信息,该第五电压频率调控信息用于指示上述单元A升高其工作电压或者工作频率。When it is determined that the unit A is in the working state again according to the working state information of the unit A, the fifth voltage frequency regulation information is sent to the unit A, and the fifth voltage frequency regulation information is used to indicate that the unit A raises its working voltage. Or the working frequency.
在本申请的一可能实施例中,本发明实施例提供了一种神经网络处理器,其包括了上述卷积运算装置。In a possible embodiment of the present application, an embodiment of the present invention provides a neural network processor including the above convolution operation device.
上述神经网络处理器用于执行人工神经网络运算,实现语音识别,图像识别,翻译等人工智能的应用。The above neural network processor is used to perform artificial neural network operations, and implements artificial intelligence applications such as speech recognition, image recognition, and translation.
在这个卷积计算任务中,图4A中的动态调压调频装置7的工作过程如下:In this convolution calculation task, the dynamic voltage modulation and frequency modulation device 7 in Fig. 4A works as follows:
情形一、上述卷积神经网络处理器在执行卷积运算过程中,动态调压调频装置7实时获取图4A中的神经网络处理器的数据访问单元3和主运算模块5的运行速度。当动态调压调频装置7根据数据访问单元3和主运算模块5的运行速度确定数据访问单元3的运行时间超过主运算模块5的运行时间,动态调压调频装置7可确定在进行卷积运算过程中,上述数据访问单元3成为了瓶颈,主运算模块5执行完当前的卷积运算操作后,需要等待数据访问单元3执行完读取任务并将其读取的数据传输至上述主运算模块5,主运算模块5才能根据此次数据访问单元3传输过来的数据进行卷积运算操作。动态调压调频装置7向主运算模块5发送第一电压频率调控信息,该第一电压频率调控信息用于指示主运算模块5降低其工作电压或者工作频率,以降低该主运算模块5的运行速度,使得主运算模块5的运行速度与数据访问单元3的运行速度相匹配,降低了主运算模块5的功耗,避免主运算模块5空闲的情况发生,最终使得在不影响任务的完成时间的情况下,降低了上述神经网络处理器整体的运行功耗。Case 1: The convolutional neural network processor performs the convolution operation, and the dynamic voltage-modulating and frequency-modulating device 7 acquires the running speeds of the data access unit 3 and the main arithmetic module 5 of the neural network processor in FIG. 4A in real time. When the dynamic voltage-modulating and frequency-modulating device 7 determines that the running time of the data access unit 3 exceeds the running time of the main computing module 5 according to the operating speed of the data access unit 3 and the main computing module 5, the dynamic voltage-modulating and frequency-modulating device 7 can determine that the convolution operation is performed. During the process, the data access unit 3 becomes a bottleneck. After the main arithmetic module 5 performs the current convolution operation, it needs to wait for the data access unit 3 to execute the read task and transfer the read data to the main operation module. 5. The main operation module 5 can perform a convolution operation operation based on the data transmitted by the data access unit 3. The dynamic voltage regulation and frequency modulation device 7 sends the first voltage frequency regulation information to the main operation module 5, and the first voltage frequency regulation information is used to instruct the main operation module 5 to lower its working voltage or operating frequency to reduce the operation of the main operation module 5. The speed is such that the running speed of the main operation module 5 matches the running speed of the data access unit 3, the power consumption of the main operation module 5 is reduced, and the idle operation of the main operation module 5 is avoided, and finally the completion time of the task is not affected. In the case of the above, the operating power consumption of the above-mentioned neural network processor as a whole is reduced.
情形二、上述神经网络处理器在执行卷积运算过程中,图4A中的动态调压调频装置7实时获取神经网络处理器的数据访问单元3和主运算模块5的运行速度。当动态调压调频装置7根据数据访问单元3和主运算模块5的运行速度确定主运算模块5的运行时间超过数据访问单元3的运行时间,动态调压调频装置7可确定在进行卷积运算过程中,主运算模块5成为了瓶颈,数据访问单元3执行完当前的数据读取操作后,需要等待主运算模块5执行当前的卷积运算操作后,数据访问单元3才将其读取的数据传输至主运算模块5。动态调压调频装置7向数据访问单元3发送第二电压频率调控信息,该第二电压频率调控信息用于指示数据访问单元3降低其工作电压或者工作频率,以降低数据访问单元3的运行速度,使得数据访问单元3的运行速度与主运算模块5的运行速度相匹配,降低了数据访问单元3的功耗,并避免数据访问单元3空闲的情况发生,最终使得在不影响任务的完成时间的情况下,降低了上述神经网络处理器整体的运行功耗。Case 2: In the process of performing the convolution operation, the dynamic voltage-modulating and frequency-modulating device 7 in FIG. 4A acquires the running speeds of the data access unit 3 and the main operation module 5 of the neural network processor in real time. When the dynamic voltage-modulating and frequency-modulating device 7 determines that the running time of the main arithmetic module 5 exceeds the running time of the data access unit 3 according to the operating speed of the data access unit 3 and the main arithmetic module 5, the dynamic voltage-modulating and frequency-modulating device 7 can determine that the convolution operation is performed. In the process, the main operation module 5 becomes a bottleneck. After the data access unit 3 performs the current data read operation, it needs to wait for the main operation module 5 to perform the current convolution operation operation, and then the data access unit 3 reads the data. The data is transferred to the main operation module 5. The dynamic voltage-modulating and frequency-modulating device 7 sends the second voltage frequency control information to the data access unit 3, and the second voltage frequency control information is used to instruct the data access unit 3 to lower its operating voltage or operating frequency to reduce the operating speed of the data access unit 3. So that the running speed of the data access unit 3 matches the running speed of the main operation module 5, the power consumption of the data access unit 3 is reduced, and the idle condition of the data access unit 3 is avoided, and finally the completion time of the task is not affected. In the case of the above, the operating power consumption of the above-mentioned neural network processor as a whole is reduced.
上述神经网络处理器执行人工神经网络运算,进行人工智能应用的时候,图4A中的动态调压调频装置7实时采集上述神经网络处理器进行人工智能应用的的工作参数并根据该工作参数调整上述神经网络处理器的工作电压或工作频率。When the neural network processor performs the artificial neural network operation and performs the artificial intelligence application, the dynamic voltage-modulating frequency modulation device 7 in FIG. 4A collects the working parameters of the artificial neural network processor in the real-time application and adjusts the above according to the working parameters. The operating voltage or operating frequency of the neural network processor.
具体地,上述人工智能应用可以是视频图像处理,物体识别、机器翻译、语音识别和图像美颜等等。Specifically, the artificial intelligence application may be video image processing, object recognition, machine translation, voice recognition, image beauty, and the like.
情形三、上述神经网络处理器进行视频图像处理时,图4A中的动态调压调频装置7实时采集上述神经网络处理器进行视频图像处理的帧率。当该视频图像处理的帧率超过目标帧率时,该目标帧率为用户正常需求的视频图像处理帧率,动态调压调频装置7向上述神经网络处理器发送第三电压频率调控信息,该第三电压频率调控信息用于指示上述神经网络处理器降低其工作电压或者工作频率,在满足了用户正常视频图像处理需求的同时也降低了上述神经网络处理器的功耗。Case 3: When the above-mentioned neural network processor performs video image processing, the dynamic voltage-modulating and frequency-modulating device 7 in FIG. 4A collects the frame rate of the video image processing by the above-mentioned neural network processor in real time. When the frame rate of the video image processing exceeds the target frame rate, the target frame rate is a video image processing frame rate that is normally required by the user, and the dynamic voltage modulation and frequency modulation device 7 sends the third voltage frequency regulation information to the neural network processor. The third voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or operating frequency, and reduces the power consumption of the neural network processor while satisfying the normal video image processing requirements of the user.
情形四、上述神经网络处理器进行语音识别时,图4A中的动态调压调频装置7实时采集上述神经网络处理器的语音识别速度。当神经网络处理器的语音识别速度超过用户实际语音识别速度时,动态调压调频装置7向上述神经网络处理器发送第四电压频率调控信息,该第四电压频率调控信息用于指示上述神经网络处理器降低其工作电压或者工作频率,在满足了用户正常语音识别需求的同时也降低了上述神经网络处理器的功耗。Case 4: When the neural network processor performs speech recognition, the dynamic voltage-modulating frequency modulation device 7 in FIG. 4A collects the speech recognition speed of the neural network processor in real time. When the voice recognition speed of the neural network processor exceeds the actual voice recognition speed of the user, the dynamic voltage regulation and frequency modulation device 7 sends fourth voltage frequency regulation information to the neural network processor, where the fourth voltage frequency regulation information is used to indicate the neural network. The processor reduces its operating voltage or operating frequency, and reduces the power consumption of the above neural network processor while satisfying the user's normal speech recognition requirements.
情形五、图4A中的动态调压调频装置7实时监控并获取上述神经网络处理器中各单元或者模块(包括指令存储单元1、控制器单元2、数据访问单元3、互连模块4、主运算模块5和N个从运算模块6)的工作状态信息。当上述神经网络处理器中各单元或者模块任一单元或者模块处于空闲状态时,动态调压调频装置7向该单元或者模块发送第五电压频率调控信息,以降低该单元或者模块的工作电压或者工作频率,以降低该单元或者模块的功耗。当该单元或者模块重新处于工作状态时,动态调压调频装置7向该单元或者模块发送第六电压频率调控信息,以升高该单元或者模块的工作电压或者工作频率,以使该单元或者模块的运行速度满足工作需求。 Case 5, the dynamic voltage-modulating frequency modulation device 7 in FIG. 4A monitors and acquires each unit or module in the above-mentioned neural network processor in real time (including the instruction storage unit 1, the controller unit 2, the data access unit 3, the interconnection module 4, and the main The operational status information of the arithmetic module 5 and the N slave arithmetic modules 6). When any unit or module of the above neural network processor is in an idle state, the dynamic voltage regulating and frequency modulation device 7 sends the fifth voltage frequency control information to the unit or the module to reduce the working voltage of the unit or the module or Operating frequency to reduce the power consumption of the unit or module. When the unit or module is in the working state again, the dynamic voltage regulating and frequency modulation device 7 sends the sixth voltage frequency regulation information to the unit or the module to raise the working voltage or the operating frequency of the unit or the module, so that the unit or the module The speed of operation meets the needs of the work.
参阅图4E,图4E示意性示出了互连模块4的一种实施方式:H树模块。互连模块4构成主运算模块5和多个从运算模块6之间的数据通路,是由多个节点构成的二叉树通路,每个节点将上游的数据同样地发给下游的两个节点,将下游的两个节点返回的数据进行合并,并返回给上游的节点。例如,在卷积神经网络开始计算阶段,主运算模块5内的神经元数据通过互连模块4发送给各个从运算模块6;当从运算模块6的计算过程完成后,当从运算模块的计算过程完成后,每个从运算模块输出的神经元的值会在互连模块中逐级拼成一个完整的由神经元组成的向量。举例说明,假设卷积运算装置中共有N个从运算模块,则输入数据xi被发送到N个从运算模块,每个从运算模块将输入数据xi与该从运算模块相应的卷积核做卷积运算,得到一标量数据,各从运算模块的标量数据被互连模块4合并成一个含有N个元素的中间向量。假设卷积窗口总共遍历得到A*B个(X方向为A个,Y方向为B个,X、Y为三维正交坐标系的坐标轴)输入数据xi,则对A*B个xi执行上述卷积操作,得到的所有向量在主运算模块中合并得到A*B*N的三维中间结果。Referring to Figure 4E, Figure 4E schematically illustrates an embodiment of an interconnect module 4: an H-tree module. The interconnection module 4 constitutes a data path between the main operation module 5 and the plurality of slave operation modules 6, and is a binary tree path composed of a plurality of nodes, and each node transmits the upstream data to the downstream two nodes in the same manner. The data returned by the two downstream nodes is merged and returned to the upstream node. For example, in the beginning of the calculation phase of the convolutional neural network, the neuron data in the main operation module 5 is sent to the respective slave operation modules 6 through the interconnection module 4; when the calculation process from the operation module 6 is completed, when the calculation from the operation module is completed After the process is completed, the value of each neuron output from the arithmetic module is progressively assembled into a complete vector of neurons in the interconnect module. For example, if there are a total of N slave arithmetic modules in the convolution operation device, the input data xi is sent to the N slave arithmetic modules, and each slave arithmetic module rolls the input data xi with the convolution kernel corresponding to the slave arithmetic module. The product operation obtains a scalar data, and the scalar data of each slave arithmetic module is merged by the interconnect module 4 into an intermediate vector containing N elements. Assuming that the convolution window traverses a total of A*B (A in the X direction, B in the Y direction, and X, Y are the coordinate axes of the three-dimensional orthogonal coordinate system), the data xi is input, and the above is performed for A*B xi Convolution operation, all the obtained vectors are combined in the main operation module to obtain the three-dimensional intermediate result of A*B*N.
参阅图4B,图4B示出了根据本申请明实施例的用于执行卷积神经网络正向运算的装置中主运算模块5的结构的示例框图。如图4B所示,主运算模块5包括第一运算单元51、第一数据依赖关系判定单元52和第一存储单元53。Referring to FIG. 4B, FIG. 4B illustrates an example block diagram of the structure of the main operation module 5 in the apparatus for performing a convolutional neural network forward operation in accordance with an embodiment of the present application. As shown in FIG. 4B, the main operation module 5 includes a first operation unit 51, a first data dependency determination unit 52, and a first storage unit 53.
其中,第一运算单元51包括向量加法单元511以及激活单元512。第一运算单元51接收来图4A中的自控制器单元2的控制信号,完成主运算模块5的各种运算功能,向量加法单元511用于实现卷积神经网络正向计算中的加偏置操作,该部件将偏置数据与所述中间结果对位相加得到偏置结果,激活运 算单元512对偏置结果执行激活函数操作。所述偏置数据可以是从外部地址空间读入的,也可以是存储在本地的。The first operation unit 51 includes a vector addition unit 511 and an activation unit 512. The first operation unit 51 receives the control signal from the controller unit 2 in FIG. 4A, and completes various operation functions of the main operation module 5, and the vector addition unit 511 is used to implement the offset in the forward calculation of the convolutional neural network. Operation, the component adds the offset data to the intermediate result bit to obtain an offset result, and the activation operation unit 512 performs an activation function operation on the bias result. The offset data may be read from an external address space or may be stored locally.
第一数据依赖关系判定单元52是第一运算单元51读写第一存储单元53的端口,保证第一存储单元53中数据的读写一致性。同时,第一数据依赖关系判定单元52也负责将从第一存储单元53读取的数据通过互连模块4发送给从运算模块6,而从运算模块6的输出数据通过互连模块4直接发送给第一运算单元51。控制器单元2输出的指令发送给计算单元51和第一数据依赖关系判定单元52,来控制其行为。The first data dependency determining unit 52 is a port in which the first computing unit 51 reads and writes the first storage unit 53, and ensures read/write consistency of data in the first storage unit 53. At the same time, the first data dependency determining unit 52 is also responsible for transmitting the data read from the first storage unit 53 to the slave computing module 6 through the interconnect module 4, and the output data of the slave computing module 6 is directly transmitted through the interconnect module 4. The first arithmetic unit 51 is given. The command output from the controller unit 2 is sent to the calculation unit 51 and the first data dependency determination unit 52 to control its behavior.
存储单元53用于缓存主运算模块5在计算过程中用到的输入数据和输出数据。The storage unit 53 is configured to buffer the input data and the output data used by the main operation module 5 in the calculation process.
参阅图4C,图4C示出了根据本申请实施例的用于执行卷积神经网络正向运算的装置中从运算模块6的结构的示例框图。如图4C所示,每个从运算模块6包括第二运算单元61、第二数据依赖关系判定单元62、第二存储单元63和第三存储单元64。Referring to FIG. 4C, FIG. 4C illustrates an example block diagram of the structure of the slave arithmetic module 6 in an apparatus for performing a convolutional neural network forward operation in accordance with an embodiment of the present application. As shown in FIG. 4C, each slave arithmetic module 6 includes a second arithmetic unit 61, a second data dependency determining unit 62, a second storage unit 63, and a third storage unit 64.
第二运算单元61接收图4A中的控制器单元2发出的控制信号并进行卷积运算。第二运算单元包括向量乘单元611和累加单元612,分别负责卷积运算中的向量乘运算和累加运算。The second arithmetic unit 61 receives the control signal from the controller unit 2 in Fig. 4A and performs a convolution operation. The second arithmetic unit includes a vector multiplication unit 611 and an accumulating unit 612, which are respectively responsible for the vector multiplication operation and the accumulation operation in the convolution operation.
第二数据依赖关系判定单元62负责计算过程中对第二存储单元63的读写操作。第二数据依赖关系判定单元62执行读写操作之前会首先保证指令之间所用的数据不存在读写一致性冲突。例如,所有发往数据依赖关系单元62的控制信号都会被存入数据依赖关系单元62内部的指令队列里,在该队列中,读指令的读取数据的范围如果与队列位置靠前的写指令写数据的范围发生冲突,则该指令必须等到所依赖的写指令被执行后才能够执行。The second data dependency determining unit 62 is responsible for the read and write operations on the second storage unit 63 in the calculation process. Before the second data dependency determining unit 62 performs the read/write operation, it first ensures that there is no read/write consistency conflict between the data used between the instructions. For example, all control signals sent to the data dependency unit 62 are stored in an instruction queue internal to the data dependency unit 62, in which the range of read data of the read command is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.
第二存储单元63缓存该从运算模块6的输入数据和输出标量数据。The second storage unit 63 buffers the input data of the slave arithmetic module 6 and outputs scalar data.
第三存储单元64缓存该从运算模块6在计算过程中需要的卷积核数据。The third storage unit 64 buffers the convolution kernel data required by the slave arithmetic module 6 in the calculation process.
在本申请的一可能实施例中,本申请实施例提供了一种神经网络处理器,其包括了上述卷积运算装置。In a possible embodiment of the present application, an embodiment of the present application provides a neural network processor including the above convolution operation device.
上述神经网络处理器用于执行人工神经网络运算,实现语音识别,图像识别,翻译等人工智能的应用。The above neural network processor is used to perform artificial neural network operations, and implements artificial intelligence applications such as speech recognition, image recognition, and translation.
在一种具体的应用场景中,在行卷积计算任务中,图4A中的动态调压调频装置7的工作过程如下:In a specific application scenario, in the line convolution calculation task, the working process of the dynamic voltage regulating and frequency modulation device 7 in FIG. 4A is as follows:
动态调压调频装置7的信息采集单元71实时采集与动态调压调频装置7相连接的神经网络处理器的工作状态信息或应用场景信息,所述应用场景信息为上述神经网络处理器通过神经网络运算得到的或者与所述神经网络处理器相连接的传感器采集的信息;动态调压调频装置7的调压调频单元72根据所述神经网络处理器的工作状态信息或应用场景信息向所述神经网络处理器发送电压频率调控信息,所述电压频率调控信息用于指示所述神经网络处理器调整其工作电压或者工作频率。The information collecting unit 71 of the dynamic voltage-modulating and frequency-modulating device 7 collects the working state information or the application scenario information of the neural network processor connected to the dynamic voltage-modulating and frequency-modulating device 7 in real time, and the application scenario information is the neural network processor through the neural network. Information obtained by the sensor obtained by the sensor or connected to the neural network processor; the voltage-modulating frequency unit 72 of the dynamic voltage-modulating frequency modulation device 7 is directed to the nerve according to the working state information or the application scene information of the neural network processor The network processor transmits voltage frequency regulation information for instructing the neural network processor to adjust its operating voltage or operating frequency.
在本申请的一可能实施例中,所述神经网络处理器的工作状态信息包括所述神经网络处理器的运行速度,所述电压频率调控信息包括第一电压频率调控信息,调压调频单元72用于:In a possible embodiment of the present application, the working state information of the neural network processor includes an operating speed of the neural network processor, the voltage frequency regulation information includes first voltage frequency regulation information, and the voltage regulating and frequency modulation unit 72 Used for:
当所述神经网络处理器的运行速度大于目标速度时,向所述神经网络处理器发送所述第一电压频率调控信息,所述第一电压频率调控信息用于指示所述神经网络处理器降低其工作频率或者工作电压,所述目标速度为满足用户需求时所述神经网络处理器的运行速度。Transmitting the first voltage frequency regulation information to the neural network processor when the operating speed of the neural network processor is greater than a target speed, the first voltage frequency regulation information being used to indicate that the neural network processor is reduced Its operating frequency or operating voltage, which is the operating speed of the neural network processor when the user's needs are met.
具体地,信息采集单元71实时采集与其连接的神经网络处理器的运行速度。该神经网络处理器的 运行速度根据上述神经网络处理器执行任务的不同可为不同类型的速度。当该神经网络处理器进行的操作为视频图像处理时,则上述神经网络处理器的运行速度可为上述神经网络处理器进行视频图像处理的帧率;当上述神经网络处理器进行的操作为语音识别时,则上述神经网络处理器的运行速度为上述信息进行语音识别的速度。调压调频单元72确定上述神经网络处理器的运行速度大于上述目标速度即上述神经网络处理器的运行速度达到满足用户需求时该神经网络处理器的运行速度时,向该神经网络处理器发送第一电压频率调控信息,以指示该神经网络处理器降低其工作电压或者工作频率,以降低神经网络处理器的功耗。Specifically, the information collecting unit 71 collects the running speed of the neural network processor connected thereto in real time. The speed of operation of the neural network processor can be different types of speeds depending on the tasks performed by the neural network processor described above. When the operation performed by the neural network processor is video image processing, the operating speed of the neural network processor may be a frame rate of the video image processing performed by the neural network processor; when the operation performed by the neural network processor is voice When recognized, the operating speed of the above-mentioned neural network processor is the speed at which the above information is voice-recognized. The voltage regulating and frequency-modulating unit 72 determines that the operating speed of the neural network processor is greater than the target speed, that is, when the operating speed of the neural network processor reaches the operating speed of the neural network processor when the user demands are met, and sends the first to the neural network processor. A voltage frequency regulation information to instruct the neural network processor to reduce its operating voltage or operating frequency to reduce the power consumption of the neural network processor.
举例说明,假设上述神经网络处理器进行的操作为视频图像处理,且上述目标速度为24帧/秒。信息采集单元71实时采集上述神经网络处理器进行视频图像处理的帧率,且当前上述神经网络处理器进行视频图像处理的帧率为54帧/秒。调压调频单元72确定当前上述神经网络处理器进行视频图像处理的帧率大于上述目标速度时,向该神经网络处理器发送第一电压频率调控信息,以指示该神经网络处理器降低其工作电压或者工作频率,以降低神经网络处理器的功耗。For example, assume that the operation performed by the above neural network processor is video image processing, and the above target speed is 24 frames/second. The information collecting unit 71 collects the frame rate of the video image processing by the neural network processor in real time, and the current frame rate of the video processing by the above neural network processor is 54 frames/second. The voltage-modulating and frequency-modulating unit 72 determines that when the frame rate of the video image processing by the above-mentioned neural network processor is greater than the target speed, the first voltage frequency regulation information is sent to the neural network processor to instruct the neural network processor to reduce the operating voltage thereof. Or operating frequency to reduce the power consumption of the neural network processor.
在本申请的一可能实施例中,所述神经网络处理器至少包括第一单元和第二单元,所述第一单元的输出数据为所述第二单元的输入数据,所述神经网络处理器的工作状态信息包括所述第一单元的运行速度和第二单元的运行速度,所述电压频率调控信息包括第二电压频率调控信息,调频调压单元72还用于:In a possible embodiment of the present application, the neural network processor includes at least a first unit and a second unit, and output data of the first unit is input data of the second unit, the neural network processor The operating state information includes the operating speed of the first unit and the operating speed of the second unit, the voltage frequency regulation information includes second voltage frequency regulation information, and the frequency modulation unit 72 is further configured to:
当根据所述第一单元的运行速度和所述第二单元的运行速度确定所述第一单元的运行时间超过所述第二单元的运行时间时,向所述第二单元发送所述第二电压频率调控信息,所述第二电压频率调控信息用于指示所述第二单元降低其工作频率或者工作电压。And when the running time of the first unit exceeds the running time of the second unit according to the running speed of the first unit and the running speed of the second unit, sending the second unit to the second unit Voltage frequency regulation information, the second voltage frequency regulation information is used to instruct the second unit to reduce its operating frequency or operating voltage.
具体地,上述神经网络处理器执行任务需要上述第一单元和上述第二单元的配合,并且上述第一单元的输出数据为上述第二单元的输入数据。信息采集单元71实时采集上述第一单元和上述第二单元的运行速度。当确定上述第一单元的运行速度小于上述第二单元的运行速度即上述第一单元的运行时间超过上述第二单元的运行时间时,调压调频单元72向上述第二单元发送上述第二电压频率调控信息,以指示上述第二单元降低其工作电压或者工作频率,达到在不影响神经网络处理器整体的运行速度的前提下,达到降低该神经网络处理器整体的功耗。Specifically, the foregoing neural network processor needs to cooperate with the first unit and the second unit, and the output data of the first unit is the input data of the second unit. The information collecting unit 71 collects the operating speeds of the first unit and the second unit in real time. When it is determined that the operating speed of the first unit is less than the operating speed of the second unit, that is, the running time of the first unit exceeds the running time of the second unit, the voltage regulating and frequency converting unit 72 transmits the second voltage to the second unit. The frequency regulation information is used to instruct the second unit to lower its working voltage or operating frequency, so as to reduce the power consumption of the entire neural network processor without affecting the overall operating speed of the neural network processor.
在本申请的一可能实施例中,所述电压频率调控信息包括第三电压频率调控信息,调频调压单元72还用于:In a possible embodiment of the present application, the voltage frequency regulation information includes third voltage frequency regulation information, and the frequency modulation unit 72 is further configured to:
当根据所述第一单元的运行速度和所述第二单元的运行速度确定所述第二单元的运行时间超过所述第一单元的运行时间时,向所述第一单元发送所述第三电压频率调控信息,所述第三电压频率调控信息用于指示所述第一单元降低其工作频率或者工作电压。Transmitting the third unit to the first unit when it is determined that an operating time of the second unit exceeds a running time of the first unit according to an operating speed of the first unit and an operating speed of the second unit Voltage frequency regulation information, the third voltage frequency regulation information is used to indicate that the first unit reduces its operating frequency or operating voltage.
具体地,上述神经网络处理器执行任务需要上述第一单元和上述第二单元的配合,并且上述第一单元的输出数据为上述第二单元的输入数据。信息采集单元71实时采集上述第一单元和上述第二单元的运行速度。当确定上述第一单元的运行速度大于上述第二单元的运行速度即上述第二单元的运行时间超过上述第一单元的运行时间时,调压调频单元72向上述第一单元发送上述第三电压频率调控信息,以指示上述第一单元降低其工作电压或者工作频率,达到在不影响神经网络处理器整体的运行速度的前提下,达到降低神经网络处理器整体的功耗。Specifically, the foregoing neural network processor needs to cooperate with the first unit and the second unit, and the output data of the first unit is the input data of the second unit. The information collecting unit 71 collects the operating speeds of the first unit and the second unit in real time. When it is determined that the running speed of the first unit is greater than the running speed of the second unit, that is, the running time of the second unit exceeds the running time of the first unit, the voltage regulating frequency adjusting unit 72 transmits the third voltage to the first unit. The frequency regulation information is used to indicate that the first unit reduces the operating voltage or the operating frequency thereof, so as to reduce the power consumption of the entire neural network processor without affecting the overall operating speed of the neural network processor.
在本申请的一可能实施例中,所述神经网络处理器包括至少N个单元,所述神经网络处理器的工作 状态信息包括所述至少N个单元中的至少S个单元的工作状态信息,所述N为大于1的整数,所述S为小于或者小于N的整数,所述电压频率调控信息包括第四电压频率调控信息,调压调频单元72用于:In a possible embodiment of the present application, the neural network processor includes at least N units, and the working state information of the neural network processor includes working state information of at least S units of the at least N units, The N is an integer greater than 1, the S is an integer less than or less than N, the voltage frequency regulation information includes fourth voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is configured to:
根据所述单元A的工作状态信息确定所述单元A处于空闲状态时,向所述单元A发送所述第四电压频率调控信息,所述第四电压频率调控信息用于指示所述单元A降低其工作频率或者工作电压,And determining, according to the working state information of the unit A, that the unit A is in an idle state, sending the fourth voltage frequency regulation information to the unit A, where the fourth voltage frequency regulation information is used to indicate that the unit A is lowered. Its working frequency or working voltage,
其中,所述单元A为所述至少S个单元中的任意一个。The unit A is any one of the at least S units.
在本申请的一可能实施例中,所述电压频率调控信息包括第五电压频率调控信息,所述调压调频单元72还用于:In a possible embodiment of the present application, the voltage frequency regulation information includes fifth voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is further configured to:
根据所述单元A的工作状态信息确定所述单元A重新处于工作状态时,向所述单元A发送所述第五电压频率调控信息,所述第五电压频率调控信息用于指示所述单元A升高其工作电压或者工作频率。And determining, according to the working state information of the unit A, that the unit A is in an active state, sending the fifth voltage frequency regulation information to the unit A, where the fifth voltage frequency regulation information is used to indicate the unit A Increase its working voltage or operating frequency.
具体地,在上述神经网络处理器工作过程中,信息采集单元71实时采集上述神经网络处理器内部至少S个单元的工作状态信息。当根据上述单元A的工作状态信息确定该单元A处于空闲状态时,调压调频单元72向上述单元A发送第四电压频率调控信息,以指示该单元A降低其工作频率或者工作电压,以降低该单元A的功耗;当根据上述单元A的工作状态信息确定该单元A重新处于工作状态时,上述调压调频单元72向上述单元A发送第五电压频率调控信息,以指示该单元A升高其工作频率或者工作电压,以使该单元A的运行速度满足工作的需求。Specifically, in the working process of the neural network processor, the information collecting unit 71 collects the working state information of at least S units inside the neural network processor in real time. When it is determined according to the working state information of the above unit A that the unit A is in an idle state, the voltage regulating frequency modulation unit 72 transmits fourth voltage frequency regulation information to the unit A to indicate that the unit A lowers its operating frequency or operating voltage to reduce The power consumption of the unit A; when it is determined that the unit A is in the working state according to the working state information of the unit A, the voltage regulating and frequency adjusting unit 72 sends the fifth voltage frequency regulation information to the unit A to indicate that the unit A is rising. The operating frequency or operating voltage is high so that the operating speed of the unit A satisfies the requirements of the work.
在本申请的一可能实施例中,在所述神经网络处理器的应用场景为图像识别时,所述应用场景信息为待识别图像中物体的个数,所述电压频率调控信息包括第六电压频率调控信息,调压调频单元72还用于:In a possible embodiment of the present application, when the application scenario of the neural network processor is image recognition, the application scenario information is the number of objects in the image to be identified, and the voltage frequency regulation information includes a sixth voltage. The frequency regulation information, the voltage regulation and frequency modulation unit 72 is also used to:
当确定所述待识别图像中物体的个数小于第一阈值时,向所述神经网络处理器发送所述第六电压频率调控信息,该第六电压频率调控信息用于指示所述神经网络处理器降低其工作电压或者工作频率。When it is determined that the number of objects in the image to be identified is less than a first threshold, sending the sixth voltage frequency regulation information to the neural network processor, where the sixth voltage frequency regulation information is used to indicate the neural network processing The device reduces its operating voltage or operating frequency.
具体地,在上述神经网络处理器应用于图像识别,上述待识别图像中物体的个数为上述神经网络处理器通过神经网络算法得到,信息采集单元71从上述神经网络处理器中获取上述待识别图像中物体的个数(即上述应用场景信息)后,当调压调频单元72确定上述待识别图像中物体的个数小于第一阈值时,该调压调频单元72向上述神经网络处理器发送上述第六电压频率调控信息,以指示上述神经网络处理器降低其工作电压或者工作频率;当确定上述待识别图像中物体的个数大于第一阈值时,该调压调频单元72向上述神经网络处理器发送用于指示上述神经网络处理器升高其工作电压或者工作频率的电压频率调控信息。Specifically, the neural network processor is applied to image recognition, and the number of objects in the image to be identified is obtained by the neural network processor by using a neural network algorithm, and the information collecting unit 71 obtains the to-be-identified from the neural network processor. After the number of the objects in the image (ie, the application scenario information), when the voltage-modulating frequency unit 72 determines that the number of objects in the image to be identified is less than the first threshold, the voltage-modulating frequency unit 72 sends the neural network processor to the neural network processor. The sixth voltage frequency control information is used to indicate that the neural network processor reduces its working voltage or operating frequency; and when it is determined that the number of objects in the image to be identified is greater than the first threshold, the voltage regulating and frequency converting unit 72 goes to the neural network. The processor transmits voltage frequency regulation information for instructing the neural network processor to increase its operating voltage or operating frequency.
在本申请的一可能实施例中,所述应用场景信息为物体标签信息,所述电压频率调控信息包括第七电压频率调控信息,调压调频单元72还用于:In a possible embodiment of the present application, the application scenario information is object tag information, and the voltage frequency regulation information includes seventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is further configured to:
当确定所述物体标签信息属于预设物体标签集时,向所述神经网络处理器发送所述第七电压频率调控信息,所述第七电压频率调控信息用于指示所述芯神经网络处理器升高其工作电压或者工作频率。When it is determined that the object tag information belongs to the preset object tag set, sending the seventh voltage frequency regulation information to the neural network processor, where the seventh voltage frequency regulation information is used to indicate the core neural network processor Increase its working voltage or operating frequency.
举例说明,上述预设物体标签集包括多个物体标签,该物体标签可为“人”、“狗”、“树”和“花”。当上述神经网络处理器通过神经网络算法确定当前应用场景中包括狗时,该神经网络处理器将该包括“狗”这个物体标签信息传输至信息采集单元71后,当调频调压单元72确定上述物体标签信息包括“狗”时,向上述神经网络处理器发送第七电压频率调控信息,以指示上述神经网络处理器升高其工作电压或者工作频率;当确定上述物体标签信息不属于上述预设物体标签集时,该调压调频单元72向上述神经网络处理器发送用于指示上述神经网络处理器降低其工作电压或者工作频率的电压频率调控信息。For example, the preset object tag set includes a plurality of object tags, and the object tags may be “person”, “dog”, “tree”, and “flower”. When the neural network processor determines, by the neural network algorithm, that the dog is included in the current application scenario, the neural network processor transmits the object tag information including the "dog" to the information collecting unit 71, and the frequency modulation unit 72 determines the above. When the object tag information includes a "dog", the seventh voltage frequency regulation information is sent to the neural network processor to indicate that the neural network processor raises its working voltage or operating frequency; when it is determined that the object tag information does not belong to the preset When the object tag set is set, the voltage regulation frequency modulation unit 72 sends voltage frequency regulation information for instructing the neural network processor to reduce its operating voltage or operating frequency to the neural network processor.
在本申请的一可能实施例中,在所述神经网络处理器应用于语音识别时,所述应用场景信息为语音输入速率,所述电压频率调控信息包括第八电压频率调控信息,调压调频单元72还用于:In a possible embodiment of the present application, when the neural network processor is applied to voice recognition, the application scenario information is a voice input rate, and the voltage frequency regulation information includes an eighth voltage frequency regulation information, and voltage regulation and frequency modulation. Unit 72 is also used to:
当所述语音输入速率小于第二阈值时,向所述神经网络处理器发送第八电压频率调控信息,所述第八电压频率调控信息用于指示所述神经网络处理器降低其工作电压或者工作频率。Transmitting, to the neural network processor, eighth voltage frequency regulation information, where the voice input rate is lower than a second threshold, the eighth voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or works frequency.
具体地,上述神经网络处理器的应用场景为语音识别,该神经网络处理器的输入单元按照一定的速率向神经网络处理器输入语音。信息采集单元71实时采集语音输入速率,并将该语音输入速率信息发送至调压调频单元72。当该调压调频单元72确定上述语音输入速率小于第二阈值时,向上述神经网络处理器发送第八电压频率调控信息,以指示该神经网络处理器降低其工作电压或者工作频率。当该调压调频单元72确定上述语音输入速率大于第二阈值时,向上述神经网络处理器发送用于指示上述神经网络处理器升高其工作电压的电压频率调控信息。Specifically, the application scenario of the above neural network processor is speech recognition, and the input unit of the neural network processor inputs speech to the neural network processor according to a certain rate. The information collecting unit 71 collects the voice input rate in real time, and transmits the voice input rate information to the voltage regulating and frequency adjusting unit 72. When the voltage modulation and frequency modulation unit 72 determines that the voice input rate is less than the second threshold, the eighth voltage frequency regulation information is sent to the neural network processor to instruct the neural network processor to reduce its operating voltage or operating frequency. When the voltage regulation and frequency modulation unit 72 determines that the voice input rate is greater than the second threshold, the voltage network control information for instructing the neural network processor to increase its operating voltage is sent to the neural network processor.
在本申请的一可能实施例中,在所述应用场景信息为所述神经网络处理器进行语音识别得到的关键词时,所述电压频率调控信息包括第九电压频率调控信息,所述调频调压单元还用于:In a possible embodiment of the present application, when the application scenario information is a keyword obtained by the neural network processor for performing speech recognition, the voltage frequency regulation information includes ninth voltage frequency regulation information, and the frequency modulation is adjusted. The press unit is also used to:
当所述关键词属于预设关键词集时,向所述神经网络处理器发送所述第九电压频率调控信息,所述第九电压频率调控信息用于指示所述神经网络处理器升高其工作电压或者工作频率。Transmitting, to the neural network processor, the ninth voltage frequency regulation information when the keyword belongs to a preset keyword set, where the ninth voltage frequency regulation information is used to instruct the neural network processor to raise the Working voltage or operating frequency.
进一步地,当上述关键词不属于上述关键词集时,调频调压单元72向上述神经网络处理器发送用于指示上述神经网络处理器降低其工作电压或者工作频率的调压调频信息。Further, when the keyword does not belong to the keyword set, the frequency modulation unit 72 sends the voltage modulation and frequency modulation information for instructing the neural network processor to reduce its working voltage or operating frequency to the neural network processor.
举例说明,在上述神经网络处理器的应用场景为语音识别时,上述预设关键词集包括“图像美颜”、“神经网络算法”、“图像处理”和“支付宝”等等关键词。假设上述应用场景信息为“图像美颜”,调频调压单元72向上述发送上述第九电压频率调控信息,以指示上述神经网络处理器升高其工作电压或者工作频率;假设上述应用场景信息为“拍照”时,调频调压单元72向上述神经网络处理器发送用于指示上述神经网络处理器降低其工作电压或者工作频率的调压调频信息。For example, when the application scenario of the neural network processor is speech recognition, the preset keyword set includes keywords such as “image beauty”, “neural network algorithm”, “image processing” and “Alipay”. Assume that the application scenario information is “image beauty”, and the frequency modulation unit 72 sends the ninth voltage frequency regulation information to the foregoing to instruct the neural network processor to increase its working voltage or operating frequency; When "photographing", the frequency modulation unit 72 transmits to the above-mentioned neural network processor voltage-regulating and frequency-modulating information for instructing the above-mentioned neural network processor to lower its operating voltage or operating frequency.
在本申请的一可能实施例中,在所述神经网络处理器应用于机器翻译时,所述应用场景信息为文字输入的速度或者待翻译图像中文字的数量,所述电压频率调控信息包括第十电压频率调控信息,调压调频单元72还用于:In a possible embodiment of the present application, when the neural network processor is applied to machine translation, the application scenario information is a speed of text input or a number of characters in an image to be translated, and the voltage frequency regulation information includes Ten voltage frequency regulation information, the voltage regulation frequency modulation unit 72 is also used to:
当所述文字输入速度小于第三阈值或者待翻译图像中文字的数量小于第四阈值时,向所述神经网络处理器发送所述第十电压频率调控信息,所述第十电压频率调控信息用于指示所述神经网络处理器降低其工作电压或者工作频率。And when the text input speed is less than a third threshold or the number of characters in the image to be translated is less than a fourth threshold, sending the tenth voltage frequency regulation information to the neural network processor, where the tenth voltage frequency regulation information is used The neural network processor is instructed to reduce its operating voltage or operating frequency.
具体地,上述神经网络处理器应用于机器翻译,信息采集单元71采集的应用场景信息为文字输入的速度或者待翻译图像中文字的数量,并将该应用场景信息传输至调压调频单元72。当确定上述文字输入速度小于第三阈值或者待翻译图像中文字的数量小于第四阈值时,调压调频单元72向上述神经网络处理器发送第十电压频率调控信息,以指示上述神经网络处理器降低其工作电压;当确定上述文字输入速度大于第三阈值或者待翻译图像中文字的数量大于第四阈值时,调压调频单元72向上述神经网络处理器发送用于指示上述神经网络处理器升高其工作电压的电压频率调控信息。Specifically, the neural network processor is applied to the machine translation, and the application scenario information collected by the information collection unit 71 is the speed of the text input or the number of characters in the image to be translated, and the application scenario information is transmitted to the voltage modulation and frequency modulation unit 72. When it is determined that the text input speed is less than the third threshold or the number of characters in the image to be translated is less than the fourth threshold, the voltage regulating and frequency modulation unit 72 transmits the tenth voltage frequency regulation information to the neural network processor to indicate the neural network processor. Decreasing the operating voltage; when it is determined that the text input speed is greater than the third threshold or the number of characters in the image to be translated is greater than the fourth threshold, the voltage regulating and frequency converting unit 72 sends the neural network processor to the neural network processor to indicate that the neural network processor is High voltage and frequency regulation information of its working voltage.
在本申请的一可能实施例中,在所述应用场景信息为外界的光照强度时,所述电压频率调控信息包括第十一电压频率调控信息,调压调频单元72还用于:In a possible embodiment of the present application, when the application scenario information is the ambient light intensity, the voltage frequency regulation information includes the eleventh voltage frequency regulation information, and the voltage regulation and frequency modulation unit 72 is further configured to:
当所述外界的光照强度小于第五阈值时,向所述神经网络处理器发送所述第十一电压频率调控信息,所述第十一电压频率调控信息用于指示所述神经网络处理器降低其工作电压或者工作频率。Transmitting, by the neural network processor, the eleventh voltage frequency regulation information, when the ambient light intensity is less than a fifth threshold, the eleventh voltage frequency regulation information is used to indicate that the neural network processor is reduced Its working voltage or operating frequency.
具体地,上述外界的光照强度为与上述神经网络处理器连接的光照传感器采集获取的。信息采集单元71获取上述光照强度后,将该光照强度传输至调压调频单元72。当确定上述光照强度小于第五阈值时,调压调频单元72向上述神经网络处理器发送上述第十一电压频率调控信息,以指示所述神经网络处理器降低其工作电压;当确定上述光照强度大于第五阈值时,调压调频单元72向神经网络处理器发送用于指示该神经网络处理器升高其工作电压或者工作频率的电压频率调控信息。Specifically, the illumination intensity of the external environment is acquired by an illumination sensor connected to the neural network processor. After acquiring the above illumination intensity, the information collection unit 71 transmits the illumination intensity to the voltage regulation and frequency modulation unit 72. When it is determined that the illumination intensity is less than the fifth threshold, the voltage regulation and frequency modulation unit 72 transmits the eleventh voltage frequency regulation information to the neural network processor to instruct the neural network processor to reduce its operating voltage; when determining the light intensity When it is greater than the fifth threshold, the voltage regulation and frequency modulation unit 72 transmits to the neural network processor voltage frequency regulation information for instructing the neural network processor to increase its operating voltage or operating frequency.
在本申请的一可能实施例中,所述神经网络处理器应用于图像美颜,所述电压频率调控信息包括第十二电压频率调控信息和第十三电压频偏调控信息,调压调频单元72还用于:In a possible embodiment of the present application, the neural network processor is applied to image beauty, and the voltage frequency regulation information includes twelfth voltage frequency regulation information and thirteenth voltage frequency offset control information, and the voltage modulation and frequency modulation unit is 72 is also used to:
在所述应用场景信息为人脸图像时,向所述神经网络处理器发送所述第十二电压频率调控信息,所述第十二电压频率调控信息用于指示所述神经网络处理器升高其工作电压或者工作频率;Transmitting, when the application scenario information is a face image, the twelfth voltage frequency regulation information to the neural network processor, where the twelfth voltage frequency regulation information is used to instruct the neural network processor to raise the Working voltage or working frequency;
在所述应用场景信息不为人脸图像时,向所述神经网络处理器所述发送第十三电压频率调控信息,所述第十三电压频率调控信息用于指示所述神经网络处理器降低其工作电压或者工作频率。And sending, when the application scenario information is not a face image, the thirteenth voltage frequency regulation information to the neural network processor, where the thirteenth voltage frequency regulation information is used to instruct the neural network processor to reduce Working voltage or operating frequency.
在本申请的一可能实施例中,在上述神经网络处理器应用于语音识别时,上述应用场景信息为语音强度,当上述语音强度大于第六阈值时,调压调频单元72向上述神经网络处理器发送用于指示上述神经网络处理器降低其工作电压或者工作频率的电压频率调控信息;当上述语音强度小于第六阈值时,调压调频单元72向上述神经网络处理器发送用于指示该神经网络处理器升高其工作电压或者工作频率的电压频率调控信息。In a possible embodiment of the present application, when the neural network processor is applied to voice recognition, the application scenario information is voice strength, and when the voice strength is greater than a sixth threshold, the voltage modulation and frequency modulation unit 72 processes the neural network. Transmitting, by the neural network processor, voltage frequency regulation information for instructing the neural network processor to reduce its operating voltage or operating frequency; when the voice strength is less than the sixth threshold, the voltage regulating and frequency modulation unit 72 sends the neural network processor to indicate the nerve The network processor raises the voltage frequency regulation information of its operating voltage or operating frequency.
需要说明的是,上述场景信息可以是传感器采集到的外部场景的信息如光照强度,语音强度等。上述应用场景信息也可以是根据人工智能算法计算出的信息,例如在物体识别任务中,神经网络处理器的实时计算结果信息将反馈给信息采集单元,所述信息包括场景中物体个数、人脸图像、物体标签关键词等信息。It should be noted that the foregoing scene information may be information of an external scene collected by the sensor, such as light intensity, voice intensity, and the like. The application scenario information may also be information calculated according to an artificial intelligence algorithm. For example, in the object recognition task, the real-time calculation result information of the neural network processor is fed back to the information collection unit, where the information includes the number of objects and people in the scene. Information such as face images, object tag keywords, and so on.
可选地,上述人工智能算法包括但不限于神经网络算法。Optionally, the artificial intelligence algorithm described above includes, but is not limited to, a neural network algorithm.
参阅图4F,图4F为本申请实施例提供的另一种卷积运算装置的结构示意图。如图4F所示,该卷积运算装置包括动态调压调频装置617、寄存器单元612、互连模块613、运算单元614、控制单元615和数据访问单元616。Referring to FIG. 4F, FIG. 4F is a schematic structural diagram of another convolution operation device according to an embodiment of the present application. As shown in FIG. 4F, the convolution operation device includes a dynamic voltage modulation and frequency modulation device 617, a register unit 612, an interconnection module 613, an operation unit 614, a control unit 615, and a data access unit 616.
其中,运算单元614包括加法计算器、乘法计算器、比较器和激活运算器中的至少二种。The operation unit 614 includes at least two of an addition calculator, a multiplication calculator, a comparator, and an activation operator.
互连模块613,用于控制运算单元614中计算器的连接关系使得将至少二种计算器组成不同的计算拓扑结构。The interconnecting module 613 is configured to control the connection relationship of the calculators in the computing unit 614 such that at least two types of calculators form different computing topologies.
寄存器单元612(可以是寄存器单元,指令缓存,高速暂存存储器),用于存储运算指令、数据块的在存储介质的地址、该运算指令对应的计算拓扑结构。The register unit 612 (which may be a register unit, an instruction cache, a scratch pad memory) is configured to store an operation instruction, an address of the data block in the storage medium, and a calculation topology corresponding to the operation instruction.
可选地,上述卷积运算装置还包括存储介质611。Optionally, the convolution operation device further includes a storage medium 611.
存储介质611可以为片外存储器,当然在实际应用中,也可以为片内存储器,用于存储数据块,该数据块具体可以为n维数据,n为大于等于1的整数,例如,n=1时,为1维数据,即向量,如n=2时,为2维数据,即矩阵,如n=3或3以上时,为多维数据。The storage medium 611 may be an off-chip memory. Of course, in an actual application, it may also be an on-chip memory for storing data blocks. The data block may be n-dimensional data, and n is an integer greater than or equal to 1, for example, n= At 1 o'clock, it is 1D data, that is, a vector, such as n=2, which is 2D data, that is, a matrix, such as n=3 or more, which is multidimensional data.
控制单元615,用于从寄存器单元612内提取运算指令、该运算指令对应的操作域以及该运算指令对应的第一计算拓扑结构,将该运算指令译码成执行指令,该执行指令用于控制运算单元614执行运算操作,将该操作域传输至数据访问单元616,将该计算拓扑结构传输至互连模块613。The control unit 615 is configured to extract an operation instruction from the register unit 612, an operation domain corresponding to the operation instruction, and a first calculation topology corresponding to the operation instruction, and decode the operation instruction into an execution instruction, where the execution instruction is used to control The arithmetic unit 614 performs an arithmetic operation, transfers the operational domain to the data access unit 616, and transmits the computational topology to the interconnect module 613.
数据访问单元616,用于从存储介质611中提取该操作域对应的数据块,并将该数据块传输至互连模块613。The data access unit 616 is configured to extract a data block corresponding to the operation domain from the storage medium 611, and transmit the data block to the interconnection module 613.
互连模块613、用于接收第一计算拓扑结构的数据块。The interconnecting module 613 is configured to receive the data block of the first computing topology.
在本申请的一可能实施例中,互连模块613还根据第一计算拓扑结构对数据块重新摆放。In a possible embodiment of the present application, the interconnect module 613 also repositions the data block according to the first computing topology.
运算单元614,用于执行指令,调用运算单元614中的计算器对数据块执行运算操作得到运算结果,将该运算结果传输至数据访问单元616并存储在存储介质611内。The operation unit 614 is configured to execute an instruction, and the calculator in the operation unit 614 is called to perform an operation operation on the data block to obtain an operation result, and the operation result is transmitted to the data access unit 616 and stored in the storage medium 611.
在本申请的一可能实施例中,运算单元614还用于按第一计算拓扑结构以及该执行指令,调用计算器对重新摆放的数据块执行运算操作,得到运算结果,将该运算结果传输至数据访问单元616并存储在存储介质611内。In a possible embodiment of the present application, the operation unit 614 is further configured to, according to the first computing topology and the execution instruction, invoke a calculator to perform an operation operation on the re-arranged data block, obtain an operation result, and transmit the operation result. The data access unit 616 is stored in the storage medium 611.
在一可行的实施例中,互连模块613还用于依据控制运算单元614中计算器的连接关系形成第一计算拓扑结构。In a possible embodiment, the interconnecting module 613 is further configured to form a first computing topology according to the connection relationship of the calculator in the control computing unit 614.
动态调压调频装置617,用于监控整个卷积运算装置的工作状态并对其电压和频率进行动态调控。The dynamic voltage regulation and frequency modulation device 617 is configured to monitor the working state of the entire convolution operation device and dynamically adjust its voltage and frequency.
下面通过不同的运算指令来说明上述卷积运算装置的具体计算方法,这里的运算指令以卷积计算指令为例,该卷积计算指令可以应用在神经网络中,所以该卷积计算指令也可以称为卷积神经网络。对于卷积计算指令来说,其实际需要执行的公式可以为:The specific calculation method of the above convolution operation device is described by different operation instructions. The operation instruction here is exemplified by a convolution calculation instruction, and the convolution calculation instruction can be applied to a neural network, so the convolution calculation instruction can also It is called a convolutional neural network. For a convolutional calculation instruction, the formula that it actually needs to execute can be:
s=s(∑wx i+b) s=s(∑wx i +b)
其中,即将卷积核W(可包括多个数据)乘以输入数据χ i,进行求和,然后可选地可加上偏置b,然后可选地还可做激活运算s(h),得到最终的输出结果S。依据该公式即可以得到该计算拓扑结构为,乘法运算器-加法运算器-(可选的)激活运算器。上述卷积计算指令可以包括指令集,该指令集包含有不同功能的卷积神经网络COMPUTE指令以及CONFIG指令、IO指令、NOP指令、JUMP指令和MOVE指令。 Wherein, the convolution kernel W (which may include a plurality of data) is multiplied by the input data χ i , summed, and then optionally biased b, and then optionally the activation operation s(h), The final output result S is obtained. According to the formula, the calculation topology can be obtained as a multiplier-adder-(optional) activation operator. The convolution calculation instruction may include an instruction set including a convolutional neural network COMPUTE instruction having different functions, and a CONFIG instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction.
在一种实施例中,COMPUTE指令包括:In one embodiment, the COMPUTE instruction includes:
卷积运算指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作。a convolution operation instruction, according to which the convolution operation device extracts input data of a specified size and a convolution kernel from a specified address of a memory (a preferred scratch pad memory or a scalar register file), and performs the convolution operation unit Convolution operation.
卷积神经网络sigmoid指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作,然后将输出结果做sigmoid激活;a convolutional neural network sigmoid instruction according to which the convolution operation means respectively extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferably a scratch pad memory or a scalar register file), in a convolution operation unit Do the convolution operation, and then make the output result sigmoid activation;
卷积神经网络TanH指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作,然后将输出结果做TanH激活;The convolutional neural network TanH instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and performs convolution in the convolution operation unit Operation, then the output is TanH activated;
卷积神经网络ReLU指令,根据该指令,装置分别从存储器(优选的高速暂存存储器)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作,然后将输出结果做ReLU激活;以及a convolutional neural network ReLU instruction, according to which the device respectively extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferred scratch pad memory), performs a convolution operation in the convolution operation unit, and then Output results for ReLU activation;
卷积神经网络group指令,根据该指令,该卷积运算装置分别从存储器(优选的高速暂存存储器)的指定地址取出指定大小的输入数据和卷积核,划分group之后,在卷积运算部件中做卷积操作,优选的,然后将输出结果做激活。The convolutional neural network group instruction, according to the instruction, the convolution operation device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferred scratch pad memory), and after dividing the group, the convolution operation unit The convolution operation is performed, preferably, and then the output is activated.
CONFIG指令,用于在每层人工神经网络计算开始前配置当前层计算需要的各种常数。The CONFIG command is used to configure the various constants required for the current layer calculation before each layer of artificial neural network calculation begins.
IO指令,用于实现从外部存储空间读入计算需要的输入数据以及在计算完成后将数据存回至外部空间。The IO instruction is used to read the input data required for calculation from the external storage space and store the data back to the external space after the calculation is completed.
NOP指令,用于负责清空当前装置内部所有控制信号缓存队列中的控制信号,保证NOP指令之前的所有指令全部指令完毕。NOP指令本身不包含任何操作;The NOP instruction is responsible for clearing the control signals in all control signal buffer queues of the current device, and ensuring that all instructions before the NOP instruction are all completed. The NOP instruction itself does not contain any operations;
JUMP指令,用于负责控制将要从指令存储单元读取的下一条指令地址的跳转,用来实现控制流的跳转;The JUMP instruction is used to control the jump of the next instruction address to be read from the instruction storage unit, and is used to implement the jump of the control flow;
MOVE指令,用于负责将该卷积运算装置内部地址空间某一地址的数据搬运至该卷积运算装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。The MOVE instruction is used to carry data of an address in the internal address space of the convolution operation device to another address in the internal address space of the convolution operation device, the process is independent of the operation unit, and does not occupy the operation unit during execution. H.
上述卷积运算装置执行卷积计算指令的方法具体可以为:The method for executing the convolution calculation instruction by the convolution operation device may specifically be:
控制单元615从寄存器单元612内提取卷积计算指令、卷积计算指令对应的操作域以及卷积计算指令对应的第一计算拓扑结构(乘法运算器-加法运算器-加法运算器-激活运算器),控制单元将该操作域传输至数据访问单元616,将该第一计算拓扑结构传输至互联模块613。The control unit 615 extracts the convolution calculation instruction, the operation domain corresponding to the convolution calculation instruction, and the first calculation topology corresponding to the convolution calculation instruction from the register unit 612 (multiplier-adder-adder-activation operator) The control unit transmits the operational domain to the data access unit 616 to transmit the first computational topology to the interconnect module 613.
数据访问单元616从存储介质611内提取该操作域对应的卷积核w和偏置b(当b为0时,不需要提取偏置b),将卷积核w和偏置b传输至运算单元614。The data access unit 616 extracts the convolution kernel w and the offset b corresponding to the operation domain from the storage medium 611 (when b is 0, it is not necessary to extract the offset b), and the convolution kernel w and the offset b are transmitted to the operation. Unit 614.
运算单元614的乘法运算器将卷积核w与输入数据Xi执行乘法运算以后得到第一结果,将第一结果输入到加法运算器执行加法运算得到第二结果,将第二结果和偏置b执行加法运算得到第三结果,将第三结果输到激活运算器执行激活运算得到输出结果s,将输出结果s传输至数据访问单元616存储至存储介质611内。其中,每个步骤后都可以直接输出结果传输到数据访问存储至存储介质611内,无需下面的步骤。另外,将第二结果和偏置b执行加法运算得到第三结果这一步骤可选,即当b为0时,不需要这个步骤。另外,加法运算和乘法运算的顺序可以调换。The multiplier of the operation unit 614 obtains the first result after performing the multiplication operation on the convolution kernel w and the input data Xi, and inputs the first result to the adder to perform the addition operation to obtain the second result, and the second result and the offset b The addition operation is performed to obtain a third result, the third result is input to the activation operator to perform an activation operation to obtain an output result s, and the output result s is transmitted to the data access unit 616 for storage in the storage medium 611. After each step, the direct output result can be transferred to the data access storage to the storage medium 611 without the following steps. In addition, the step of performing the addition of the second result and the offset b to obtain the third result is optional, that is, when b is 0, this step is not required. In addition, the order of addition and multiplication operations can be reversed.
可选地,上述第一结果可包括多个乘法运算的结果。Optionally, the first result may include a result of a plurality of multiplication operations.
在本申请的一可能实施例中,本发明实施例提供了一种神经网络处理器,其包括了上述卷积运算装置。In a possible embodiment of the present application, an embodiment of the present invention provides a neural network processor including the above convolution operation device.
上述神经网络处理器用于执行人工神经网络运算,实现语音识别,图像识别,翻译等人工智能的应用。The above neural network processor is used to perform artificial neural network operations, and implements artificial intelligence applications such as speech recognition, image recognition, and translation.
在这个卷积计算任务中,图4F中的动态调压调频装置617的工作过程如下:In this convolution calculation task, the dynamic voltage modulation and frequency modulation device 617 in FIG. 4F works as follows:
情形一、上述神经网络处理器在执行卷积运算过程中,图4F中的动态调压调频装置617实时获取神经网络处理器的数据访问单元616和运算单元614的运行速度。当动态调压调频装置617根据数据访问单元616和运算单元614的运行速度确定数据访问单元616的运行时间超过上述运算单元614的运行时间,上述动态调压调频装置617可确定在进行卷积运算过程中,数据访问单元616成为了瓶颈,运算单元614执行完当前的卷积运算操作后,需要等待上述数据访问单元616执行完读取任务并将其读取的数据传输至运算单元614,该运算单元614才能根据此次上述数据访问单元616传输过来的数据进行卷积运算操作。动态调压调频装置617向运算单元614发送第一电压频率调控信息,该第一电压频率调控信息用于指示运算单元614降低其工作电压或者工作频率,以降低该运算单元614的运行速度,使得运算单元614的运行速度与数据访问单元616的运行速度相匹配,降低了运算单元614的功耗,避免运算单元614空闲的情况发生,最终使得在不影响任务的完成时间的情况下,降低了上述神经网络处理器整体的运行功耗。Case 1: In the process of performing the convolution operation, the dynamic voltage-modulating frequency modulation device 617 in FIG. 4F acquires the running speed of the data access unit 616 and the arithmetic unit 614 of the neural network processor in real time. When the dynamic voltage-modulating frequency modulation device 617 determines that the running time of the data access unit 616 exceeds the running time of the computing unit 614 according to the operating speed of the data access unit 616 and the computing unit 614, the dynamic voltage-modulating frequency adjusting device 617 can determine that the convolution operation is performed. During the process, the data access unit 616 becomes a bottleneck. After the current convolution operation operation is performed, the operation unit 614 needs to wait for the data access unit 616 to execute the read task and transmit the read data to the operation unit 614. The arithmetic unit 614 can perform a convolution operation operation based on the data transmitted by the data access unit 616 mentioned above. The dynamic voltage modulation and frequency modulation device 617 sends the first voltage frequency regulation information to the operation unit 614, where the first voltage frequency regulation information is used to instruct the operation unit 614 to lower the operating voltage or the operating frequency thereof to reduce the operating speed of the operation unit 614. The running speed of the computing unit 614 is matched with the running speed of the data access unit 616, which reduces the power consumption of the computing unit 614, prevents the computing unit 614 from being idle, and finally reduces the time without affecting the completion time of the task. The overall operating power consumption of the above neural network processor.
情形二、上述神经网络处理器在执行卷积运算过程中,动态调压调频装置617实时获取神经网络处理器的数据访问单元616和运算单元614的运行速度。当上动态调压调频装置617根据数据访问单元616和运算单元614的运行速度确定运算单元614的运行时间超过数据访问单元616的运行时间,动态调压调频装置617可确定在进行卷积运算过程中,运算单元614成为了瓶颈,数据访问单元616执行完当前的数据读取操作后,需要等待运算单元614执行当前的卷积运算操作后,数据访问单元616才将其读取的数据传输至上述运算单元614。动态调压调频装置617向上述数据访问单元616发送第二电压频率调控信息,该第二电压频率调控信息用于指示数据访问单元616降低其工作电压或者工作频率,以降低该数据访问单元616的运行速度,使得数据访问单元616的运行速度与上述预算单元614的运行速度相匹配,降低了数据访问单元616的功耗,并避免数据访问单元616空闲的情况发生,最终使得在不影响任务的完成时间的情况下,降低了上述神经网络处理器整体的运行功耗。Case 2: In the process of performing the convolution operation, the dynamic voltage modulation and frequency modulation device 617 acquires the running speed of the data access unit 616 and the operation unit 614 of the neural network processor in real time. When the upper dynamic voltage modulation and frequency modulation device 617 determines that the operation time of the operation unit 614 exceeds the running time of the data access unit 616 according to the operation speed of the data access unit 616 and the operation unit 614, the dynamic voltage modulation and frequency modulation device 617 can determine that the convolution operation process is performed. The operation unit 614 becomes a bottleneck. After the data access unit 616 performs the current data read operation, the data access unit 616 transfers the data read by the data access unit 616 after waiting for the operation unit 614 to perform the current convolution operation. The above operation unit 614. The dynamic voltage-modulating frequency modulation device 617 sends the second voltage frequency control information to the data access unit 616, where the second voltage frequency control information is used to instruct the data access unit 616 to lower its operating voltage or operating frequency to reduce the data access unit 616. The running speed is such that the running speed of the data access unit 616 matches the running speed of the budget unit 614, the power consumption of the data access unit 616 is reduced, and the data access unit 616 is prevented from being idle, and finally the task is not affected. In the case of completion time, the overall operating power consumption of the above-described neural network processor is reduced.
上述神经网络处理器执行人工神经网络运算,在进行人工智能应用的时,动态调压调频装置617实时采集上述神经网络处理器进行人工智能应用的的工作参数,并根据该工作参数调整上述神经网络处理器的工作电压或工作频率。The neural network processor performs an artificial neural network operation. When performing the artificial intelligence application, the dynamic voltage regulation and frequency modulation device 617 collects the working parameters of the artificial neural network application by the neural network processor in real time, and adjusts the neural network according to the working parameter. The operating voltage or operating frequency of the processor.
具体地,上述人工智能应用可以是视频图像处理,物体识别、机器翻译、语音识别和图像美颜等等。Specifically, the artificial intelligence application may be video image processing, object recognition, machine translation, voice recognition, image beauty, and the like.
情形三、上述神经网络处理器进行视频图像处理时,动态调压调频装置617实时采集上述神经网络处理器进行视频图像处理的帧率。当该视频图像处理的帧率超过目标帧率时,该目标帧率为用户正常需求的视频图像处理帧率,动态调压调频装置617向上述神经网络处理器发送第三电压频率调控信息,该第三电压频率调控信息用于指示上述神经网络处理器降低其工作电压或者工作频率,在满足了用户正常视频图像处理需求的同时也降低了上述神经网络处理器的功耗。Case 3: When the above neural network processor performs video image processing, the dynamic voltage regulation and frequency modulation device 617 collects the frame rate of the video image processing by the neural network processor in real time. When the frame rate of the video image processing exceeds the target frame rate, the target frame rate is a video image processing frame rate that is normally required by the user, and the dynamic voltage modulation and frequency modulation device 617 sends the third voltage frequency regulation information to the neural network processor. The third voltage frequency regulation information is used to indicate that the neural network processor reduces its working voltage or operating frequency, and reduces the power consumption of the neural network processor while satisfying the normal video image processing requirements of the user.
情形四、上述神经网络处理器进行语音识别时,动态调压调频装置617实时采集上述神经网络处理器的语音识别速度。当上述神经网络处理器的语音识别速度超过用户实际语音识别速度时,动态调压调频装置617向上述神经网络处理器发送第四电压频率调控信息,该第四电压频率调控信息用于指示上述神经网络处理器降低其工作电压或者工作频率,在满足了用户正常语音识别需求的同时也降低了上述神经网络处理器的功耗。Case 4: When the neural network processor performs speech recognition, the dynamic voltage modulation and frequency modulation device 617 collects the speech recognition speed of the neural network processor in real time. When the voice recognition speed of the neural network processor exceeds the actual voice recognition speed of the user, the dynamic voltage regulation and frequency modulation device 617 sends fourth voltage frequency regulation information to the neural network processor, where the fourth voltage frequency regulation information is used to indicate the neural network. The network processor reduces its operating voltage or operating frequency, and reduces the power consumption of the above-mentioned neural network processor while satisfying the user's normal speech recognition requirements.
情形五、动态调压调频装置617实时监控上述神经网络处理器中各单元或者模块(包括存储介质611、寄存器单元612、互连模块613、运算单元614、控制器单元615、数据访问单元616)的工作状态。当上述神经网络处理器中各单元或者模块任一单元或者模块处于空闲状态时,上述动态调压调频装置向该单元或者模块发送第五电压频率调控信息,以降低该单元或者模块的工作电压或者工作频率,以进而降低该单元或者模块的功耗。当该单元或者模块重新处于工作状态时,上述动态调压调频装置向该单元或者模块发送第六电压频率调控信息,以升高该单元或者模块的工作电压或者工作频率,以使该单元或者模块的运行速度满足工作需求。 Case 5, the dynamic voltage modulation and frequency modulation device 617 monitors each unit or module in the above neural network processor in real time (including the storage medium 611, the register unit 612, the interconnection module 613, the operation unit 614, the controller unit 615, and the data access unit 616). Working status. When any unit or module of the above neural network processor is in an idle state, the dynamic voltage regulating and frequency modulation device sends fifth voltage frequency regulation information to the unit or module to reduce the working voltage of the unit or the module or Operating frequency to further reduce the power consumption of the unit or module. When the unit or the module is in the working state again, the dynamic voltage regulating and frequency modulation device sends the sixth voltage frequency regulation information to the unit or the module to raise the working voltage or the operating frequency of the unit or the module, so that the unit or the module The speed of operation meets the needs of the work.
参阅图4G,图4G为本申请实施例提供的用于执行单层卷积神经网络正向运方法的流程示意图,该方法应用于上述卷积运算装置中。如图4G所示,该方法包括以下步骤:Referring to FIG. 4G, FIG. 4G is a schematic flowchart of a method for performing a forward operation of a single-layer convolutional neural network according to an embodiment of the present application, where the method is applied to the convolution operation device. As shown in FIG. 4G, the method includes the following steps:
S701、在指令存储单元的首地址处预先存入一条输入输出IO指令;S701, pre-storing an input and output IO instruction at a first address of the instruction storage unit;
S702、运算开始,控制器单元从所述指令存储单元的首地址读取所述IO指令,根据译出的控制信号,数据访问单元从外部地址空间读取相应的所有卷积神经网络运算指令,并将其缓存在所述指令存储 单元中;S702: The operation starts, the controller unit reads the IO instruction from the first address of the instruction storage unit, and according to the decoded control signal, the data access unit reads all corresponding convolutional neural network operation instructions from the external address space. And buffering it in the instruction storage unit;
S703、所述控制器单元接着从所述指令存储单元读入下一条IO指令,根据译出的控制信号,所述数据访问单元从外部地址空间读取主运算模块需要的所有数据至所述主运算模块的第一存储单元;S703, the controller unit then reads the next IO instruction from the instruction storage unit, and according to the decoded control signal, the data access unit reads all data required by the main operation module from the external address space to the main a first storage unit of the computing module;
S704、所述控制器单元接着从所述指令存储单元读入下一条IO指令,根据译出的控制信号,所述数据访问单元从外部地址空间读取从运算模块需要的卷积核数据;S704, the controller unit then reads the next IO instruction from the instruction storage unit, and the data access unit reads the convolution kernel data required by the operation module from the external address space according to the decoded control signal;
S705、所述控制器单元接着从所述指令存储单元读入下一条CONFIG指令,根据译出的控制信号,所述卷积运算装置配置该层神经网络计算需要的各种常数;S705, the controller unit then reads the next CONFIG instruction from the instruction storage unit, and the convolution operation device configures various constants required for the calculation of the layer neural network according to the decoded control signal;
S706、所述控制器单元接着从所述指令存储单元读入下一条COMPUTE指令,根据译出的控制信号,所述主运算模块首先通过互连模块将卷积窗口内的输入数据发给N个从运算模块,保存至所述N个从运算模块的第二存储单元,之后,再依据指令移动卷积窗口;S706, the controller unit then reads the next COMPUTE instruction from the instruction storage unit, and according to the decoded control signal, the main operation module first sends the input data in the convolution window to the N through the interconnection module. The operation module saves to the second storage unit of the N slave operation modules, and then moves the convolution window according to the instruction;
S707、根据COMPUTE指令译出的控制信号,所述N个从运算模块的运算单元从第三存储单元读取卷积核,从所述第二存储单元读取输入数据,完成输入数据和卷积核的卷积运算,将得到的输出标量通过所述互连模块返回;S707, according to the control signal decoded by the COMPUTE instruction, the operation unit of the N slave operation modules reads the convolution kernel from the third storage unit, reads the input data from the second storage unit, and completes the input data and the convolution. a convolution operation of the core, returning the resulting output scalar through the interconnect module;
S708、在所述互连模块中,所述N个从运算模块返回的输出标量被逐级拼成完整的中间向量;S708. In the interconnecting module, the output scalars returned by the N operation modules are successively formed into a complete intermediate vector.
S709、所述主运算模块得到互连模块返回的中间向量,卷积窗口遍历所有输入数据,所述主运算模块将所有返回向量拼接成中间结果,根据COMPUTE指令译出的控制信号,从第一存储单元读取偏置数据,与中间结果通过向量加单元相加得到偏置结果,然后激活单元对偏置结果做激活,并将最后的输出数据写回至所述第一存储单元中;S709, the main operation module obtains an intermediate vector returned by the interconnection module, and the convolution window traverses all the input data, and the main operation module concatenates all the return vectors into an intermediate result, and the control signal is decoded according to the COMPUTE instruction from the first The storage unit reads the offset data, and adds the offset result to the intermediate result by the vector addition unit, and then the activation unit activates the offset result, and writes the last output data back to the first storage unit;
S710、所述控制器单元接着从指令存储单元读入下一条IO指令,根据译出的控制信号,所述数据访问单元将所述第一存储单元中的输出数据存至外部地址空间指定地址,运算结束。S710, the controller unit reads the next IO instruction from the instruction storage unit, and the data access unit stores the output data in the first storage unit to an external address space specified address according to the decoded control signal. The operation ends.
可选地,所述方法还包括:Optionally, the method further includes:
实时采集所述卷积运算装置的工作状态信息;Collecting working state information of the convolution operation device in real time;
根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息,所述电压频率调控信息用于指示所述卷积运算装置调整其工作电压或工作频率。And transmitting voltage frequency regulation information to the convolution operation device according to the operation state information of the convolution operation device, wherein the voltage frequency adjustment information is used to instruct the convolution operation device to adjust its operating voltage or operating frequency.
可选地,所述卷积运算装置的工作状态信息包括所述卷积运算装置的运行速度,所述电压频率调控信息包括第一电压频率调控信息,所述根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息包括:Optionally, the working state information of the convolution operation device includes an operating speed of the convolution operation device, the voltage frequency regulation information includes first voltage frequency regulation information, and the operation according to the convolution operation device Sending the voltage frequency regulation information to the convolution operation device by the status information includes:
当所述卷积运算装置的运行速度大于目标速度时,向所述卷积运算装置发送所述第一电压频率调控信息,所述第一电压频率调控信息用于指示所述卷积运算装置降低其工作频率或者工作电压,所述目标速度为满足用户需求时所述芯片的运行速度。Transmitting, by the convolution operation device, the first voltage frequency regulation information, when the operation speed of the convolution operation device is greater than a target speed, the first voltage frequency regulation information being used to indicate that the convolution operation device is lowered Its operating frequency or operating voltage, the target speed is the running speed of the chip when the user needs are met.
可选地,所述卷积运算装置的工作状态信息包括所述数据访问单元的运行速度和主运算单元的运行速度,所述电压频率调控信息包括第二电压频率调控信息,所述根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息还包括:Optionally, the working state information of the convolution operation device includes an operation speed of the data access unit and an operation speed of the main operation unit, and the voltage frequency regulation information includes second voltage frequency regulation information, according to the The transmitting the operating state information of the convolution operation device to the convolution operation device further includes:
当根据所述数据访问单元的运行速度和所述主运算单元的运行速度确定所述数据访问单元的运行时间超过所述主运算单元的运行时间时,向所述主运算单元发送所述第二电压频率调控信息,所述第二电压频率调控信息用于指示所述主运算单元降低其工作频率或者工作电压。And when the running time of the data access unit exceeds the running time of the main operating unit according to the running speed of the data access unit and the running speed of the main operating unit, sending the second to the main operating unit Voltage frequency regulation information, the second voltage frequency regulation information is used to instruct the main operation unit to lower its operating frequency or operating voltage.
可选地,所述电压频率调控信息包括第三电压频率调控信息,所述根据所述卷积运算装置的工作状 态信息向所述卷积运算装置发送电压频率调控信息还包括:Optionally, the voltage frequency regulation information includes third voltage frequency regulation information, and the sending the voltage frequency regulation information to the convolution operation device according to the working state information of the convolution operation device further includes:
当根据所述数据访问单元的运行速度和所述主运算单元的运行速度确定所述主运算单元的运行时间超过所述数据访问单元的运行时间时,向所述数据访问单元发送所述第三电压频率调控信息,所述第三电压频率调控信息用于指示所述数据访问单元降低其工作频率或者工作电压。And when the running time of the main operation unit exceeds the running time of the data access unit according to the running speed of the data access unit and the running speed of the main operation unit, sending the third to the data access unit Voltage frequency regulation information, the third voltage frequency regulation information is used to instruct the data access unit to reduce its operating frequency or operating voltage.
可选地,所述卷积运算装置的工作状态信息包括所述指令存储单元、控制器单元、数据访问单元、互连模块、主运算模块及N个从运算模块中的至少S个单元/模块的工作状态信息,所述S为大于1且小于或等于N+5的整数,所述电压频率调控信息包括第四电压频率调控信息,所述根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息还包括:Optionally, the working state information of the convolution operation device includes the instruction storage unit, the controller unit, the data access unit, the interconnection module, the main operation module, and at least S units/modules of the N slave operation modules. Working state information, the S is an integer greater than 1 and less than or equal to N+5, and the voltage frequency regulation information includes fourth voltage frequency regulation information, according to the working state information of the convolution operation device The transmitting the voltage frequency regulation information by the convolution operation device further includes:
根据所述单元A的工作状态信息确定所述单元A处于空闲状态时,向所述单元A发送所述第四电压频率调控信息,所述第四电压频率调控信息用于指示所述单元A降低其工作频率或者工作电压,And determining, according to the working state information of the unit A, that the unit A is in an idle state, sending the fourth voltage frequency regulation information to the unit A, where the fourth voltage frequency regulation information is used to indicate that the unit A is lowered. Its working frequency or working voltage,
其中,所述单元A为所述至少S个单元/模块中的任意一个。The unit A is any one of the at least S units/modules.
可选地,所述电压频率调控信息包括第五电压频率调控信息,所述根据所述卷积运算装置的工作状态信息向所述卷积运算装置发送电压频率调控信息还包括:Optionally, the voltage frequency regulation information includes the fifth voltage frequency regulation information, and the sending the voltage frequency regulation information to the convolution operation device according to the working state information of the convolution operation device further includes:
根据所述单元A的工作状态信息确定所述单元A重新处于工作状态时,向所述单元A发送所述第五电压频率调控信息,所述第五电压频率调控信息用于指示所述单元A升高其工作电压或者工作频率。And determining, according to the working state information of the unit A, that the unit A is in an active state, sending the fifth voltage frequency regulation information to the unit A, where the fifth voltage frequency regulation information is used to indicate the unit A Increase its working voltage or operating frequency.
需要说明的是,上述方法实施例的具体实现过程可参阅图4A-图4F所示实施例的相关描述,在此不再叙述。It should be noted that the specific implementation process of the foregoing method embodiment may refer to the related description of the embodiment shown in FIG. 4A to FIG. 4F, and is not described herein.
在本申请的一可能的实施例中,提供一种用于执行多层卷积神经网络正向运算的方法,包括:对每一层执行如图4G所示的神经网络正向运算的方法,当上一层卷积神经网络执行完毕后,本层的运算指令将主运算模块中存储的上一层的输出数据地址作为本层的输入数据地址,并且指令中的卷积核和偏置数据地址变更至本层对应的地址。In a possible embodiment of the present application, a method for performing a forward operation of a multi-layer convolutional neural network is provided, comprising: performing a neural network forward operation method as shown in FIG. 4G for each layer, After the execution of the upper convolutional neural network, the operation instruction of this layer uses the output data address of the upper layer stored in the main operation module as the input data address of the layer, and the convolution kernel and the offset data in the instruction. The address is changed to the address corresponding to this layer.
在本申请的又一方面,提供了一种图像压缩方法及相关装置,可训练用于图像压缩的压缩神经网络,提高了图像压缩的有效性和识别的准确率。In still another aspect of the present application, an image compression method and related apparatus are provided, which can train a compressed neural network for image compression, and improve the effectiveness of image compression and the accuracy of recognition.
参阅图5A,图5A为本申请提供一种的神经网络运算过程,如图5A所示,图中虚线的箭头表示反向运算,实线的箭头表示正向运算。在正向运算中,当上一层人工神经网络执行完成之后,将上一层得到的输出神经元作为下一层的输入神经元进行运算(或者是对该输出神经元进行某些操作再作为下一层的输入神经元),同时,将权值也替换为下一层的权值。在反向运算中,当上一层人工神经网络的反向运算执行完成后,将上一层得到的输入神经元梯度作为下一层的输出神经元梯度进行运算(或者是对该输入神经元梯度进行某些操作再作为下一层的输出神经元梯度),同时将权值替换为下一层的权值。Referring to FIG. 5A, FIG. 5A provides a neural network operation process according to the present application. As shown in FIG. 5A, the dotted arrow indicates the reverse operation, and the solid arrow indicates the forward operation. In the forward operation, when the execution of the upper artificial neural network is completed, the output neurons obtained by the previous layer are operated as the input neurons of the next layer (or some operations are performed on the output neurons. The input neurons of the next layer), at the same time, replace the weights with the weights of the next layer. In the reverse operation, when the inverse operation of the upper artificial neural network is completed, the input neuron gradient obtained by the previous layer is used as the output neuron gradient of the next layer (or the input neuron) The gradient performs some operations as the output neuron gradient of the next layer, and the weight is replaced with the weight of the next layer.
神经网络的正向传播阶段对应于正向运算,是输入数据输入至输出数据输出的过程,反向传播阶段对应于反向运算,是最终结果数据与期望输出数据之间的误差反向通过正向传播阶段的过程,通过周而复始的正向传播和反向传播,按照误差梯度下降的方式修正各层权值,对各层权值进行调整,也是神经网络学习训练的过程,可减少网络输出的误差。The forward propagation phase of the neural network corresponds to the forward operation, which is the process of inputting data input to output data. The back propagation phase corresponds to the inverse operation, and the error between the final result data and the expected output data is reversed. In the process of the propagation phase, through the repeated forward and backward propagation, the weights of each layer are corrected according to the error gradient, and the weights of each layer are adjusted. This is also the process of neural network learning and training, which can reduce the network output. error.
在本申请中,对于压缩神经网络的压缩训练图集的类型和每类训练图集包括的训练图像的数量不作限制,类型越多,数量越多,训练次数越多,图像压缩的损耗率越低,便于提高图像识别的准确率。In the present application, there is no limitation on the type of compression training atlas of the compressed neural network and the number of training images included in each type of training atlas. The more types, the greater the number, the more training times, the more the loss rate of image compression. Low, easy to improve the accuracy of image recognition.
压缩训练图集可包括多个角度的图像、多种光线强度下的图像或多种不同类型的图像采集设备采集 的图像等多个维度。当针对上述不同维度对应的压缩训练图集对压缩神经网络进行训练,提高不同情况下的图像压缩的有效性,扩大了图像压缩方法的适用范围。The compressed training atlas may include multiple dimensions such as images of multiple angles, images of multiple light intensities, or images acquired by multiple different types of image acquisition devices. When the compression neural network is trained for the compression training atlas corresponding to the above different dimensions, the effectiveness of image compression in different situations is improved, and the application range of the image compression method is expanded.
压缩训练图集中训练图像包括的标签信息,本申请对于标签信息的具体内容不作限定,对待训练的图像部分进行标记,可用于检测压缩神经网络是否训练完成。例如:道路视频监控拍摄的行车图像中,标签信息为目标车牌信息,将行车图像输入至压缩神经网络得到压缩图像,基于识别神经网络模型对压缩图像进行识别得到参考车牌信息,若参考车牌信息与目标车牌信息匹配,则可确定完成压缩神经网络的训练,否则,在压缩神经网络当前的训练次数小于预设阈值时,还需对压缩神经网络进行训练。The compressed training map focuses on the label information included in the training image. The specific content of the label information is not limited in this application, and the image portion to be trained is marked, which can be used to detect whether the compressed neural network is trained. For example, in the driving image taken by road video surveillance, the tag information is the target license plate information, the driving image is input to the compressed neural network to obtain a compressed image, and the compressed image is identified based on the recognized neural network model to obtain reference license plate information, if reference vehicle license information is used If the target license plate information is matched, the training of the compressed neural network can be determined. Otherwise, when the current training number of the compressed neural network is less than the preset threshold, the compressed neural network needs to be trained.
本申请对于标签信息的类型不作限定,可以是车牌信息,也可以是人脸信息、交通标志信息、物体分类信息等等。The application does not limit the type of tag information, and may be license plate information, face information, traffic sign information, object classification information, and the like.
本申请所涉及的识别神经网络模型为用于图像识别的识别神经网络训练完成时得到的数据,对于识别神经网络的训练方法不作限定,可采用批量梯度下降算法(Batch Gradient Descent,简称:BGD)、随机梯度下降算法(Stochastic Gradient Descent,简称:SGD)或小批量梯度下降算法(mini-batch SGD)等进行训练,一个训练周期由单次正向运算和反向梯度传播完成。The recognition neural network model involved in the present application is data obtained when the recognition neural network for image recognition is completed, and the training method for identifying the neural network is not limited, and a Batch Gradient Descent (BGD) algorithm may be adopted. Training is performed by Stochastic Gradient Descent (SGD) or mini-batch SGD. One training period is completed by single forward operation and reverse gradient propagation.
其中,识别训练图集中每一训练图像至少包括与所述压缩训练图像中每一训练图像的目标标签信息的类型一致的标签信息。也就是说,识别神经网络模型可对压缩神经网络(待训练或完成训练)输出的压缩图像进行识别。Wherein, each training image in the identification training map set includes at least tag information that is consistent with the type of the target tag information of each training image in the compressed training image. That is to say, the recognition neural network model can identify the compressed image output by the compressed neural network (to be trained or completed training).
举例来说,若压缩训练图像的标签信息的类型为车牌,则识别训练图像的标签信息的类型至少包括车牌,从而保证识别神经网络模型对压缩神经网络输出的压缩图像进行识别,得到车牌信息。For example, if the type of the tag information of the compressed training image is a license plate, the type of the tag information identifying the training image includes at least the license plate, thereby ensuring that the recognized neural network model recognizes the compressed image output by the compressed neural network, and obtains the license plate information.
可选的,压缩训练图集至少包括识别训练图集。Optionally, the compressed training atlas includes at least an identification training atlas.
由于训练图集中的图像受限于角度、光线或图像采集设备等因素的影响,当采用识别训练图集进行训练时,可提高识别神经网络模型的准确率,从而提高压缩神经网络的训练效率,即便于提高图像压缩的有效性。Since the images in the training map are limited by factors such as angle, light or image acquisition equipment, when using the training training atlas, the accuracy of the recognition neural network model can be improved, thereby improving the training efficiency of the compressed neural network. That is to improve the effectiveness of image compression.
参阅图5B,图5B是本申请实施例提供的一种图像压缩方法的流程示意图。如图5B所示,该图像压缩方法包括以下步骤:Referring to FIG. 5B, FIG. 5B is a schematic flowchart of an image compression method according to an embodiment of the present application. As shown in FIG. 5B, the image compression method includes the following steps:
步骤S201、获取第一分辨率的原始图像。Step S201: Acquire an original image of the first resolution.
其中,第一分辨率为压缩神经网络的输入分辨率,第二分辨率小于第一分辨率,为压缩神经网络的输出分辨率,即输入压缩神经网络的图像的压缩比(第二分辨率与第一分辨率之比)是固定的,也就是说,基于同一个压缩神经网络模型对不同图像进行压缩,可得到同一个压缩比的图像。The first resolution is the input resolution of the compressed neural network, and the second resolution is smaller than the first resolution, which is the output resolution of the compressed neural network, that is, the compression ratio of the image input to the compressed neural network (the second resolution and The ratio of the first resolution is fixed, that is, the same compression ratio is obtained by compressing different images based on the same compressed neural network model.
原始图像为压缩神经网络的压缩训练图集中的任一训练图像,将原始图像的标签信息作为目标标签信息。本申请对于标签信息不做限定,可以是人为识别进行标记所得,也可以将原始图像输入至识别神经网络,基于识别神经网络模型进行识别所得等。The original image is any training image in the compressed training map set of the compressed neural network, and the label information of the original image is used as the target label information. The application does not limit the tag information, and may be obtained by marking the human identification, or inputting the original image into the recognition neural network, and performing recognition based on the recognition neural network model.
步骤S202、基于目标模型对所述原始图像进行压缩得到第二分辨率的压缩图像。Step S202: compress the original image based on the target model to obtain a compressed image of the second resolution.
其中,目标模型为所述压缩神经网络当前的神经网络模型,即目标模型为压缩神经网络的当前参数。基于目标模型对分辨率等于压缩神经网络的输入分辨率的原始图像进行压缩,可得到分辨率等于压缩神经网络的输出分辨率的压缩图像。The target model is the current neural network model of the compressed neural network, that is, the target model is the current parameter of the compressed neural network. Compressing the original image with a resolution equal to the input resolution of the compressed neural network based on the target model yields a compressed image having a resolution equal to the output resolution of the compressed neural network.
可选的,所述基于目标模型对所述原始图像进行压缩得到第二分辨率的压缩图像包括:基于所述目 标模型对所述原始图像进行识别,得到多个图像信息;基于所述目标模型和所述多个图像信息对所述原始图像进行压缩,得到所述压缩图像。Optionally, the compressing the original image based on the target model to obtain the compressed image of the second resolution comprises: identifying the original image based on the target model to obtain a plurality of image information; and based on the target model And compressing the original image with the plurality of image information to obtain the compressed image.
如上述的训练图像包括多个维度,先基于目标模型对原始图像进行识别,可确定每一维度对应的图像信息,再针对每个图像信息对原始图像进行压缩,从而提高了不同维度下图像压缩的准确率。The training image as described above includes multiple dimensions. First, the original image is identified based on the target model, image information corresponding to each dimension can be determined, and the original image is compressed for each image information, thereby improving image compression in different dimensions. The accuracy rate.
步骤S203、基于识别神经网络模型对所述压缩图像进行识别得到参考标签信息。Step S203: Identify the compressed image based on the recognition neural network model to obtain reference label information.
本申请对于识别方法不作限定,可包括特征提取和特征识别两部分,将特征识别得到的结果作为参考标签信息,例如:行车图像压缩之后得到行车压缩图像对应的参考标签信息为车牌号码;人脸图像压缩之后得到人脸压缩图像对应的参考标签信息为人脸识别结果。The present application does not limit the identification method, and may include two parts: feature extraction and feature recognition, and the result of feature recognition is used as reference label information. For example, the reference label information corresponding to the driving compressed image is obtained as the license plate number after the driving image is compressed; After the image is compressed, the reference tag information corresponding to the face compressed image is obtained as the face recognition result.
可选的,所述基于识别神经网络模型对所述压缩图像进行识别得到参考标签信息包括:对所述压缩图像进行预处理得到待识别图像;基于所述识别神经网络模型对所述待识别图像进行识别得到所述参考标签信息。Optionally, the identifying the compressed image by using the identifying neural network model to obtain the reference label information comprises: preprocessing the compressed image to obtain an image to be identified; and determining the image to be recognized based on the identifying neural network model The identification is performed to obtain the reference tag information.
预处理包括但不限于以下中的任一项或多项:数据格式转换处理(如归一化处理、整型数据转换等)、数据去重处理、数据异常处理、数据缺失填补处理等等。通过对压缩图像进行预处理,可提高图像识别的识别效率和准确率。The preprocessing includes, but is not limited to, any one or more of the following: data format conversion processing (eg, normalization processing, integer data conversion, etc.), data deduplication processing, data exception processing, data missing padding processing, and the like. By preprocessing the compressed image, the recognition efficiency and accuracy of image recognition can be improved.
同样的,所述获取第一分辨率的原始图像包括:接收输入图像;对所述输入图像进行预处理得到所述原始图像。通过对输入图像的预处理,可提高图像压缩的压缩效率。Similarly, the acquiring the original image of the first resolution comprises: receiving an input image; and preprocessing the input image to obtain the original image. The compression efficiency of image compression can be improved by preprocessing the input image.
上述的预处理还包括尺寸处理,由于神经网络具有固定的尺寸要求,即只能对与该神经网络的基本图像大小相等的图像进行处理。将压缩神经网络的基本图像大小作为第一基本图像大小,将识别神经网络的基本图像大小作为第二基本图像大小,即压缩神经网络对输入图像的尺寸要求为图像大小等于第一基本图像大小,识别神经网络对输入图像的尺寸要求为图像大小等于第二基本图像大小。压缩神经网络可对满足第一基本图像大小的待压缩图像进行压缩得到压缩图像;识别神经网络可对满足第二基本图像大小的待识别图像进行识别得到参考标签信息。The pre-processing described above also includes size processing, since the neural network has a fixed size requirement that only images of the same basic image size as the neural network can be processed. The basic image size of the compressed neural network is taken as the first basic image size, and the basic image size of the neural network is identified as the second basic image size, that is, the size of the input image of the compressed neural network is required to be equal to the first basic image size. The recognition neural network requires that the size of the input image be equal to the second basic image size. The compressed neural network may compress the image to be compressed that satisfies the first basic image size to obtain a compressed image; the recognition neural network may identify the image to be identified that satisfies the second basic image size to obtain reference tag information.
本申请对于尺寸处理的具体方式不作限定,可包括裁剪或填充像素点的方式,也可以按照基本图像大小进行缩放的方式,还可以对输入图像进行降采样方法等等。The specific manner of the size processing is not limited, and may include a method of cutting or filling pixels, a method of scaling according to a basic image size, a down sampling method for an input image, and the like.
其中,外围像素点裁剪为裁剪图像外围的非关键信息区域;降采样处理是降低特定信号的采样率的过程,例如:4个相邻像素点取平均值,作为处理后图像的对应位置上的一个像素点的值,从而减小图像的大小。Wherein, the peripheral pixel is cropped to a non-critical information area around the cropped image; the downsampling process is a process of reducing the sampling rate of the specific signal, for example, four adjacent pixels are averaged as corresponding positions of the processed image. The value of a pixel, thereby reducing the size of the image.
可选的,所述对所述压缩图像进行预处理得到待识别图像包括:在所述压缩图像的图像大小小于识别神经网络的基本图像大小时,按照所述基本图像大小对所述压缩图像进行填充像素点得到所述待识别图像。Optionally, the pre-processing the compressed image to obtain an image to be identified includes: when the image size of the compressed image is smaller than a basic image size of the recognition neural network, performing the compressed image according to the basic image size. Filling the pixels to obtain the image to be identified.
本申请对于像素点不作限定,可以是任一色彩模式对应的,例如:rgb(0,0,0)。对于像素点填充的具体位置也不作限定,可以是除了压缩图像之外的任一位置,即对压缩图像不进行处理,而是采用填充像素点的方式进行图像扩展,不会对压缩图像产生形变,便于提高图像识别的识别效率和准确率。The present application does not limit the pixel point, and may correspond to any color mode, for example: rgb (0, 0, 0). The specific position of the pixel padding is not limited, and may be any position other than the compressed image, that is, the compressed image is not processed, but the image is expanded by filling the pixel, and the compressed image is not deformed. It is convenient to improve the recognition efficiency and accuracy of image recognition.
举例来说,如图5C所示,将压缩图像置于待识别图像的左上方,待识别图像除了压缩图像之外的位置填充像素点。For example, as shown in FIG. 5C, the compressed image is placed on the upper left of the image to be recognized, and the position of the image to be recognized is filled with pixels other than the compressed image.
同样的,所述对所述输入图像进行预处理得到所述原始图像包括:在所述输入图像的图像大小小于所述压缩神经网络的第一基本图像大小时,按照所述第一基本图像大小对所述输入图像进行填充像素 点,得到所述原始图像。通过像素点填充使待压缩的原始图像被识别神经网络进行识别得到参考标签信息,且像素点填充未改变输入图像的压缩率,便于提高训练压缩神经网络的效率和准确率。Similarly, the pre-processing the input image to obtain the original image comprises: when the image size of the input image is smaller than a first basic image size of the compressed neural network, according to the first basic image size The input image is filled with pixel points to obtain the original image. The original image to be compressed is identified by the recognition neural network by pixel dot filling to obtain reference label information, and the pixel point fills the compression ratio of the input image without changing, which is convenient for improving the efficiency and accuracy of training the compressed neural network.
步骤S204、根据所述目标标签信息与所述参考标签信息获取损失函数。Step S204: Acquire a loss function according to the target tag information and the reference tag information.
在本申请中,损失函数用于描述目标标签信息与参考标签信息之间的误差大小,标签信息包括多个维度,一般使用平方差公式进行计算:In the present application, the loss function is used to describe the magnitude of the error between the target tag information and the reference tag information. The tag information includes multiple dimensions, which are generally calculated using a squared difference formula:
Figure PCTCN2018095548-appb-000003
Figure PCTCN2018095548-appb-000003
其中:c为标签信息的维度,t k为参考标签信息的第k维,y k为目标标签信息的第k维。 Where: c is the dimension of the tag information, t k is the kth dimension of the reference tag information, and y k is the kth dimension of the target tag information.
步骤S205、判断所述损失函数是否收敛于第一阈值或所述压缩神经网络当前的训练次数是否大于或等于第二阈值,若是,执行步骤S206;若否,执行步骤S207。Step S205: Determine whether the loss function converges to the first threshold or whether the current training number of the compressed neural network is greater than or equal to the second threshold. If yes, go to step S206; if no, go to step S207.
本申请所涉及的压缩神经网络的训练方法中每一训练图像对应的训练周期由单次正向运算和反向梯度传播完成,将损失函数的阈值设置为第一阈值,将压缩神经网络的训练次数的阈值设置为第二阈值。也就是说,若损失函数收敛于第一阈值或训练次数大于或等于第二阈值,则完成压缩神经网络的训练,将所述目标模型作为所述压缩神经网络训练完成时对应的压缩神经网络模型;否则,根据损失函数进入压缩神经网络的反向传播阶段,即根据损失函数更新目标模型,并针对下一个训练图像进行训练,即执行步骤S202-S205,直到满足上述条件时,结束训练,等待执行步骤S206。In the training method of the compressed neural network involved in the present application, the training period corresponding to each training image is completed by a single forward operation and reverse gradient propagation, and the threshold of the loss function is set to the first threshold, and the training of the compressed neural network is performed. The threshold of the number of times is set to the second threshold. That is, if the loss function converges to the first threshold or the number of training times is greater than or equal to the second threshold, the training of the compressed neural network is completed, and the target model is used as the compressed neural network model corresponding to the completion of the training of the compressed neural network. Otherwise, enter the back propagation phase of the compressed neural network according to the loss function, that is, update the target model according to the loss function, and train for the next training image, that is, perform steps S202-S205 until the above conditions are met, end the training, wait Step S206 is performed.
本申请对于压缩神经网络的反向训练方法不作限定,可选的,请参阅图5D所提供的单层神经网络运算方法的流程示意图,图5D可应用于图5E所示的用于执行压缩神经网络反向训练装置的结构示意图。The present application is not limited to the reverse training method of the compressed neural network. Optionally, please refer to the flow diagram of the single layer neural network operation method provided in FIG. 5D, and FIG. 5D can be applied to the compressed nerve shown in FIG. 5E. Schematic diagram of the structure of the network reverse training device.
如图5E所示,该装置包括指令缓存单元21、控制器单元22、直接内存访问单元23、H树模块24、主运算模块25和多个从运算模块26,上述装置可通过硬件电路(例如专用集成电路ASIC)实现。As shown in FIG. 5E, the apparatus includes an instruction cache unit 21, a controller unit 22, a direct memory access unit 23, an H-tree module 24, a main operation module 25, and a plurality of slave operation modules 26, which can be implemented by hardware circuits (for example, ASIC) implementation.
其中,指令缓存单元21通过直接内存访问单元23读入指令并缓存读入的指令;控制器单元22从指令缓存单元21中读取指令,将指令译成控制其他模块行为的微指令,所述其他模块例如直接内存访问单元23、主运算模块25和从运算模块26等;直接内存访问单元23能够访存外部地址空间,直接向装置内部的各个缓存单元读写数据,完成数据的加载和存储。The instruction cache unit 21 reads the instruction through the direct memory access unit 23 and caches the read instruction; the controller unit 22 reads the instruction from the instruction cache unit 21, and translates the instruction into a micro instruction that controls the behavior of other modules. Other modules such as the direct memory access unit 23, the main operation module 25 and the slave operation module 26, etc.; the direct memory access unit 23 can access the external address space, directly read and write data to each cache unit inside the device, and complete data loading and storage. .
参阅图5F,图5F示出了H树模块24的结构,如图5F所示,H树模块24构成主运算模块25和多个从运算模块26之间的数据通路,并具有H树型的结构。H树是由多个节点构成的二叉树通路,每个节点将上游的数据同样地发给下游的两个节点,将下游的两个节点返回的数据进行合并,并返回给上游的节点。例如,在神经网络反向运算过程中,下游两个节点返回的向量会在当前节点相加成一个向量并返回给上游节点。在每层人工神经网络开始计算的阶段,主运算模块25内的输入梯度通过H树模块24发送给各个从运算模块26;当从运算模块26的计算过程完成后,每个从运算模块26输出的输出梯度向量部分和会在H树模块24中逐级两两相加,即对所有输出梯度向量部分和求和,作为最终的输出梯度向量。Referring to FIG. 5F, FIG. 5F shows the structure of the H-tree module 24. As shown in FIG. 5F, the H-tree module 24 constitutes a data path between the main operation module 25 and the plurality of slave operation modules 26, and has an H-tree type. structure. The H-tree is a binary tree path composed of multiple nodes. Each node sends the upstream data to the downstream two nodes in the same way, and the data returned by the two downstream nodes are combined and returned to the upstream node. For example, in the inverse operation of the neural network, the vectors returned by the two downstream nodes are added to a vector at the current node and returned to the upstream node. At the stage where each layer of artificial neural network starts calculation, the input gradient in the main operation module 25 is sent to the respective slave operation modules 26 through the H-tree module 24; when the calculation process from the operation module 26 is completed, each slave operation module 26 outputs The sum of the output gradient vectors will be summed two-by-two in the H-tree module 24, ie, summing and summing all the output gradient vectors as the final output gradient vector.
参阅图5G,图5G为主运算模块25的结构示意图,如图5G所示,主运算模块25包括运算单元251、数据依赖关系判断单元252和神经元缓存单元253。Referring to FIG. 5G, FIG. 5G is a schematic structural diagram of the main operation module 25. As shown in FIG. 5G, the main operation module 25 includes an operation unit 251, a data dependency determination unit 252, and a neuron buffer unit 253.
其中,神经元缓存单元253用于缓存主运算模块25在计算过程中用到的输入数据和输出数据。运算单元251完成主运算模块的各种运算功能。数据依赖关系判断单元252是运算单元251读写神经元缓 存单元253的端口,同时能够保证对神经元缓存单元253中数据的读写不存在一致性冲突。具体地,数据依赖关系判断单元252判断尚未执行的微指令与正在执行过程中的微指令的数据之间是否存在依赖关系,如果不存在,允许该条微指令立即发射,否则需要等到该条微指令所依赖的所有微指令全部执行完成后该条微指令才允许被发射。例如,所有发往数据依赖关系单元252的微指令都会被存入数据依赖关系单元252内部的指令队列里,在该队列中,读指令的读取数据的范围如果与队列位置靠前的写指令写数据的范围发生冲突,则该指令必须等到所依赖的写指令被执行后才能够执行。同时,数据依赖关系判断单元252也负责从神经元缓存单元253读取输入梯度向量通过H树模块24发送给从运算模块26,而从运算模块26的输出数据通过H树模块24直接发送给运算单元251。控制器单元22输出的指令发送给运算单元251和依赖关系判断单元252,来控制其行为。The neuron buffer unit 253 is configured to buffer the input data and the output data used by the main operation module 25 in the calculation process. The arithmetic unit 251 performs various arithmetic functions of the main arithmetic module. The data dependency judging unit 252 is a port in which the arithmetic unit 251 reads and writes the neuron cache unit 253, and at the same time, can ensure that there is no consistency conflict with the reading and writing of data in the neuron buffer unit 253. Specifically, the data dependency determining unit 252 determines whether there is a dependency between the microinstruction that has not been executed and the data of the microinstruction that is being executed, and if not, allows the microinstruction to be immediately transmitted, otherwise it is necessary to wait until the microinstruction The microinstruction is allowed to be transmitted after all the microinstructions on which the instruction depends are executed. For example, all microinstructions sent to the data dependency unit 252 are stored in an instruction queue inside the data dependency unit 252, in which the range of read data of the read instruction is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed. At the same time, the data dependency determination unit 252 is also responsible for reading the input gradient vector from the neuron buffer unit 253 and transmitting it to the slave operation module 26 through the H-tree module 24, and the output data from the operation module 26 is directly sent to the operation through the H-tree module 24. Unit 251. The command output from the controller unit 22 is sent to the arithmetic unit 251 and the dependency determination unit 252 to control its behavior.
参阅图5H,图5H为运算模块26的结构示意图,如图5H所示,每个从运算模块26包括运算单元261、数据依赖关系判定单元262、神经元缓存单元263、权值缓存单元264和权值梯度缓存单元265。Referring to FIG. 5H, FIG. 5H is a schematic structural diagram of the operation module 26. As shown in FIG. 5H, each slave operation module 26 includes an operation unit 261, a data dependency determination unit 262, a neuron buffer unit 263, a weight buffer unit 264, and Weight gradient buffer unit 265.
其中,运算单元261接收控制器单元22发出的微指令并进行算数逻辑运算。The arithmetic unit 261 receives the micro-instructions issued by the controller unit 22 and performs arithmetic logic operations.
数据依赖关系判断单元262负责计算过程中对缓存单元的读写操作。数据依赖关系判断单元262保证对缓存单元的读写不存在一致性冲突。具体地,数据依赖关系判断单元262判断尚未执行的微指令与正在执行过程中的微指令的数据之间是否存在依赖关系,如果不存在,允许该条微指令立即发射,否则需要等到该条微指令所依赖的所有微指令全部执行完成后该条微指令才允许被发射。例如,所有发往数据依赖关系单元262的微指令都会被存入数据依赖关系单元262内部的指令队列里,在该队列中,读指令的读取数据的范围如果与队列位置靠前的写指令写数据的范围发生冲突,则该指令必须等到所依赖的写指令被执行后才能够执行。The data dependency determination unit 262 is responsible for the read and write operations on the cache unit in the calculation process. The data dependency judging unit 262 ensures that there is no consistency conflict between the reading and writing of the cache unit. Specifically, the data dependency determining unit 262 determines whether there is a dependency relationship between the microinstruction that has not been executed and the data of the microinstruction that is being executed, and if not, allows the microinstruction to be immediately transmitted, otherwise it is necessary to wait until the micro The microinstruction is allowed to be transmitted after all the microinstructions on which the instruction depends are executed. For example, all microinstructions sent to the data dependency unit 262 are stored in an instruction queue inside the data dependency unit 262, in which the range of read data of the read instruction is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.
神经元缓存单元263缓存输入梯度向量数据以及该从运算模块26计算得到的输出梯度向量部分和。The neuron buffer unit 263 buffers the input gradient vector data and the output gradient vector partial sum calculated by the arithmetic operation module 26.
权值缓存单元264缓存该从运算模块26在计算过程中需要的权值向量。对于每一个从运算模块,都只会存储权值矩阵中与该从运算模块26相对应的列。The weight buffer unit 264 buffers the weight vector required by the slave operation module 26 in the calculation process. For each slave arithmetic module, only the columns in the weight matrix corresponding to the slave arithmetic module 26 are stored.
权值梯度缓存单元265缓存相应从运算模块在更新权值过程中需要的权值梯度数据。每一个从运算模块26存储的权值梯度数据与其存储的权值向量相对应。The weight gradient buffer unit 265 buffers the weight gradient data required by the corresponding slave module in updating the weights. Each of the weight gradient data stored from the arithmetic module 26 corresponds to its stored weight vector.
从运算模块26实现每层人工神经网络反向训练计算输出梯度向量的过程中可以并行的前半部分以及权值的更新。以人工神经网络全连接层(MLP)为例,过程为out_gradient=w*in_gradient,其中权值矩阵w和输入梯度向量in_gradient的乘法可以划分为不相关的并行计算子任务,out_gradient与in_gradient是列向量,每个从运算模块只计算in_gradient中相应的部分标量元素与权值矩阵w对应的列的乘积,得到的每个输出向量都是最终结果的一个待累加的部分和,这些部分和在H树中逐级两两相加得到最后的结果。所以计算过程变成了并行的计算部分和的过程和后面的累加的过程。每个从运算模块26计算出输出梯度向量的部分和,所有的部分和在H树模块24中完成求和运算得到最后的输出梯度向量。每个从运算模块26同时将输入梯度向量和正向运算时每层的输出值相乘,计算出权值梯度,以更新从运算模块26中存储的权值。正向运算和反向训练是神经网络算法的两个主要过程,神经网络要训练(更新)网络中的权值,首先需要计算输入向量在当前权值构成的网络中的正向输出,这是正向过程,然后根据输出值与输入向量本身的标注值之间的差值,反向逐层训练(更新)每层的权值。在正向计算过程中会保存每一层的输出向量以及激活函数的导数值,这些数据是反向训练过程所需要的,所以在反向训练开始时,这些数据已经保证存在。正向运算中每层的输出值是反向运算开始时已有的数据,可以 通过直接内存访存单元缓存在主运算模块中并通过H树发送给从运算模块。主运算模块25基于输出梯度向量进行后续计算,例如将输出梯度向量乘以正向运算时的激活函数的导数得到下一层的输入梯度值。正向运算时的激活函数的导数是在反向运算开始时已有的数据,可以通过直接内存访存单元缓存在主运算模块中。From the operation module 26, the first half of the parallel and the update of the weights in the process of calculating the output gradient vector for each layer of artificial neural network reverse training are implemented. Taking the artificial neural network full connection layer (MLP) as an example, the process is out_gradient=w*in_gradient, where the multiplication of the weight matrix w and the input gradient vector in_gradient can be divided into unrelated parallel computing subtasks, out_gradient and in_gradient are column vectors. Each of the arithmetic modules calculates only the product of the corresponding partial scalar element in the in_gradient and the column corresponding to the weight matrix w, and each of the obtained output vectors is a sum of the final result and the sum of the parts and the H-tree The middle and the second are added together to get the final result. So the computational process becomes a parallel computational part of the process and the subsequent accumulation process. Each of the slave arithmetic modules 26 calculates a partial sum of the output gradient vectors and performs a summation operation in the H-tree module 24 to obtain the final output gradient vector. Each slave arithmetic module 26 simultaneously multiplies the input gradient vector by the output value of each layer in the forward operation to calculate a weight gradient to update the weight stored in the slave arithmetic module 26. Forward and reverse training are the two main processes of neural network algorithms. To train (update) the weights in the network, the neural network needs to calculate the forward output of the input vector in the network composed of the current weights. This is positive. To the process, the weight of each layer is trained (updated) layer by layer according to the difference between the output value and the label value of the input vector itself. In the forward calculation process, the output vector of each layer and the derivative value of the activation function are saved. These data are required for the reverse training process, so these data are guaranteed to exist at the beginning of the reverse training. The output value of each layer in the forward operation is the data existing at the beginning of the reverse operation, and can be buffered in the main operation module by the direct memory fetch unit and sent to the slave operation module through the H-tree. The main operation module 25 performs subsequent calculation based on the output gradient vector, for example, multiplying the output gradient vector by the derivative of the activation function in the forward operation to obtain the input gradient value of the next layer. The derivative of the activation function in the forward operation is the data existing at the beginning of the reverse operation, and can be cached in the main operation module by the direct memory fetch unit.
根据本发明实施例,还提供了在前述装置上执行人工神经网络正向运算的指令集。指令集中包括CONFIG指令、COMPUTE指令、IO指令、NOP指令、JUMP指令和MOVE指令,其中:In accordance with an embodiment of the present invention, an instruction set for performing an artificial neural network forward operation on the aforementioned apparatus is also provided. The instruction set includes the CONFIG instruction, the COMPUTE instruction, the IO instruction, the NOP instruction, the JUMP instruction, and the MOVE instruction, where:
CONFIG指令,用于在每层人工神经网络计算开始前配置当前层计算需要的各种常数;The CONFIG command is used to configure various constants required for current layer calculation before each layer of artificial neural network calculation begins;
COMPUTE指令,用于完成每层人工神经网络的算术逻辑计算;COMPUTE instruction for completing the arithmetic logic calculation of each layer of artificial neural network;
IO指令,用于实现从外部地址空间读入计算需要的输入数据以及在计算完成后将数据存回至外部空间;IO instruction, which is used to read input data required for calculation from an external address space and store the data back to the external space after the calculation is completed;
NOP指令,用于负责清空当前装至内部所有微指令缓存队列中的微指令,保证NOP指令之前的所有指令全部指令完毕。NOP指令本身不包含任何操作;The NOP instruction is responsible for clearing the microinstructions currently loaded into all internal microinstruction buffer queues, and ensuring that all instructions before the NOP instruction are all completed. The NOP instruction itself does not contain any operations;
JUMP指令,用于负责控制器将要从指令缓存单元读取的下一条指令地址的跳转,用来实现控制流的跳转;The JUMP instruction is used to be responsible for the jump of the next instruction address that the controller will read from the instruction cache unit, and is used to implement the jump of the control flow;
MOVE指令,用于负责将装置内部地址空间某一地址的数据搬运至装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。The MOVE instruction is used to carry data of an address in the internal address space of the device to another address in the internal address space of the device. The process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
参阅图5I,图5I为本申请请实施例提供的压缩神经网络反向训练的示例框图。计算输出梯度向量的过程为out_gradient=w*in_gradient,其中权值矩阵w和输入梯度向量in_gradient的矩阵向量乘法可以划分为不相关的并行计算子任务,每个从运算模块26计算出输出梯度向量的部分和,所有的部分和在H树模块24中完成求和运算得到最后的输出梯度向量。图5I中上一层的输出梯度向量input gradient乘以对应的激活函数导数得到本层的输入数据,再与权值矩阵相乘得到输出梯度向量。计算权值更新梯度的过程为dw=x*in_gradient,其中每个从运算模块26计算本模块对应部分的权值的更新梯度。从运算模块26将输入梯度和正向运算时的输入神经元相乘计算出权值更新梯度dw,然后使用w、dw和上一次更新权值时使用的权值更新梯度dw’根据指令设置的学习率更新权值w。Referring to FIG. 5I, FIG. 5I is a block diagram of an example of the reverse training of the compressed neural network provided by the embodiment of the present application. The process of calculating the output gradient vector is out_gradient=w*in_gradient, wherein the matrix vector multiplication of the weight matrix w and the input gradient vector in_gradient can be divided into uncorrelated parallel computation subtasks, each of which calculates the output gradient vector from the operation module 26. Partial sum, all parts and summation operations in H-tree module 24 yield the final output gradient vector. The output gradient vector input gradient of the upper layer in Fig. 5I is multiplied by the corresponding activation function derivative to obtain the input data of this layer, and then multiplied by the weight matrix to obtain the output gradient vector. The process of calculating the weight update gradient is dw=x*in_gradient, wherein each slave operation module 26 calculates an update gradient of the weight of the corresponding portion of the module. The operation module 26 multiplies the input gradient and the input neuron in the forward operation to calculate the weight update gradient dw, and then uses w, dw, and the weight used in the last update of the weight to update the gradient dw' according to the instruction set learning. Rate update weight w.
参考图5I所示,input gradient(图5I中的[input gradient0,…,input gradient3])是第n+1层的输出梯度向量,该向量首先要与正向运算过程中第n层的导数值(图5I中的[f’(out0),…,f’(out3)])相乘,得到第n层的输入梯度向量,该过程在主运算模块25中完成,由H树模块24发往从运算模块26,暂存在从运算模块26的神经元缓存单元263中。然后,输入梯度向量与权值矩阵相乘得到第n层的输出梯度向量。在这个过程中,第i个从运算模块计算输入梯度向量中第i个标量和权值矩阵中列向量[w_i0,…,w_iN]的乘积,得到的输出向量在H树模块24中逐级两两相加得到最后的输出梯度向量output gradient(图5I中的[output gradient0,…,output gradient3])。Referring to FIG. 5I, the input gradient ([input gradient0,..., input gradient3] in FIG. 5I) is the output gradient vector of the n+1th layer, which is first and the derivative value of the nth layer in the forward operation process. ([f'(out0),...,f'(out3)]) in Fig. 5) is multiplied to obtain an input gradient vector of the nth layer, which is completed in the main operation module 25, and sent to the H-tree module 24 From the arithmetic module 26, it is temporarily stored in the neuron buffer unit 263 of the slave arithmetic module 26. Then, the input gradient vector is multiplied by the weight matrix to obtain an output gradient vector of the nth layer. In this process, the i-th slave computing module calculates the product of the i-th scalar in the input gradient vector and the column vector [w_i0,...,w_iN] in the weight matrix, and the resulting output vector is step-by-step in the H-tree module 24. The two additions yield the final output gradient vector output gradient ([output gradient0,...,output gradient3] in Figure 5I).
同时,从运算模块26还需要更新本模块中存储的权值,计算权值更新梯度的过程为dw_ij=x_j*in_gradient_i,其中x_j是正向运算时第n层的输入(即第n-1层的输出)向量的第j个元素,in_gradient_i是反向运算第n层的输入梯度向量(即图5I中input gradient与导数f’的乘积)的第i个元素。正向运算时第n层的输入是在反向训练开始时就存在的数据,通过H树模块24送往从运算模块26并暂存在神经元缓存单元263中。则,在从运算模块26中,在完成输出梯度向量部分和的计算后,将输入梯度向量第i个标量和正向运算第n层的输入向量相乘,得到更新权值的梯度向量dw并据此更新权值。At the same time, the slave computing module 26 also needs to update the weight stored in the module, and the process of calculating the weight update gradient is dw_ij=x_j*in_gradient_i, where x_j is the input of the nth layer in the forward operation (ie, the n-1th layer Output) The jth element of the vector, in_gradient_i is the i-th element of the inverse input n-th layer input gradient vector (ie, the product of input gradient and derivative f' in Figure 5I). The input of the nth layer in the forward operation is the data existing at the beginning of the reverse training, and is sent to the slave arithmetic module 26 through the H-tree module 24 and temporarily stored in the neuron buffer unit 263. Then, in the slave operation module 26, after completing the calculation of the sum of the output gradient vectors, the i-th scalar of the input gradient vector and the input vector of the n-th layer of the forward operation are multiplied to obtain a gradient vector dw of the updated weight. This update weight.
如图5D所示,在指令缓存单元的首地址处预先存入一条IO指令;控制器单元从指令缓存单元的首地址读取该条IO指令,根据译出的微指令,直接内存访问单元从外部地址空间读取与该单层人工神经网络反向训练有关的所有指令,并将其缓存在指令缓存单元中;控制器单元接着从指令缓存单元读入下一条IO指令,根据译出的微指令,直接内存访问单元从外部地址空间读取主运算模块需要的所有数据至主运算模块的神经元缓存单元,所述数据包括之前正向运算时的输入神经元和激活函数导数值以及输入梯度向量;控制器单元接着从指令缓存单元读入下一条IO指令,根据译出的微指令,直接内存访问单元从外部地址空间读取从运算模块需要的所有权值数据和权值梯度数据,并分别存储到相应的从运算模块的权值缓存单元和权值梯度缓存单元;控制器单元接着从指令缓存单元读入下一条CONFIG指令,运算单元根据译出的微指令里的参数配置运算单元内部寄存器的值,包括该层神经网络计算需要的各种常数,本层计算的精度设置、更新权值时的学习率等;控制器单元接着从指令缓存单元读入下一条COMPUTE指令,根据译出的微指令,主运算模块通过H树模块将输入梯度向量和正向运算时的输入神经元发给各从运算模块,所述输入梯度向量和正向运算时的输入神经元存至从运算模块的神经元缓存单元;根据COMPUTE指令译出的微指令,从运算模块的运算单元从权值缓存单元读取权值向量(即该从运算模块存储的权值矩阵的部分列),完成权值向量和输入梯度向量的向量乘标量运算,将输出向量部分和通过H树返回;同时从运算模块将输入梯度向量与输入神经元相乘,得到权值梯度存至权值梯度缓存单元;在H树模块中,各从运算模块返回的输出梯度部分和被逐级两两相加得到完整的输出梯度向量;主运算模块得到H树模块的返回值,根据COMPUTE指令译出的微指令,从神经元缓存单元读取正向运算时的激活函数导数值,将导数值乘以返回的输出向量,得到下一层反向训练的输入梯度向量,将其写回至神经元缓存单元;控制器单元接着从指令缓存单元读入下一条COMPUTE指令,根据译出的微指令,从运算模块从权值缓存单元读取权值w,从权值梯度缓存单元读取本次的权值梯度dw和上一次更新权值使用的权值梯度dw’,更新权值w;控制器单元接着从指令缓存单元读入下一条IO指令,根据译出的微指令,直接内存访问单元将神经元缓存单元中的输出梯度向量存至外部地址空间指定地址,运算结束。As shown in FIG. 5D, an IO instruction is pre-stored at the first address of the instruction cache unit; the controller unit reads the IO instruction from the first address of the instruction cache unit, and directly accesses the memory access unit according to the translated microinstruction. The external address space reads all instructions related to the single layer artificial neural network reverse training and caches it in the instruction cache unit; the controller unit then reads the next IO instruction from the instruction cache unit, according to the translated micro The direct memory access unit reads all data required by the main operation module from the external address space to the neuron buffer unit of the main operation module, where the data includes the input neuron and the activation function derivative value and the input gradient in the previous forward operation The controller unit then reads the next IO instruction from the instruction cache unit. According to the translated microinstruction, the direct memory access unit reads the ownership value data and the weight gradient data required by the operation module from the external address space, and respectively a weight buffer unit and a weight gradient buffer unit stored in the corresponding slave arithmetic module; the controller unit then The next CONFIG instruction is read from the instruction cache unit, and the operation unit configures the value of the internal unit of the operation unit according to the parameters in the translated micro instruction, including various constants required for the calculation of the layer neural network, and the precision calculation and update of the calculation of the layer. The learning rate of the weight, etc.; the controller unit then reads the next COMPUTE instruction from the instruction cache unit, and according to the translated microinstruction, the main operation module sends the input gradient vector and the input neuron in the forward operation through the H-tree module. For each slave arithmetic module, the input gradient vector and the input neuron in the forward operation are stored in the neuron cache unit of the slave arithmetic module; the microinstruction decoded according to the COMPUTE instruction, from the arithmetic unit of the arithmetic module from the weight buffer unit Reading the weight vector (ie, the partial column of the weight matrix stored by the slave module), completing the vector multiplication scalar operation of the weight vector and the input gradient vector, returning the output vector portion and returning through the H-tree; The input gradient vector is multiplied by the input neuron to obtain a weight gradient stored in the weight gradient buffer unit; in the H-tree module, each The output gradient part returned from the arithmetic module is added step by step to obtain a complete output gradient vector; the main operation module obtains the return value of the H-tree module, and reads from the neuron cache unit according to the micro-instruction decoded by the COMPUTE instruction. The value of the activation function in the forward operation is multiplied by the returned output vector to obtain the input gradient vector of the next layer of reverse training, which is written back to the neuron buffer unit; the controller unit then proceeds from the instruction cache unit. Read the next COMPUTE instruction, read the weight w from the weight buffer unit from the arithmetic module according to the translated micro-instruction, and read the weight gradient dw and the last update weight from the weight gradient buffer unit. The weight gradient dw', the update weight w; the controller unit then reads the next IO instruction from the instruction cache unit, and according to the translated microinstruction, the direct memory access unit stores the output gradient vector in the neuron cache unit to The external address space specifies the address and the operation ends.
对于多层人工神经网络,其实现过程与单层神经网络类似,当上一层人工神经网络执行完毕后,下一层的运算指令会将主运算模块中计算出的输出梯度向量作为下一层训练的输入梯度向量进行如上的计算过程,指令中的权值地址和权值梯度地址也会变更至本层对应的地址。For a multi-layer artificial neural network, the implementation process is similar to that of a single-layer neural network. When the previous artificial neural network is executed, the next-level operation instruction will use the output gradient vector calculated in the main operation module as the next layer. The trained input gradient vector performs the above calculation process, and the weight address and the weight gradient address in the instruction are also changed to the address corresponding to the layer.
通过采用用于执行神经网络反向训练装置,有效提高了对多层人工神经网络正向运算的支持。且采用针对多层神经网络反向训练的专用片上缓存,充分挖掘了输入神经元和权值数据的重用性,避免了反复向内存读取这些数据,降低了内存访问带宽,避免了内存带宽成为多层人工神经网络正向运算性能瓶颈的问题。By using a neural network reverse training device, the support for the forward operation of the multi-layer artificial neural network is effectively improved. The use of dedicated on-chip buffering for multi-layer neural network reverse training fully exploits the reusability of input neurons and weight data, avoids repeatedly reading these data into memory, reduces memory access bandwidth, and avoids memory bandwidth. Multi-layer artificial neural network forward computing performance bottleneck problem.
步骤S206、获取所述第一分辨率的目标原始图像,基于压缩神经网络模型对所述目标原始图像进行压缩得到所述第二分辨率的目标压缩图像。Step S206: Acquire a target original image of the first resolution, and compress the target original image based on a compressed neural network model to obtain a target compressed image of the second resolution.
其中,目标原始图像是与训练图像的标签信息的类型一致的图像(属于相同数据集的图像)。若损失函数收敛于第一阈值或训练次数大于或等于第二阈值,压缩神经网络完成训练,可直接输入压缩神经网络进行图像压缩得到目标压缩图像,且该目标压缩图像可被识别神经网络识别。The target original image is an image (an image belonging to the same data set) that matches the type of the tag information of the training image. If the loss function converges to the first threshold or the number of training times is greater than or equal to the second threshold, the compressed neural network completes the training, and can directly input the compressed neural network to perform image compression to obtain the target compressed image, and the target compressed image can be recognized by the recognized neural network.
可选的,在所述基于压缩神经网络模型对所述目标原始图像进行压缩得到所述第二分辨率的目标压缩图像之后,所述方法还包括:基于所述识别神经网络模型对所述目标压缩图像进行识别得到所述目标 原始图像的标签信息,并存储所述目标原始图像的标签信息。Optionally, after the target original image is compressed by the compressed neural network model to obtain the target compressed image of the second resolution, the method further includes: targeting the target based on the identifying neural network model The compressed image is identified to obtain tag information of the target original image, and tag information of the target original image is stored.
也就是说,在压缩神经网络训练完成之后,可对基于识别神经网络模型对压缩图像进行识别,提高了人工识别标签信息的效率和准确率。That is to say, after the compression neural network training is completed, the compressed image can be identified based on the recognition neural network model, and the efficiency and accuracy of the manual identification of the tag information are improved.
步骤S207、根据所述损失函数对所述目标模型进行更新得到更新模型,将所述更新模型作为所述目标模型,将下一个训练图像作为所述原始图像,执行步骤S202。Step S207, updating the target model according to the loss function to obtain an updated model, using the updated model as the target model, and using the next training image as the original image, and executing step S202.
可以理解,通过已经训练完成得到的识别神经网络模型得到的参考标签值和原始图像包括的目标标签值获取损失函数,在损失函数满足预设条件或压缩神经网络当前的训练次数超过预设阈值时完成训练,否则通过训练压缩神经网络反复调整其权值,即对同一个图像中每一像素所表示的图像内容进行调整,减少压缩神经网络的损失。并通过训练完成得到的压缩神经网络模型进行图像压缩,提高了图像压缩的有效性,从而便于提高识别的准确率。It can be understood that the loss function is obtained by using the reference tag value obtained by the trained neural network model and the target tag value included in the original image, when the loss function satisfies the preset condition or the current training number of the compressed neural network exceeds the preset threshold. The training is completed. Otherwise, the weight is repeatedly adjusted by training the compressed neural network, that is, the image content represented by each pixel in the same image is adjusted to reduce the loss of the compressed neural network. The compressed neural network model obtained through training is used for image compression, which improves the effectiveness of image compression, and thus improves the accuracy of recognition.
参阅图5J,图5J是本申请实施例提供的一种图像压缩装置300的结构示意图,如图5J所示,图像压缩装置300包括:处理器301、存储器302。Referring to FIG. 5J, FIG. 5J is a schematic structural diagram of an image compression apparatus 300 according to an embodiment of the present disclosure. As shown in FIG. 5J, the image compression apparatus 300 includes a processor 301 and a memory 302.
在本申请实施例中,存储器302用于存储第一阈值、第二阈值、压缩神经网络当前的神经网络模型和训练次数、所述压缩神经网络的压缩训练图集和所述压缩训练图集中每一训练图像的标签信息、识别神经网络模型、压缩神经网络模型,将所述压缩神经网络当前的神经网络模型作为目标模型,所述压缩神经网络模型为所述压缩神经网络训练完成时对应的目标模型,所述识别神经网络模型为识别神经网络训练完成时对应的神经网络模型。In the embodiment of the present application, the memory 302 is configured to store a first threshold, a second threshold, a current neural network model and a training number of the compressed neural network, a compressed training atlas of the compressed neural network, and each of the compressed training map sets. a training image tag information, a recognition neural network model, a compressed neural network model, and a current neural network model of the compressed neural network as a target model, wherein the compressed neural network model is a target corresponding to the compression neural network training completion The model, the recognition neural network model is a corresponding neural network model for identifying a neural network training completion.
处理器301用于获取第一分辨率的原始图像,所述原始图像为所述压缩训练图集中的任一训练图像,将所述原始图像的标签信息作为目标标签信息;基于所述目标模型对所述原始图像进行压缩,得到第二分辨率的压缩图像,所述第二分辨率小于所述第一分辨率;基于所述识别神经网络模型对所述压缩图像进行识别,得到参考标签信息;根据所述目标标签信息与所述参考标签信息获取损失函数;在所述损失函数收敛于所述第一阈值,或所述训练次数大于或等于所述第二阈值时,获取所述第一分辨率的目标原始图像,确认所述目标模型为所述压缩神经网络模型;基于所述压缩神经网络模型对所述目标原始图像进行压缩,得到所述第二分辨率的目标压缩图像。The processor 301 is configured to acquire an original image of a first resolution, where the original image is any training image in the compressed training map set, and label information of the original image is used as target label information; The original image is compressed to obtain a compressed image of a second resolution, the second resolution is smaller than the first resolution; and the compressed image is identified based on the recognized neural network model to obtain reference label information; Obtaining a loss function according to the target tag information and the reference tag information; acquiring the first resolution when the loss function converges to the first threshold, or the training number is greater than or equal to the second threshold a target original image of the rate, confirming that the target model is the compressed neural network model; compressing the target original image based on the compressed neural network model to obtain the target compressed image of the second resolution.
可选的,处理器301还用于在所述损失函数未收敛于所述第一阈值,或所述训练次数小于所述第二阈值时,根据所述损失函数对所述目标模型进行更新,得到更新模型,将所述更新模型作为所述目标模型,将下一个训练图像作为所述原始图像,执行所述获取第一分辨率的原始图像的步骤。Optionally, the processor 301 is further configured to: when the loss function does not converge to the first threshold, or when the training number is less than the second threshold, update the target model according to the loss function, An update model is obtained, the update model is used as the target model, and the next training image is used as the original image, and the step of acquiring the original image of the first resolution is performed.
可选的,处理器301具体用于对所述压缩图像进行预处理,得到待识别图像;以及用于基于所述识别神经网络模型对所述待识别图像进行识别,得到所述参考标签信息。Optionally, the processor 301 is specifically configured to perform pre-processing on the compressed image to obtain an image to be identified, and to identify the to-be-identified image based on the recognized neural network model to obtain the reference tag information.
可选的,所述预处理包括尺寸处理,存储器302还用于存储所述识别神经网络的基本图像大小;处理器301具体用于在所述压缩图像的图像大小小于所述基本图像大小时,按照所述基本图像大小对所述压缩图像进行填充像素点,得到所述待识别图像。Optionally, the preprocessing includes a size processing, the memory 302 is further configured to store a basic image size of the identification neural network, and the processor 301 is specifically configured to: when an image size of the compressed image is smaller than the basic image size, The compressed image is filled with pixels according to the basic image size to obtain the image to be recognized.
可选的,所述压缩训练图集至少包括识别训练图集,处理器301还用于采用所述识别训练图集对所述识别神经网络进行训练,得到所述识别神经网络模型,所述识别训练图集中每一训练图像至少包括与所述目标标签信息的类型一致的标签信息。Optionally, the compressed training atlas includes at least an identification training atlas, and the processor 301 is further configured to use the identification training atlas to train the recognized neural network to obtain the recognized neural network model, where the identification Each training image in the training map set includes at least tag information that is consistent with the type of the target tag information.
可选的,处理器301还用于基于所述识别神经网络模型对所述目标压缩图像进行识别,得到所述目标原始图像的标签信息;Optionally, the processor 301 is further configured to: identify, according to the identifying a neural network model, the target compressed image, to obtain label information of the target original image;
存储器302还用于存储所述目标原始图像的标签信息。The memory 302 is also used to store tag information of the target original image.
可选的,所述压缩训练图集包括多个维度,处理器301具体用于基于所述目标模型对所述原始图像进行识别,得到多个图像信息,每一维度对应一个图像信息;以及用于基于所述目标模型和所述多个图像信息对所述原始图像进行压缩,得到所述压缩图像。Optionally, the compressed training atlas includes multiple dimensions, and the processor 301 is specifically configured to: identify the original image based on the target model, obtain multiple image information, and each dimension corresponds to one image information; The original image is compressed based on the target model and the plurality of image information to obtain the compressed image.
可以理解,基于目标模型获取原始图像的压缩图像,基于识别神经网络模型获取压缩图像的参考标签信息,根据原始图像包括的目标标签信息与参考标签信息获取损失函数,在损失函数收敛于第一阈值或压缩神经网络当前的训练次数大于或等于第二阈值时,即完成用于图像压缩的压缩神经网络的训练,将目标模型作为压缩神经网络模型,可基于压缩神经网络模型获取目标原始图像的目标压缩图像。也就是说,通过已经训练完成得到的识别神经网络模型得到的参考标签值和原始图像包括的目标标签值获取损失函数,在损失函数满足预设条件或压缩神经网络当前的训练次数超过预设阈值时完成训练,否则通过训练压缩神经网络反复调整其权值,即对同一个图像中每一像素所表示的图像内容进行调整,减少压缩神经网络的损失,提高了图像压缩的有效性,从而便于提高识别的准确率。It can be understood that the compressed image of the original image is obtained based on the target model, the reference tag information of the compressed image is obtained based on the recognition neural network model, and the loss function is obtained according to the target tag information included in the original image and the reference tag information, and the loss function converges to the first threshold. Or when the current training number of the compressed neural network is greater than or equal to the second threshold, the training of the compressed neural network for image compression is completed, and the target model is used as the compressed neural network model, and the target of the target original image can be acquired based on the compressed neural network model. Compress the image. That is, the loss function is obtained by the reference tag value obtained by the trained neural network model and the target tag value included in the original image, and the loss function satisfies the preset condition or the current training number of the compressed neural network exceeds the preset threshold. The training is completed at any time. Otherwise, the weight is repeatedly adjusted by training the compressed neural network, that is, the image content represented by each pixel in the same image is adjusted, the loss of the compressed neural network is reduced, and the effectiveness of image compression is improved, thereby facilitating Improve the accuracy of recognition.
在本申请一可能的实施例中,提供了一个电子装置400,电子装置400包括图像压缩装置300,如图5K所示,电子装置400包括处理器401、存储器402、通信接口403以及一个或多个程序404,其中,一个或多个程序404被存储在存储器中402,并且被配置由处理器401执行,程序404包括用于执行上述图像压缩方法中所描述的部分或全部步骤的指令。In a possible embodiment of the present application, an electronic device 400 is provided. The electronic device 400 includes an image compression device 300. As shown in FIG. 5K, the electronic device 400 includes a processor 401, a memory 402, a communication interface 403, and one or more. Program 404, wherein one or more programs 404 are stored in memory 402 and configured to be executed by processor 401, which includes instructions for performing some or all of the steps described in the image compression method described above.
需要说明的是,上述各单元或者模块都可以是电路,包括数字电路,模拟电路等等。上述各单元或者模块结构的物理实现包括但不局限于物理器件,物理器件包括但不局限于晶体管,忆阻器等等。上述芯片或上述神经网络处理器可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。所述存储单元可以是任何适当的磁存储介质或者磁光存储介质,比如阻变式随机存取存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高宽带存储器(High Bandwidth Memory,HBM)、混合存储器立方(Hybrid Memory Cube,HMC)等等。It should be noted that each of the above units or modules may be a circuit, including a digital circuit, an analog circuit, and the like. Physical implementations of the various unit or module structures described above include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The above chip or the above neural network processor may be any suitable hardware processor such as a CPU, GPU, FPGA, DSP, ASIC, and the like. The storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as Resistive Random Access Memory (RRAM), Dynamic Random Access Memory (DRAM), static. Random Random Access Memory (SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (HBM), Hybrid Memory Cube (HMC) and many more.
本申请可用于众多通用或专用的计算系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶合、可编程的消费电子设备、网络个人计算机(personal computer,PC)、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。This application can be used in a variety of general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, top-mounted, programmable consumer electronics, personal computers (PCs), Small computer, mainframe computer, distributed computing environment including any of the above systems or devices, and the like.
在一个实施例里,本申请提供了一种芯片,其包括了前述运算装置,该芯片能够同时对权值和输入神经元进行多种运算,实现了运算的多元化。另外,通过采用针对多层人工神经网络运算算法的专用片上缓存,充分挖掘了输入神经元和权值数据的重用性,避免了反复向内存读取这些数据,降低了内存访问带宽,避免了内存带宽成为多层人工神经网络运算及其训练算法性能瓶颈的问题。In one embodiment, the present application provides a chip that includes the foregoing computing device that is capable of performing multiple operations on weights and input neurons simultaneously, thereby achieving diversification of operations. In addition, by using a dedicated on-chip cache for multi-layer artificial neural network operation algorithms, the reuse of input neurons and weight data is fully exploited, avoiding repeated reading of these data into memory, reducing memory access bandwidth and avoiding memory. Bandwidth becomes a problem of multi-layer artificial neural network operation and performance bottleneck of its training algorithm.
在本申请的一可能实施例中,本发明实施例提供了一种芯片封装结构,其包括了上述神经网络处理器。In a possible embodiment of the present application, an embodiment of the present invention provides a chip package structure including the above neural network processor.
在本申请的一可能实施例中,本发明实施例提供了一种板卡,其包括了上述芯片封装结构。In a possible embodiment of the present application, an embodiment of the present invention provides a board that includes the above chip package structure.
在本申请的一可能实施例中,本发明实施例提供了一种电子装置,其包括了上述板卡。In a possible embodiment of the present application, an embodiment of the present invention provides an electronic device including the above card.
上述电子装置包括但不限于机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、医疗设备。The above electronic devices include, but are not limited to, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers, cameras, cameras, projectors, watches, headphones, mobile Storage, wearables, vehicles, household appliances, medical equipment.
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。The vehicle includes an airplane, a ship, and/or a vehicle; the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood; the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.
本领域普通技术人员可以意识到,结合本文中所申请的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments of the application herein can be implemented in electronic hardware, computer software, or a combination of both, for clarity of hardware and software. Interchangeability, the composition and steps of the various examples have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的终端和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。A person skilled in the art can clearly understand that, for the convenience and brevity of the description, the specific working process of the terminal and the unit described above can be referred to the corresponding process in the foregoing method embodiment, and details are not described herein again.
在本申请所提供的几个实施例中,应该理解到,所揭露的终端和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。In the several embodiments provided by the present application, it should be understood that the disclosed terminal and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the above units is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, or an electrical, mechanical or other form of connection.
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本发明实施例方案的目的。The units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present invention.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例上述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The above-described integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention contributes in essence or to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the above-described methods of various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .
需要说明的是,在附图或说明书正文中,未绘示或描述的实现方式,均为所属技术领域中普通技术人员所知的形式,并未进行详细说明。此外,上述对各元件和方法的定义并不仅限于实施例中提到的各种具体结构、形状或方式,本领域普通技术人员可对其进行简单地更改或替换。It should be noted that the implementations that are not shown or described in the drawings or the text of the specification are all known to those of ordinary skill in the art and are not described in detail. In addition, the above definitions of the various elements and methods are not limited to the specific structures, shapes or manners mentioned in the embodiments, and those skilled in the art can simply modify or replace them.
以上所述的具体实施例,对本申请的目的、技术方案和有益效果进行了进一步详细说明,应理解的是,以上所述仅为本申请的具体实施例而已,并不用于限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The specific embodiments of the present invention have been described in detail with reference to the specific embodiments of the present application. It is to be understood that Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims (11)

  1. 一种处理方法,包括:A processing method comprising:
    分别对权值和输入神经元进行量化,确定权值字典、权值密码本、神经元字典和神经元密码本;以及Weighting and inputting neurons are separately quantified to determine a weight dictionary, a weight codebook, a neuron dictionary, and a neuron codebook;
    根据所述权值密码本和神经元密码本,确定运算密码本。The operation codebook is determined based on the weight codebook and the neuron codebook.
  2. 根据权利要求1所述的处理方法,其中,所述对权值进行量化包括步骤:The processing method according to claim 1, wherein said quantizing the weight comprises the steps of:
    对所述权值分组,对每一组权值用聚类算法进行聚类操作,将所述每一组权值分成m类,m为正整数,每一类权值对应一个权值索引,确定所述权值字典,其中,所述权值字典包括权值位置和权值索引,所述权值位置指权值在神经网络结构中的位置;以及Grouping the weights, performing a clustering operation on each set of weights by using a clustering algorithm, dividing each set of weights into m classes, m being a positive integer, and each type of weight corresponding to a weight index, Determining the weight dictionary, wherein the weight dictionary includes a weight position and a weight index, the weight position indicating a position of the weight in the neural network structure;
    将每一类的所有权值用一中心权值替换,确定所述权值密码本,其中,所述权值密码本包括权值索引和中心权值。The weighted codebook is determined by replacing the ownership value of each class with a central weight, wherein the weighted codebook includes a weighted index and a central weight.
  3. 根据权利要求1或2所述的处理方法,其中,所述对输入神经元进行量化包括步骤:The processing method according to claim 1 or 2, wherein said quantizing the input neurons comprises the steps of:
    将所述输入神经元分为p段,每一段输入神经元对应一个神经元范围及一个神经元索引,确定所述神经元字典,其中,p为正整数;以及Deciphering the input neuron into p segments, each segment input neuron corresponding to a neuron range and a neuron index, determining the neuron dictionary, wherein p is a positive integer;
    对所述输入神经元进行编码,将每一段的所有输入神经元用一中心神经元替换,确定所述神经元密码本。The input neurons are encoded, and all input neurons of each segment are replaced with a central neuron to determine the neuron codebook.
  4. 根据权利要求3所述的处理方法,其中,所述确定运算密码本,具体包括步骤:The processing method according to claim 3, wherein the determining the operation codebook comprises the following steps:
    根据所述权值确定所述权值密码本中的对应的权值索引,再通过权值索引确定该权值对应的中心权值;Determining, according to the weight value, a corresponding weight index in the weight credential, and determining, by using a weight index, a center weight corresponding to the weight;
    根据所述输入神经元确定所述神经元密码本中对应的神经元索引,再通过神经元索引确定该输入神经元对应的中心神经元;以及Determining, according to the input neuron, a corresponding neuron index in the neuron codebook, and determining, by the neuron index, a central neuron corresponding to the input neuron; and
    将该中心权值和中心神经元进行运算操作,得到运算结果,并将该运算结果组成矩阵,从而确定所述运算密码本。The center weight and the central neuron are operated to obtain an operation result, and the operation result is formed into a matrix to determine the operation codebook.
  5. 根据权利要求4所述的处理方法,其中,所述运算操作包括以下的至少一种:加法、乘法和池化,其中,所述池化包括:平均值池化,最大值池化和中值池化。The processing method according to claim 4, wherein the arithmetic operation comprises at least one of: addition, multiplication, and pooling, wherein the pooling comprises: average pooling, maximum pooling, and median Pooling.
  6. 根据权利要求1至5任一项所述的处理方法,其中,还包括步骤:对所述权值和输入神经元进行重训练,重训练时只训练所述权值密码本和神经元密码本,所述权值字典和神经元字典中的内容保持不变,所述重训练采用反向传播算法。The processing method according to any one of claims 1 to 5, further comprising the steps of: retraining the weight and the input neuron, and training only the weight codebook and the neuron codebook during retraining The contents of the weight dictionary and the neuron dictionary remain unchanged, and the retraining uses a back propagation algorithm.
  7. 根据权利要求2所述的处理方法,其中,所述对所述权值分组包括:The processing method according to claim 2, wherein said grouping said weights comprises:
    分为一组,将神经网络中的所有权值归为一组;Divided into groups, grouping the ownership values in the neural network into one group;
    层类型分组,将所述神经网络中所有卷积层的权值、所有全连接层的权值和所有长短时记忆网络层的权值各划分成一组;Layer type grouping, dividing weights of all convolution layers in the neural network, weights of all fully connected layers, and weights of all long and short memory network layers into a group;
    层间分组,将所述神经网络中一个或者多个卷积层的权值、一个或者多个全连接层的权值和一个或者多个长短时记忆网络层的权值各划分成一组;以及Inter-layer grouping, dividing weights of one or more convolution layers in the neural network, weights of one or more fully connected layers, and weights of one or more long-term memory network layers into a group;
    层内分组,将所述神经网络的一层内的权值进行切分,切分后的每一个部分划分为一组。In-layer grouping, the weights in one layer of the neural network are segmented, and each part after the segmentation is divided into a group.
  8. 根据权利要求2所述的处理方法,其中,所述聚类算法包括K-means、K-medoids、Clara和/或Clarans。The processing method according to claim 2, wherein the clustering algorithm comprises K-means, K-medoids, Clara, and/or Clarans.
  9. 根据权利要求2至8任一所述的处理方法,其中,每一类对应的中心权值的选择方法包括:确定使得代价函数J(w,w 0)的值最小时W 0的取值,此时W 0的取值即为该中心权值; The processing method according to any one of claims 2 to 8, wherein the method for selecting the center weight corresponding to each class comprises: determining the value of W 0 when the value of the cost function J(w, w 0 ) is minimized, At this time, the value of W 0 is the center weight;
    其中,
    Figure PCTCN2018095548-appb-100001
    J(w,w 0)是代价函数,W是该类中所有权值,W 0是中心权值,n是该类下所有权值的数量,W i是该类中的第i个权值,1≤i≤n,且i为正整数。
    among them,
    Figure PCTCN2018095548-appb-100001
    J(w, w 0 ) is the cost function, W is the ownership value in the class, W 0 is the center weight, n is the number of ownership values in the class, and W i is the i-th weight in the class, 1 ≤ i ≤ n, and i is a positive integer.
  10. 一种处理装置,包括:A processing device comprising:
    存储器,用于存储操作指令;a memory for storing an operation instruction;
    处理器,用于执行存储器中的操作指令,在执行该操作指令时依照权利要求1至9中任一所述的处理方法进行操作。And a processor for executing an operation instruction in the memory, and operating the processing instruction according to any one of claims 1 to 9 when the operation instruction is executed.
  11. 根据权利要求10所述的装置,其中,所述操作指令为二进制数,包括操作码和地址码,操作码指示处理器即将进行的操作,地址码指示处理器到存储器中的地址中读取参与操作的数据。The apparatus according to claim 10, wherein said operation instruction is a binary number including an operation code and an address code, the operation code indicating an operation to be performed by the processor, and the address code instructing the processor to read the participation in the address in the memory Operational data.
PCT/CN2018/095548 2017-10-20 2018-07-13 Processing method and apparatus WO2019076095A1 (en)

Priority Applications (9)

Application Number Priority Date Filing Date Title
EP19215858.2A EP3667569A1 (en) 2017-10-20 2018-07-13 Processing method and device, operation method and device
EP19215859.0A EP3660628B1 (en) 2017-10-20 2018-07-13 Dynamic voltage frequency scaling device and method
KR1020197037566A KR102434728B1 (en) 2017-10-20 2018-07-13 Processing method and apparatus
US16/482,710 US11593658B2 (en) 2017-10-20 2018-07-13 Processing method and device
EP19215860.8A EP3660706B1 (en) 2017-10-20 2018-07-13 Convolutional operation device and method
KR1020197037574A KR102434729B1 (en) 2017-10-20 2018-07-13 Processing method and apparatus
EP18868807.1A EP3627397B1 (en) 2017-10-20 2018-07-13 Processing method and apparatus
KR1020197023878A KR102434726B1 (en) 2017-10-20 2018-07-13 Treatment method and device
US16/528,948 US10747292B2 (en) 2017-10-29 2019-08-01 Dynamic voltage frequency scaling device and method

Applications Claiming Priority (12)

Application Number Priority Date Filing Date Title
CN201710989575.4 2017-10-20
CN201710989575.4A CN109697135B (en) 2017-10-20 2017-10-20 Storage device and method, data processing device and method, and electronic device
CN201711061069.5A CN109697509B (en) 2017-10-24 2017-10-24 Processing method and device, and operation method and device
CN201711004974.7 2017-10-24
CN201711004974.7A CN109697507B (en) 2017-10-24 2017-10-24 Processing method and device
CN201711061069.5 2017-10-24
CN201711029543.6A CN109725700A (en) 2017-10-29 2017-10-29 Dynamic voltage adjustment frequency modulation device and method
CN201711118938.3A CN109726353B (en) 2017-10-29 2017-10-29 Convolution operation device and method
CN201711118938.3 2017-10-29
CN201711029543.6 2017-10-29
CN201711289667.8A CN109903350B (en) 2017-12-07 2017-12-07 Image compression method and related device
CN201711289667.8 2017-12-07

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US16/482,710 A-371-Of-International US11593658B2 (en) 2017-10-20 2018-07-13 Processing method and device
US16/528,948 Continuation US10747292B2 (en) 2017-10-29 2019-08-01 Dynamic voltage frequency scaling device and method

Publications (1)

Publication Number Publication Date
WO2019076095A1 true WO2019076095A1 (en) 2019-04-25

Family

ID=66173090

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095548 WO2019076095A1 (en) 2017-10-20 2018-07-13 Processing method and apparatus

Country Status (1)

Country Link
WO (1) WO2019076095A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095468A (en) * 2019-12-23 2021-07-09 上海商汤智能科技有限公司 Neural network accelerator and data processing method thereof
CN113128673A (en) * 2019-12-31 2021-07-16 Oppo广东移动通信有限公司 Data processing method, storage medium, neural network processor and electronic device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0488150A2 (en) * 1990-11-26 1992-06-03 Hitachi, Ltd. Neural network system adapted for non-linear processing
EP0528511A2 (en) * 1991-08-15 1993-02-24 Sony Corporation Neural network quantizers
CN106096723A (en) * 2016-05-27 2016-11-09 北京航空航天大学 A kind of based on hybrid neural networks algorithm for complex industrial properties of product appraisal procedure
CN106485316A (en) * 2016-10-31 2017-03-08 北京百度网讯科技有限公司 Neural network model compression method and device
CN106529609A (en) * 2016-12-08 2017-03-22 郑州云海信息技术有限公司 Image recognition method and device based on neural network structure

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0488150A2 (en) * 1990-11-26 1992-06-03 Hitachi, Ltd. Neural network system adapted for non-linear processing
EP0528511A2 (en) * 1991-08-15 1993-02-24 Sony Corporation Neural network quantizers
CN106096723A (en) * 2016-05-27 2016-11-09 北京航空航天大学 A kind of based on hybrid neural networks algorithm for complex industrial properties of product appraisal procedure
CN106485316A (en) * 2016-10-31 2017-03-08 北京百度网讯科技有限公司 Neural network model compression method and device
CN106529609A (en) * 2016-12-08 2017-03-22 郑州云海信息技术有限公司 Image recognition method and device based on neural network structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3627397A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095468A (en) * 2019-12-23 2021-07-09 上海商汤智能科技有限公司 Neural network accelerator and data processing method thereof
CN113095468B (en) * 2019-12-23 2024-04-16 上海商汤智能科技有限公司 Neural network accelerator and data processing method thereof
CN113128673A (en) * 2019-12-31 2021-07-16 Oppo广东移动通信有限公司 Data processing method, storage medium, neural network processor and electronic device
CN113128673B (en) * 2019-12-31 2023-08-11 Oppo广东移动通信有限公司 Data processing method, storage medium, neural network processor and electronic device

Similar Documents

Publication Publication Date Title
KR102434726B1 (en) Treatment method and device
US11307865B2 (en) Data processing apparatus and method
CN109478144B (en) Data processing device and method
Sze Designing hardware for machine learning: The important role played by circuit designers
WO2020073211A1 (en) Operation accelerator, processing method, and related device
US20200265300A1 (en) Processing method and device, operation method and device
CN111368993A (en) Data processing method and related equipment
KR102530548B1 (en) neural network processing unit
WO2023231794A1 (en) Neural network parameter quantification method and apparatus
CN113326930A (en) Data processing method, neural network training method, related device and equipment
US20240135174A1 (en) Data processing method, and neural network model training method and apparatus
WO2022088063A1 (en) Method and apparatus for quantizing neural network model, and method and apparatus for processing data
Lu et al. Convolutional autoencoder-based transfer learning for multi-task image inferences
WO2019076095A1 (en) Processing method and apparatus
CN112789627A (en) Neural network processor, data processing method and related equipment
WO2023185209A1 (en) Model pruning
CN116401552A (en) Classification model training method and related device
Chen et al. SmartDeal: Remodeling Deep Network Weights for Efficient Inference and Training
CN114707643A (en) Model segmentation method and related equipment thereof
CN110334359B (en) Text translation method and device
CN113065638A (en) Neural network compression method and related equipment thereof
US20200150971A1 (en) Data processing apparatus and method
US20220121926A1 (en) Tensor ring decomposition for neural networks
Furuta et al. An Efficient Implementation of FPGA-based Object Detection Using Multi-scale Attention
CN117746047A (en) Image processing method and related equipment thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18868807

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20197023878

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2018868807

Country of ref document: EP

Effective date: 20191216

NENP Non-entry into the national phase

Ref country code: DE