WO2019165939A1 - 一种计算装置及相关产品 - Google Patents

一种计算装置及相关产品 Download PDF

Info

Publication number
WO2019165939A1
WO2019165939A1 PCT/CN2019/075975 CN2019075975W WO2019165939A1 WO 2019165939 A1 WO2019165939 A1 WO 2019165939A1 CN 2019075975 W CN2019075975 W CN 2019075975W WO 2019165939 A1 WO2019165939 A1 WO 2019165939A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
unit
instruction
input
computing device
Prior art date
Application number
PCT/CN2019/075975
Other languages
English (en)
French (fr)
Inventor
周徐达
郭雪婷
杜子东
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201810161816.0A external-priority patent/CN110196734A/zh
Priority claimed from CN201810161636.2A external-priority patent/CN110196735A/zh
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2019165939A1 publication Critical patent/WO2019165939A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode

Definitions

  • the present application relates to the field of information processing technologies, and in particular, to a computing device and related products.
  • the terminal acquires and processes information based on a general-purpose processor.
  • the embodiment of the present application provides a data processing method and related products, which can improve the processing speed of the computing device and improve the efficiency.
  • a method for data processing using a computing device comprising an arithmetic unit, an instruction control unit, a storage unit, and a compression unit, the method comprising:
  • the instruction control unit acquires an operation instruction, decodes the operation instruction into a first micro instruction and a second micro instruction, and sends the first micro instruction to the compression unit, and sends the second micro instruction Giving the arithmetic unit;
  • the compression unit processes the acquired input data according to the first microinstruction to obtain processed input data; wherein the input data includes at least one input neuron and/or at least one input data, after the processing Input data includes processed input neurons and/or processed input data;
  • the operation unit processes the processed input data according to the second microinstruction to obtain an operation result.
  • a method for data processing using a computing device comprising an arithmetic unit, an instruction control unit, and a compression unit; the computing device acquires a training sample, a training model; the training model is: a neural network a training model and/or a non-neural network training model, the training model comprising an n-layer structure; the method comprising:
  • the instruction control unit acquires a training instruction, obtains a forward operation instruction and a reverse operation instruction according to the training instruction, and sends the forward operation instruction and the reverse operation instruction to the operation unit respectively;
  • the operation unit performs an n-th layer forward operation on the training sample according to the forward operation instruction to obtain an nth layer forward operation result; and obtains an nth layer output data gradient according to the nth layer forward operation result;
  • the inverse operation instruction performs an n-th layer inverse operation on the nth layer output data gradient to obtain a weight gradient of the training model; the weight gradient of the training model includes a weight gradient of each layer in the n layer;
  • the compression unit processes the weight gradient of the training model to correspond to the processed weight gradient
  • the computing device updates the weight of the training model according to the processed weight gradient to complete the training.
  • a computing device comprising a hardware unit for performing the method of the first aspect or the second aspect described above.
  • a computer readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to perform the method provided by the first aspect or the second aspect.
  • a chip comprising the computing device provided in the third aspect above.
  • a chip package structure comprising the chip provided in the fifth aspect above.
  • a board comprising the chip package structure as provided in the sixth aspect above.
  • an electronic device comprising the card provided in the seventh aspect above.
  • the electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, Cameras, projectors, watches, headsets, mobile storage, wearables, vehicles, household appliances, and/or medical equipment.
  • the vehicle includes an airplane, a ship, and/or a vehicle
  • the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood
  • Equipment includes nuclear magnetic resonance, B-ultrasonic and/or electrocardiographs.
  • the computing device provided by the present application has a compression unit, and performs compression processing on the data, thereby saving transmission resources and computing resources, so that it has the advantages of low power consumption and small calculation amount.
  • FIG. 1a-1c are schematic structural diagrams of several computing devices provided by embodiments of the present application.
  • FIG. 1d is a schematic structural diagram of a compression unit according to an embodiment of the present invention.
  • FIG. 1e is a schematic diagram of control state transition of a control unit according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a partial structure of a compression unit according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a neural network according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a partial structure of another compression unit according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a partial structure of another compression unit according to an embodiment of the present invention.
  • FIG. 6 is a schematic flowchart of a data processing method using a computing device according to an embodiment of the present invention.
  • FIG. 6b is a schematic flowchart diagram of a data processing method using a computing device according to an embodiment of the present invention.
  • FIG. 7a is a schematic structural diagram of a chip device provided by the present disclosure.
  • FIG. 7b is a schematic structural diagram of a main processing circuit provided by the present disclosure.
  • FIG. 7c is a schematic diagram of data distribution of the chip device provided by the present disclosure.
  • Figure 7d is a schematic diagram of data backhaul of a chip device.
  • references to "an embodiment” herein mean that a particular feature, structure, or characteristic described in connection with the embodiments can be included in at least one embodiment of the present application.
  • the appearances of the phrases in various places in the specification are not necessarily referring to the same embodiments, and are not exclusive or alternative embodiments that are mutually exclusive. Those skilled in the art will understand and implicitly understand that the embodiments described herein can be combined with other embodiments.
  • FIG. 1a is a schematic structural diagram of a possible computing device according to the present application.
  • the computing device includes a compression unit 101, a storage unit 102, an instruction control unit 107, and an operation unit 108.
  • the computing device as shown in FIG. 1b may further include a first input buffer unit 105 and a second input buffer unit 106.
  • the computing device may further include a direct memory access (DMA) unit 103, an instruction cache unit 104, and an output buffer unit 109.
  • DMA direct memory access
  • the storage unit 102 is configured to store input data, operation instructions (specifically, but not limited to, neural network operation instructions, non-neural network operation instructions, addition instructions, convolution instructions, etc.),
  • the input data after processing, the positional relationship data of the input data, the calculation result, and the intermediate data generated in other neural network operations, etc., are not limited in this application.
  • the input data includes, but is not limited to, input weights and input neurons, and the amount of the input data is not limited in this application, that is, the input data includes at least one input weight and/or at least one input neuron.
  • the positional relationship data of the input data is used to represent the position of the input data. For example, if the input data is a matrix A, the data Aij is taken as an example, and the position information of the data Aij is the i-th row and the j-th column in the matrix A.
  • the location relationship data of the input data may also represent a positional relationship of the input data in which the absolute value is greater than or equal to a preset threshold, and the input data may be an input neuron or an input weight.
  • the preset threshold may be a threshold set by the user side or the device side, for example, 0.2, 0.5, and the like.
  • the positional relationship data of the input data may be represented by a direct index or a step index, which will be described in detail later.
  • the input data is a matrix.
  • the preset threshold is 0.5
  • the positional relationship data of the input data is
  • the storage unit 102 is configured to store a training model, a training sample, a training instruction, intermediate data generated by the operation unit, and an operation result, processed data generated by the compression unit, and positional relationship data, and the like.
  • a training model includes, but is not limited to, a combination of any one or more of the following: a neural network model, a non-neural network model, and other device side or user side custom set mathematical models, and the like.
  • the training samples include training input data (ie, input neurons) used to train the training model.
  • the storage unit further stores training output data (ie, result data corresponding to the input data) used when training the training model.
  • the training sample and the training output result are specifically real data acquired by the device side.
  • the positional relationship data is used to characterize the location of the data, and can also be used to characterize the location of data whose absolute value is greater than or equal to a predetermined threshold.
  • the input data and the training samples are collectively referred to as data in the following.
  • the data referred to below may be input data or may be a training sample.
  • the computing device may store the data according to location relationship data of the data.
  • the location relationship data of the data is cached, and the data whose absolute value is greater than or equal to the preset threshold is determined by default.
  • the data of the other location is: the data whose absolute value is less than the preset threshold, or the default is 0. limited. That is, the present application uses a dense manner to store data, such as buffering data in a dense manner in the first/second cache unit, including but not limited to input neurons or input weights, and the like.
  • the direct storage access DMA unit 103 is configured to read and write data between the storage unit 102 and the instruction cache unit 104, the mapping unit 101, the first input buffer unit 105, and the output buffer unit 109.
  • the DMA unit 103 reads an operation instruction from the storage unit 102, and transmits the operation instruction to the instruction control unit 107, or to the instruction buffer unit 104 or the like.
  • the DMA unit 103 reads the training instruction from the storage unit 102 and transmits the training instruction to the instruction control unit 107, or to the instruction buffer unit 104 or the like.
  • the DMA unit 103 can also read the input weight or the processed input weight from the storage unit 102 for transmission to the first input storage unit 105 or the second input storage unit 106 for caching. Accordingly, the DMA unit 103 can also read input neurons or processed input neurons from the storage unit 102 for transmission to the first input storage unit 105 or the second input storage unit 106.
  • the data buffered in the first input storage unit 105 and the second input storage unit 106 are different.
  • the first input buffer unit 105 stores input neurons or processed input neurons
  • the second input buffer unit 106 Store input weights or processed weights; vice versa.
  • the instruction cache unit 104 is used to cache an operation instruction or to cache a training instruction.
  • the first input buffer unit is configured to cache first cache data
  • the second input cache data is used to cache second cache data.
  • the first cache data and the second cache data are different.
  • the first cache data and the second cache data refer to the foregoing.
  • the second cached data may be an unprocessed input neuron, a processed input neuron, or the like.
  • the instruction control unit 107 can be configured to retrieve an operation instruction from the instruction cache unit or the storage unit, and further decode the operation instruction into a corresponding micro instruction, so that the calculation unit The relevant components in it can be identified and executed.
  • the instruction control unit in the present application can decode the arithmetic unit into a first micro instruction and a second micro instruction.
  • the first microinstruction is used to instruct the compression unit to perform corresponding data processing by using the processing manner indicated by the first microinstruction.
  • the second microinstruction is used to instruct the operation unit to perform an operation process corresponding to the second micro instruction, such as a multiplication operation, a convolution operation, or the like.
  • the instruction control unit 107 may be configured to acquire a training instruction (specifically, a forward operation instruction and/or a reverse operation instruction) from the instruction cache unit or the storage unit, and further The training instructions are decoded into corresponding microinstructions so that relevant components in the computing unit can be identified and executed.
  • the instruction control unit in the present application can decode the forward operation instruction into the first micro instruction and the second micro instruction.
  • the first microinstruction is used to instruct the compression unit to perform corresponding data processing by using the processing manner indicated by the first microinstruction.
  • the second microinstruction is used to instruct the operation unit to perform an operation process corresponding to the second micro instruction, such as a multiplication operation, a convolution operation, or the like.
  • the output buffer unit may be configured to buffer an operation result output by the operation unit.
  • the operation unit is configured to perform corresponding data operation processing according to an instruction sent by the instruction control unit to obtain an operation result.
  • the compression unit is configured to perform compression processing on the data to reduce the data dimension, reduce the amount of data calculation in the operation unit, and improve data processing efficiency.
  • the operation unit may store the operation result in the output buffer unit 109, and the output buffer unit 109 stores the operation result in the storage unit 102 through the direct storage access unit 103.
  • the computing device can also include a pre-processing module 110, as shown in Figure 1c.
  • the pre-processing module can be used to pre-process data to obtain pre-processed data.
  • the processed data can be stored in the storage unit.
  • the input data buffered in the storage unit may be input data and the like processed by the pre-processing module.
  • the pre-processing includes, but is not limited to, a combination of any one or more of the following processes: Gaussian filtering, binarization, normalization, regularization, abnormal data screening, and the like, which are not limited herein.
  • the first input buffer unit and the second input buffer unit in the present application may be split into two cache units, one of which is used to store input weights or input neurons, and the other
  • the cache unit corresponds to positional relationship data for storing input weights or positional relationship data of input neurons, and the like, which is not shown in the present application.
  • the design location of the compression unit 101 is not limited herein.
  • the compression unit can be placed behind the first input buffer unit and the second input buffer unit.
  • the compression unit may be placed in front of the first input buffer unit and the second input buffer unit, behind the DMA unit.
  • the compression unit may also be placed in the operation unit, etc., without limitation.
  • FIG. 1d is a schematic structural diagram of a compression unit provided by the present invention.
  • the compression unit 101 may comprise a combination of any one or more of the following: a pruning unit 201, a quantization unit 202, and an encoding unit 203.
  • the compression unit 101 may further include a control unit 204.
  • the pruning unit 201 is specifically configured to perform pruning processing on the received data, and specific implementations of the pruning processing will be described in detail below.
  • the quantization unit 202 is specifically configured to perform quantization processing on the received data, and specific implementations of the quantization processing will be described in detail below.
  • the encoding unit 203 is specifically configured to perform encoding processing on the received data, and specific implementations of the encoding processing will be described in detail below.
  • the control unit 204 is specifically configured to complete processing manners of the corresponding data by using at least one of the foregoing three units according to the indication of the receiving instruction, and the specific implementation will be described in detail below.
  • the compression unit 101 may acquire input data, and process the input data according to a processing manner indicated by the first microinstruction. Get processed input data.
  • the processing manner includes, but is not limited to, a combination of any one or more of the following: pruning processing, quantization processing, encoding processing, or other processing manner for reducing data dimensions or reducing data volume, the application does not limited.
  • the compression unit may use the pruning unit to retain the input data whose absolute value is greater than or equal to the first threshold, and delete the input data whose absolute value is less than the first threshold.
  • the input data includes, but is not limited to, at least one input neuron, at least one input weight, and the like.
  • the first threshold is customized on the user side or the device side, for example, 0.5 or the like.
  • the input data is a vector P (0.02, 0.05, 1, 2, 0.07, 2.1, 0.89), and if the first microinstruction indicates that the compression unit uses pruning processing to process data, the compression unit will The vector P performs pruning processing to delete data having an absolute value less than 0.5 in the vector P, thereby obtaining the processed input data (ie, the processed vector P) as (0, 0, 1, 2, 0, 2.1, 0.89).
  • the compression unit may employ a quantization unit to cluster and quantize the input data, thereby obtaining processed input data.
  • the quantification refers to quantifying the original data to new data close to the original data, and the new data may be customized for the user side, or ensure that the error between the original data and the new data is less than a preset value, for example, Quantify 0.5 to 1 and so on.
  • the quantization and clustering algorithm specifically used in the quantization processing method is not limited in this application, for example, the K-means algorithm is used to cluster each layer of input weights or input neurons in the neural network model.
  • input data is a matrix
  • the processed input data can be obtained, for example,
  • the compression unit may encode the input data by using a preset encoding format by the encoding unit, thereby obtaining the processed input data.
  • the preset encoding format is custom set on the user side or the device side, which may include, but is not limited to, Huffman huffman coding, non-return to zero code, Manchester code, and the like.
  • the execution sequence of the multiple processing manners may be not limited in the application, and may also be defined in the first micro-instruction. . That is, the first microinstruction indicates the processing mode that the compression unit needs to adopt, the execution order of the processing manner, and the like.
  • the processing manner indicated by the first microinstruction includes: pruning processing, quantization processing, and encoding processing. If the first microinstruction indicates that there is an execution order of the above three processing manners, the compression unit will follow The input data is processed by the execution order and the processing mode indicated by the first microinstruction. If the execution order of the processing mode is not specified in the first microinstruction, the compression unit may perform the above three processing modes in an arbitrary execution order, and process the input data to obtain processed input data. .
  • a control unit is designed to control the migration between various processing modes in order to support data processing of multiple processing modes.
  • the control unit may use a concept of the control unit to propose multiple operational states, each of which corresponds to a processing mode.
  • the first state indicates the initial state of the control unit, no data processing is performed; the second state indicates/instructs the pruning process of the data; the third state indicates the quantization process of the data, and the fourth state indicates the data.
  • the fifth state indication may also be used to sequentially perform pruning, quantization, and encoding processing of the data, the sixth state indication sequentially performs pruning and quantization states of the data, and the seventh state indication sequentially performs pruning and encoding of the data.
  • the processing and the like, the number of the operation states and the processing manner of the indicated data can be customized by the user side or the device side, which is not limited in this application.
  • the operation state may be specifically a first state to a fourth state.
  • the state may be represented by a preset value or a preset value, for example, the first state is indicated by 00, the second state is indicated by 01, the third state is indicated by 10, and the fourth state is indicated by 11.
  • a schematic diagram of the transition between various states using a control unit is shown in Figure 1e.
  • the control unit may be a controller, an accumulator, or other physical components for indicating various data processing modes, which is not limited in this application.
  • the operation state is associated with the processing mode.
  • the first state 00 can be represented as an initial state, and no data processing is performed.
  • the second state 01 is associated with the pruning process, and is used to indicate that the compression unit can perform data processing by using a pruning processing manner.
  • the third state 10 is associated with the quantization process for indicating that the compression unit can perform processing of data using a quantization process.
  • the fourth state 11 is associated with the encoding process, and is used to indicate that the compression unit can perform encoding processing of data by using a preset encoding format.
  • the control unit before the compression unit performs data processing, the control unit is in a first state (ie, an initial state); when the compression unit receives the first microinstruction, may be according to the first micro
  • the processing mode indicated by the instruction uses the control unit to reset the state of the control unit, so that the compression unit is in the state to complete the data processing corresponding to the processing mode. That is, in the present application, the operation state may be reset/modified by the control unit in the compression unit to complete data processing corresponding to the processing mode in the compression unit.
  • the first microinstruction is used to indicate that the processing of data is performed by sequentially performing pruning processing, quantization processing, and encoding processing in the compression unit.
  • the compression unit sets a first (initial) state 00 to a second state 01 by using a control unit (such as an accumulator), and the pruning process is completed at the processing unit.
  • the second state 01 can be set to the third state 10 by the control unit.
  • the third state 10 can be set to the fourth state 11 by the control unit.
  • the fourth state 11 can be set to the first (initial) state 00 by the control unit, and the process can be ended.
  • the control unit is an accumulator
  • the mutual migration between the four states can be realized by sequentially adding 1 to the accumulator, and the accumulator can be reset to after the fourth state is completed. 00, that is, return to the initial state.
  • the number of operation states involved in the control unit is not limited in this application, and it depends on the processing manner of the data.
  • the operation state involved by the control unit may correspond to three states, and the upper state corresponds to the first state to the third state.
  • the computing device sequentially performs the pruning process and the quantization process indicated by the second state and the third state according to the control of the control unit, after the data processing is completed,
  • the operation state of the control unit is set to a first state (initial state) or the like.
  • processing of the input data by the compression unit of the present application may specifically be processing the input weight and/or the input neuron to reduce the calculation amount of the input weight or the input neuron in the operation unit, thereby Improve data processing efficiency.
  • corresponding processing operations may be performed on the processed input data according to the received second micro-instruction to obtain an operation result.
  • the operation unit may perform correspondingly on the processed input neurons and input weights according to the second microinstruction. The operation of the operation to obtain the result of the operation.
  • the operation unit may perform corresponding to the processed input weight and the input neuron according to the second microinstruction. The operation of the ground to obtain the operation result.
  • the operation unit may input the processed input according to the second microinstruction.
  • the value and the processed input neuron perform corresponding arithmetic operations to obtain the operation result.
  • the compression unit may store the operation result in the output buffer unit 109, and the output buffer unit 109 stores the operation result in the storage unit 102 through the direct storage access unit 103.
  • instruction cache unit 104 the first input buffer unit 105, the second input buffer unit 106, and the output buffer unit 109 may all be on-chip cache.
  • the above-mentioned operation unit 108 includes but is not limited to three parts, respectively, a multiplier, one or more adders (optionally, a plurality of adders constitute an addition tree) and an activation function unit/activation function operator.
  • pooling operations include but are not limited to: average pooling, maximum pooling, median pooling, and input data in is the data in a pooled core associated with output out.
  • the computing device includes an arithmetic unit, an instruction control unit, and a compression unit; the computing device acquires a training sample and a training model; the training model is: a neural network training model and/or a non-neural network training model, and the training model includes N-layer structure, n is a positive integer;
  • the instruction control unit is configured to obtain a training instruction, and obtain a forward operation instruction and a reverse operation instruction according to the training instruction; and send the forward operation instruction and the reverse operation instruction to the operation unit respectively;
  • the operation unit is configured to perform n-th layer forward operation on the training sample according to the forward operation instruction to obtain an n-th layer forward operation result; and obtain an n-th layer output data gradient according to the n-th layer forward operation result ;
  • the operation unit is further configured to perform an n-layer inverse operation on the n-th layer output data gradient according to the reverse operation instruction to obtain a weight gradient of the training model;
  • the weight gradient of the training model includes an n-layer The weight gradient of each layer in the middle;
  • the compression unit is configured to process a weight gradient of the training model to correspond to a processed weight gradient
  • the computing device is configured to update the weight of the training model according to the processed weight gradient to complete the training.
  • the computing device may first acquire the training sample and the training model, and optionally obtain the training output data used in the training.
  • the training model includes a neural network model and/or a non-neural network model, the training model including an n-layer structure.
  • the neural network model includes n neural network layers, that is, n layers.
  • the neural network layer includes, but is not limited to, a convolution layer, an activation layer, a pooling layer, and the like, which are not described or limited in detail herein.
  • the instruction control unit may acquire a training instruction from the storage unit, and parse the training instruction to obtain a forward operation instruction and a reverse operation instruction, and send them to the operation unit respectively.
  • the operation unit may perform an n-layer forward operation on the training sample (ie, an input neuron used during training) by using the forward operation instruction, thereby obtaining a forward direction
  • the result of the operation includes the forward operation result of each layer in the n-layer forward operation, and of course includes the n-th layer forward operation result, the n-1th layer forward operation result, and the layer 1 forward operation result.
  • the forward operation instruction includes a first micro instruction and a second micro instruction.
  • the forward operation instruction may be decoded to obtain the first micro instruction and the second micro instruction.
  • the first microinstruction and the second microinstruction may be sent to the operation unit; or the first microinstruction may be sent to the compression unit, and the second microinstruction may be sent to the operation unit.
  • the first microinstruction is used to instruct the compression unit to process the training sample to obtain a processed training sample; the second microinstruction is used to instruct the operation unit to perform the processed
  • the training samples are subjected to n-level forward operations to obtain forward operation results.
  • the computing device invokes/processes the training samples by the compression unit according to the indication of the first microinstruction to obtain a processed training sample. Then, the operation unit performs an n-layer forward operation on the processed training samples according to the instruction of the second micro-instruction to obtain a forward operation result.
  • the forward operation instruction is used to instruct the operation unit to perform an n-layer forward operation on the training sample to obtain a forward operation result.
  • the operation unit may further obtain an output data gradient of each layer in the n layer according to the forward operation result (ie, a forward operation result of each layer in the n layer).
  • the forward operation result ie, a forward operation result of each layer in the n layer.
  • the operation unit may obtain an nth layer output data gradient according to the nth layer forward operation result.
  • the operation unit performs a preset ratio reduction on the nth layer forward operation result or multiplies by a preset percentage or the like to obtain the nth layer output data gradient.
  • the output data gradient of the previous layer is used as the input data gradient of the next layer
  • the n-1th output data gradient is the nth layer input data gradient
  • the operation unit may The nth layer forward operation is performed using the nth layer input data gradient to obtain an nth layer forward operation result.
  • the forward operation result of the unprocessed previous layer may also be used to calculate the forward operation result of the next layer. That is, the result of the forward operation of the previous layer can be used as the input data (training sample) of the next layer.
  • the result of the n-1th layer forward operation is the input data of the nth layer
  • the operation unit uses the nth layer. The input data is subjected to the nth layer forward operation, and the nth layer forward operation result is obtained correspondingly.
  • the operation unit may receive a reverse operation instruction, perform an n-layer inverse operation on the n-th layer output data gradient according to the reverse operation instruction, thereby obtaining a weight gradient of the training model,
  • the weight gradient of the training model includes the weight gradient for each layer in the n-layer structure.
  • the computing device may further invoke the compression unit to process a weight gradient of the training model to correspondingly obtain a processed weight gradient in the training model.
  • the computing device may use the processed weight gradient to update the weight of the training model to complete the training of the training model.
  • the operation unit after the operation unit receives the reverse operation instruction, performing an n-layer inverse operation on the n-th output data gradient to obtain a weight gradient of the training model;
  • the value gradient includes a weight gradient of each layer in the n layer;
  • the compression unit is invoked to process the weight gradient of the training model (ie, the weight gradient of the n layer) to obtain the processed weight gradient, That is, the weight gradient of the n layer after processing.
  • the computing device uses the weight gradient of the processed n-layer to update the weight of the n-layer in the training model, thereby completing the training of the training model.
  • the operation unit may acquire an nth layer output data gradient and training output data; and then output a data gradient and training to the nth layer according to the reverse operation instruction.
  • the output result performs an nth layer inverse operation to obtain a weight gradient of the training model.
  • the computing device invokes the compression unit once, and then completes processing the weight gradient of the training model to update the training model by using the processed weight gradient. It can save training time and improve data processing efficiency.
  • the reverse operation instruction may include two micro instructions, such as a third micro instruction and a fourth micro instruction.
  • the instruction control unit may decode the reverse operation instruction to obtain a third micro instruction and a fourth micro instruction; correspondingly, the third micro instruction and the fourth micro The instruction is sent to the operation unit; or the third micro instruction is sent to the operation unit, and the fourth micro instruction is sent to the compression unit.
  • the third microinstruction is used to instruct the computing unit to calculate a weight gradient of the n layer in the training model; and the fourth microinstruction is used to instruct the compression unit to weight the n layer in the training model. The gradient is processed.
  • the operation unit may calculate a weight gradient of the training model (ie, a weight gradient of the n layer) according to the indication of the third microinstruction; and then invoke/pass the instruction according to the indication of the fourth microinstruction.
  • the compression unit processes the weight gradient of the training model to correspondingly update the training model with the processed weight gradient.
  • the operation unit sequentially acquires an nth layer output data gradient according to the reverse operation instruction, an n-1th layer output data gradient... a layer 1 output data gradient
  • the nth layer inverse operation is performed on the nth layer output data gradient according to the layer number to obtain the weight gradient of the nth layer
  • the n-1th layer inverse is performed on the n-1th output data gradient.
  • the operation obtains the weight gradient of the n-1th layer, and so on, and performs the first layer inverse operation on the first layer output data gradient to obtain the weight gradient of the first layer.
  • the operation unit may perform an inverse operation on the layer output weight gradient and the training output result of the layer to obtain a weight gradient of the layer.
  • the operation unit may perform an inverse operation according to the nth layer output data gradient and the nth layer training output data (ie, the training output data described above) to obtain the nth The weight gradient of the layer.
  • the computing device may correspondingly invoke the compression unit to process the weight gradient of the layer to correspond to the processed layer. a weighting gradient; correspondingly, the computing device may update the weight of the layer in the training model by using the processed weight gradient of the layer until the weight of each layer in the training model is updated To complete the training of the training model.
  • the operation unit may call the compression unit to process the weight gradient of the nth layer to obtain a weight gradient of the processed nth layer; Further, the computing device may update the weight of the nth layer in the training model by using the processed weight gradient of the nth layer.
  • the operation unit may call/process the weight gradient of the n-1th layer by using the compression unit to obtain the processed The weight gradient of the n-1 layer; further, the computing device may update the weight of the n-1th layer in the training model by using the processed weight gradient of the n-1th layer.
  • the computing device may use the compression unit to process the calculated weight gradient of the n layer to correspond to the weight gradient of the processed n layer, and then according to the weight of the processed n layer.
  • the value gradient is correspondingly updated to update the weight of the n layer in the training model, thereby completing the training of the training model.
  • the arithmetic unit calculates the weight gradient of the n layer, and processes the weight gradient of the n layer by using the compression unit.
  • These two units can be executed in parallel or serial execution mode when implemented. For example, in the serial mode, after calculating the weight gradient of the nth layer, the operation unit needs to obtain the weight of the nth layer after the compression unit processes the weight gradient of the nth layer. After the gradient, and the computing device updates the weight of the nth layer in the training model by using the processed weight gradient of the nth layer, the computing unit further calculates the weight gradient of the n-1th layer.
  • the operation unit and the compression unit respectively perform corresponding processing according to the layer order, and the two units do not affect each other.
  • the weight gradient of the n-1th layer can be calculated.
  • the weight gradient of the first layer is calculated by analogy.
  • the compression unit may process the weight gradient of the nth layer, and the operation unit does not need to obtain the weight gradient of the compression unit in the nth layer.
  • the weight gradient of the n-1th layer is calculated. Understandably, in parallel mode, the data processing efficiency of the computing device can be improved.
  • the reverse operation instruction includes a third micro instruction and a fourth micro instruction.
  • the instruction control unit decodes the reverse operation instruction to obtain the third micro instruction and the fourth micro instruction.
  • the third micro instruction and the fourth micro instruction are sent to the operation unit; or the third micro instruction is sent to the operation unit, and the fourth micro instruction is sent to the compression unit.
  • the third microinstruction is used to instruct the operation unit to calculate the weight gradient of the n layer according to the layer order; the fourth microinstruction is used to instruct the compression unit to perform the weight gradient of the calculated n layer according to the number of layers. Processing to correspondingly obtain the weight gradient of the processed n layer.
  • the weight gradient of each layer in the n layer may be sequentially calculated according to the layer order; then, according to the instruction of the fourth microinstruction, the compression unit is called for each layer.
  • the weight gradient is processed to correspond to the weight gradient of each layer after processing.
  • the compression unit 101 may acquire input data, and process the input data according to a data processing manner indicated by the target microinstruction to obtain a processed input. data.
  • the input data may be the training sample and the weight gradient of the n layer obtained by the operation unit in the foregoing embodiment (specifically, the weight gradient of the nth layer and the weight gradient of the n-1 layer) .... the weight gradient of the first layer).
  • the target microinstruction may be the first microinstruction or the fourth microinstruction or the like in the foregoing embodiment, and is used to instruct the compression unit to process a weight gradient of a certain layer or an n layer in the training model. It will not be detailed here.
  • the compression unit may further process the input data according to the position relationship data to obtain the processed input data.
  • the following specific implementation modes exist.
  • the input data includes first input data and second input data.
  • the compression unit 101 includes:
  • the first sparse processing unit 1011 is configured to process the second input data to obtain third output data and second output data, and transmit the third output data to the first data processing unit 1012.
  • the first data processing unit 1012 is configured to receive the first input data and receive the third output data, and output the first output data according to the third output data and the first input data.
  • the first output data is a processed input neuron
  • the second output data is a processed weight
  • the third output data being positional relationship data of the weight
  • the An output data is a processed weight
  • the second output data is a processed input neuron
  • the third output data is positional relationship data of the input neuron
  • the wij represents a weight between the i-th input neuron and the j-th output neuron; the first sparse processing unit
  • the first position data ie, the third output data
  • the first sparse processing unit 1011 obtains positional relationship data according to the input neuron, and inputs an input neuron whose absolute value is less than or equal to the first threshold. Meta-deleted to get processed input neurons.
  • the first threshold may be 0.1, 0.08, 0.05, 0.02, 0.01, 0 or other values.
  • the second threshold may be 0.1, 0.08, 0.06, 0.05, 0.02, 0.01, 0 or other values. It should be noted that the first threshold and the second threshold may be the same or may not be consistent.
  • the location relationship data may be expressed in the form of a step index or a direct index.
  • the positional relationship data represented by the direct index form is a character string composed of 0 and 1.
  • the second input data is a weight
  • 0 indicates that the absolute value of the weight is less than or equal to the second threshold, that is, There is no connection between the input neuron corresponding to the weight and the output neuron
  • 1 indicates that the absolute value of the weight is greater than the second threshold, that is, the input neuron corresponding to the weight has a connection with the output neuron.
  • Positional relationship data expressed in direct index form has two representation order: a string of 0 and 1 is formed by a connection state of each output neuron and all input neurons to represent a connection relationship of weights; or each input nerve The connection state of the element with all the output neurons constitutes a string of 0 and 1 to represent the connection relationship of the weights.
  • 0 indicates that the absolute value of the input neuron is less than or equal to the first threshold
  • 1 indicates that the absolute value of the input neuron is greater than the first threshold.
  • the positional relationship data represented by the step index is the distance between the input neuron connected to the output neuron and the input neuron connected to the previous output neuron.
  • a string consisting of values; when the second input data is an input neuron, the data represented by the step index is an input neuron whose current absolute value is greater than the first threshold and an input whose absolute value is greater than the first threshold.
  • FIG. 3 is a schematic diagram of a neural network according to an embodiment of the present invention.
  • the first input data is an input neuron, including input neurons i1, i2, i3, and i4, and the second input data is a weight.
  • the weights are w11, w21, w31 and w41; for the output neuron o2, the weights w12, w22, w32 and w42, wherein the weights w21, w12 and w42 have a value of 0, and their absolute values are Less than the first threshold value 0.01, the first sparse processing unit 1011 determines that the input neuron i2 and the output neuron o1 are not connected, and the input neurons i1 and i4 are not connected to the output neuron o2, and the input neurons i1 and i3 are not connected.
  • the positional relationship data is represented by the connection state of each output neuron and all the input neurons, and the positional relationship data of the output neuron o1 is "1011", and the positional relationship data of the output neuron o2 is "0110" (ie, the above The positional relationship data is "10110110”); with the connection relationship between each input neuron and all output neurons, the positional relationship data of the input neuron i1 is "10", and the positional relationship data of the input neuron i2 is "01".
  • the positional relationship data of the input neuron i3 is "11"
  • the positional relationship data of the input neuron i4 is "10" (that is, the positional relationship data described above is "10011110").
  • the compression unit 101 uses the above i1 and w11, i3 and w31 and i4 and w41 as a data set, respectively, and stores the data set in the storage unit 102; for the output neuron o2, the above The compression unit 101 regards the above i2 and w22 and i3 and w32 as a data set, respectively, and stores the data set in the above-described storage unit 102.
  • the second output data is w11, w31 and w41 for the output neuron o1; and the second output data is w22 and w32 for the output neuron o2.
  • the positional relationship data (ie, the third output data) is "1011".
  • the second output data is 1, 3, 5.
  • the first input data includes input neurons i1, i2, i3, and i4, and the second input data is a weight.
  • the weights are w11, w21, w31 and w41; for the output neuron o2, the weights w12, w22, w32 and w42, wherein the weights w21, w12 and w42 have a value of 0, the sparse processing unit 1011 determines that the input neurons i1, i3, and i4 are connected to the output neuron o1, and the input neurons i2 and i3 are connected to the output neuron o2.
  • the positional relationship data between the output neuron o1 and the input neuron is "021".
  • the first number "0" in the positional relationship data indicates that the distance between the first input neuron connected to the output neuron o1 and the first input neuron is 0, that is, the first and output nerves
  • the input neuron with meta o1 is the input neuron i1;
  • the second digit "2" in the above positional relationship data indicates the second input neuron connected to the output neuron o1 and the first and output neuron o1
  • the distance between the connected input neurons (ie, input neuron i1) is 2, that is, the second input neuron connected to the output neuron o1 is the input neuron i3;
  • the third digit in the above positional relationship data "1" indicates that the distance between the third input neuron connected to the output neuron o1 and the second input neuron connected to the output neuron o1 is 1, that is, the third and output neuron o1
  • the connected input neuron is the
  • the positional relationship data between the output neuron o2 and the input neuron is "11".
  • the first number "1" in the positional relationship data indicates that the distance between the first input neuron connected to the output neuron o2 and the first input neuron (ie, the input neuron i1) is 1, That is, the first input neuron connected to the output neuron o2 is the output neuron i2; the second digit "1" in the positional relationship data represents the second input neuron connected to the output neuron o2.
  • the distance from the first input neuron connected to the output neuron o2 is 1, that is, the second input neuron connected to the output neuron o2 is the input neuron i3.
  • the compression unit 101 uses the above i1 and w11, i3 and w31 and i4 and w41 as a data set, respectively, and stores the data set in the storage unit 102; for the output neuron o2, the above The compression unit 101 regards the above i2 and w22 and i3 and w32 as a data set, respectively, and stores the data set in the above-described storage unit 102.
  • the second output data is w11, w31 and w41 for the output neuron o1; and the second output data is w22 and w32 for the output neuron o2.
  • the second input data is the input neurons i1, i2, i3, and i4, and the values of the input neurons are 1, 0, 3, 5, respectively, the positional relationship data, that is, the third output data is "021",
  • the second output data described above is 1, 3, 5.
  • the first input data is an input neuron
  • the second input data is a weight
  • the third output data is positional relationship data between the output neuron and the input neuron.
  • the first data processing unit 1012 culls the input neuron whose absolute value is less than or equal to the second threshold, and extracts the culled input neuron according to the positional relationship data.
  • the input neuron associated with the above weight is selected and output as the first output data.
  • the input neurons i1, i2, i3, and i4 have values of 1, 0, 3, and 5, respectively, and for the output neuron o1, the third output data (ie, positional relationship)
  • the data is "021”
  • the above second output data is w11, w31 and w41.
  • the first data processing unit 1012 above rejects the input neurons having the value 0 of the input neurons i1, i2, i3, and i4 to obtain the input neurons i1, i3, and i4.
  • the first data processing unit 1012 determines that the input neurons i1, i3, and i4 are both connected to the output neurons according to the third output data "021", so the data processing unit 1012 inputs the input neurons i1, i3.
  • i4 is output as the first output data, that is, outputs 1, 3, 5.
  • the third output data is positional relationship data of the input neuron.
  • the first data processing unit 1012 After receiving the weights w11, w21, w31, and w41, the first data processing unit 1012 rejects the weights in which the absolute value of the weights is smaller than the first threshold, and according to the position relationship data, the cull A weight associated with the input neuron is selected as the first output data and output.
  • the weights w11, w21, w31, and w41 have values of 1, 0, 3, and 4, respectively, and for the output neuron o1, the third output data (ie, positional relationship data) ) is "1011", and the above second output data is i1, i3 and i5.
  • the first data processing unit 1012 rejects the input neurons whose values are 0 in the weights w11, w21, w31 and w41 to obtain weights w11, w21, w31 and w41.
  • the first data processing unit 1012 determines that the value of the input neuron i2 in the input neurons i1, i2, i3, and i4 is 0 according to the third output data "1011", so the first data processing unit 1012 inputs the above input. Neurons 1, 3 and 4 are output as first output data.
  • the third input data and the fourth input data are respectively at least one weight and at least one input neuron
  • the compression unit 101 determines that the absolute value of the at least one input neuron is greater than the first threshold.
  • the position of the input neuron is obtained, and the positional relationship data of the input neuron is acquired; the compression unit 101 determines a position of the weight of the at least one of the at least one weight that is greater than the weight of the second threshold, and acquires the positional relationship data of the weight.
  • the compression unit 101 obtains a new positional relationship data according to the positional relationship data of the weight value and the positional relationship data of the input neuron, and the positional relationship data represents an input nerve of the at least one input neuron whose absolute value is greater than the first threshold.
  • the relationship between the element and the output neuron and the value of the corresponding weight The compression unit 101 acquires the processed input neurons and the processed weights based on the new positional relationship data, the at least one input neuron, and the at least one weight.
  • the compression unit 101 stores the processed input neurons and the processed weights in the one-to-one correspondence format in the storage unit 102.
  • the specific manner in which the compression unit 101 stores the processed input neurons and the processed weights in a one-to-one correspondence manner is to input each of the processed input neurons.
  • the neurons and the processed weights corresponding thereto are taken as a data set, and the data set is stored in the above-described storage unit 102.
  • the sparse processing unit 1011 in the compression unit 101 thins out the input neurons or weights, reducing the weight or inputting the nerve
  • the number of elements which in turn reduces the number of operations performed by the arithmetic unit, improves the computational efficiency.
  • the input data includes first input data and second input data.
  • the compression unit 101 includes:
  • the second sparse processing unit 1013 after receiving the third input data, obtain the first positional relationship data according to the third input data, and transmit the first positional relationship data to the connection relationship processing unit 1015;
  • the third sparse processing unit 1014 is configured to: after receiving the fourth input data, obtain second positional relationship data according to the fourth input data, and transmit the second positional relationship data to the connection relationship processing unit 1015;
  • connection relationship processing unit 1015 is configured to obtain third location relationship data according to the first location relationship data and the second location relationship data, and transmit the third location relationship data to the second data processing unit. 1016;
  • the second data processing unit 1016 is configured to: after receiving the third input data, the fourth input data, and the third location relationship data, the third location relationship data according to the third location relationship data Input data and the fourth input data are processed to obtain fourth output data and fifth output data;
  • the fourth input data when the third input data includes at least one input neuron, and the fourth input data includes at least one weight, the first positional relationship data is positional relationship data of the input neuron, and the second positional relationship data a positional relationship data of weights, the fourth output data is a processed input neuron, the fifth output data is a processed weight; and when the third input data includes at least one weight, the When the fourth input data includes at least one input neuron, the first positional relationship data is positional relationship data of the weight, the second positional relationship data is positional relationship data of the input neuron, and the fourth output data is The processed weight, the fifth output data is the processed input neuron.
  • the first positional relationship data is a character string indicating a position of an input neuron in which the absolute value of the at least one input neuron is greater than the first threshold;
  • the first positional relationship data is a character string for indicating whether there is a connection between the input neuron and the output neuron.
  • the second positional relationship data is a character string indicating a position of the input neuron in which the absolute value of the at least one input neuron is greater than the first threshold;
  • the second positional relationship data is a character string for indicating whether there is a connection between the input neuron and the output neuron.
  • first location relationship data, the second location relationship data, and the third location relationship data may be represented by a step index or a direct index.
  • the connection relationship processing unit 1015 processes the first positional relationship data and the second positional relationship data to obtain third positional relationship data.
  • the third location relationship data may be expressed in the form of a direct index or a step index.
  • the connection relationship processing unit 1015 performs an AND operation on the first location relationship data and the second location relationship data, To obtain third positional relationship data, the third positional relationship data is expressed in the form of a direct index.
  • the character strings indicating the first positional relationship data and the second positional relationship data are stored in the order of the physical address in the memory, and may be stored in a high to low order, or may be low. Stored in high order.
  • connection relationship processing unit 1015 When the first positional relationship data and the second positional relationship data are both expressed in the form of a step index, and the character strings indicating the first positional relationship data and the second positional relationship data are in descending order of physical addresses.
  • the connection relationship processing unit 1015 accumulates each element in the character string of the first positional relationship data and an element storing a physical address lower than the physical address stored in the element, and the obtained new element constitutes the fourth position. Correlation data; similarly, the connection relationship processing unit 1015 performs the same processing on the character string of the second positional relationship data to obtain fifth positional relationship data.
  • connection relationship processing unit 1015 selects the same element from the character string of the fourth positional relationship data and the character string of the fifth positional relationship data, and sorts the element values in ascending order to form a new string. .
  • the above-described connection relationship processing unit 1015 subtracts each element of the above new character string from its adjacent element whose value is smaller than the element value to obtain a new element. According to the method, each element in the new string is subjected to a corresponding operation to obtain the third positional relationship data.
  • the connection relationship processing unit 1015 adds each element in the character string of the first positional relationship data to the adjacent previous element to obtain the fourth positional relationship data "01234"; similarly, the connection relationship The fifth positional relationship data obtained by the processing unit 1015 performing the same processing on the character string of the second positional relationship data is "024".
  • the connection relationship processing unit 1015 selects the same element from the fourth positional relationship data "01234" and the fifth positional relationship data "024" to obtain a new character string "024".
  • the connection relationship processing unit 1015 subtracts each element in the new character string from its neighboring previous element, that is, 0, (2-0), (4-2), to obtain the third connection data. "022".
  • connection relationship processing unit 1015 When any one of the first positional relationship data and the second positional relationship data is expressed in the form of a step index and the other is expressed in the form of a direct index, the connection relationship processing unit 1015 indicates the step index in the above.
  • the positional relationship data is converted into a direct index representation or a positional relationship data represented by a direct index is converted into a form represented by a step index. Then, the connection relationship processing unit 1015 performs processing in accordance with the above method to obtain the third positional relationship data (ie, the fifth output data).
  • connection relationship processing unit 1015 converts the first location relationship data and the second location relationship data into The positional relationship data represented by the step index is processed, and then the first positional relationship data and the second positional relationship data are processed according to the above method to obtain the third positional relationship data.
  • the third input data may be an input neuron or a weight
  • the fourth input data may be an input neuron or a weight
  • the third input data and the fourth input data are inconsistent.
  • the second data processing unit 1016 selects data related to the third positional relationship data from the third input data (ie, input neurons or weights) as the fourth output data according to the third positional relationship data;
  • the second data processing unit 1016 selects data related to the third positional relationship data from the fourth input data as the fifth output data according to the third positional relationship data.
  • the second data processing unit 1016 uses each processed input neuron in the processed input neuron and its corresponding processed weight as a data set, and stores the data set in the storage unit. 102.
  • the third input data includes input neurons i1, i2, i3, and i4, and the fourth input data includes weights w11, w21, w31, and w41, and the third positional relationship data is expressed by direct index.
  • the fourth output data output by the second data processing unit 1016 is the input neurons i1 and i3, and the output fifth output data is the weights w11 and w31.
  • the second data processing unit 1016 described the input neuron i1 and the weight w11 and the input neuron i3 and the weight w31 as a data set, respectively, and stores the data set in the storage unit 102.
  • the sparse processing unit in the compression unit 101 pairs the input neurons and the weights.
  • the thinning process is performed to further reduce the number of input neurons and weights, thereby reducing the computational complexity of the arithmetic unit and improving the computational efficiency.
  • the input data includes at least one input weight or at least one input neuron.
  • the above compression unit 601 includes:
  • the input data buffer unit 6011 is configured to buffer the input data, and the input data includes at least one input neuron or at least one weight.
  • connection relationship buffer unit 6012 is configured to buffer the positional relationship data of the input data, that is, the positional relationship data of the input neurons or the positional relationship data of the weights.
  • the positional relationship data of the input neuron is a character string for indicating whether the absolute value in the input neuron is less than or equal to the first threshold
  • the positional relationship data of the weight is whether the absolute value of the weight is less than or equal to
  • the character string of the first threshold is a character string indicating whether there is a connection between the input neuron and the output neuron corresponding to the weight.
  • the positional relationship data of the input neuron and the positional relationship data of the weight may be expressed in the form of a direct index or a step index.
  • the fourth sparse processing unit 6013 is configured to process the input data according to the position relationship data of the input data to obtain the processed input data, and store the processed input data to the first input buffer unit. In 605.
  • the compression unit 101 before the compression unit 101 processes the input data, the compression unit 101 is further configured to:
  • the first preset condition comprising an input neuron of an input neuron having an absolute value less than or equal to a third threshold
  • the number is less than or equal to the fourth threshold
  • each set of weights of the N sets of weights satisfies a second preset condition, where the second preset condition includes that the number of weights of the set of weights whose absolute value is less than or equal to the fifth threshold is less than or Equal to the sixth threshold;
  • the third threshold may be 0.5, 0.2, 0.1, 0.05, 0.025, 0.0, 0 or other values.
  • the fourth threshold is related to the number of input neurons in the set of input neurons.
  • the fourth threshold the number of input neurons in a set of input neurons - 1 or the fourth threshold is other values.
  • the fifth threshold may be 0.5, 0.2, 0.1, 0.05, 0.025, 0.01, 0 or other values.
  • the sixth threshold is related to the number of weights in the set of weights.
  • the sixth threshold the number of weights - 1 in a set of weights or the sixth threshold is another value. It should be noted that the third threshold and the fifth threshold may be the same or different, and the fourth threshold and the sixth threshold may be the same or different.
  • the positional relationship data involved in the present application may be in the following cases: List of Lists (LIL) and Coordinate list. , COO), Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), (ELL Pack, ELL), and Hybrid (HyB, HYB).
  • LIL List of Lists
  • CSR Compressed Sparse Row
  • CSC Compressed Sparse Column
  • Hybrid Hybrid
  • the input neurons and output neurons mentioned in the embodiments of the present invention do not refer to neurons in the input layer of the entire neural network and neurons in the output layer, but to any adjacent nodes in the neural network.
  • the two-layer neuron, the neuron in the lower layer of the network feedforward operation is the input neuron, and the neuron in the upper layer of the network feedforward operation is the output neuron.
  • the layer, the neurons in the layer are the above input neurons, the K+1 layer is called the input layer, and the neurons in the layer are the above-mentioned output neurons, that is, each layer can be used as an input except the top layer.
  • Layer, the next layer is the corresponding output layer.
  • the present application also provides a data processing method using the above computing device, as shown in FIG. 6a, the method includes:
  • Step S6a01 the instruction control unit acquires an operation instruction, decodes the operation instruction into a first micro instruction and a second micro instruction, and sends the first micro instruction to the compression unit, and the second Microinstructions are sent to the arithmetic unit;
  • Step S6a02 the compression unit processes the acquired input data according to the first microinstruction to obtain processed input data, wherein the input data includes at least one input neuron and/or at least one input data.
  • the processed input data includes the processed input neurons and/or the processed input data;
  • Step S6a03 The arithmetic unit processes the processed input data according to the second microinstruction to obtain an operation result.
  • the present application also provides another data processing method using the above computing device, the computing device comprising an arithmetic unit, an instruction control unit and a compression unit; the computing device acquires a training sample, a training model; the training model is: a nerve a network training model and/or a non-neural network training model, the training model comprising an n-layer structure; as specifically shown in Figure 6b, the method comprises:
  • Step S6b01 the instruction control unit acquires a training instruction, obtains a forward operation instruction and a reverse operation instruction according to the training instruction, and sends the forward operation instruction and the reverse operation instruction to the operation unit, respectively. ;
  • Step S6b02 the operation unit performs n-th layer forward operation on the training sample according to the forward operation instruction to obtain an n-th layer forward operation result; and obtains an n-th layer output data according to the n-th layer forward operation result. And performing a n-layer inverse operation on the nth layer output data gradient according to the reverse operation instruction to obtain a weight gradient of the training model; the weight gradient of the training model includes a weight of each layer in the n layer Value gradient
  • Step S6b03 the compression unit processes the weight gradient of the training model to correspond to the processed weight gradient
  • Step S6b04 The computing device updates the weight of the training model according to the processed weight gradient to complete the training.
  • the compression unit described in the above embodiment of the present application can also be applied to a chip device (also referred to as an arithmetic circuit or a computing unit) to implement data compression processing, and reduce the amount of data transmission and the amount of data calculation in the circuit. Thereby improving data processing efficiency.
  • a chip device also referred to as an arithmetic circuit or a computing unit
  • FIG. 7a is a schematic structural diagram of a chip device.
  • the operation circuit includes: a main processing circuit, a basic processing circuit, and a branch processing circuit. Specifically, the main processing circuit is connected to the branch processing circuit, and the branch processing circuit is connected to at least one basic processing circuit.
  • the branch processing circuit is configured to transmit and receive data of a main processing circuit or a basic processing circuit.
  • FIG. 7b is a schematic structural diagram of a main processing circuit.
  • the main processing circuit may include a register and/or an on-chip buffer circuit.
  • the main processing circuit may further include: a control circuit and a vector operator circuit. , ALU (arithmetic and logic unit) circuit, accumulator circuit, DMA (Direct Memory Access) circuit and other circuits, of course, in practical applications, the above main processing circuit can also be added, the conversion circuit ( For example, a matrix transposition circuit), a data rearrangement circuit, or an activation circuit, and the like.
  • ALU arithmetic and logic unit
  • accumulator circuit accumulator circuit
  • DMA Direct Memory Access
  • the main processing circuit further includes a data transmitting circuit, a data receiving circuit or an interface, and the data transmitting circuit can integrate the data distributing circuit and the data broadcasting circuit.
  • the data distributing circuit and the data broadcasting circuit can also be separately set; in practical applications
  • the above data transmitting circuit and data receiving circuit may also be integrated to form a data transmitting and receiving circuit.
  • broadcast data data that needs to be sent to each of the underlying processing circuits.
  • For distributing data it is necessary to selectively send data to a part of the basic processing circuit, and the specific selection manner may be specifically determined by the main processing circuit according to the load and the calculation manner.
  • the broadcast transmission mode the broadcast data is transmitted to each of the basic processing circuits in a broadcast form.
  • the broadcast data is sent to each of the basic processing circuits by means of one broadcast, and the broadcast data may be sent to each of the basic processing circuits by means of multiple broadcasts.
  • the specific embodiment of the present application does not limit the above.
  • the number of times of broadcasting for the distribution transmission method, the distribution data is selectively transmitted to the partial basic processing circuit.
  • the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits (the data may be the same or different, and specifically, if the data is transmitted by means of distribution, the basic processing circuit of each received data is received.
  • the data can be different, of course, the data received by some basic processing circuits can be the same;
  • the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits, and the basic processing circuit of each received data can receive the same data, that is, the broadcast data can include all basic processing circuits that need to be received.
  • the data. Distributing data may include data that is required to be received by a portion of the underlying processing circuitry.
  • the main processing circuit can transmit the broadcast data to all branch processing circuits by one or more broadcasts, and the branch processing circuit forwards the broadcast data to all of the basic processing circuits.
  • the vector operator circuit of the main processing circuit may perform vector operations, including but not limited to: two vector addition, subtraction, multiplication and division, vector and constant addition, subtraction, multiplication, division, or each element in the vector.
  • the continuous operation may specifically be: vector and constant addition, subtraction, multiplication, division, activation, accumulation, and the like.
  • Each of the basic processing circuits may include a base register and/or a base on-chip buffer circuit; each of the base processing circuits may further include one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like.
  • the inner product operator circuit, the vector operator circuit, and the accumulator circuit may be integrated circuits, and the inner product operator circuit, the vector operator circuit, and the accumulator circuit may be separately provided circuits.
  • connection structure of the branch processing circuit and the base circuit may be arbitrary, and is not limited to the H-type structure of FIG. 1b.
  • the main processing circuit to the basic circuit is a broadcast or distributed structure, and the basic circuit to the main processing circuit is a gather structure.
  • the definitions of broadcasting, distribution and collection are as follows:
  • the data transfer manner of the main processing circuit to the basic circuit may include:
  • the main processing circuit is connected to the plurality of branch processing circuits, and each of the branch processing circuits is connected to the plurality of basic circuits.
  • the main processing circuit is connected to a branch processing circuit, which is connected to a branch processing circuit, and so on, and a plurality of branch processing circuits are connected in series, and then each branch processing circuit is connected to a plurality of basic circuits.
  • the main processing circuit is connected to the plurality of branch processing circuits, and each branch processing circuit is connected in series with a plurality of basic circuits.
  • the main processing circuit is connected to a branch processing circuit, which is connected to a branch processing circuit, and so on, and a plurality of branch processing circuits are connected in series, and then each branch processing circuit is connected in series with a plurality of basic circuits.
  • the main processing circuit When distributing data, the main processing circuit transmits data to some or all of the basic circuits, and the data received by the basic circuits of each received data may be different;
  • the main processing circuit When broadcasting data, the main processing circuit transmits data to some or all of the basic circuits, and the basic circuits that receive the data receive the same data.
  • the chip device shown in FIG. 7a may be a single physical chip.
  • the chip device may also be integrated in other chips (eg, CPU, GPU), and the specific implementation of the present application is The physical representation of the above chip device is not limited.
  • Figure 7c is a data distribution apparatus A chip schematic, 7 c as shown in FIG arrows, the arrow indicates the direction of data distribution, shown in Figure 7c, the main processing circuit receives external data after the external After the data is split, it is distributed to a plurality of branch processing circuits, and the branch processing circuit sends the split data to the basic processing circuit.
  • FIG. 7d is a schematic diagram of data backhaul of a chip device. As shown by the arrow in FIG. 1d, the arrow is the data return direction. As shown in FIG. 7d, the basic processing circuit will calculate data (for example, inner product calculation). The result is passed back to the branch processing circuit, which is passed back to the main processing circuit.
  • data for example, inner product calculation
  • the data may be a vector, a matrix, a multi-dimensional (three-dimensional or four-dimensional and above) data, and a specific value of the input data may be referred to as an element of the input data.
  • the embodiment of the present disclosure further provides a calculation method of the calculation unit as shown in FIG. 7a, and the calculation method is applied to the calculation of the neural network.
  • the calculation unit may be used for one or more layers in the multilayer neural network.
  • the input data and the weight data perform operations.
  • the calculating unit is configured to perform an operation on one or more layers of input data and weight data in the trained multi-layer neural network
  • the computing unit is configured to perform an operation on one or more layers of input data and weight data in a forward-operating multi-layer neural network.
  • the above operations include, but are not limited to, one or any combination of a convolution operation, a matrix multiplication matrix operation, a matrix multiplication vector operation, a paranoid operation, a full join operation, a GEMM operation, a GEMV operation, and an activation operation.
  • GEMM calculation refers to the operation of matrix-matrix multiplication in the BLAS library.
  • auxiliary integers as parameters to describe the width and height of the S and P of the matrix;
  • GEMV calculation refers to the operation of matrix-vector multiplication in the BLAS library.
  • the foregoing compression unit of the present application may be specifically designed or applied to any one or more of the following circuits in the embodiments of the present application: a main processing circuit, a branch processing circuit connection, and a basic processing circuit.
  • the application embodiment further provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, the computer program causing the computer to execute part of any one of the data processing methods as described in the above method embodiments or All steps.
  • the embodiment of the present application further provides a computer program product, comprising: a non-transitory computer readable storage medium storing a computer program, the computer program being operative to cause a computer to perform the operations as recited in the foregoing method embodiments Part or all of the steps of any data processing method.
  • a chip is also disclosed that includes the neural network processor described above for performing a data processing method.
  • a chip package structure that includes the chip described above.
  • a board that includes the chip package structure described above.
  • an electronic device that includes the above-described card.
  • Electronic equipment including data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headphones , mobile storage, wearables, vehicles, household appliances, and/or medical equipment.
  • the vehicle includes an airplane, a ship, and/or a vehicle;
  • the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood;
  • the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本公开提供了一种计算装置,所述计算装置包括:指令控制单元用于获取运算指令,将所述运算指令译码为第一微指令和第二微指令,并将所述第一微指令发送给所述压缩单元,将所述第二微指令发送给所述运算单元;存储单元用于存储输入数据、处理后的输入数据、运算指令以及运算结果,所述输入数据包括至少一个输入神经元和/或至少一个权值,所述处理后的输入数据包括处理后的输入神经元和/或处理后的权值;压缩单元用于根据所述第一微指令对所述输入数据进行处理,以得到所述处理后的输入数据;运算单元用于根据所述第二微指令对所述处理后的输入数据进行处理,以得到所述运算结果。

Description

一种计算装置及相关产品
相关申请:
本申请要求2018年02月27日提交,申请号为201810161636.2,发明名称为“一种计算装置及相关产品”的优先权;
本申请要求2018年02月27日提交,申请号为201810161816.0,发明名称为“一种计算装置及相关产品”的优先权。
技术领域
本申请涉及信息处理技术领域,具体涉及一种计算装置及相关产品。
背景技术
随着信息技术的不断发展和人们日益增长的需求,人们对信息及时性的要求越来越高了。目前,终端对信息的获取以及处理均是基于通用处理器获得的。
在实践中发现,这种基于通用处理器运行软件程序来处理信息的方式,受限于通用处理器的运行速率,特别是在通用处理器负荷较大的情况下,信息处理效率较低、时延较大。
发明内容
本申请实施例提供了一种数据处理方法及相关产品,可提升计算装置的处理速度,提高效率。
第一方面,提供一种使用计算装置进行数据处理的方法,所述计算装置包括运算单元、指令控制单元、存储单元以及压缩单元,所述方法包括:
所述指令控制单元获取运算指令,将所述运算指令译码为第一微指令和第二微指令,并将所述第一微指令发送给所述压缩单元,将所述第二微指令发送给所述运算单元;
所述压缩单元根据所述第一微指令对获取的输入数据进行处理,得到处理后的输入数据;其中,所述输入数据包括至少一个输入神经元和/或至少一个输入数据,所述处理后的输入数据包括处理后的输入神经元和/或处理后的输入数据;
所述运算单元根据所述第二微指令对所述处理后的输入数据进行处理,得到运算结果。
第二方面,提供一种使用计算装置进行数据处理的方法,所述计算装置包括运算单元、指令控制单元以及压缩单元;所述计算装置获取训练样本、训练模型;所述训练模型为:神经网络训练模型和/或非神经网络训练模型,所述训练模型包括n层结构;所述方法包括:
所述指令控制单元获取训练指令,根据所述训练指令得到正向运算指令和反向运算指令,并将所述正向运算指令和所述反向运算指令分别发送给所述运算单元;
所述运算单元根据所述正向运算指令对所述训练样本进行n层正向运算得到第n层正向运算结果;根据所述第n层正向运算结果获得第n层输出数据梯度;根据所述反向运算指令对所述第n层输出数据梯度执行n层反向运算得到所述训练模型的权值梯度;所述训练模型的权值梯度包括n层中每层的权值梯度;
所述压缩单元对所述训练模型的权值梯度进行处理,以对应得到处理后的权值梯度;
所述计算装置依据所述处理后的权值梯度对所述训练模型的权值进行更新,以完成训练。
第三方面,提供一种计算装置,所述计算装置包括用于执行上述第一方面或第二方面的方法的硬件单元。
第四方面,提供一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行第一方面或第二方面提供的方法。
第五方面,提供了一种芯片,所述芯片包括如上第三方面提供的计算装置。
第六方面,提供了一种芯片封装结构,所述芯片封装结构包括如上第五方面提供的芯片。
第七方面,提供了一种板卡,所述板卡包括如上第六方面提供的芯片封装结构。
第八方面,提供了一种电子设备,所述电子设备包括如上第七方面提供的板卡。
在一些实施例中,所述电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
在一些实施例中,所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
本申请提供的计算装置设置了压缩单元,将数据进行压缩处理后再运算,节省了传输资源以及计算资源,所以其具有功耗低,计算量小的优点。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1a-1c是本申请实施例提供的几种计算装置的结构示意图。
图1d是本发明实施例提供的一种压缩单元的结构示意图。
图1e是本发明实施例提供的一种控制单元控制状态迁移的示意图。
图2为本发明实施例提供的一种压缩单元的局部结构示意图。
图3为本发明实施例提供的一种神经网络结构示意图。
图4为本发明实施例提供的另一种压缩单元的局部结构示意图。
图5为本发明实施例提供的另一种压缩单元的局部结构示意图。
图6a为本发明实施例提供的一种使用计算装置的数据处理方法的流程示意图。
图6b为本发明实施例提供的一种使用计算装置的数据处理方法的流程示意图。
图7a是本披露提供的一种芯片装置的结构示意图。
图7b是本披露提供的一种主处理电路的结构示意图。
图7c是本披露提供的芯片装置的数据分发示意图。
图7d为一种芯片装置的数据回传示意图。
附图中“/”表示“或”。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
本申请提供一种使用计算装置进行数据处理的方法,该方法应用于计算装置中。下面阐述所述计算装置涉及的相关实施例,请参见图1a是本申请一种可能的计算装置的结构示意图。如图1a所示,该计算装置包括:压缩单元101、存储单元102、指令控制单元107以及运算单元108。可选的,如图1b所述计算装置还可包括第一输入缓存单元105、第二输入缓存单元106。进一步可选的,所述计算装置还可包括直接存储访问(Direct Memory Access,DMA)单元103、指令缓存单元104以及输出缓存单元109。下面结合图1a和1b示出的计算装置,具体阐述本申请实施例。其中,
在一种可能的实施例中,所述存储单元102用于存储输入数据、运算指令(具体可包括但不限于神经网络运算指令、非神经网络运算指令、加法指令、卷积指令等等)、处理后的输入数据、输入数据的位置关系数据、运算结果以及其他神经网络运算中产生的中间数据等等,本申请不做限定。所述输入数据包括但不限于输入权值和输入神经元,且所述输入数据的数量本申请也不做限定,即所述输入数据包括至少一个输入权值和/或至少一个输入神经元。
所述输入数据的位置关系数据用于表征输入数据的位置,例如输入数据为一个矩阵A,则以数据Aij为例,该数据Aij的位置信息为位于矩阵A中的第i行第j列。
可选的本申请中,所述输入数据的位置关系数据还可表示所述输入数据中绝对值大于或等于预设阈值的输入数据的位置关系,该输入数据可为输入神经元或输入权值。其中,所述预设阈值可为用户侧或设备侧自定义设置的阈值,例如0.2、0.5等等。所述输入数据的位置关系数据可用直接索引或者步长索引的方式表示,具体将在后文进行详细阐述。
以直接索引为例,假设输入数据为矩阵
Figure PCTCN2019075975-appb-000001
预设阈值为0.5,则所述输入数据的位置关系数据为
Figure PCTCN2019075975-appb-000002
在又一种可能的实施例中,所述存储单元102用于存储训练模型、训练样本、训练指令、运算单元产生的中间数据以及运算结果、压缩单元产生的处理后的数据以及位置关系数据等等,本申请不做限定。所述训练模型包括但不限于以下中的任一项或多项的组合:神经网络模型、非神经网络模型以及其他装置侧或用户侧自定义设置的数学模型等等。
所述训练样本包括用于对所述训练模型进行训练时所使用的训练输入数据(即输入神经元)。可选的,所述存储单元还存储有对所述训练模型进行训练时所使用的训练输出数据(即与所述输入数据对应的结果数据)。所述训练样本以及训练输出结果具体为装置侧获取的真实数据。关于所述存储单元中存储的各种数据将在下文进行详述,这里不过需叙述。
位置关系数据用于表征数据的位置,也可用于表征绝对值大于或等于预设阈值的数据的位置。其中,为方便描述,本申请下文将输入数据和训练样本统称为数据。在一些情况下,下文涉及的数据可为输入数据,或者可为训练样本。
可选实施例中,本申请为节省存储空间,所述计算装置可根据所述数据的位置关系数据来存储所述数据。具体的,缓存该数据的位置关系数据以及缓存绝对值大于或等于预设阈值的数据,其他位置的数据默认为是:绝对值小于该预设阈值的数据,或默认为0,本申请不做限定。即是本申请采用稠密方式来存储数据,例如在第一/二缓存单元中以稠密方式缓存数据,所述数据包括但不限于输入神经元或输入权值等等。
所述直接存储访问DMA单元103用于在上述存储单元102与上述指令缓存单元104、上述映射单元101、上述第一输入缓存单元105和上述输出缓存单元109之间进行数据读写。
例如,DMA单元103从所述存储单元102中读取运算指令,并将该运算指令发送给指令控制单元107,或缓存至指令缓存单元104等。或者,DMA单元103从所述存储单元102中读取训练指令,并将该训练指令发送给指令控制单元107,或缓存至指令缓存单元104等。
又如,DMA单元103还可从所述存储单元102中读取输入权值或处理后的输入权值,以发送至第一输入存储单元105或第二输入存储单元106中进行缓存。相应地,DMA单元103还可从所述存储单元102中读取输入神经元或处理后的输入神经元,以发送至第一输入存储单元105或第二输入存储单元106中。其中,所述第一输入存储单元105和第二输入存储单元106中缓存的数据不同,例如第一输入缓存单元105存储有输入神经元或处理后的输入神经元,则第二输入缓存单元106中存储输入权值或处理后的权值;反之亦然。
所述指令缓存单元104用于缓存运算指令,或者用于缓存训练指令。所述第一输入缓存单元用于缓存第一缓存数据,所述第二输入缓存数据用于缓存第二缓存数据。所述第一缓存数据和第二缓存数据不同,关于所述第一缓存数据和第二缓存数据可参见前文所述。例如,如果所述第一缓存数据为处理后的输入权值,则第二缓存数据可为未处理的输入神经元,也可为处理后的输入神经元等等。
在一些实施例中,所述指令控制单元107可用于从所述指令缓存单元或存储单元中获取运算指令,进一步地可将所述运算指令译码为相应地的微指令,以便所述计算单元中的相关部件能够识别和执行。例如,本申请中指令控制单元可将运算单元译码为第一微指令和第二微指令。其中,第一微指令用于指示所述压缩单元采用所述第一微指令所指示的处理方式进行对应的数据处理。第二微指令用于指示所述运算单元执行所述第二微指令对应的运算处理,例如乘法运算、卷积运算等等。
在另一些实施例中,所述指令控制单元107可用于从所述指令缓存单元或存储单元中获取训练指令(具体可为正向运算指令和/或反向运算指令),进一步地可将该训练指令译码为相应地的微指令,以便 所述计算单元中的相关部件能够识别和执行。例如,本申请中指令控制单元可将正向运算指令译码为第一微指令和第二微指令。其中,第一微指令用于指示所述压缩单元采用所述第一微指令所指示的处理方式进行对应的数据处理。第二微指令用于指示所述运算单元执行所述第二微指令对应的运算处理,例如乘法运算、卷积运算等等。
所述输出缓存单元可用于缓存所述运算单元输出的运算结果。所述运算单元用于根据指令控制单元发送的指令进行相应的数据运算处理,以获得运算结果。所述压缩单元用于对数据进行压缩处理,以降低数据维度,减少运算单元中的数据运算量,提高数据处理效率。可选的,所述运算单元可将该运算结果存储到上述输出缓存单元109中,该输出缓存单元109通过上述直接存储访问单元103将该运算结果存储到上述存储单元102中。
在可选实施例中,所述计算装置还可包括预处理模块110,具体如图1c所示。所述预处理模块可用于对数据进行预处理,以获得预处理后的数据。相应地,可将处理后的数据存储至存储单元中。例如本申请中,所述存储单元中缓存的输入数据即可为经过该预处理模块处理后的输入数据等。所述预处理包括但不限于以下处理中的任一项或多项的组合:高斯滤波、二值化、归一化、正则化、异常数据筛选等等,本申请不做限定。
在可选实施例中,本申请中的第一输入缓存单元和第二输入缓存单元均可被拆分为两个缓存单元,其中一个缓存单元用于存储输入权值或输入神经元,另一个缓存单元对应用于存储输入权值的位置关系数据或输入神经元的位置关系数据等等,本申请图未示。
在可选实施例中,所述压缩单元101的设计位置本申请并不做限定。如图1b所示,所述压缩单元可放在所述第一输入缓存单元和第二输入缓存单元的后面。可选的,所述压缩单元可放在所述第一输入缓存单元和第二输入缓存单元的前面,位于DMA单元的后面。可选的,所述压缩单元还可放在所述运算单元内等等,不做限定。
结合前述图1a-图1c所述的计算装置,下面介绍所述计算装置各部件涉及的几个具体实施例。
第一个:
如图1d是本发明提供的一种压缩单元的结构示意图。如图1d,所述压缩单元101可包括以下中的任一项或多项的组合:剪枝单元201、量化单元202以及编码单元203。可选的,所述压缩单元101还可包括控制单元204。其中,所述剪枝单元201具体用于对接收的数据进行剪枝处理,关于所述剪枝处理的具体实现将在下文详述。所述量化单元202具体用于对接收的数据进行量化处理,关于所述量化处理的具体实现将在下文详述。所述编码单元203具体用于对接收的数据进行编码处理,关于所述编码处理的具体实现将在下文详述。所述控制单元204具体用于根据接收指令的指示利用上述三个单元中的至少一个单元完成对应的数据的处理方式,具体实现将在下文详述。
下面阐述所述压缩单元和所述运算单元涉及的具体实施例。具体的,所述压缩单元101在接收到所述指令控制单元107发送的第一微指令后,可获取输入数据,根据所述第一微指令所指示的处理方式对所述输入数据进行处理,以获得处理后的输入数据。
其中,所述处理方式包括但不限于以下中的任一项或多项的组合:剪枝处理、量化处理、编码处理或者其他用于降低数据维度或减少数据量的处理方式,本申请不做限定。
具体的,如果所述处理方式为剪枝处理,则所述压缩单元可采用剪枝单元将绝对值大于或等于第一阈值的输入数据进行保留,将绝对值小于第一阈值的输入数据进行删除,从而获得处理后的输入数据。 所述输入数据包括但不限于至少一个输入神经元、至少一个输入权值等。所述第一阈值为用户侧或设备侧自定义设置的,例如0.5等。
例如,输入数据为向量P(0.02,0.05,1,2,0.07,2.1,0.89),如果所述第一微指令指示所述压缩单元采用剪枝处理来处理数据,则所述压缩单元将对向量P进行剪枝处理,删除向量P中绝对值小于0.5的数据,从而获得处理后的输入数据(即处理后的向量P)为(0,0,1,2,0,2.1,0.89)。
如果所述处理方式为量化处理,则所述压缩单元可采用量化单元将所述输入数据进行聚类和量化,从而获得处理后的输入数据。其中所述量化是指将原数据量化至接近该原数据的新数据,该新数据可为用户侧自定义设置的,或者保证所述原数据和新数据之间的误差小于预设值,例如将0.5量化至1等等。所述量化处理方式具体所采用的量化和聚类算法本申请不做限定,例如采用K-means算法对神经网络模型中的每一层输入权值或输入神经元进行聚类等。
例如输入数据为矩阵
Figure PCTCN2019075975-appb-000003
采用量化处理方式后可获得处理后的输入数据,例如可为
Figure PCTCN2019075975-appb-000004
如果所述处理方式为编码处理,则所述压缩单元可通过编码单元采用预设编码格式对所述输入数据进行编码,从而获得处理后的输入数据。所述预设编码格式为用户侧或设备侧自定义设置的,其可包括但不限于霍夫曼huffman编码、非归零码、曼切斯特编码等等。
本申请中,当所述第一微指令所指示的处理方式包括多种处理方式时,所述多种处理方式的执行顺序本申请可以不做限定,也可在所述第一微指令中定义。即,所述第一微指令中指示有所述压缩单元所需采用的处理方式以及所述处理方式的执行顺序等。
例如,所述第一微指令所指示的处理方式包括:剪枝处理、量化处理以及编码处理,如果所述第一微指令指示有上述三种处理方式的执行顺序,则所述压缩单元将按照所述第一微指令所指示的执行顺序以及处理方式,对所述输入数据进行处理。如果所述第一微指令中没有规定所述处理方式的执行顺序,则所述压缩单元可按照任意执行顺序执行上述三种处理方式,对所述输入数据进行处理,以获得处理后的输入数据。
在所述压缩单元内部,为支持多种处理方式的数据处理,设计了控制单元来控制各种处理方式间的迁移。具体的,控制单元可利用控制单元的概念,提出多种运算状态,每一种运算状态对应一种处理方式。例如,用第一状态表示控制单元的初始状态,不进行任何数据处理;用第二状态表示/指示进行数据的剪枝处理;用第三状态指示进行数据的量化处理,第四状态指示进行数据的编码处理。可选的,还可用第五状态指示依次执行数据的剪枝、量化以及编码处理,用第六状态指示依次执行数据的剪枝和量化状态,用第七状态指示依次执行数据的剪枝和编码处理等等,所述运算状态的数量以及所指示的数据的处理方式均可由用户侧或装置侧自定义设置的,本申请并不做限定。
下面以所述运算状态包括四个状态为例,阐述所述控制单元实现多种处理方式的具体实施。具体的,所述运算状态具体可为第一状态至第四状态。所述状态可用预设数值或预设数值表示,例如第一状态用00表示,第二状态用01表示,第三状态用10表示,第四状态用11表示。如图1e示出利用控制单元实现各种状态之间的迁移示意图。在实际应用中,所述控制单元具体可为控制器、累加器、或者其他用于指示多种数据处理方式的物理元器件,本申请不做限定。
其中,所述运算状态与所述处理方式关联。具体的,本申请中,所述第一状态00可表示为初始状态,不进行任何数据处理。第二状态01与所述剪枝处理关联,用于表示所述压缩单元可采用剪枝处理方式进行数据的处理。第三状态10与所述量化处理关联,用于表示所述压缩单元可采用量化处理方式 进行数据的处理。第四状态11与所述编码处理关联,用于表示所述压缩单元可采用预设编码格式进行数据的编码处理。
应理解的,在所述压缩单元进行数据处理之前,所述控制单元处于第一状态(即初始状态);当所述压缩单元接收到所述第一微指令后,可根据所述第一微指令所指示的处理方式利用控制单元重新设置该控制单元的状态,使得所述压缩单元处于该状态下完成所述处理方式所对应的数据处理。即是,本申请中在所述压缩单元内部可通过所述控制单元重新设置/修改所述运算状态,以在所述压缩单元中完成所述处理方式对应的数据处理。
例如,所述第一微指令用于指示在所述压缩单元中依次采用剪枝处理、量化处理以及编码处理进行数据的处理。相应地所述压缩单元接收到所述第一微指令后,利用控制单元(如累加器)将第一(初始)状态00设置为第二状态01,在所述处理单元完成所述剪枝处理(即根据剪枝处理,对所述输入数据进行处理后),可利用该控制单元将第二状态01设置为第三状态10。相应地,在所述压缩单元完成所述量化处理后,可通过控制单元将第三状态10设置为第四状态11。相应地,在所述压缩单元完成所述第四状态关联的编码处理后,可通过控制单元将所述第四状态11设置为第一(初始)状态00,可结束流程。在本例中,如果所述控制单元为累加器,在状态改变过程中,通过累加器依次加1可实现四个状态之间的相互迁移,在完成第四状态后可将累加器重置为00,即恢复至初始状态。
需要说明的,所述控制单元涉及的运算状态的数量本申请不做限定,其取决于数据的处理方式。例如,当运算指令所指示的数据处理方式包括剪枝处理和量化处理,则所述控制单元涉及的运算状态可对应包括3个状态,上例中对应为第一状态至第三状态。相应地,所述计算装置接收所述运算指令后,根据所述控制单元的控制,依次完成所述第二状态和第三状态所指示的剪枝处理和量化处理,在数据处理完成后,可将所述控制单元的运算状态设置为第一状态(初始状态)等。
需要说明的,本申请所述压缩单元对输入数据的处理,具体可为对输入权值和/或输入神经元的处理,以减少运算单元中对输入权值或者输入神经元的计算量,从而提高数据处理效率。
相应地,在所述运算单元中可根据接收的第二微指令对所述处理后的输入数据执行相应地的运算处理,以获得运算结果。具体存在以下几种具体实施方式。
在一种实施方式中,当所述处理后的输入数据包括处理后的输入神经元时,所述运算单元可根据第二微指令对所述处理后的输入神经元和输入权值执行相应地的运算操作,从而获得运算结果。
在又一种实施方式中,当所述处理后的输入数据包括处理后的输入权值时,所述运算单元可根据第二微指令对所述处理后的输入权值以及输入神经元执行相应地的运算操作,以获得运算结果。
在又一种实施方式中,当所述处理后的输入数据包括处理后的输入神经元和处理后的输入权值,则所述运算单元可根据第二微指令对所述处理后的输入权值和处理后的输入神经元执行相应的运算操作,以获得运算结果。
可选的,所述压缩单元可将该运算结果存储到上述输出缓存单元109中,该输出缓存单元109通过上述直接存储访问单元103将该运算结果存储到上述存储单元102中。
需要指出的是,上述指令缓存单元104、上述第一输入缓存单元105、上述第二输入缓存单元106和上述输出缓存单元109均可为片上缓存。
进一步地,上述运算单元108包括但不限定于三个部分,分别为乘法器、一个或多个加法器(可选地,多个加法器组成加法树)和激活函数单元/激活函数运算器。上述乘法器将第一输入数据(in1)和第二输入数据(in2)相乘得到第一输出数据(out1),过程为:out1=in1*in2;上述加法树将第三输入数据(in3)通过加法树逐级相加得到第二输出数据(out2),其中in3是一个长度为N的向量,N大于1,过称为:out2=in3[1]+in3[2]+...+in3[N],和/或将第三输入数据(in3)通过加法树累加之后得到的结果和第四输入数据(in4)相加得到第二输出数据(out2),过程为:out2=in3[1]+in3[2]+...+in3[N]+in4,或者将第三输入数据(in3)和第四输入数据(in4)相加得到第二输出数据(out2),过称为:out2=in3+in4;上述激活函数单元将第五输入数据(in5)通过激活函数(active)运算得到第三输出数据(out3),过程为:out3=active(in5),激活函数active可以是sigmoid、tanh、relu、softmax等函数,除了做激活操作,激活函数单元可以实现其他的非线性函数运算,可将输入数据(in)通过函数(f)运算得到输出数据(out),过程为:out=f(in)。
上述运算单元108还可以包括池化单元,池化单元将输入数据(in)通过池化运算得到池化操作之后的输出数据(out),过程为out=pool(in),其中pool为池化操作,池化操作包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out相关的一个池化核中的数据。
第二个:
所述计算装置包括运算单元、指令控制单元以及压缩单元;所述计算装置获取训练样本、训练模型;所述训练模型为:神经网络训练模型和/或非神经网络训练模型,所述训练模型包括n层结构,n为正整数;
所述指令控制单元用于获取训练指令,根据所述训练指令得到正向运算指令和反向运算指令;将所述正向运算指令和所述反向运算指令分别发送给所述运算单元;
所述运算单元用于根据所述正向运算指令对所述训练样本进行n层正向运算得到第n层正向运算结果;根据所述第n层正向运算结果获得第n层输出数据梯度;
所述运算单元还用于根据所述反向运算指令对所述第n层输出数据梯度执行n层反向运算得到所述训练模型的权值梯度;所述训练模型的权值梯度包括n层中每层的权值梯度;
所述压缩单元用于对所述训练模型的权值梯度进行处理,以对应得到处理后的权值梯度;
所述计算装置用于依据所述处理后的权值梯度对所述训练模型的权值进行更新,以完成训练。
具体的,所述计算装置可先获取训练样本以及训练模型,可选的还可获取训练时所用的训练输出数据。所述训练模型包括神经网络模型和/或非神经网络模型,所述训练模型包括有n层结构。以所述训练模型为神经网络模型为例,该神经网络模型包括n个神经网络层,即n层。该神经网络层包括但不限于卷积层、激活层、池化层等等功能层,本申请不做详述和限定。
进一步地,所述指令控制单元可从所述存储单元中获取训练指令,并解析所述训练指令获得正向运算指令和反向运算指令,并将它们分别发送给所述运算单元。
相应地,所述运算单元接收所述正向运算指令后,可利用所述正向运算指令对所述训练样本(即训练时使用的输入神经元)进行n层正向运算,从而获得正向运算结果。其中,所述正向运算包括n层正向运算中各层的正向运算结果,当然包括第n层正向运算结果、第n-1层正向运算结果…以及第1层正向运算结果。具体实现中存在以下两种具体实施方式:
第一种实施方式中,所述正向运算指令包括第一微指令和第二微指令。具体的,所述指令控制单元接收所述正向运算指令后,可对该正向运算指令进行译码以获得所述第一微指令和第二微指令。相应地,可将所述第一微指令以及第二微指令发送给所述运算单元;或者将所述第一微指令发送给所述压缩单元,将所述第二微指令发送给所述运算单元。其中,所述第一微指令用于指示所述压缩单元对所述训练样本进行处理,以获得处理后的训练样本;所述第二微指令用于指示所述运算单元对所述处理后的训练样本进行n层正向运算,以获得正向运算结果。
相应地,所述计算装置根据所述第一微指令的指示调用/通过所述压缩单元先对所述训练样本进行处理,以获得处理后的训练样本。然后,所述运算单元根据所述第二微指令的指示,对所述处理后的训练样本进行n层正向运算,以得到正向运算结果。
第二种实施方式中,所述正向运算指令用于指示所述运算单元对所述训练样本进行n层正向运算以获得正向运算结果。关于所述正向运算结果具体可参见上文所述,这里不再赘述。
可选的,所述运算单元还可根据所述正向运算结果(即n层中每层的正向运算结果)获得n层中每层的输出数据梯度。以第n层正向运算结果为例,所述运算单元可根据所述第n层正向运算结果,获得第n层输出数据梯度。例如,所述运算单元对所述第n层正向运算结果进行预设比例的缩小或者乘以预设百分比等等,以得到所述第n层输出数据梯度。
应理解的,在正向运算过程中,前一层的输出数据梯度作为下一层的输入数据梯度,例如第n-1层输出数据梯度即为第n层输入数据梯度,所述运算单元可利用该第n层输入数据梯度进行第n层正向运算,以获得第n层正向运算结果。可选,在正向运算过程中,也可使用未处理的前一层的正向运算结果来计算下一层的正向运算结果。即前一层的正向运算结果可作为下一层的输入数据(训练样本),例如第n-1层正向运算结果即为第n层的输入数据,所述运算单元利用该第n层的输入数据进行第n层正向运算,对应获得第n层正向运算结果。
进一步地,所述运算单元可接收反向运算指令,根据所述反向运算指令对所述第n层输出数据梯度执行n层反向运算,从而获得所述训练模型的权值梯度,所述训练模型的权值梯度包括n层结构中每层的权值梯度。进一步地,所述计算装置还可调用所述压缩单元对所述训练模型的权值梯度进行处理,以对应获得所述训练模型中处理后的权值梯度。相应地,所述计算装置可利用所述处理后的权值梯度来对应更新所述训练模型的权值,以完成所述训练模型的训练。具体存在以下两种具体实施方式:
第一种实施例方式中,所述运算单元接收所述反向运算指令后,对第n层输出数据梯度执行n层反向运算得到所述训练模型的权值梯度;所述训练模型的权值梯度包括n层中每层的权值梯度;然后,调用所述压缩单元对所述训练模型的权值梯度(即n层的权值梯度)进行处理,以得到处理后的权值梯度,即处理后n层的权值梯度。进一步地,所述计算装置利用所述处理后的n层的权值梯度,对应更新所述训练模型中n层的权值,从而完成所述训练模型的训练。
具体实现时,所述运算单元获取所述反向运算指令后,可获取第n层输出数据梯度以及训练输出数据;然后,根据所述反向运算指令对所述第n层输出数据梯度以及训练输出结果执行第n层反向运算得到所述训练模型的权值梯度。
可理解的,本实施例中,所述计算装置调用一次所述压缩单元,即可完成对所述训练模型的权值梯度进行处理,以利用处理后的权值梯度来更新所述训练模型,可节省训练时间,提升数据处理效率。
在可选实施例中,上述实施方式在具体实施时,所述反向运算指令可包括两个微指令,例如第三微指令和第四微指令。所述指令控制单元在获取所述反向运算指令后,可对该反向运算指令进行译码得到第三微指令和第四微指令;相应地可将所述第三微指令和第四微指令发送给所述运算单元;或者,将所述第三微指令发送给所述运算单元,将所述第四微指令发送给所述压缩单元。其中,这里的第三微指令用于指示所述运算单元计算所述训练模型中n层的权值梯度;第四微指令用于指示所述压缩单元对所述训练模型中n层的权值梯度进行处理。
相应地,所述运算单元可根据第三微指令的指示,计算所述训练模型的权值梯度(即n层的权值梯度);然后,根据所述第四微指令的指示调用/通过所述压缩单元对所述训练模型的权值梯度进行处理,以利用处理后的权值梯度来对应更新所述训练模型。关于所述运算单元和所述运算单元的具体实施可参见第一种实施中的相关阐述,这里不再赘述。
第二种实施方式中,所述运算单元接收所述反向运算指令后,依据该反向运算指令依次获取第n层输出数据梯度,第n-1层输出数据梯度…第1层输出数据梯度;相应地,按照层次数先对该第n层输出数据梯度执行第n层反向运算得到第n层的权值梯度,然后对第n-1层输出数据梯度执行第n-1层反向运算得到第n-1层的权值梯度,依次类推,可对第1层输出数据梯度执行第1层反向运算得到第1层的权值梯度。在计算每一层的权值梯度时,所述运算单元可对该层输出权值梯度以及该层的训练输出结果执行反向运算,以得到该层的权值梯度。以计算第n层的权值梯度为例,所述运算单元可根据第n层输出数据梯度以及第n层的训练输出数据(即前文所述的训练输出数据)执行反向运算,得到第n层的权值梯度。
进一步地,在所述运算单元计算得到每一层的权值梯度后,所述计算装置都可对应调用所述压缩单元对该层的权值梯度进行处理,以对应得到处理后的该层的权值梯度;相应地,所述计算装置可利用所述处理后的该层的权值梯度对应更新所述训练模型中该层的权值,直至更新完所述训练模型中每层的权值,以完成所述训练模型的训练。
例如,所述运算单元在计算得到第n层的权值梯度后,可调用所述压缩单元对所述第n层的权值梯度进行处理,以得到处理后的第n层的权值梯度;进一步地,所述计算装置可用所述处理后的第n层的权值梯度,来更新所述训练模型中第n层的权值。相应地,所述运算单元在计算得到第n-1层的权值梯度后,可调用/通过所述压缩单元对所述第n-1层的权值梯度进行处理,以得到处理后的第n-1层的权值梯度;进一步地,所述计算装置可用所述处理后的第n-1层的权值梯度,来更新所述训练模型中第n-1层的权值。依次类推,所述计算装置可采用n次所述压缩单元对计算的n层的权值梯度进行处理,以对应得到处理后的n层的权值梯度,然后依据该处理后的n层的权值梯度,来对应更新所述训练模型中n层的权值,从而完成了所述训练模型的训练。
应理解的,所述运算单元计算获得n层的权值梯度,以及利用压缩单元对n层的权值梯度进行处理。这两个单元在具体实施时,可以采用并行或串行的执行方式进行执行。例如,以串行方式为例,所述运算单元在计算到第n层的权值梯度后,需得到压缩单元对该第n层的权值梯度进行处理得到处理后的第n层的权值梯度、以及计算装置利用该处理后的第n层的权值梯度更新了所述训练模型中第n层的权值后,进一步所述运算单元才计算第n-1层的权值梯度。
以并行方式为例,所述运算单元和所述压缩单元各自按照层次序进行对应处理,两个单元互不影响。例如,所述运算单元计算获得第n层的权值梯度后,可接着计算第n-1层的权值梯度。依次类推计算第1层的权值梯度。所述压缩单元在获取第n层的权值梯度后,可对该第n层的权值梯度进行处理,此时所述运算单元无需在得到所述压缩单元在对第n层的权值梯度进行处理后,再计算第n-1层的权值梯度。可理解的,采用并行方式,可提高计算装置的数据处理效率。
在可选实施例中,上述实施方式在具体实施时,所述反向运算指令包括第三微指令和第四微指令。所述指令控制单元对所述反向运算指令进行译码获得所述第三微指令和所述第四微指令。相应地,将所述第三微指令和第四微指令发送给所述运算单元;或者,将所述第三微指令发送给所述运算单元,将第四微指令发送给所述压缩单元。其中,第三微指令用于指示所述运算单元按照层次序计算n层的权值梯度;所述第四微指令用于指示所述压缩单元按照层次数对计算的n层的权值梯度进行处理,以对应获得处理后的n层的权值梯度。
相应地,所述运算单元接收所述第三微指令后,可根据层次序依次计算n层中每层的权值梯度;然后根据第四微指令的指示,调用所述压缩单元对每层的权值梯度进行处理,以对应得到处理后的每层的权值梯度。关于所述运算单元和所述压缩单元可参见前述第一个实施例中的相关阐述,这里不再赘述。
在一些实施例中,所述压缩单元101在接收到目标微指令后,可获取输入数据,根据所述目标微指令所指示的数据处理方式对所述输入数据进行处理,以获得处理后的输入数据。其中,所述输入数据可为前述实施例中的训练样本、所述运算单元计算获得的n层的权值梯度(具体可为第n层的权值梯度、第n-1层的权值梯度….第1层的权值梯度)。所述目标微指令可为前述实施例中的第一微指令或第四微指令等,用于指示所述压缩单元对所述训练模型中的某一层或n层的权值梯度进行处理,这里不再详述。
在可选实施例中,本发明上文涉及的剪枝处理过程中,所述压缩单元还可根据位置关系数据,对所述输入数据进行处理,以获得所述处理后的输入数据。具体存在以下几种具体实施方式。
在一种具体实施方式中,所述输入数据包括第一输入数据和第二输入数据。具体如图2,上述压缩单元101包括:
第一稀疏处理单元1011,用于对第二输入数据进行处理,以得到第三输出数据和第二输出数据,并将所述第三输出数据传输至第一数据处理单元1012。
第一数据处理单元1012,用于接收第一输入数据和接收所述第三输出数据,并根据上述第三输出数据和第一输入数据输出第一输出数据。
其中,当所述第一输入数据包括至少一个输入神经元,所述第二输入数据包括至少一个权值时,所述第一输出数据为处理后的输入神经元,所述第二输出数据为处理后的权值,所述第三输出数据为权值的位置关系数据;当所述第一输入数据包括至少一个权值,所述第二输入数据包括至少一个输入神经元时,所述第一输出数据为处理后的权值,所述第二输出数据为处理后的输入神经元,所述第三输出数据为输入神经元的位置关系数据。
具体地,当上述第二输入数据为权值时,且权值的形式为wij,该wij表示第i个输入神经元与第j个输出神经元之间的权值;上述第一稀疏处理单元1011根据权值确定上述位置关系数据(即上述第三输出数据),并将上述权值中绝对值小于或者等于第二阈值的权值删除,得到处理后的权值(即上述第二输出数据);当上述第二输入数据为输入神经元时,上述第一稀疏处理单元1011根据输入神经元得到位置关系数据,并将该输入神经元中的绝对值小于或等于上述第一阈值的输入神经元删除,以得到处理后的输入神经元。
可选地,上述第一阈值可为0.1、0.08、0.05、0.02、0.01、0或者其他值。上述第二阈值可为0.1、0.08、0.06、0.05、0.02、0.01、0或者其他值。需要指出的是,上述第一阈值和上述第二阈值可以一致,也可以不一致。
其中,上述位置关系数据可以步长索引或者直接索引的形式表示。
具体地,以直接索引形式表示的位置关系数据为由0和1组成的字符串,当上述第二输入数据为权值时,0表示该权值的绝对值小于或者等于上述第二阈值,即该权值对应的输入神经元与输出神经元之间没有连接,1表示该权值的绝对值大于上述第二阈值,即该权值对应的输入神经元与输出神经元之间有连接。以直接索引形式表示的位置关系数据有两种表示顺序:以每个输出神经元与所有输入神经元的连接状态组成一个0和1的字符串来表示权值的连接关系;或者每个输入神经元与所有输出神经元的连接状态组成一个0和1的字符串来表示权值的连接关系。当上述第二输入数据为输入神经元时,0表示该输入神经元的绝对值小于或者等于上述第一阈值,1表示该输入神经元的绝对值大于上述第一阈值。
当上述第二输入数据为权值时,以步长索引形式表示的位置关系数据为与输出神经元有连接的输入神经元与上一个与该输出神经元有连接的输入神经元之间的距离值组成的字符串;当上述第二输入数据为输入神经元时,以步长索引表示的数据以当前绝对值大于上述第一阈值的输入神经元与上一个绝对值大于上述第一阈值的输入神经元之间的距离值组成的字符串表示。
举例说明,假设上述第一阈值和上述第二阈值均为为0.01,参见图3,图3为本发明实施例提供的一种神经网络的示意图。如图3中的a图所示,上述第一输入数据为输入神经元,包括输入神经元i1、 i2、i3和i4,上述第二输入数据为权值。对于输出神经元o1,权值为w11,w21,w31和w41;对于输出神经元o2,权值w12,w22,w32和w42,其中权值w21,w12和w42的值为0,其绝对值均小于上述第一阈值0.01,上述第一稀疏处理单元1011确定上述输入神经元i2和输出神经元o1没有连接,上述输入神经元i1和i4与输出神经元o2没有连接,上述输入神经元i1、i3和i4与上述输出神经元o1有连接,上述输入神经元i2和i3与输出神经元o2有连接。以每个输出神经元与所有输入神经元的连接状态表示上述位置关系数据,则上述输出神经元o1的位置关系数据为“1011”,输出神经元o2的位置关系数据为“0110”(即上述位置关系数据为“10110110”);以每个输入神经元与所有输出神经元的连接关系,则输入神经元i1的位置关系数据为“10”,输入神经元i2的位置关系数据为“01”,输入神经元i3的位置关系数据为“11”,输入神经元i4的位置关系数据为“10”(即上述位置关系数据为“10011110”)。
对于上述输出神经元o1,上述压缩单元101将上述i1与w11,i3与w31和i4与w41分别作为一个数据集,并将该数据集存储到上述存储单元102中;对于输出神经元o2,上述压缩单元101将上述i2与w22和i3与w32分别作为一个数据集,并将该数据集存储到上述存储单元102中。
针对上述输出神经元o1,上述第二输出数据为w11,w31和w41;针对上述输出神经元o2,上述第二输出数据为w22和w32。
当上述第二输入数据为输入神经元i1、i2、i3和i4,且该输入神经元的值分别为1,0,3,5则上述位置关系数据(即上述第三输出数据)为“1011”,上述第二输出数据为1,3,5。
如图3中的b图所示,上述第一输入数据包括输入神经元i1、i2、i3和i4,上述第二输入数据为权值。对于输出神经元o1,权值为w11,w21,w31和w41;对于输出神经元o2,权值w12,w22,w32和w42,其中权值w21,w12和w42的值为0,上述稀疏处理单元1011确定上述输入神经元i1、i3和i4与上述输出神经元o1有连接,上述输入神经元i2和i3与输出神经元o2有连接。上述输出神经元o1与输入神经元之间的位置关系数据为“021”。其中,该位置关系数据中第一个数字“0”表示第一个与输出神经元o1有连接的输入神经元与第一个输入神经元之间的距离为0,即第一个与输出神经元o1有连接的输入神经元为输入神经元i1;上述位置关系数据中第二个数字“2”表示第二个与输出神经元o1有连接的输入神经元与第一个与输出神经元o1有连接的输入神经元(即输入神经元i1)之间的距离为2,即第二个与输出神经元o1有连接的输入神经元为输入神经元i3;上述位置关系数据中第三个数字“1”表示第三个与输出神经元o1有连接的输入神经元与第二个与该输出神经元o1有连接的输入神经元之间的距离为1,即第三个与输出神经元o1有连接的输入神经元为输入神经元i4。
上述输出神经元o2与输入神经元之间的位置关系数据为“11”。其中,该位置关系数据中的第一数字“1”表示第一个与输出神经元o2有连接的输入神经元与第一个输入神经元(即输入神经元i1)之间的距离为1,即该第一个与输出神经元o2有连接关系的输入神经元为输出神经元i2;上述位置关系数据中的第二数字“1”表示第二个与输出神经元o2有连接的输入神经元与第一个与输出神经元o2有连接的输入神经元的距离为1,即第二个与输出神经元o2有连接的输入神经元为输入神经元i3。
对于上述输出神经元o1,上述压缩单元101将上述i1与w11,i3与w31和i4与w41分别作为一个数据集,并将该数据集存储到上述存储单元102中;对于输出神经元o2,上述压缩单元101将上述i2与w22和i3与w32分别作为一个数据集,并将该数据集存储到上述存储单元102中。
针对上述输出神经元o1,上述第二输出数据为w11,w31和w41;针对上述输出神经元o2,上述第二输出数据为w22和w32。
当上述第二输入数据为输入神经元i1、i2、i3和i4,且该输入神经元的值分别为1,0,3,5则上述位置关系数据即上述第三输出数据为“021”,上述第二输出数据为1,3,5。
当上述第一输入数据为输入神经元时,则上述第二输入数据为权值,上述第三输出数据为输出神经元与上述输入神经元之间的位置关系数据。上述第一数据处理单元1012接收到上述输入神经元后,将该输入神经元中绝对值小于或等于上述第二阈值的输入神经元剔除,并根据上述位置关系数据,从剔除后的输入神经元中选择与上述权值相关的输入神经元,作为第一输出数据输出。
举例说明,假设上述第一阈值为0,上述输入神经元i1、i2、i3和i4,其值分别为1,0,3和5,对于输出神经元o1,上述第三输出数据(即位置关系数据)为“021”,上述第二输出数据为w11,w31和w41。上述第一数据处理单元1012将上述输入神经元i1、i2、i3和i4中值为0的输入神经元剔除,得到输入神经元i1、i3和i4。该第一数据处理单元1012根据上述第三输出数据“021”确定上述输入神经元i1、i3和i4均与上述输出神经元均有连接,故上述数据处理单元1012将上述输入神经元i1、i3和i4作为第一输出数据输出,即输出1,3,5。
当上述第一输入数据为权值,上述第二输入数据为输入神经元时,上述第三输出数据为上述输入神经元的位置关系数据。上述第一数据处理单元1012接收到上述权值w11,w21,w31和w41后,将该权值中绝对值小于上述第一阈值的权值剔除,并根据上述位置关系数据,从上述剔除后的权值中选择与该上述输入神经元相关的权值,作为第一输出数据并输出。
举例说明,假设上述第二阈值为0,上述权值w11,w21,w31和w41,其值分别为1,0,3和4,对于输出神经元o1,上述第三输出数据(即位置关系数据)为“1011”,上述第二输出数据为i1,i3和i5。上述第一数据处理单元1012将上述权值w11,w21,w31和w41中值为0的输入神经元剔除,得到权值w11,w21,w31和w41。该第一数据处理单元1012根据上述第三输出数据“1011”确定上述输入神经元i1、i2,i3和i4中的输入神经元i2的值为0,故上述第一数据处理单元1012将上述输入神经元1,3和4作为第一输出数据输出。
在一种可行的实施例中,第三输入数据和第四输入数据分别为至少一个权值和至少一个输入神经元,上述压缩单元101确定上述至少一个输入神经元中绝对值大于上述第一阈值的输入神经元的位置,并获取输入神经元的位置关系数据;上述压缩单元101确定上述至少一个权值中绝对值大于上述第二阈值的权值的位置,并获取权值的位置关系数据。上述压缩单元101根据上述权值的位置关系数据和输入神经元的位置关系数据得到一个新的位置关系数据,该位置关系数据表示上述至少一个输入神经元中绝对值大于上述第一阈值的输入神经元与输出神经元之间的关系和对应的权值的值。压缩单元101根据该新的位置关系数据、上述至少一个输入神经元和上述至少一个权值获取处理后的输入神经元和处理后的权值。
进一步地,上述压缩单元101将上述处理后的输入神经元和处理后的权值按照一一对应的格式存储到上述存储单元102中。
具体地,上述压缩单元101对上述处理后的输入神经元和上述处理后的权值按照一一对应的格式进行存储的具体方式是将上述处理后的输入神经元中的每个处理后的输入神经元和与其对应的处理后的权值作为一个数据集,并将该数据集存储到上述存储单元102中。
对于压缩单元101包括第一稀疏处理单元1011和第一数据处理单元1012的情况,压缩单元101中的稀疏处理单元1011对输入神经元或者权值进行稀疏化处理,减小了权值或者输入神经元的数量,进而减小了运算单元进行运算的次数,提高了运算效率。
在又一种具体实施方式中,所述输入数据包括第一输入数据和第二输入数据。具体如图4,上述压缩单元101包括:
第二稀疏处理单元1013,用于接收到第三输入数据后,根据所述第三输入数据得到第一位置关系数据,并将该第一位置关系数据传输至连接关系处理单元1015;
第三稀疏处理单元1014,用于接收到第四输入数据后,根据所述第四输入数据得到第二位置关系数据,并将该第二位置关系数据传输至所述连接关系处理单元1015;
所述连接关系处理单元1015,用于根据所述第一位置关系数据和所述第二位置关系数据,以得到第三位置关系数据,并将该第三位置关系数据传输至第二数据处理单元1016;
所述第二数据处理单元1016,用于在接收到所述第三输入数据,所述第四输入数据和所述第三位置关系数据后,根据所述第三位置关系数据对所述第三输入数据和所述第四输入数据进行处理,以得到第四输出数据和第五输出数据;
其中,当所述第三输入数据包括至少一个输入神经元,第四输入数据包括至少一个权值时,所述第一位置关系数据为输入神经元的位置关系数据,所述第二位置关系数据为权值的位置关系数据,所述第四输出数据为处理后的输入神经元,所述第五输出数据为处理后的权值;当所述第三输入数据包括至少一个权值,所述第四输入数据包括至少一个输入神经元时,所述第一位置关系数据为权值的位置关系数据,所述第二位置关系数据为输入神经元的位置关系数据,所述第四输出数据为处理后的权值,所述第五输出数据为处理后的输入神经元。
当上述第三输入数据包括至少一个输入神经元时,上述第一位置关系数据为用于表示该至少一个输入神经元中绝对值大于上述第一阈值的输入神经元的位置的字符串;当上述第三输入数据包括至少一个权值时,上述第一位置关系数据为用于表示输入神经元与输出神经元之间是否有连接的字符串。
当上述第四输入数据包括至少一个输入神经元时,上述第二位置关系数据为用于表示该至少一个输入神经元中绝对值大于上述第一阈值的输入神经元的位置的字符串;当上述第四输入数据包括至少一个权值时,上述第二位置关系数据为用于表示输入神经元与输出神经元之间是否有连接的字符串。
需要说明的是,上述第一位置关系数据、第二位置关系数据和第三位置关系数据均可以步长索引或者直接索引的形式表示,具体可参见上述相关描述。换句话说,上述连接关系处理单元1015对上述第一位置关系数据和上述第二位置关系数据进行处理,以得到第三位置关系数据。该第三位置关系数据可以直接索引或者步长索引的形式表示。
具体地,当上述第一位置关系数据和上述第二位置关系数据均以直接索引的形式表示时,上述连接关系处理单元1015对上述第一位置关系数据和上述第二位置关系数据进行与操作,以得到第三位置关系数据,该第三位置关系数据是以直接索引的形式表示的。需要说明的是,表示上述第一位置关系数据和第二位置关系数据的字符串在内存中是按照物理地址高低的顺序存储的,可以是由高到低的顺序存储的,也可以是由低到高的顺序存储的。
当上述第一位置关系数据和上述第二位置关系数据均以步长索引的形式表示,且表示上述第一位置关系数据和第二位置关系数据的字符串是按照物理地址由低到高的顺序存储时,上述连接关系处理单元1015将上述第一位置关系数据的字符串中的每一个元素与存储物理地址低于该元素存储的物理地址的元素进行累加,得到的新的元素组成第四位置关系数据;同理,上述连接关系处理单元1015对上述第二位置关系数据的字符串进行同样的处理,得到第五位置关系数据。然后上述连接关系处理单元1015从上述第四位置关系数据的字符串和上述第五位置关系数据的字符串中,选取相同的元素,按照元素值从小到大的顺序排序,组成一个新的字符串。上述连接关系处理单元1015将上述新的字符串中每一个元素与其相邻且值小于该元素值的元素进行相减,以得到一个新的元素。按照该方法,对上述新的字串中的每个元素进行相应的操作,以得到上述第三位置关系数据。
举例说明,假设以步长索引的形式表示上述第一位置关系数据和上述第二位置关系数据,上述第一位置关系数据的字符串为“01111”,上述第二位置关系数据的字符串为“022”,上述连接关系处理单元1015将上述第一位置关系数据的字符串中的每个元素与其相邻的前一个元素相加,得到第四位置关系数据“01234”;同理,上述连接关系处理单元1015对上述第二位置关系数据的字符串进行相同的处理后得到的第五位置关系数据为“024”。上述连接关系处理单元1015从上述第四位置关系数据“01234”和上述第五位置关系数据“024”选取相同的元素,以得到新的字符串“024”。上述连接关系处理单元1015将该新的字符串中的每个元素与其相邻的前一个元素进行相减,即0,(2-0),(4-2),以得到上述第三连接数据“022”。
当上述第一位置关系数据和上述第二位置关系数据中的任意一个以步长索引的形式表示,另一个以直接索引的形式表示时,上述连接关系处理单元1015将上述以步长索引表示的位置关系数据转换成以直接索引的表示形式或者将以直接索引表示的位置关系数据转换成以步长索引表示的形式。然后上述连接关系处理单元1015按照上述方法进行处理,以得到上述第三位置关系数据(即上述第五输出数据)。
可选地,当上述第一位置关系数据和上述第二位置关系数据均以直接索引的形式表示时,上述连接关系处理单元1015将上述第一位置关系数据和上述第二位置关系数据均转换成以步长索引的形式表示的位置关系数据,然后按照上述方法对上述第一位置关系数据和上述第二位置关系数据进行处理,以得到上述第三位置关系数据。
具体地,上述第三输入数据可为输入神经元或者权值、第四输入数据可为输入神经元或者权值,且上述第三输入数据和第四输入数据不一致。上述第二数据处理单元1016根据上述第三位置关系数据从上述第三输入数据(即输入神经元或者权值)中选取与该第三位置关系数据相关的数据,作为第四输出数据;上述第二数据处理单元1016根据上述第三位置关系数据从上述第四输入数据中选取与该第三位置关系数据相关的数据,作为第五输出数据。
进一步地,上述第二数据处理单元1016将上述处理后的输入神经元中的每个处理后的输入神经元与其对应的处理后的权值作为一个数据集,将该数据集存储出上述存储单元102中。
举例说明,假设上述第三输入数据包括输入神经元i1,i2,i3和i4,上述第四输入数据包括权值w11,w21,w31和w41,上述第三位置关系数据以直接索引方式表示,为“1010”,则上述第二数据处理单元1016输出的第四输出数据为输入神经元i1和i3,输出的第五输出数据为权值w11和w31。上述第二数据处理单元1016将输入神经元i1与权值w11和输入神经元i3与权值w31分别作为一个数据集,并将该数据集存储到上述存储单元102中。
对于压缩单元101包括第二稀疏处理单元1013,第三稀疏处理单元1014、连接关系处理单元1015和第二数据处理单元1016的情况,压缩单元101中的稀疏处理单元对输入神经元和权值均进行稀疏化处理,使得输入神经元和权值的数量进一步减小,进而减小了运算单元的运算量,提高了运算效率。
在又一种具体实施方式中,所述输入数据包括至少一个输入权值或至少一个输入神经元。如图5,上述压缩单元601包括:
输入数据缓存单元6011,用于缓存所述输入数据,该输入数据包括至少一个输入神经元或者至少一个权值。
连接关系缓存单元6012,用于缓存输入数据的位置关系数据,即上述输入神经元的位置关系数据或者上述权值的位置关系数据。
其中,上述输入神经元的位置关系数据为用于表示该输入神经元中绝对值是否小于或者等于第一阈值的字符串,上述权值的位置关系数据为表示该权值绝对值是否小于或者等于上述第一阈值的字符串,或者为表示该权值对应的输入神经元和输出神经元之间是否有连接的字符串。该输入神经元的位置关系数据和权值的位置关系数据可以直接索引或者步长索引的形式表示。
第四稀疏处理单元6013,用于根据所述输入数据的位置关系数据对所述输入数据进行处理,以得到处理后的输入数据,并将该处理后的输入数据存储到上述第一输入缓存单元中605。
在上述三个具体实施方式中,在所述压缩单元101对所述输入数据进行处理之前,所述压缩单元101还用于:
对所述至少一个输入神经元进行分组,以得到M组输入神经元,所述M为大于或者等于1的整数;
判断所述M组输入神经元的每一组输入神经元是否满足第一预设条件,所述第一预设条件包括一组输入神经元中绝对值小于或者等于第三阈值的输入神经元的个数小于或者等于第四阈值;
当所述M组输入神经元任意一组输入神经元不满足所述第一预设条件时,将该组输入神经元删除;
对所述至少一个权值进行分组,以得到N组权值,所述N为大于或者等于1的整数;
判断所述N组权值的每一组权值是否满足第二预设条件,所述第二预设条件包括一组权值中绝对值小于或者等于第五阈值的权值的个数小于或者等于第六阈值;
当所述N组权值任意一组权值不满足所述第二预设条件时,将该组权值删除。
可选地,上述第三阈值可为0.5,0.2,0.1,0.05,0.025,0.0,0或者其他值。其中,上述第四阈值与上述一组输入神经元中输入神经元的个数相关。可选地,该第四阈值=一组输入神经元中的输入神经元个数-1或者该第四阈值为其他值。可选地,上述第五阈值可为0.5,0.2,0.1,0.05,0.025,0.01,0或者其他值。其中,上述第六阈值与上述一组权值中的权值个数相关。可选地,该第六阈值=一组权值中的权值个数-1或者该第六阈值为其他值。需要说明的是,上述第三阈值和上述第五阈值可相同或者不同,上述第四阈值和上述第六阈值可相同或者不同。
需要说明的是,本申请中涉及的位置关系数据的表示方式除了直接索引和步长索引之外,还可为以下几种情况:列表的列表(List of Lists,LIL)、坐标列表(Coordinate list,COO)、压缩稀疏行(Compressed Sparse Row,CSR)、压缩稀疏列(Compressed Sparse Column,CSC)、(ELL Pack,ELL)以及混合(Hybird,HYB)。
需要指出的是,本发明实施例中提到的输入神经元和输出神经元并非是指整个神经网络的输入层中的神经元和输出层中的神经元,而是对于神经网络中任意相邻的两层神经元,处于网络前馈运算下层中的神经元即为输入神经元,处于网络前馈运算上层中的神经元即为输出神经元。以卷积神经网络为例,假设一个卷积神经网络有L层,K=1,2,3…L-1,对于第K层和第K+1层来说,第K层被称为输入层,该层中的神经元为上述输入神经元,第K+1层被称为输入层,该层中的神经元为上述输出神经元,即除了顶层之外,每一层都可以作为输入层,其下一层为对应的输出层。
此外,本申请还提供了使用上述计算装置的数据处理方法,具体如图6a所示,所述方法包括:
步骤S6a01、所述指令控制单元获取运算指令,将所述运算指令译码为第一微指令和第二微指令,并将所述第一微指令发送给所述压缩单元,将所述第二微指令发送给所述运算单元;
步骤S6a02、所述压缩单元根据所述第一微指令对获取的输入数据进行处理,得到处理后的输入数据;其中,所述输入数据包括至少一个输入神经元和/或至少一个输入数据,所述处理后的输入数据包括处理后的输入神经元和/或处理后的输入数据;
步骤S6a03、所述运算单元根据所述第二微指令对所述处理后的输入数据进行处理,得到运算结果。
关于本发明未示出或未描述的部分可具体参见前述所有或者部分实施例中的相关阐述,这里不再赘述。
本申请还提供了另一种使用上述计算装置的数据处理方法,所述计算装置包括运算单元、指令控制单元以及压缩单元;所述计算装置获取训练样本、训练模型;所述训练模型为:神经网络训练模型和/或非神经网络训练模型,所述训练模型包括n层结构;具体如图6b所示,所述方法包括:
步骤S6b01、所述指令控制单元获取训练指令,根据所述训练指令得到正向运算指令和反向运算指令,并将所述正向运算指令和所述反向运算指令分别发送给所述运算单元;
步骤S6b02、所述运算单元根据所述正向运算指令对所述训练样本进行n层正向运算得到第n层正向运算结果;根据所述第n层正向运算结果获得第n层输出数据梯度;根据所述反向运算指令对所述第n层输出数据梯度执行n层反向运算得到所述训练模型的权值梯度;所述训练模型的权值梯度包括n层中每层的权值梯度;
步骤S6b03、所述压缩单元对所述训练模型的权值梯度进行处理,以对应得到处理后的权值梯度;
步骤S6b04、所述计算装置依据所述处理后的权值梯度对所述训练模型的权值进行更新,以完成训练。
关于本发明未示出或未描述的部分可具体参见前述所有或者部分实施例中的相关阐述,这里不再赘述。
本申请上述实施例阐述的压缩单元,还可应用到如下芯片装置(也可称为运算电路或计算单元)中,以实现数据的压缩处理,降低电路中数据的传输量以及数据的计算量,从而提升数据处理效率。
参阅图7a,图7a为一种芯片装置的结构示意图,如图7a所示,该运算电路包括:主处理电路、基本处理电路和分支处理电路。具体的,主处理电路与分支处理电路连接,分支处理电路连接至少一个基本处理电路。
该分支处理电路,用于收发主处理电路或基本处理电路的数据。
参阅图7b,图7b为主处理电路的一种结构示意图,如图7b所示,主处理电路可以包括寄存器和/或片上缓存电路,该主处理电路还可以包括:控制电路、向量运算器电路、ALU(arithmetic and logic unit,算数逻辑电路)电路、累加器电路、DMA(Direct Memory Access,直接内存存取)电路等电路,当然在实际应用中,上述主处理电路还可以添加,转换电路(例如矩阵转置电路)、数据重排电路或激活电路等等其他的电路。
主处理电路还包括数据发送电路、数据接收电路或接口,该数据发送电路可以集成数据分发电路以及数据广播电路,当然在实际应用中,数据分发电路以及数据广播电路也可以分别设置;在实际应用中上述数据发送电路以及数据接收电路也可以集成在一起形成数据收发电路。对于广播数据,即需要发送给每个基础处理电路的数据。对于分发数据,即需要有选择的发送给部分基础处理电路的数据,具体的选择方式可以由主处理电路依据负载以及计算方式进行具体的确定。对于广播发送方式,即将广播数据以广播形式发送至每个基础处理电路。(在实际应用中,通过一次广播的方式将广播数据发送至每个基础处理电路,也可以通过多次广播的方式将广播数据发送至每个基础处理电路,本申请具体实施方式并不限制上述广播的次数),对于分发发送方式,即将分发数据有选择的发送给部分基础处理电路。
在实现分发数据时,主处理电路的控制电路向部分或者全部基础处理电路传输数据(该数据可以相同,也可以不同,具体的,如果采用分发的方式发送数据,各个接收数据的基础处理电路收到的数据可以不同,当然也可以有部分基础处理电路收到的数据相同;
具体地,广播数据时,主处理电路的控制电路向部分或者全部基础处理电路传输数据,各个接收数据的基础处理电路可以收到相同的数据,即广播数据可以包括所有基础处理电路均需要接收到的数据。分发数据可以包括:部分基础处理电路需要接收到的数据。主处理电路可以通过一次或多次广播将该广播数据发送给所有分支处理电路,分支处理电路该广播数据转发给所有的基础处理电路。
可选的,上述主处理电路的向量运算器电路可以执行向量运算,包括但不限于:两个向量加减乘除,向量与常数加、减、乘、除运算,或者对向量中的每个元素执行任意运算。其中,连续的运算具体可以为,向量与常数加、减、乘、除运算、激活运算、累加运算等等。
每个基础处理电路可以包括基础寄存器和/或基础片上缓存电路;每个基础处理电路还可以包括:内积运算器电路、向量运算器电路、累加器电路等中一个或任意组合。上述内积运算器电路、向量运算器电路、累加器电路都可以是集成电路,上述内积运算器电路、向量运算器电路、累加器电路也可以为单独设置的电路。
分支处理电路和基础电路的连接结构可以是任意的,不局限在图1b的H型结构。可选的,主处理电路到基础电路是广播或分发的结构,基础电路到主处理电路是收集(gather)的结构。广播,分发和收集的定义如下:
所述主处理电路到基础电路的数据传递方式可以包括:
主处理电路与多个分支处理电路分别相连,每个分支处理电路再与多个基础电路分别相连。
主处理电路与一个分支处理电路相连,该分支处理电路再连接一个分支处理电路,依次类推,串联多个分支处理电路,然后,每个分支处理电路再与多个基础电路分别相连。
主处理电路与多个分支处理电路分别相连,每个分支处理电路再串联多个基础电路。
主处理电路与一个分支处理电路相连,该分支处理电路再连接一个分支处理电路,依次类推,串联多个分支处理电路,然后,每个分支处理电路再串联多个基础电路。
分发数据时,主处理电路向部分或者全部基础电路传输数据,各个接收数据的基础电路收到的数据可以不同;
广播数据时,主处理电路向部分或者全部基础电路传输数据,各个接收数据的基础电路收到相同的数据。
收集数据时,部分或全部基础电路向主处理电路传输数据。需要说明的,如图7a所示的芯片装置可以是一个单独的物理芯片,当然在实际应用中,该芯片装置也可以集成在其他的芯片内(例如CPU,GPU),本申请具体实施方式并不限制上述芯片装置的物理表现形式。
参阅图7c,图7c为一种芯片装置的数据分发示意图,如图7 c的箭头所示,该箭头为数据的分发方向,如图7c所示,主处理电路接收到外部数据以后,将外部数据拆分以后,分发至多个分支处理电路,分支处理电路将拆分数据发送至基本处理电路。
参阅图7d,图7d为一种芯片装置的数据回传示意图,如图1d的箭头所示,该箭头为数据的回传方向,如图7d所示,基本处理电路将数据(例如内积计算结果)回传给分支处理电路,分支处理电路在回传至主处理电路。
对于输入数据,具体的可以为向量、矩阵、多维(三维或四维及以上)数据,对于输入数据的一个具体的值,可以称为该输入数据的一个元素。
本披露实施例还提供一种如图7a所示的计算单元的计算方法,该计算方法应用于神经网络计算中,具体的,该计算单元可以用于对多层神经网络中一层或多层的输入数据与权值数据执行运算。
具体的,上述所述计算单元用于对训练的多层神经网络中一层或多层的输入数据与权值数据执行运算;
或所述计算单元用于对正向运算的多层神经网络中一层或多层的输入数据与权值数据执行运算。
上述运算包括但不限于:卷积运算、矩阵乘矩阵运算、矩阵乘向量运算、偏执运算、全连接运算、GEMM运算、GEMV运算、激活运算中的一种或任意组合。
GEMM计算是指:BLAS库中的矩阵-矩阵乘法的运算。该运算的通常表示形式为:C=alpha*op(S)*op(P)+beta*C,其中,S和P为输入的两个矩阵,C为输出矩阵,alpha和beta为标量,op代表对矩阵S或P的某种操作,此外,还会有一些辅助的整数作为参数来说明矩阵的S和P的宽高;
GEMV计算是指:BLAS库中的矩阵-向量乘法的运算。该运算的通常表示形式为:C=alpha*op(S)*P+beta*C,其中,S为输入矩阵,P为输入的向量,C为输出向量,alpha和beta为标量,op代表对矩阵S的某种操作。
在实际应用中,本申请上述压缩单元具体可设计或应用至本申请实施例中以下电路中的任一个或多个中:主处理电路、分支处理电路连接以及基础处理电路。关于所述压缩单元以及所述压缩单元涉及的数据处理具体可参见前述实施例中所述,这里不再赘述。
申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储用于电子数据交换的计算机程序,该计算机程序使得计算机执行如上述方法实施例中记载的任何一种数据处理方法的部分或全部步骤。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种数据处理方法的部分或全部步骤。
在一些实施例里,还公开了一种芯片,其包括了上述用于执行数据处理方法所对应的神经网络处理器。
在一些实施例里,公开了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,公开了一种板卡,其包括了上述芯片封装结构。
在一些实施例里,公开了一种电子设备,其包括了上述板卡。
电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
以上所述的具体实施例,对本发明实施例的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本披露的具体实施例而已,并不用于限制本披露,凡在本披露的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本披露的保护范围之内。

Claims (29)

  1. 一种计算装置,其特征在于,包括运算单元、指令控制单元、存储单元以及压缩单元;
    所述指令控制单元,用于获取运算指令,将所述运算指令译码为第一微指令和第二微指令,并将所述第一微指令发送给所述压缩单元,将所述第二微指令发送给所述运算单元;
    所述存储单元,用于存储输入数据、处理后的输入数据、运算指令以及运算结果,所述输入数据包括至少一个输入神经元和/或至少一个权值,所述处理后的输入数据包括处理后的输入神经元和/或处理后的权值;
    所述压缩单元,用于根据所述第一微指令对所述输入数据进行处理,以得到所述处理后的输入数据;
    所述运算单元,用于根据所述第二微指令对所述处理后的输入数据进行处理,以得到所述运算结果。
  2. 根据权利要求1所述的计算装置,其特征在于,
    所述运算单元,具体用于获取第一缓存数据和第二缓存数据,根据所述第二微指令对所述第一缓存数据和第二缓存数据进行处理,得到所述运算结果;
    其中,所述第一缓存数据和/或所述第二缓存数据与所述处理后的输入数据相关,且所述第一缓存数据和所述第二缓存数据不同。
  3. 根据权利要求1所述的计算装置,其特征在于,
    所述压缩单元,具体用于根据所述第一微指令,确定针对所述输入数据的处理方式,所述处理方式包括以下中的至少一项:剪枝处理、量化处理以及编码处理;
    所述压缩单元,还用于根据所述处理方式,对所述输入数据进行对应的处理,以得到处理后的输入数据。
  4. 根据权利要求3所述的计算装置,其特征在于,所述压缩单元还包括控制单元,在所述压缩单元内部通过所述控制单元对所述处理方式对应的运算状态进行修改,以实现多种处理方式的数据处理;所述运算状态包括以下中的至少一项:第一状态、第二状态、第三状态以及第四状态,其中,
    所述第一状态,用于指示所述压缩单元处于初始状态,不进行任何数据处理;
    所述第二状态与所述剪枝处理关联,用于指示所述压缩单元将进行数据的剪枝处理;
    所述第三状态与所述量化处理关联,用于指示所述压缩单元将进行数据的量化处理;
    所述第四状态与所述编码处理关联,用于指示所述压缩单元将进行数据的编码处理。
  5. 根据权利要求3所述的计算装置,其特征在于,当所述处理方式为剪枝处理时,
    所述压缩单元,具体用于根据所述剪枝处理,将绝对值大于第一阈值的所述输入数据进行删除,以得到处理后的输入数据。
  6. 根据权利要求5所述的计算装置,其特征在于,
    所述压缩单元,具体用于根据位置关系数据,采用所述剪枝处理将绝对值大于第一阈值的所述输入数据进行删除,以得到处理后的输入数据;
    其中,所述位置关系数据包括以下中的任一项:输入神经元的位置关系数据、输入权值的位置关系数据、由输入神经元的位置关系数据和输入权值的位置关系确定而得的位置关系数据。
  7. 根据权利要求6所述的计算装置,其特征在于,所述位置关系数据可用直接索引或步长索引的形式表示。
  8. 根据权利要求3所述的计算装置,其特征在于,当所述处理方式为量化处理时,
    所述压缩单元,具体用于根据所述量化处理,将所述输入数据进行聚类和量化,以得到处理后的输入数据。
  9. 根据权利要求3所述的计算装置,其特征在于,当所述处理方式为编码处理时,
    所述压缩单元,具体用于根据所述编码处理,将所述输入数据进行预设编码格式的编码,以得到处理后的输入数据;其中,所述预设编码格式为用户侧或装置侧自定义设置的。
  10. 根据权利要求2-9中任一项所述的计算装置,其特征在于,所述计算装置还包括指令缓存单元、直接存储访问单元、第一输入缓存单元和第二输入缓存单元;
    所述指令缓存单元,用于缓存所述运算指令;
    所述直接存储访问单元,用于在所述存储单元与所述指令缓存单元、所述第一输入缓存单元、所述第二输入缓存单元和所述输出缓存单元之间进行数据的读写;
    所述指令缓存单元,用于缓存所述直接存储访问单元读取的所述神经网络指令;
    所述第一输入缓存单元,用于缓存所述直接存储访问单元读取的第一缓存数据;
    所述第二输入缓存单元,用于缓存所述直接存储访问单元读取的第二缓存数据,所述第一缓存数据和/或所述第二缓存数据与所述处理后的输入数据相关,且所述第一缓存数据和所述第二缓存数据不同。
  11. 根据权利要求10所述的计算装置,其特征在于,所述计算装置还包括输出缓存单元;
    所述输出缓存单元,用于缓存所述运算结果。
  12. 一种计算装置,其特征在于,包括运算单元、指令控制单元以及压缩单元;所述计算装置获取训练样本、训练模型;所述训练模型为:神经网络训练模型和/或非神经网络训练模型,所述训练模型包括n层结构;
    所述指令控制单元用于获取训练指令,根据所述训练指令得到正向运算指令和反向运算指令;将所述正向运算指令和所述反向运算指令分别发送给所述运算单元;
    所述运算单元用于根据所述正向运算指令对所述训练样本进行n层正向运算得到第n层正向运算结果;根据所述第n层正向运算结果获得第n层输出数据梯度;
    所述运算单元还用于根据所述反向运算指令对所述第n层输出数据梯度执行n层反向运算得到所述训练模型的权值梯度;所述训练模型的权值梯度包括n层中每层的权值梯度;
    所述压缩单元用于对所述训练模型的权值梯度进行处理,以对应得到处理后的权值梯度;
    所述计算装置用于依据所述处理后的权值梯度对所述训练模型的权值进行更新,以完成训练。
  13. 根据权利要求12所述的计算装置,其特征在于,
    所述指令运算单元还用于将所述正向运算指令进行译码得到第一微指令和第二微指令;
    所述压缩单元,还用于根据所述第一微指令对所述训练样本进行处理,得到处理后的训练样本;
    所述运算单元,具体用于根据所述第二微指令对所述处理后的训练样本进行n层正向运算得到第n层正向运算结果。
  14. 根据权利要求12所述的计算装置,其特征在于,
    所述运算单元,具体用于根据所述反向运算指令对所述第n层输出权值梯度执行第n层反向运算得到所述训练模型中第n层的权值梯度;
    所述压缩单元,还用于对所述第n层的权值梯度进行处理,得到处理后的第n层的权值梯度;
    所述计算装置,具体用于利用所述处理后的第n层的权值梯度对所述训练模型中第n层的权值进行更新。
  15. 根据权利要求14所述的计算装置,其特征在于,
    所述运算单元,还用于获取第n-1层输出数据梯度,对所述第n-1层输出数据梯度执行第n-1层反向运算得到第n-1层的权值梯度;
    所述压缩单元,还用于对所述第n-1层的权值梯度进行处理,得到处理后的第n-1层的权值梯度;
    所述计算装置,具体用于依据所述处理后的第n-1层的权值梯度对所述训练模型中第n-1层的权值进行更新,依次类推,所述计算装置将依据所述训练模型中处理后的每层的权值梯度对应对所述训练模型的权值进行更新,以完成训练。
  16. 根据权利要求12-15中任一项所述的计算装置,其特征在于,
    所述压缩单元,用于确定数据处理方式,根据所述数据处理方式对所述训练模型的权值梯度进行处理,以对应得到处理后的权值梯度;所述数据处理方式包括以下中的至少一项:剪枝处理、量化处理以及编码处理。
  17. 根据权利要求16所述的计算装置,其特征在于,所述压缩单元还包括控制单元,在所述压缩单元内部通过所述控制单元对所述处理方式对应的运算状态进行修改,以实现多种处理方式的数据处理;所述运算状态包括以下中的至少一项:第一状态、第二状态、第三状态以及第四状态,其中,
    所述第一状态,用于指示所述压缩单元处于初始状态,不进行任何数据处理;
    所述第二状态与所述剪枝处理关联,用于指示所述压缩单元将进行数据的剪枝处理;
    所述第三状态与所述量化处理关联,用于指示所述压缩单元将进行数据的量化处理;
    所述第四状态与所述编码处理关联,用于指示所述压缩单元将进行数据的编码处理。
  18. 根据权利要求16所述的计算装置,其特征在于,当所述数据处理方式为剪枝处理时,
    所述压缩单元,具体用于根据所述剪枝处理,将绝对值大于或等于第一阈值的权值梯度进行删除,以对应得到处理后的权值梯度。
  19. 根据权利要求18所述的计算装置,其特征在于,
    所述压缩单元,具体用于根据所述训练模型的权值梯度的位置关系数据,采用所述剪枝处理将绝对值大于或等于第一阈值的所述权值数据进行删除,以得到处理后的权值梯度。
  20. 根据权利要求19所述的计算装置,其特征在于,所述位置关系数据可用直接索引或步长索引的形式表示。
  21. 根据权利要求16所述的计算装置,其特征在于,当所述数据处理方式为量化处理时,
    所述压缩单元,具体用于根据所述量化处理,对所述训练模型的权值梯度进行聚类和量化,以对应得到处理后的权值梯度。
  22. 根据权利要求16所述的计算装置,其特征在于,当所述数据处理方式为编码处理时,
    所述压缩单元,具体用于根据所述编码处理,将所述训练模型的权值梯度进行预设编码格式的编码,以对应得到处理后的权值梯度;其中,所述预设编码格式为用户侧或装置侧自定义设置的。
  23. 根据权利要求12所述的计算装置,其特征在于,所述计算装置还包括存储单元,
    所述存储单元用于存储所述训练样本、所述训练模块以及所述训练指令。
  24. 根据权利要求12所述的计算装置,其特征在于,所述计算装置还包括指令缓存单元、直接存储访问单元、第一输入缓存单元、第二输入缓存单元以及输出缓存单元;
    所述直接存储访问单元,用于在所述存储单元与所述指令缓存单元、所述第一输入缓存单元、所述第二输入缓存单元和所述输出缓存单元之间进行数据的读写;
    所述指令缓存单元,用于缓存所述直接存储访问单元读取所述训练指令;
    所述第一输入缓存单元,用于缓存所述直接存储访问单元读取的第一缓存数据,所述第一缓存数据为所述训练模型的权值或者所述训练样本;
    所述第二输入缓存单元,用于缓存所述直接存储访问单元读取的第二缓存数据,所述第一缓存数据为所述训练模型的权值或者所述训练样本,所述第一缓存数据和所述第二缓存数据不同;
    所述输出缓存单元,用于缓存所述运算单元的运算结果,所述运算结果包括所述第n层正向运算结果、所述第n层输出数据梯度以及所述训练模型的权值梯度。
  25. 一种使用计算装置进行数据处理的方法,其特征在于,所述计算装置包括运算单元、指令控制单元、存储单元以及压缩单元,所述方法包括:
    所述指令控制单元获取运算指令,将所述运算指令译码为第一微指令和第二微指令,并将所述第一微指令发送给所述压缩单元,将所述第二微指令发送给所述运算单元;
    所述压缩单元根据所述第一微指令对获取的输入数据进行处理,得到处理后的输入数据;其中,所述输入数据包括至少一个输入神经元和/或至少一个输入数据,所述处理后的输入数据包括处理后的输入神经元和/或处理后的输入数据;
    所述运算单元根据所述第二微指令对所述处理后的输入数据进行处理,得到运算结果。
  26. 一种使用计算装置进行数据处理的方法,其特征在于,所述计算装置包括运算单元、指令控制单元以及压缩单元;所述计算装置获取训练样本、训练模型;所述训练模型为:神经网络训练模型和/或非神经网络训练模型,所述训练模型包括n层结构;所述方法包括:
    所述指令控制单元获取训练指令,根据所述训练指令得到正向运算指令和反向运算指令,并将所述正向运算指令和所述反向运算指令分别发送给所述运算单元;
    所述运算单元根据所述正向运算指令对所述训练样本进行n层正向运算得到第n层正向运算结果;根据所述第n层正向运算结果获得第n层输出数据梯度;根据所述反向运算指令对所述第n层输出数据梯度执行n层反向运算得到所述训练模型的权值梯度;所述训练模型的权值梯度包括n层中每层的权值梯度;
    所述压缩单元对所述训练模型的权值梯度进行处理,以对应得到处理后的权值梯度;
    所述计算装置依据所述处理后的权值梯度对所述训练模型的权值进行更新,以完成训练。
  27. 一种芯片,其特征在于,所述芯片包括如上权利要求1-11任意一项所述的计算装置,或者包括如上权利要求12-24任意一项所述的计算装置。
  28. 一种电子设备,其特征在于,所述电子设备包括如上权利要求27所述的芯片。
  29. 一种计算机可读存储介质,其特征在于,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求25或26所述的方法。
PCT/CN2019/075975 2018-02-27 2019-02-23 一种计算装置及相关产品 WO2019165939A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201810161816.0A CN110196734A (zh) 2018-02-27 2018-02-27 一种计算装置及相关产品
CN201810161816.0 2018-02-27
CN201810161636.2 2018-02-27
CN201810161636.2A CN110196735A (zh) 2018-02-27 2018-02-27 一种计算装置及相关产品

Publications (1)

Publication Number Publication Date
WO2019165939A1 true WO2019165939A1 (zh) 2019-09-06

Family

ID=67805667

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/075975 WO2019165939A1 (zh) 2018-02-27 2019-02-23 一种计算装置及相关产品

Country Status (1)

Country Link
WO (1) WO2019165939A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203627A (zh) * 2016-07-08 2016-12-07 中国电子科技集团公司电子科学研究院 一种对网络靶场进行评价的方法
CN106529395A (zh) * 2016-09-22 2017-03-22 文创智慧科技(武汉)有限公司 基于深度置信网络和k均值聚类的签名图像鉴定方法
CN106991477A (zh) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 一种人工神经网络压缩编码装置和方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991477A (zh) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 一种人工神经网络压缩编码装置和方法
CN106203627A (zh) * 2016-07-08 2016-12-07 中国电子科技集团公司电子科学研究院 一种对网络靶场进行评价的方法
CN106529395A (zh) * 2016-09-22 2017-03-22 文创智慧科技(武汉)有限公司 基于深度置信网络和k均值聚类的签名图像鉴定方法

Similar Documents

Publication Publication Date Title
CN109104876B (zh) 一种运算装置及相关产品
CN109062610B (zh) 神经网络处理装置及其执行吉文斯旋转指令的方法
KR102434728B1 (ko) 처리방법 및 장치
WO2018214913A1 (zh) 处理方法及加速装置
US11307865B2 (en) Data processing apparatus and method
WO2019157812A1 (zh) 一种计算装置及方法
US20190250860A1 (en) Integrated circuit chip device and related product thereof
CN110909870B (zh) 训练装置及方法
WO2022088063A1 (zh) 神经网络模型的量化方法和装置、数据处理的方法和装置
CN113238989A (zh) 将数据进行量化的设备、方法及计算机可读存储介质
CN113238987B (zh) 量化数据的统计量化器、存储装置、处理装置及板卡
CN111045726B (zh) 支持编码、解码的深度学习处理装置及方法
CN111047020B (zh) 支持压缩及解压缩的神经网络运算装置及方法
CN109389210B (zh) 处理方法和处理装置
TWI768167B (zh) 集成電路芯片裝置及相關產品
CN110196735A (zh) 一种计算装置及相关产品
WO2019165939A1 (zh) 一种计算装置及相关产品
CN113238976B (zh) 缓存控制器、集成电路装置及板卡
CN113238988A (zh) 优化深度神经网络的参数的处理系统、集成电路及板卡
CN109102074B (zh) 一种训练装置
CN110196734A (zh) 一种计算装置及相关产品
CN110197275B (zh) 集成电路芯片装置及相关产品
CN111967588A (zh) 量化运算方法及相关产品
CN111382848A (zh) 一种计算装置及相关产品
TWI768168B (zh) 集成電路芯片裝置及相關產品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19760508

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19760508

Country of ref document: EP

Kind code of ref document: A1