WO2022134688A1 - 数据处理电路、数据处理方法及相关产品 - Google Patents

数据处理电路、数据处理方法及相关产品 Download PDF

Info

Publication number
WO2022134688A1
WO2022134688A1 PCT/CN2021/119946 CN2021119946W WO2022134688A1 WO 2022134688 A1 WO2022134688 A1 WO 2022134688A1 CN 2021119946 W CN2021119946 W CN 2021119946W WO 2022134688 A1 WO2022134688 A1 WO 2022134688A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
circuit
sparse
structured
processing
Prior art date
Application number
PCT/CN2021/119946
Other languages
English (en)
French (fr)
Inventor
高钰峰
朱时兵
何皓源
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Priority to US18/257,723 priority Critical patent/US20240070445A1/en
Publication of WO2022134688A1 publication Critical patent/WO2022134688A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30021Compare instructions, e.g. Greater-Than, Equal-To, MINMAX
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present disclosure relates generally to the field of processors. More particularly, the present disclosure relates to data processing circuits, methods, chips and boards for processing data using data processing circuits.
  • the network parameter sparsification is to reduce the redundant components in the larger network by appropriate methods, so as to reduce the network's demand for computation and storage space.
  • Existing hardware and/or instruction sets cannot efficiently support sparse processing.
  • the solutions of the present disclosure provide a data processing circuit, a data processing method, a chip and a board.
  • the present disclosure discloses a data processing circuit comprising a control circuit, a storage circuit, and an arithmetic circuit, wherein: the control circuit is configured to control the storage circuit and the arithmetic circuit to perform an operation on at least one tensor One dimension of the data is subjected to structured sparse processing; the storage circuit is configured to store information, the information includes at least information before and/or after the thinning; and the arithmetic circuit is configured to be used in the control circuit Under the control of , perform structured sparse processing on one dimension of the tensor data.
  • the present disclosure provides a chip including the data processing circuit of any embodiment of the foregoing first aspect.
  • the present disclosure provides a board including the chip of any embodiment of the foregoing second aspect.
  • the present disclosure provides a method of processing data using the data processing circuit of any of the foregoing embodiments of the first aspect.
  • the embodiments of the present disclosure provide a hardware circuit that supports structured sparse operation of data, which can perform tensor data processing.
  • One dimension for structured sparse processing may support structured sparseness in the inference and training process of the neural network, and the dimension for sparseness is the input channel dimension.
  • FIG. 1 is a structural diagram illustrating a board according to an embodiment of the present disclosure
  • FIG. 2 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram showing the internal structure of a processor core of a single-core or multi-core computing device according to an embodiment of the present disclosure
  • FIG. 4 shows a structural block diagram of a data processing circuit according to an embodiment of the present disclosure
  • 5A-5D are schematic diagrams showing partial structures of the operation circuit according to the embodiment of the present disclosure.
  • FIG. 6 illustrates an exemplary operational pipeline for structured sparse processing according to an embodiment of the present disclosure.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting”.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure.
  • the board 10 includes a chip 101, which is a system-on-chip (SoC), or a system-on-a-chip, and integrates one or more combined processing devices.
  • SoC system-on-chip
  • the combined processing device is an artificial
  • the intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage capacity and computing capacity of the platform.
  • the board 10 in this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing power.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be transmitted back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected to the control device 106 and the chip 101 through a bus and performs data transmission.
  • the control device 106 in the board 10 is configured to control the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing a combined processing device in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201 , an interface device 202 , a processing device 203 and a storage device 204 .
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete a user-specified operation.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write the input data into the storage device on-chip of the computing device 201 .
  • the computing device 201 can obtain the control instruction from the processing device 203 via the interface device 202 and write it into the control cache on the computing device 201 .
  • the interface device 202 can also read the data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201, and the like.
  • the processing device 203 may be one or more types of central processing unit (CPU), graphics processing unit (GPU), or other general-purpose and/or special-purpose processors.
  • processors include but are not limited to digital signal processors (DSPs), application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field-programmable gate arrays
  • Programmable logic devices discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are considered together, the two are considered to form a heterogeneous multi-core structure.
  • the storage device 204 is used to store data to be processed, which may be a DRAM, or a DDR memory, with a size of 16G or more, and is used to save the data of the computing device 201 and/or the processing device 203 .
  • FIG. 3 shows a schematic diagram of the internal structure of the processor core when the computing device 201 is a single-core or multi-core device.
  • the computing device 301 is used to process input data such as computer vision, speech, natural language, and data mining.
  • the computing device 301 includes three modules: a control module 31 , an arithmetic module 32 and a storage module 33 .
  • the control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, and it comprises an instruction fetch unit (instruction fetch unit, IFU) 311 and an instruction decoding unit (instruction decode unit, IDU) 312.
  • the instruction fetching unit 311 is used to acquire the instruction from the processing device 203 , and the instruction decoding unit 312 decodes the acquired instruction, and sends the decoding result to the operation module 32 and the storage module 33 as control information.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 .
  • the vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, that is, matrix multiplication and convolution.
  • the storage module 33 is used to store or transport related data, including a neuron storage unit (neuron RAM, NRAM) 331, a parameter storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons and intermediate results after calculation;
  • WRAM 332 is used to store convolution kernels of deep learning networks, namely weights;
  • DMA 333 is connected to DRAM 204 through bus 34, and is responsible for computing device 301 Data transfer to and from DRAM 204 .
  • the embodiments of the present disclosure provide a data processing circuit that supports structured sparse operation of data.
  • various processes related to structured sparsity can be simplified and accelerated.
  • FIG. 4 shows a structural block diagram of a data processing circuit 400 according to an embodiment of the present disclosure.
  • the data processing circuit 400 may be implemented, for example, in the computing device 201 of FIG. 2 .
  • the data processing apparatus 400 may include a control circuit 410 , a storage circuit 420 and an arithmetic circuit 430 .
  • control circuit 410 may be similar to that of the control module 314 of FIG. 3 , and it may include, for example, an instruction fetch unit for fetching an instruction from, for example, the processing device 203 of FIG. 2 , and an instruction decoding unit for processing the fetched instruction. decode, and send the decoded result to the operation circuit 430 and the storage circuit 420 as control information.
  • control circuit 410 may be configured to control the storage circuit 420 and the arithmetic circuit 430 to perform structured sparse processing on one dimension of the at least one tensor data.
  • the storage circuit 420 may be configured to store information before and/or after thinning.
  • the data to be structured and sparsely processed may be data in a neural network, such as weights, neurons, and the like.
  • the storage circuit may be, for example, the WRAM 332 and NRAM 331 of FIG. 3 .
  • the operation circuit 430 may be configured to perform structured sparse processing on one dimension of the above-mentioned tensor data under the control of the control circuit 410 .
  • the arithmetic circuit 430 may include a structured sparse circuit 432, which may be configured to select n data elements as valid data elements from every m data elements of the dimension to be sparsed of the input data according to a sparse rule, where m>n.
  • n can also take other values, such as 1 or 3.
  • the arithmetic circuit 430 may further include a convolution circuit 433, which may be configured to perform a convolution operation on the structured sparse processed data.
  • the convolution circuit 433 may be configured to receive the data to be convolved and perform a convolution operation thereon.
  • the data to be convoluted includes at least the thinned input data received from structured thinning circuit 432 .
  • the structured sparse circuit 432 may perform structured sparse processing according to different sparse rules.
  • the structured sparse circuit 510 may include a first structured sparse sub-circuit 512 and/or a second structured sparse sub-circuit 514 therein.
  • the first structured sparse subcircuit 512 may be configured to perform structured sparse processing on the input data according to a specified sparse mask.
  • the second structured thinning subcircuit 514 may then be configured to perform structured thinning processing on the input data according to predetermined thinning rules.
  • the need to perform convolution processing on structured sparse data is described as an example. These scenarios can occur, for example, during inference or training of neural networks.
  • one of the data to be convoluted (it is assumed to be the first data) may have undergone structured sparse processing in advance, and the other data to be processed (assumed to be the second data) needs to be processed according to the first A sparse way of data to perform structured sparse.
  • the first scenario can be, for example, the inference process of a neural network
  • the first data is weights, which have been sparsely structured offline, so the sparsed weights can be directly used as convolution inputs.
  • the second data is a neuron, which has not been sparsed yet, and needs to be sparsely structured according to the index part obtained by the sparseness of the weights.
  • the first scenario can also be, for example, the forward training process of the neural network, which is also referred to as the forward training process.
  • the first data are weights, which are pre-structured sparse at the beginning of training, and the resulting data part and index part are stored, for example, in DDR for subsequent use when performing convolution with the second data.
  • the second data is a neuron, which has not been sparsed yet, and needs to be sparsely structured according to the index part after the weights are sparsed.
  • the first scenario may also be, for example, a reverse weight gradient training process of a neural network, or a training reverse weight gradient process.
  • the first data are weights, which have been sparsed during the forward training process, and can be updated directly based on the sparsed weights.
  • the second data is the weight gradient, which has not been sparse yet, and needs to be structured and sparse according to the index part obtained by the sparse weight, so as to add the sparse weight to update the weight.
  • the second data may be thinned out by using the first structured thinning subcircuit.
  • FIG. 5A shows a schematic diagram of part of the structure of the arithmetic circuit in the first scenario.
  • the structured sparse circuit 510 includes a first structured sparse sub-circuit 512 that receives the input second data and an index portion of the pre-structured sparse first data.
  • the first structured sparse subcircuit 512 uses the index portion of the first data as a sparse mask to perform structured sparse processing on the second data. Specifically, the first structured sparse subcircuit 512 extracts the data of the corresponding position from the first data as the valid data according to the valid data position indicated by the index part of the first data.
  • the first structured sparse subcircuit 512 may be implemented by, for example, a circuit such as vector multiplication or matrix multiplication.
  • the convolution circuit 520 receives the structured thinned first data and the thinned second data output from the first structured thinning subcircuit 512, and performs convolution on both.
  • the first data and the second data may be thinned out by using the second structured thinning subcircuit.
  • FIG. 5B shows a schematic diagram of a partial structure of the arithmetic circuit in the second scenario.
  • the structured sparse circuit 510 may include two second structured sparse subcircuits 514, which respectively receive the first and second data to be convolved for simultaneous, independent
  • the first data and the second data are respectively subjected to structured sparse processing, and the sparsed data is output to the convolution circuit 520 .
  • the second structured sparse subcircuit 514 may be configured to perform structured sparse processing according to a predetermined filtering rule, for example, according to the filtering rule with a larger absolute value, filter out n pieces of data with a larger absolute value from every m data elements element as a valid data element.
  • the second structured sparse sub-circuit 514 can implement the above-mentioned processing by, for example, configuring a multi-stage operation pipeline composed of circuits such as comparators.
  • the structured sparse circuit 510 may also include only one second structured sparse sub-circuit 514, which sequentially performs structured sparse processing on the first data and the second data.
  • neither of the two data to be convolved is subjected to structured sparse processing, the two data need to be subjected to structured sparse processing before convolution, and one of the data (for example, the first data) needs to use
  • the thinning-processed index portion of another data serves as a thinning mask.
  • the third scenario may be, for example, a forward training process of a neural network, or a forward process called training.
  • the first data is a neuron, which has not been sparsed yet, and needs to be structurally sparsed according to the index part obtained by the sparseness of the weights (ie, the second data).
  • the second data is the weight value, which needs to be sparsely structured to obtain the sparsed data part and the index part.
  • the first data of the convolution operation is performed.
  • FIG. 5C shows a partial structural schematic diagram of the arithmetic circuit in the third scenario.
  • the structured sparse circuit 510 may include a first structured sparse subcircuit 512 and a second structured sparse subcircuit 514 .
  • the second structured thinning subcircuit 514 may be configured to perform structured thinning processing on, for example, the second data according to a predetermined filtering rule, and provide the index portion of the thinned second data to the first structured thinning subcircuit 512 .
  • the first structured sparse subcircuit 512 uses the index portion of the second data as a sparse mask to perform structured sparse processing on the first data.
  • the convolution circuit 520 receives the structured sparse first data and the second process from the first structured sparse sub-circuit 512 and the second structured sparse sub-circuit 514, respectively, and performs convolution on both.
  • the two data to be convolved are not subjected to structured sparse processing, and the same sparse mask needs to be applied to the two data before the convolution to perform structured sparse processing.
  • the fourth scenario may be, for example, the forward training process of a neural network, where the sparse masks have been pre-generated and stored, for example, in the DDR.
  • the first data are neurons, which have not been sparsed yet and need to be sparsely structured according to the sparse mask.
  • the second data is the weight, which has not been sparse yet, and needs to be sparsely structured according to the sparse mask.
  • FIG. 5D shows a schematic diagram of a part of the structure of the arithmetic circuit in the fourth scenario.
  • the structured sparse circuit 510 includes two first structured sparse sub-circuits 512, which respectively receive input first and second data and a pre-generated sparse mask.
  • the first structured sparse subcircuit 512 respectively performs structured sparse processing on the first data and the second data according to the sparse mask.
  • the convolution circuit 520 receives the structured thinned first data and the thinned second data, and performs convolution on both.
  • the structured sparse circuit may include two first structured sparse sub-circuits for processing.
  • FIG. 6 illustrates an exemplary operational pipeline for structured sparse processing according to one embodiment of the present disclosure.
  • the pipeline includes at least one multi-stage pipeline operation circuit including a plurality of operators arranged in stages and configured to perform a structured sparse process of selecting n data elements with larger absolute values as valid data elements from m data elements , for example, for implementing the aforementioned second structured sparse subcircuit.
  • the above-described structured sparse processing can be performed using a multi-stage pipeline operation circuit composed of an absolute value operator, a comparator, and the like.
  • the first pipeline stage may include m(4) absolute value operators 610 for synchronously performing absolute value operations on the 4 input data elements A, B, C and D, respectively.
  • the first pipeline stage will simultaneously output the original data elements (ie, A, B, C, and D) and the data after the absolute value operation (ie,
  • the second pipeline stage may include a permutation and combination circuit 620 for permuting and combining the m absolute values to generate m groups of data, wherein each group of data includes the m absolute values, and the m absolute values are in each group The locations in the data are different from each other.
  • the permutation combination circuit may be a cyclic shifter that performs m-1 cyclic shifts on a permutation of m absolute values (eg,
  • m absolute values eg,
  • four sets of data are generated, namely: ⁇
  • each group of data is output, the corresponding original data element is also output, and each group of data corresponds to one original data element.
  • the third pipeline stage includes a comparison circuit 630 for comparing absolute values in the m sets of data and generating a comparison result.
  • the third pipeline stage may include m comparison circuits, each comparison circuit includes m-1 comparators (631, 632, 633), and m-1 comparators in the i-th comparison circuit are used for One absolute value in the i-th group of data is compared with the other three absolute values in turn and a comparison result is generated, where 1 ⁇ i ⁇ m.
  • the third pipeline stage can also be considered as m-1(3) sub-pipeline stages.
  • Each sub-pipeline stage includes m comparators for comparing one of its corresponding absolute values with other absolute values.
  • the m-1 sub-pipeline stages are to sequentially compare a corresponding absolute value with the other m-1 absolute values.
  • the four comparators 631 in the first sub-pipeline stage are used to compare the first absolute value and the second absolute value of the four sets of data respectively, and output the comparison respectively.
  • the four comparators 632 in the second sub-pipeline are used to compare the first absolute value with the third absolute value in the four sets of data respectively, and output the comparison results w1, x1, y1 and z1 respectively.
  • the four comparators 633 in the third sub-pipeline stage are used to compare the first absolute value and the fourth absolute value of the four sets of data respectively, and output the comparison results w2, x2, y2 and z2 respectively.
  • the comparison result of each absolute value and the other m-1 absolute values can be obtained.
  • the comparison result may be represented using a bitmap. For example, at the first comparator of the first comparison circuit, when
  • , w0 1; at the second comparator of the first channel, when
  • , w1 0; at the third comparator of the first channel, when
  • , w2 1, thus, the output result of the first comparison circuit is ⁇ A, w0, w1, w2 ⁇ , this time ⁇ A, 1, 0, 1 ⁇ .
  • the output of the second comparison circuit is ⁇ B, x0, x1, x2 ⁇
  • the output of the third comparison circuit is ⁇ C, y0, y1, y2 ⁇
  • the output of the fourth comparison circuit is ⁇ D, z0, z1, z2 ⁇ .
  • the fourth pipeline stage includes a screening circuit 640 for selecting n data elements with larger absolute values from the m data elements as valid data elements according to the comparison result of the third stage, and outputting these valid data elements and corresponding indexes .
  • the index is used to indicate the position of these valid data elements within the input m data elements. For example, when A and C are filtered out from the four data elements of A, B, C, and D, their corresponding indices can be 0 and 2.
  • appropriate logic can be designed to select n data elements with larger absolute values.
  • selection is performed according to a specified priority order.
  • the priority of A can be set to be the highest and the priority of D to be the lowest according to the way of fixing the priority from low to high.
  • the absolute values of the three numbers A, C, and D are the same and greater than the absolute value of B, the selected data are A and C.
  • the result after sparse processing consists of two parts: the data part and the index part.
  • the data part includes the data after sparse processing, that is, the valid data elements extracted according to the filtering rules of the structured sparse processing.
  • the index part is used to indicate the data after thinning, that is, the original positions of valid data elements in the original data before thinning (that is, the data to be thinned).
  • Structured sparse data can be represented and/or stored in a variety of forms.
  • the structured sparse processed data may be in the form of a structure.
  • the data part and the index part are bound to each other.
  • each 1 bit in the index portion may correspond to one data element.
  • each 1 bit in the index part may correspond to 8 bits of data.
  • each 1 bit in the index part of the structure may be set as a position corresponding to N-bit data, and N is determined at least in part based on the hardware configuration.
  • the data part in the structure may be aligned according to the first alignment requirement, and the index part in the structure may be aligned according to the second alignment requirement, so that the entire structure also meets the alignment requirement.
  • the data part can be aligned according to 64B
  • the index part can be aligned according to 32B
  • the entire structure can be aligned according to 96B (64B+32B).
  • the data part and the index part can be used uniformly. Since in structured sparse processing, the ratio of valid data elements to original data elements is fixed, such as n/m, the size of data after sparse processing is also fixed or predictable. Thus, structures can be densely stored in memory circuits without performance loss.
  • the data portion and the index portion obtained after the thinning process may also be represented and/or stored separately for separate use.
  • the index portion of the structured thinning-processed second input data may be provided to the first structured thinning circuit 512 to be used as a mask to perform the structured thinning-processing on the first input data.
  • each 1 bit in the separately provided index part can indicate whether a data element is valid or not.
  • the arithmetic circuit 430 may further include a pre-processing circuit 431 and a post-processing circuit 434 .
  • the preprocessing circuit 431 can be configured to preprocess the data before the operation performed by the structured sparse circuit 432 and/or the convolution circuit 433;
  • the postprocessing circuit 434 can be configured to perform postprocessing on the data after the operation by the convolution circuit 433 .
  • the aforementioned preprocessing and postprocessing may include, for example, data splitting and/or data splicing operations.
  • the pre-processing circuit may divide the data to be sparsed into segments according to every m data elements, and then send the data to the structured sparse circuit 432 for structured sparse processing.
  • the preprocessing circuit can also control the rate at which data is fed to the structured sparse circuit. For example, in processing involving structured sparseness, the preprocessing circuit may feed data to the structured sparse circuit at a first rate; while in processing not involving structured sparseness, the preprocessing circuit may feed the convolution circuit at a second rate data.
  • the first rate is greater than the second rate, for example, by a ratio equal to the sparse ratio in the structured sparse process, eg, m/n.
  • the first rate is twice the second rate.
  • the post-processing circuit 434 may perform fusion processing on the output of the convolution circuit, such as addition, subtraction, multiplication, and the like.
  • the data for structured sparse processing can be data in a neural network, such as weights, neurons, etc.
  • Data in neural networks often contains multiple dimensions, also known as tensor data.
  • data may exist in four dimensions: input channels, output channels, length, and width.
  • the structured sparse described above is performed for one dimension of multidimensional data in a neural network.
  • the above structured sparseness can be used for structured sparseness in the forward process (eg, inference, or forward training) of neural networks, where the structured sparseness processing is aimed at multidimensional neural networks.
  • the input channel dimension of the data is executed.
  • the above structured sparseness can be used for structured sparseness in the reverse process of neural networks (eg, reverse weight gradient training), where the structured sparseness processing is aimed at the multidimensional data in the neural network. Enter the channel dimension.
  • the embodiments of the present disclosure provide at least two scheduling solutions.
  • the weights can be pre-structured and sparse online during forward training, and the obtained index part and data part can be directly used in the subsequent process of performing the structured sparse convolution process with neurons .
  • the task of structurally sparse weights can be split in parallel, split from the on-chip system to different clusters, and then split from the cluster Split into different processor cores for parallel execution to improve execution speed.
  • the index part and the data part generated by all the processor cores are written back to the memory (eg DDR).
  • the index part and the data part of the weight are completely re-sent to different processor cores to perform structured sparse related operations (for example, structured sparse convolution operations), where the index part is used as part of the execution process.
  • structured sparse related operations for example, structured sparse convolution operations
  • the sparse mask, the data part directly participates in the convolution operation.
  • the scheduling scheme occupies DDR space to store the data part and the index part of the sparse weight (for example, in the form of a structure), and does not need to re-execute the sparse process when repeatedly loading the weight participating in the operation.
  • the weights and neurons can be sparsed online during forward training, and the sparse mask used in the structured sparse is pre-generated. Since the sparse mask is used for a long time, its generation frequency can be generated once for K iterations, so there is no need to repeatedly generate the sparse mask every time, but can be cached in the DDR as the structured sparse input for each iteration.
  • unsparse weights, neurons, and sparse masks required for the operation can be directly sent to each processor core, and the sparse-processed weights and neurons directly participate in the operation.
  • this embodiment does not need to additionally occupy DDR to cache the sparsely processed weight data, but is directly used for subsequent operations. Therefore, when the neuron data is small, for example, it can be loaded at one time, so that the weights need not be loaded repeatedly, the execution performance can be improved by using the scheduling scheme provided by this embodiment.
  • an embodiment of the present disclosure provides a hardware solution for performing processing related to structured sparsity.
  • This structured sparsity processing can support the inference and training process of neural networks by performing structured sparsity processing on one dimension (eg, the input channel dimension) of the tensor data in the neural network.
  • one dimension eg, the input channel dimension
  • processing can be simplified, thereby increasing the processing efficiency of the machine.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • various types of devices described herein eg, computing devices or other processing devices
  • suitable hardware processors such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • a data processing circuit comprising a control circuit, a storage circuit and an arithmetic circuit, wherein:
  • the control circuit is configured to control the storage circuit and the operation circuit to perform structured sparse processing on one dimension of at least one tensor data
  • the storage circuit is configured to store information including at least information before and/or after thinning;
  • the operation circuit is configured to perform structured sparse processing on one dimension of the tensor data under the control of the control circuit.
  • Clause 2 The data processing circuit of Clause 1, wherein the arithmetic circuit comprises a structured sparse circuit configured to select n data elements from every m data elements of the dimension to be sparsed of the input data according to a sparse rule As valid data elements, where m>n.
  • Clause 3 The data processing circuit of clause 2, wherein the structured sparse circuit comprises:
  • the first structured sparse subcircuit is configured to perform structured sparse processing on the input data according to a specified sparse mask.
  • Clause 4 The data processing circuit of clause 2 or 3, wherein the structured sparse circuit comprises:
  • the second structured sparse subcircuit is configured to perform structured sparse processing on the input data according to a predetermined sparse rule.
  • Clause 5 The data processing circuit of clause 3 or 4, wherein the at least one tensor data comprises first data, the arithmetic circuit configured to:
  • the structured sparse processing is performed on the first data by using the index part corresponding to the structured sparse processed second data as a sparse mask, wherein the index part indicates the structure to be executed The positions of valid data elements in the sparse.
  • the structured sparse processed second data is pre-processed online or offline for structured sparse processing and stored in the storage circuit, or
  • the structured sparse processed second data is generated online by using the second structured sparse subcircuit to perform structured sparse processing.
  • Clause 7 The data processing circuit according to any one of clauses 5-6, wherein the arithmetic circuit further comprises:
  • a convolution circuit configured to perform a convolution operation on the structured sparse processed first data and the second data.
  • Item 8 The data processing circuit according to any one of Items 5-7, wherein the second data that has been structured and sparsely processed is in the form of a structure, and the structure includes a data part and an index part bound to each other, so The data part includes the valid data elements after the structured sparse processing, and the index part is used to indicate the position of the sparse data in the data before the sparse.
  • Clause 9 The data processing circuit of any one of clauses 4-8, wherein the second structured sparse subcircuit further comprises: at least one multi-stage pipeline operation circuit comprising a plurality of operators arranged in stages and configured A structured sparse process of selecting n data elements with larger absolute values as valid data elements from m data elements is performed.
  • the first pipeline stage includes m absolute value operators for respectively taking absolute values of m data elements to be sparsed to generate m absolute values;
  • the second pipeline stage includes a permutation and combination circuit for permuting and combining the m absolute values to generate m sets of data, wherein each set of data includes the m absolute values and the m absolute values are in each set The locations in the data are different from each other;
  • the third pipeline stage includes m comparison circuits for comparing absolute values in the m sets of data and generating a comparison result
  • the fourth pipeline stage includes a screening circuit for selecting n data elements with larger absolute values as valid data elements according to the comparison result, and outputting the valid data elements and a corresponding index, the index indicating the valid data the position of the element within the m data elements.
  • each comparison circuit in the third pipeline stage includes m-1 comparators, and the m-1 comparators in the i-th comparison circuit are used to compare the i-th comparison circuit.
  • One absolute value in the group data is compared with the other three absolute values in turn and a comparison result is generated, 1 ⁇ i ⁇ m.
  • Clause 12 The data processing circuit of any one of clauses 10-11, wherein the screening circuit is further configured to, when there are data elements with the same absolute value, select in accordance with a specified priority order.
  • Clause 13 The data processing circuit of any one of clauses 1-12, wherein the one dimension includes an input channel dimension, and the unit of the structured sparse processing is a data row of the input channel dimension of the tensor data.
  • Clause 14 The data processing circuit of any of clauses 1-13, the data processing circuit being used in any of the following processes:
  • Clause 15 A chip comprising a data processing circuit according to any of clauses 1-14.
  • Clause 17 A method of processing data using the data processing circuit of any of clauses 1-14.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Neurology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Complex Calculations (AREA)

Abstract

一种数据处理电路、数据处理方法及相关产品。数据处理电路可以实现为计算装置(201)包括在组合处理装置(20)中,该组合处理装置(20)还可以包括接口装置(202)和其他处理装置(203)。该计算装置(201)与其他处理装置(203)进行交互,共同完成用户指定的计算操作。组合处理装置(20)还可以包括存储装置(204),该存储装置(204)分别与计算装置(201)和其他处理装置(203)连接,用于存储该计算装置(201)和其他处理装置(203)的数据。提供了用于与结构化稀疏相关操作的硬件实现,其可以简化处理,提高机器的处理效率。

Description

数据处理电路、数据处理方法及相关产品
相关申请的交叉引用
本申请要求于2020年12月25日申请的,申请号为2020115661591,名称为“数据处理电路、数据处理方法及相关产品”的中国专利申请的优先权。
技术领域
本披露一般地涉及处理器领域。更具体地,本披露涉及数据处理电路、使用数据处理电路来处理数据的方法、芯片和板卡。
背景技术
近年来,随着深度学习的迅猛发展,使得计算机视觉、自然语言处理等一系列领域的算法性能都有了跨越式的进展。然而深度学习算法是一种计算密集型和存储密集型的工具,随着信息处理任务的日趋复杂,对算法实时性和准确性要求不断增高,神经网络往往会被设计得越来越深,使得其计算量和存储空间需求越来越大,导致现存的基于深度学习的人工智能技术难以直接应用在硬件资源受限的手机、卫星或嵌入式设备上。
因此,深度神经网络模型的压缩、加速、优化变得格外重要。大量的研究试着在不影响模型精度的前提下,减少神经网络的计算和存储需求,对深度学习技术在嵌入端、移动端的工程化应用具有十分重要的意义。稀疏化正是模型轻量化方法之一。
网络参数稀疏化是通过适当的方法减少较大网络中的冗余成分,以降低网络对计算量和存储空间的需求。现有的硬件和/或指令集不能有效地支持稀疏化处理。
发明内容
为了至少部分地解决背景技术中提到的一个或多个技术问题,本披露的方案提供了一种数据处理电路、数据处理方法、芯片和板卡。
在第一方面中,本披露公开一种数据处理电路,包括控制电路、存储电路和运算电路,其中:所述控制电路配置用于控制所述存储电路和所述运算电路以对至少一个张量数据的一个维度进行结构化稀疏处理;所述存储电路配置用于存储信息,所述信息至少包括稀疏化前和/或稀疏化后的信息;以及所述运算电路配置用于在所述控制电路的控制下,对所述张量数据的一个维度进行结构化稀疏处理。
在第二方面中,本披露提供一种芯片,包括前述第一方面任一实施例的数据处理电路。
在第三方面中,本披露提供一种板卡,包括前述第二方面任一实施例的芯片。
在第四方面中,本披露提供一种使用前述第一方面任一实施例的数据处理电路来处理数据的方法。
通过如上所提供的数据处理电路、使用数据处理电路来处理数据的方法、芯片和板卡,本披露实施例提供了一种支持数据的结构化稀疏操作的硬件电路,其可以对张量数据的一个维度进行结构化稀疏处理。在一些实施例中,本披露实施例的数据处理电路可以支持神经网络的推理和训练过程中的结构化稀疏,并且进行稀疏化的维度是输入通道维度。通过提供专门的结构化稀疏硬件来执行与结构化稀疏相关的操作,可以简化处理,由此提高机器的处理效率。
附图说明
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:
图1是示出本披露实施例的板卡的结构图;
图2是示出本披露实施例的组合处理装置的结构图;
图3是示出本披露实施例的单核或多核计算装置的处理器核的内部结构示意图;
图4示出根据本披露实施例的数据处理电路的结构框图;
图5A-图5D示出了本披露实施例的运算电路的部分结构示意图;以及
图6示出了根据本披露实施例的结构化稀疏处理的示例性运算流水线。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
下面结合附图来详细描述本披露的具体实施方式。
图1示出本披露实施例的一种板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种系统级芯片(System on Chip,SoC),或称片上系统,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接 口形式,例如PCIe接口等。
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20包括计算装置201、接口装置202、处理装置203和存储装置204。
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。
存储装置204用以存储待处理的数据,其可以是DRAM,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。
图3示出了计算装置201为单核或多核装置时处理器核的内部结构示意图。计算装置301用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,计算装置301包括三大模块:控制模块31、运算模块32及存储模块33。
控制模块31用以协调并控制运算模块32和存储模块33的工作,以完成深度学习的任务,其包括取指单元(instruction fetch unit,IFU)311及指令译码单元(instruction decode unit,IDU)312。取指单元311用以获取来自处理装置203的指令,指令译码单元312则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块32和存储模块33。
运算模块32包括向量运算单元321及矩阵运算单元322。向量运算单元321用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元322负责深度学习算法的核心计算,即矩阵乘及卷积。
存储模块33用来存储或搬运相关数据,包括神经元存储单元(neuron RAM,NRAM)331、参数存储单元(weight RAM,WRAM)332、直接内存访问模块(direct memory access,DMA)333。NRAM 331用以存储输入神经元、输出神经元和计算后的中间结果;WRAM 332则用以存储深度学习网络的卷积核,即权值;DMA 333通过总线34连接DRAM 204, 负责计算装置301与DRAM 204间的数据搬运。
本披露的实施例基于前述硬件环境,提供一种数据处理电路,支持数据的结构化稀疏操作。通过提供一种结构化稀疏处理的硬件方案,可以简化并加速与结构化稀疏相关的各种处理。
图4示出根据本披露实施例的数据处理电路400的结构框图。数据处理电路400例如可以实现在图2的计算装置201中。如图所示,数据处理装置400可以包括控制电路410、存储电路420和运算电路430。
控制电路410的功能可以类似于图3的控制模块314,其例如可以包括取指单元,用以获取来自例如图2的处理装置203的指令,以及指令译码单元,用于将获取的指令进行译码,并将译码结果作为控制信息发送给运算电路430和存储电路420。
在一个实施例中,控制电路410可以配置用于控制存储电路420和运算电路430以对至少一个张量数据的一个维度进行结构化稀疏处理。
存储电路420可以配置用于存储稀疏化前和/或稀疏化后的信息。在一个实施例中,待结构化稀疏处理的数据可以是神经网络中的数据,例如权值、神经元等。在此实施例中,存储电路例如可以是图3的WRAM 332、NRAM 331。
运算电路430可以配置用于在控制电路410的控制下,对上述张量数据的一个维度进行结构化稀疏处理。
在一些实施例中,运算电路430可以包括结构化稀疏电路432,其可以配置用于根据稀疏规则,从输入数据的待稀疏维度的每m个数据元素中选择n个数据元素作为有效数据元素,其中m>n。在一个实现中,m=4,n=2。在另一些实现中,m=4时,n也可以取其他值,例如1或3。
在一些实施例中,需要对稀疏化后的数据进行其他处理,例如乘法操作、卷积操作等。图中示例性示出了运算电路430还可以包括卷积电路433,其可以配置用于对已结构化稀疏处理的数据执行卷积操作。例如,卷积电路433可以配置成接收待卷积的数据,并对其执行卷积操作。待卷积的数据至少包括从结构化稀疏电路432接收的稀疏化后的输入数据。
取决于不同的应用场景,结构化稀疏电路432可以根据不同的稀疏规则执行结构化稀疏处理。
图5A-图5C示出了本披露实施例的运算电路的部分结构示意图。如图所示,结构化稀疏电路510中可以包括第一结构化稀疏子电路512和/或第二结构化稀疏子电路514。第一结构化稀疏子电路512可以配置用于按照指定的稀疏掩码来对输入数据执行结构化稀疏处理。第二结构化稀疏子电路514则可以配置用于按照预定的稀疏规则对输入数据执行结构化稀疏处理。在下面的场景中,以需要对结构化稀疏后的数据进行卷积处理为例进行描述。这些场景例如可以发生在神经网络的推理或训练过程中。
在第一场景中,待卷积的数据之一(不防假设为第一数据)可能事先已经进行了结构化稀疏处理,而待处理的数据中另一(假设为第二数据)需要按照第一数据的稀疏方式进行结构化稀疏。
第一场景例如可以是神经网络的推理过程,第一数据是权值,其已经离线进行了结构化稀疏,因此可以直接使用稀疏后的权值作为卷积输入。第二数据是神经元,其尚未进行稀疏化处理,需要根据权值的稀疏化得到的索引部分来进行结构化稀疏。
第一场景例如也可以是神经网络的正向训练过程,也称为训练的正向过程。在此场景 中,第一数据是权值,其在训练开始时预先进行了结构化稀疏,得到的数据部分和索引部分存储在例如DDR中,以便后续与第二数据执行卷积时使用。第二数据是神经元,其尚未进行稀疏化处理,需要根据权值稀疏化后的索引部分来进行结构化稀疏。
第一场景例如还可以是神经网络的反向权值梯度训练过程,或称训练的反向权值梯度过程。在此场景中,第一数据是权值,其在正向训练过程中已经进行了稀疏化,可以直接基于该稀疏化的权值进行更新。第二数据是权值梯度,其尚未进行稀疏化处理,需要根据权值的稀疏化得到的索引部分来进行结构化稀疏,从而与稀疏化后的权值相加,以更新权值。此时,可以利用第一结构化稀疏子电路对第二数据进行稀疏化处理。
图5A示出了第一场景中运算电路的部分结构示意图。如图所示,在此第一场景中,结构化稀疏电路510包括一个第一结构化稀疏子电路512,其接收输入的第二数据以及预先已结构化稀疏的第一数据的索引部分。第一结构化稀疏子电路512将第一数据的索引部分用作稀疏掩码,来对第二数据进行结构化稀疏处理。具体地,第一结构化稀疏子电路512根据第一数据的索引部分所指示的有效数据位置,从第一数据中提取对应位置的数据作为有效数据。在这些实施例中,第一结构化稀疏子电路512例如可以通过向量乘法或矩阵乘法等电路来实现。卷积电路520接收已结构化稀疏的第一数据以及从第一结构化稀疏子电路512输出的已稀疏化处理的第二数据,对二者执行卷积。
在第二场景中,待卷积的两个数据都未进行结构化稀疏处理,需要在卷积之前对两个数据分别进行结构化稀疏处理。此时,可以利用第二结构化稀疏子电路对第一数据和第二数据进行稀疏化处理。
图5B示出了第二场景中运算电路的部分结构示意图。如图所示,在此第二场景中,结构化稀疏电路510可以包括两个第二结构化稀疏子电路514,其分别接收待卷积的第一数据和第二数据,以便同时、独立地分别对第一数据和第二数据进行结构化稀疏处理,并将稀疏化后的数据输出给卷积电路520。第二结构化稀疏子电路514可以配置用于按照预定筛选规则来执行结构化稀疏处理,例如按照筛选绝对值较大的规则,从每m个数据元素中筛选出n个绝对值较大的数据元素作为有效数据元素。在这些实施例中,第二结构化稀疏子电路514例如可以通过配置由比较器等电路构成的多级运算流水线来实现上述处理。本领域技术人员可以理解,结构化稀疏电路510也可以只包括一个第二结构化稀疏子电路514,其顺次对第一数据和第二数据执行结构化稀疏处理。
在第三场景中,待卷积的两个数据都未进行结构化稀疏处理,需要在卷积之前对两个数据分别进行结构化稀疏处理,并且其中一个数据(例如,第一数据)需要使用另一数据(例如,第二数据)的稀疏化处理后的索引部分作为稀疏掩码。
第三场景例如可以是神经网络的正向训练过程,或称为训练的正向过程。第一数据是神经元,其尚未进行稀疏化处理,需要根据权值(也即,第二数据)的稀疏化得到的索引部分来进行结构化稀疏。第二数据是权值,需要进行结构化稀疏以得到稀疏化后的数据部分和索引部分,其中索引部分用作第一数据的稀疏化处理时的稀疏掩码,数据部分用于与稀疏化后的第一数据进行卷积运算。
图5C示出了第三场景中运算电路的部分结构示意图。如图所示,在此第三场景中,结构化稀疏电路510可以包括第一结构化稀疏子电路512和第二结构化稀疏子电路514。第二结构化稀疏子电路514可以配置用于按照预定筛选规则对例如第二数据执行结构化稀疏处理,并将稀疏化后的第二数据的索引部分提供给第一结构化稀疏子电路512。第一 结构化稀疏子电路512将第二数据的索引部分用作稀疏掩码,来对第一数据进行结构化稀疏处理。卷积电路520从第一结构化稀疏子电路512和第二结构化稀疏子电路514分别接收已结构化稀疏的第一数据和第二处理,对二者执行卷积。
在第四场景中,待卷积的两个数据都未进行结构化稀疏处理,需要在卷积之前对两个数据应用相同的稀疏掩码进行结构化稀疏处理。第四场景例如可以是神经网络的正向训练过程,其中稀疏掩码已经预先生成并存储在例如DDR中。第一数据是神经元,其尚未进行稀疏化处理,需要根据稀疏掩码进行结构化稀疏。第二数据是权值,其也尚未进行稀疏化处理,需要根据稀疏掩码进行结构化稀疏。
图5D示出了第四场景中运算电路的部分结构示意图。如图所示,在此第四场景中,结构化稀疏电路510包括两个第一结构化稀疏子电路512,其分别接收输入的第一数据和第二数据以及预先生成的稀疏掩码。第一结构化稀疏子电路512根据该稀疏掩码,分别对第一数据和第二数据进行结构化稀疏处理。卷积电路520接收已结构化稀疏的第一数据和已稀疏化处理的第二数据,对二者执行卷积。
本领域技术人员还可以考虑其他应用场景,相应地设计结构化稀疏电路。例如,可能需要对待卷积的两个数据应用相同的稀疏掩码,此时,结构化稀疏电路中可以包括两个第一结构化稀疏子电路来处理。
图6示出了根据本披露一个实施例的结构化稀疏处理的示例性运算流水线。该流水线包括至少一个多级流水运算电路,其包括逐级布置的多个运算器并且配置成执行从m个数据元素中选择n个绝对值较大的数据元素作为有效数据元素的结构化稀疏处理,例如用于实现前述第二结构化稀疏子电路。在图6的实施例中,示出了当m=4,n=2时,从4个数据元素A、B、C和D中筛选出2个绝对值较大的数据元素的结构化稀疏处理。如图所示,可以利用由求绝对值运算器、比较器等构成的多级流水运算电路来执行上述结构化稀疏处理。
第一流水级可以包括m(4)个求绝对值运算器610,用于同步地分别对4个输入的数据元素A、B、C和D进行求绝对值操作。为了便于最后输出有效数据元素,在一些实施例中,第一流水级会同时输出原数据元素(也即,A、B、C和D)和经过求绝对值操作后的数据(也即,|A|、|B|、|C|和|D|)。
第二流水级可以包括排列组合电路620,用于对这m个绝对值进行排列组合,以生成m组数据,其中每组数据均包括这m个绝对值,并且这m个绝对值在各组数据中的位置互不相同。
在一些实施例中,排列组合电路可以是循环移位器,对m个绝对值(例如,|A|、|B|、|C|和|D|)的排列进行m-1次循环移位,从而生成m组数据。例如,在图中示出的示例中,生成4组数据,分别是:{|A|,|B|,|C|,|D|}、{|B|,|C|,|D|,|A|}、{|C|,|D|,|A|,|B|}和{|D|,|A|,|B|,|C|}。同样地,输出每组数据的同时还会输出对应的原数据元素,每组数据对应一个原数据元素。
第三流水级包括比较电路630,用于对这m组数据中的绝对值进行比较并生成比较结果。
在一些实施例中,第三流水级可以包括m路比较电路,每路比较电路包括m-1个比较器(631,632,633),第i路比较电路中的m-1个比较器用于将第i组数据中的一个绝对值与其他三个绝对值依次比较并生成比较结果,其中1≤i≤m。
从图中可以看出,第三流水级也可以认为是m-1(3)个子流水级。每个子流水级包括m个比较器,用于将其对应的一个绝对值与其他绝对值进行比较。m-1个子流水级也就是依次将对应的一个绝对值与其他m-1个绝对值进行比较。
例如,在图中示出的示例中,第一子流水级中的4个比较器631用于分别将4组数据中的第一个绝对值与第二个绝对值进行比较,并分别输出比较结果w0、x0、y0和z0。第二子流水级中的4个比较器632用于分别将4组数据中的第一个绝对值与第三个绝对值进行比较,并分别输出比较结果w1、x1、y1和z1。第三子流水级中的4个比较器633用于分别将4组数据中的第一个绝对值与第四个绝对值进行比较,并分别输出比较结果w2、x2、y2和z2。由此,可以得到每个绝对值与其他m-1个绝对值的比较结果。
在一些实施例中,比较结果可以使用位图来表示。例如,在第1路比较电路的第1个比较器处,当|A|≥|B|时,w0=1;在第1路第2个比较器处,当|A|<|C|时,w1=0;在第1路第3个比较器处,当|A|≥|D|时,w2=1,由此,第1路比较电路的输出结果是{A,w0,w1,w2},此时为{A,1,0,1}。类似地,第2路比较电路的输出结果是{B,x0,x1,x2},第3路比较电路的输出结果是{C,y0,y1,y2},第4路比较电路的输出结果是{D,z0,z1,z2}。
第四流水级包括筛选电路640,用于根据第三级的比较结果,从m个数据元素中选择n个绝对值较大的数据元素作为有效数据元素,以及输出这些有效数据元素及对应的索引。索引用于指示这些有效数据元素在输入的m个数据元素中的位置。例如,当从A、B、C、D四个数据元素中筛选出A和C时,其对应的索引可以是0和2。
根据比较结果,可以设计合适的逻辑来选择绝对值较大的n个数据元素。考虑到可能出现多个绝对值相同的情况,在进一步的实施例中,当存在绝对值相同的数据元素时,按照指定的优先级顺序来进行选择。例如,可以按照索引从低到高固定优先级的方式,设置A的优先级最高,D的优先级最低。在一个示例中,当A、C、D三个数的绝对值均相同并且大于B的绝对值时,选择的数据为A和C。
从前面的比较结果可以看出,根据w0、w1和w2可以分析出|A|比{|B|,|C|,|D|}几个数大。如果w0、w1和w2均为1,则表示|A|比|B|、|C|、|D|都大,为四个数中的最大值,因此选择A。如果w0、w1和w2中有两个1,则表示|A|是四个绝对值中的次大值,因此也选择A。否则,不选择A。因此,在一些实施例中,可以根据这些数值的出现次数来分析判断。
在一种实现中,可以基于如下逻辑来选择有效数据元素。首先,可以统计各个数据大于其他数据的次数。例如,定义N A=sum_w=w0+w1+w2,N B=sum_x=x0+x1+x2,N C=sum_y=y0+y1+y2,N D=sum_z=z0+z1+z2。接着,按如下条件进行判断选择。
选择A的条件为:N A=3,或者N A=2且N B/N C/N D中只有一个3;
选择B的条件为:N B=3,或者N B=2且N A/N C/N D中只有一个3,且N A≠2;
选择C的条件为:N C=3,且N A/N B中至多只有一个3,或者N C=2且N A/N B/N D中只有一个3,且N A/N B中没有2;
选择D的条件为:N D=3,且N A/N B/N C中至多只有一个3,或者N D=2且N A/N B/N C中只有一个3,且N A/N B/N C中没有2。
本领域技术人员可以理解,为了确保按预定优先级选择,上述逻辑中存在一定的冗余。基于比较结果提供的大小及顺序信息,本领域技术人员可以设计其他逻辑来实现有效数据 元素的筛选,本披露在此方面没有限制。由此,通过图6的多级流水运算电路可以实现四选二的结构化稀疏处理。
本领域技术人员可以理解,还可以设计其他形式的流水运算电路来实现结构化稀疏处理,本披露在此方面没有限制。
稀疏化处理后的结果包括两部分:数据部分和索引部分。数据部分包括经稀疏化处理后的数据,也即根据结构化稀疏处理的筛选规则提取出的有效数据元素。索引部分用于指示稀疏化后的数据,也即有效数据元素在稀疏化前的原始数据(也即,待稀疏化数据)中的原始位置。
可以采取多种形式来表示和/或存储已结构化稀疏处理的数据。在一种实现中,已结构化稀疏处理的数据可以为结构体形式。在该结构体中,数据部分和索引部分相互绑定。在一些实施例中,索引部分中每1比特可以对应一个数据元素。例如,当数据类型是fix8时,一个数据元素为8比特,则索引部分中每1比特可以对应8比特的数据。在另一些实施例中,考虑到后续在使用结构体时硬件层面的实现,可以将结构体中的索引部分中每1比特设定为对应N比特数据的位置,N至少部分基于硬件配置确定。例如,可以设置为结构体中的索引部分的每1比特对应于4比特数据的位置。例如,当数据类型是fix8时,索引部分中每2比特对应于一个fix8类型的数据元素。在一些实施例中,结构体中的数据部分可以按照第一对齐要求对齐,结构体中的索引部分可以按照第二对齐要求对齐,从而整个结构体也满足对齐要求。例如,数据部分可以按照64B对齐,索引部分可以按照32B对齐,整个结构体则按照96B(64B+32B)对齐。通过这种对齐要求,在后续使用时可以减少访存次数,提高处理效率。
通过使用这种结构体,数据部分和索引部分可以统一使用。由于结构化稀疏处理中,有效数据元素占据原始数据元素的比例是固定的,例如n/m,因此,稀疏化处理后的数据大小也是固定或可预期的。从而,结构体可以在存储电路中致密存储而无性能损失。
在另一些实现中,稀疏化处理后得到的数据部分和索引部分也可以分开表示和/或存储,以便单独使用。例如,已结构化稀疏处理的第二输入数据的索引部分可以提供给第一结构化稀疏电路512,以用作掩码,对第一输入数据进行结构化稀疏处理。此时,为了使用不同的数据类型,单独提供的索引部分中每1比特可以指示一个数据元素是否有效。
回到图4,在一些实施例中,运算电路430还可以包括前处理电路431和后处理电路434。其中,前处理电路431可以配置成对结构化稀疏电路432和/或卷积电路433执行运算前的数据进行预处理;后处理电路434可以配置成对卷积电路433运算后的数据进行后处理。
在一些应用场景中,前述的预处理和后处理可以例如包括数据拆分和/或数据拼接操作。在结构化稀疏处理中,前处理电路可以将待稀疏化数据按照每m个数据元素进行分段拆分,然后送给结构化稀疏电路432进行结构化稀疏处理。此外,前处理电路还可以控制向结构化稀疏电路输送数据的速率。例如在涉及结构化稀疏的处理中,前处理电路可以以第一速率向结构化稀疏电路输送数据;而在不涉及结构化稀疏的处理中,前处理电路可以以第二速率向卷积电路输送数据。第一速率大于第二速率,其比例例如等于结构化稀疏处理中的稀疏比例,例如m/n。例如在4选2的结构化稀疏处理中,第一速率是第二速率的2倍。由此,第一速率至少部分基于卷积电路的处理能力和结构化稀疏处理的稀疏比例来确定。后处理电路434可以对卷积电路的输出结果进行融合处理,诸如加法、减法、乘法等 操作。
如前面所提到的,结构化稀疏处理的数据可以是神经网络中的数据,例如权值、神经元等。神经网络中的数据通常包含多个维度,也称为张量数据。例如,在卷积神经网络中,数据可能存在四个维度:输入通道、输出通道、长度和宽度。在一些实施例中,上述结构化稀疏针对神经网络中多维数据的一个维度执行。具体地,在一个实现中,上述结构化稀疏可以用于神经网络的正向过程(例如,推理,或正向训练)中的结构化稀疏,此时的结构化稀疏处理针对神经网络中的多维数据的输入通道维度执行。在另一个实现中,上述结构化稀疏可以用于神经网络的反向过程(例如,反向权值梯度训练)中的结构化稀疏,此时的结构化稀疏处理针对神经网络中的多维数据的输入通道维度。
从前面提到的各种应用场景可知,对于权值的结构化稀疏,本披露实施例至少提供了两种调度方案。
具体地,在一个实施例中,可以在正向训练时,在线预先对权值进行结构化稀疏,得到的索引部分和数据部分可以在后续与神经元执行结构化稀疏的卷积过程时直接使用。例如,在以片上系统-集群-处理器核的层次构成的多核计算装置中,可以将对权值进行结构化稀疏的任务进行并行拆分,从片上系统拆分到不同的集群,再从集群拆分到不同的处理器核并行执行,提高执行速度。各个处理器核计算完成后,将所有处理器核生成的索引部分和数据部分写回内存(例如DDR)。然后在正向训练时,将权值的索引部分和数据部分重新完整发送给不同的处理器核执行结构化稀疏相关的操作(例如,结构化稀疏卷积操作),其中索引部分作为执行过程中稀疏化的掩码,数据部分则直接参与卷积运算。通过预先进行权值的结构化稀疏,可以降低与后续运算(例如,卷积)串行执行的裸露开销。该调度方案占用DDR空间来存储稀疏后的权值的数据部分和索引部分(例如,以结构体的形式存储),在重复加载参与运算的权值时不需要重新执行稀疏化过程。当权值重复加载次数大于2时,可以存在明显的IO收益,总体时间开销也存在较大收益。因此,当神经元数据较大,无法完全使用,需要重复加载权值时,采用本实施例提供的调度方案可以提高执行性能。
在另一实施例中,可以在正向训练时,在线对权值和神经元进行结构化稀疏,结构化稀疏所使用的稀疏掩码是预先生成。由于稀疏掩码的使用周期很长,其生成频率可以为K次迭代生成一次,因此不需要每次重复生成稀疏掩码,而是可以缓存在DDR中作为每个迭代的结构化稀疏的输入。在此实施例中,可以直接向每个处理器核发送运算所需要的未稀疏化的权值、神经元和稀疏掩码,稀疏处理后的权值和神经元直接参与运算。相比于上一实施例,此实施例无需额外占用DDR来缓存稀疏处理后的权值数据,而是直接用于后续运算。因此,当神经元数据较小,例如可以一次加载完,从而无需重复加载权值时,采用本实施例提供的调度方案可以提高执行性能。
从上面描述可知,本披露实施例提供了一种硬件方案,用于执行与结构化稀疏相关的处理。这种结构化稀疏处理可以支持神经网络的推理和训练过程,对神经网络中张量数据的一个维度(例如,输入通道维度)进行结构化稀疏处理。通过提供专门的结构化稀疏硬件来执行与结构化稀疏相关的操作,可以简化处理,由此提高机器的处理效率。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像 机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电 阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
依据以下条款可更好地理解前述内容:
条款1、一种数据处理电路,包括控制电路、存储电路和运算电路,其中:
所述控制电路配置用于控制所述存储电路和所述运算电路以对至少一个张量数据的一个维度进行结构化稀疏处理;
所述存储电路配置用于存储信息,所述信息至少包括稀疏化前和/或稀疏化后的信息;以及
所述运算电路配置用于在所述控制电路的控制下,对所述张量数据的一个维度进行结构化稀疏处理。
条款2、根据条款1所述的数据处理电路,其中所述运算电路包括结构化稀疏电路,配置用于根据稀疏规则,从输入数据的待稀疏维度的每m个数据元素中选择n个数据元素作为有效数据元素,其中m>n。
条款3、根据条款2所述的数据处理电路,其中所述结构化稀疏电路包括:
第一结构化稀疏子电路,配置用于按照指定的稀疏掩码对输入数据执行结构化稀疏处理。
条款4、根据条款2或3所述的数据处理电路,其中所述结构化稀疏电路包括:
第二结构化稀疏子电路,配置用于按照预定的稀疏规则对输入数据执行结构化稀疏处理。
条款5、根据条款3或4所述的数据处理电路,其中所述至少一个张量数据包括第一数据,所述运算电路配置用于:
利用第一结构化稀疏子电路,将已结构化稀疏处理的第二数据对应的索引部分作为稀疏掩码,对所述第一数据执行结构化稀疏处理,其中所述索引部分指示将要执行的结构化稀疏中有效数据元素的位置。
条款6、根据条款5所述的数据处理电路,其中:
所述已结构化稀疏处理的第二数据是在线预先或离线进行结构化稀疏处理并存储在所述存储电路中的,或者
所述已结构化稀疏处理的第二数据是在线利用所述第二结构化稀疏子电路进行结构化稀疏处理而生成的。
条款7、根据条款5-6任一所述的数据处理电路,其中所述运算电路还包括:
卷积电路,其配置用于对已结构化稀疏处理的第一数据和第二数据执行卷积操作。
条款8、根据条款5-7任一所述的数据处理电路,其中所述已结构化稀疏处理的第二数据为结构体形式,所述结构体包括相互绑定的数据部分和索引部分,所述数据部分包括已结构化稀疏处理后的有效数据元素,所述索引部分用于指示稀疏化后的数据在稀疏化前数据中的位置。
条款9、根据条款4-8任一所述的数据处理电路,其中所述第二结构化稀疏子电路进一步包括:至少一个多级流水运算电路,其包括逐级布置的多个运算器并且配置成执行从m个数据元素中选择n个绝对值较大的数据元素作为有效数据元素的结构化稀疏处理。
条款10、根据条款9所述的数据处理电路,其中所述多级流水运算电路包括四个流 水级,其中:
第一流水级包括m个求绝对值运算器,用于分别对待稀疏化的m个数据元素取绝对值,以生成m个绝对值;
第二流水级包括排列组合电路,用于对所述m个绝对值进行排列组合,以生成m组数据,其中每组数据均包括所述m个绝对值并且所述m个绝对值在各组数据中的位置互不相同;
第三流水级包括m路比较电路,用于对所述m组数据中的绝对值进行比较并生成比较结果;以及
第四流水级包括筛选电路,用于根据所述比较结果选择n个绝对值较大的数据元素作为有效数据元素,以及输出所述有效数据元素及对应的索引,所述索引指示所述有效数据元素在所述m个数据元素中的位置。
条款11、根据条款10所述的数据处理电路,其中所述第三流水级中每路比较电路包括m-1个比较器,第i路比较电路中的m-1个比较器用于将第i组数据中的一个绝对值与其他三个绝对值依次比较并生成比较结果,1≤i≤m。
条款12、根据条款10-11任一所述的数据处理电路,其中所述筛选电路进一步配置用于,当存在绝对值相同的数据元素时,按照指定的优先级顺序进行选择。
条款13、根据条款1-12任一所述的数据处理电路,其中所述一个维度包括输入通道维度,并且所述结构化稀疏处理的单位是所述张量数据的输入通道维度的数据行。
条款14、根据条款1-13任一所述的数据处理电路,所述数据处理电路用于以下任一过程:
神经网络的推理过程;
神经网络的正向训练过程;以及
神经网络的反向权值梯度训练过程。
条款15、一种芯片,包括根据条款1-14任一所述的数据处理电路。
条款16、一种板卡,包括根据条款15所述的芯片。
条款17、一种使用条款1-14任一所述的数据处理电路来处理数据的方法。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。

Claims (17)

  1. 一种数据处理电路,包括控制电路、存储电路和运算电路,其中:
    所述控制电路配置用于控制所述存储电路和所述运算电路以对至少一个张量数据的一个维度进行结构化稀疏处理;
    所述存储电路配置用于存储信息,所述信息至少包括稀疏化前和/或稀疏化后的信息;以及
    所述运算电路配置用于在所述控制电路的控制下,对所述张量数据的一个维度进行结构化稀疏处理。
  2. 根据权利要求1所述的数据处理电路,其中所述运算电路包括结构化稀疏电路,配置用于根据稀疏规则,从输入数据的待稀疏维度的每m个数据元素中选择n个数据元素作为有效数据元素,其中m>n。
  3. 根据权利要求2所述的数据处理电路,其中所述结构化稀疏电路包括:
    第一结构化稀疏子电路,配置用于按照指定的稀疏掩码对输入数据执行结构化稀疏处理。
  4. 根据权利要求2或3所述的数据处理电路,其中所述结构化稀疏电路包括:
    第二结构化稀疏子电路,配置用于按照预定的稀疏规则对输入数据执行结构化稀疏处理。
  5. 根据权利要求3或4所述的数据处理电路,其中所述至少一个张量数据包括第一数据,所述运算电路配置用于:
    利用第一结构化稀疏子电路,将已结构化稀疏处理的第二数据对应的索引部分作为稀疏掩码,对所述第一数据执行结构化稀疏处理,其中所述索引部分指示将要执行的结构化稀疏中有效数据元素的位置。
  6. 根据权利要求5所述的数据处理电路,其中:
    所述已结构化稀疏处理的第二数据是在线预先或离线进行结构化稀疏处理并存储在所述存储电路中的,或者
    所述已结构化稀疏处理的第二数据是在线利用所述第二结构化稀疏子电路进行结构化稀疏处理而生成的。
  7. 根据权利要求5-6任一所述的数据处理电路,其中所述运算电路还包括:
    卷积电路,其配置用于对已结构化稀疏处理的第一数据和第二数据执行卷积操作。
  8. 根据权利要求5-7任一所述的数据处理电路,其中所述已结构化稀疏处理的第二数据为结构体形式,所述结构体包括相互绑定的数据部分和索引部分,所述数据部分包括已结构化稀疏处理后的有效数据元素,所述索引部分用于指示稀疏化后的数据在稀疏化前数据中的位置。
  9. 根据权利要求4-8任一所述的数据处理电路,其中所述第二结构化稀疏子电路进一步包括:至少一个多级流水运算电路,其包括逐级布置的多个运算器并且配置成执行从m个数据元素中选择n个绝对值较大的数据元素作为有效数据元素的结构化稀疏处理。
  10. 根据权利要求9所述的数据处理电路,其中所述多级流水运算电路包括四个流水级,其中:
    第一流水级包括m个求绝对值运算器,用于分别对待稀疏化的m个数据元素取绝对值,以生成m个绝对值;
    第二流水级包括排列组合电路,用于对所述m个绝对值进行排列组合,以生成m组数据,其中每组数据均包括所述m个绝对值并且所述m个绝对值在各组数据中的位置互不相同;
    第三流水级包括m路比较电路,用于对所述m组数据中的绝对值进行比较并生成比较结果;以及
    第四流水级包括筛选电路,用于根据所述比较结果选择n个绝对值较大的数据元素作为有效数据元素,以及输出所述有效数据元素及对应的索引,所述索引指示所述有效数据元素在所述m个数据元素中的位置。
  11. 根据权利要求10所述的数据处理电路,其中所述第三流水级中每路比较电路包括m-1个比较器,第i路比较电路中的m-1个比较器用于将第i组数据中的一个绝对值与其他三个绝对值依次比较并生成比较结果,1≤i≤m。
  12. 根据权利要求10-11任一所述的数据处理电路,其中所述筛选电路进一步配置用于,当存在绝对值相同的数据元素时,按照指定的优先级顺序进行选择。
  13. 根据权利要求1-12任一所述的数据处理电路,其中所述一个维度包括输入通道维度,并且所述结构化稀疏处理的单位是所述张量数据的输入通道维度的数据行。
  14. 根据权利要求1-13任一所述的数据处理电路,所述数据处理电路用于以下任一过程:
    神经网络的推理过程;
    神经网络的正向训练过程;以及
    神经网络的反向权值梯度训练过程。
  15. 一种芯片,包括根据权利要求1-14任一所述的数据处理电路。
  16. 一种板卡,包括根据权利要求15所述的芯片。
  17. 一种使用权利要求1-14任一所述的数据处理电路来处理数据的方法。
PCT/CN2021/119946 2020-12-25 2021-09-23 数据处理电路、数据处理方法及相关产品 WO2022134688A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/257,723 US20240070445A1 (en) 2020-12-25 2021-09-23 Data processing circuit, data processing method, and related products

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011566159.1A CN114692847B (zh) 2020-12-25 2020-12-25 数据处理电路、数据处理方法及相关产品
CN202011566159.1 2020-12-25

Publications (1)

Publication Number Publication Date
WO2022134688A1 true WO2022134688A1 (zh) 2022-06-30

Family

ID=82130854

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/119946 WO2022134688A1 (zh) 2020-12-25 2021-09-23 数据处理电路、数据处理方法及相关产品

Country Status (3)

Country Link
US (1) US20240070445A1 (zh)
CN (1) CN114692847B (zh)
WO (1) WO2022134688A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170277628A1 (en) * 2016-03-24 2017-09-28 Somnath Paul Technologies for memory management of neural networks with sparse connectivity
CN107341544A (zh) * 2017-06-30 2017-11-10 清华大学 一种基于可分割阵列的可重构加速器及其实现方法
CN107909148A (zh) * 2017-12-12 2018-04-13 北京地平线信息技术有限公司 用于执行卷积神经网络中的卷积运算的装置
CN108510066A (zh) * 2018-04-08 2018-09-07 清华大学 一种应用于卷积神经网络的处理器
CN109598338A (zh) * 2018-12-07 2019-04-09 东南大学 一种基于fpga的计算优化的卷积神经网络加速器
CN109740739A (zh) * 2018-12-29 2019-05-10 北京中科寒武纪科技有限公司 神经网络计算装置、神经网络计算方法及相关产品

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11055063B2 (en) * 2016-05-02 2021-07-06 Marvell Asia Pte, Ltd. Systems and methods for deep learning processor
US11551028B2 (en) * 2017-04-04 2023-01-10 Hailo Technologies Ltd. Structured weight based sparsity in an artificial neural network
CN108388446A (zh) * 2018-02-05 2018-08-10 上海寒武纪信息科技有限公司 运算模块以及方法
CN111160516B (zh) * 2018-11-07 2023-09-05 杭州海康威视数字技术股份有限公司 一种深度神经网络的卷积层稀疏化方法及装置
US11126690B2 (en) * 2019-03-29 2021-09-21 Intel Corporation Machine learning architecture support for block sparsity
CN111027691B (zh) * 2019-12-25 2023-01-17 上海寒武纪信息科技有限公司 用于神经网络运算、训练的装置、设备及板卡

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170277628A1 (en) * 2016-03-24 2017-09-28 Somnath Paul Technologies for memory management of neural networks with sparse connectivity
CN107341544A (zh) * 2017-06-30 2017-11-10 清华大学 一种基于可分割阵列的可重构加速器及其实现方法
CN107909148A (zh) * 2017-12-12 2018-04-13 北京地平线信息技术有限公司 用于执行卷积神经网络中的卷积运算的装置
CN108510066A (zh) * 2018-04-08 2018-09-07 清华大学 一种应用于卷积神经网络的处理器
CN109598338A (zh) * 2018-12-07 2019-04-09 东南大学 一种基于fpga的计算优化的卷积神经网络加速器
CN109740739A (zh) * 2018-12-29 2019-05-10 北京中科寒武纪科技有限公司 神经网络计算装置、神经网络计算方法及相关产品

Also Published As

Publication number Publication date
CN114692847B (zh) 2024-01-09
US20240070445A1 (en) 2024-02-29
CN114692847A (zh) 2022-07-01

Similar Documents

Publication Publication Date Title
US11531541B2 (en) Processing apparatus and processing method
TWI795519B (zh) 計算裝置、機器學習運算裝置、組合處理裝置、神經網絡芯片、電子設備、板卡及執行機器學習計算的方法
CN111047022A (zh) 一种计算装置及相关产品
WO2022134873A1 (zh) 数据处理装置、数据处理方法及相关产品
WO2023045446A1 (zh) 计算装置、数据处理方法及相关产品
CN111178492B (zh) 计算装置及相关产品、执行人工神经网络模型的计算方法
WO2022095676A1 (zh) 神经网络稀疏化的设备、方法及相应产品
WO2022134688A1 (zh) 数据处理电路、数据处理方法及相关产品
WO2021082746A1 (zh) 运算装置及相关产品
WO2021082747A1 (zh) 运算装置及相关产品
WO2022001500A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN114692844A (zh) 数据处理装置、数据处理方法及相关产品
WO2022134872A1 (zh) 数据处理装置、数据处理方法及相关产品
CN114692846A (zh) 数据处理装置、数据处理方法及相关产品
CN113469365B (zh) 基于神经网络模型的推理和编译方法及其相关产品
WO2022001454A1 (zh) 集成计算装置、集成电路芯片、板卡和计算方法
WO2022001496A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN114692845A (zh) 数据处理装置、数据处理方法及相关产品
JP7368512B2 (ja) 計算装置、集積回路チップ、ボードカード、電子デバイスおよび計算方法
CN113742266B (zh) 集成电路装置、电子设备、板卡和计算方法
CN114692074A (zh) 矩阵乘法电路、方法及相关产品
CN114692841A (zh) 数据处理装置、数据处理方法及相关产品
CN114691561A (zh) 数据处理电路、数据处理方法及相关产品
CN114444677A (zh) 进行稀疏化训练的装置、板卡、方法及可读存储介质
WO2020125092A1 (zh) 计算装置及板卡

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908689

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18257723

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21908689

Country of ref document: EP

Kind code of ref document: A1