WO2022134872A1 - Appareil de traitement de données, procédé de traitement de données et produit associé - Google Patents

Appareil de traitement de données, procédé de traitement de données et produit associé Download PDF

Info

Publication number
WO2022134872A1
WO2022134872A1 PCT/CN2021/128187 CN2021128187W WO2022134872A1 WO 2022134872 A1 WO2022134872 A1 WO 2022134872A1 CN 2021128187 W CN2021128187 W CN 2021128187W WO 2022134872 A1 WO2022134872 A1 WO 2022134872A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
sparse
structured
circuit
convolution
Prior art date
Application number
PCT/CN2021/128187
Other languages
English (en)
Chinese (zh)
Inventor
郑万凯
陈伟伦
高钰峰
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202011566134.1A external-priority patent/CN114692844A/zh
Priority claimed from CN202011566148.3A external-priority patent/CN114692846A/zh
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2022134872A1 publication Critical patent/WO2022134872A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates generally to the field of processors. More specifically, the present disclosure relates to a data processing apparatus, a data processing method, a chip and a board.
  • the network parameter sparsification is to reduce the redundant components in the larger network by appropriate methods, so as to reduce the network's demand for computation and storage space.
  • Existing hardware and/or instruction sets cannot effectively support thinning processing and operations related to thinning processing.
  • the solution of the present disclosure provides a data processing apparatus, a data processing method, a chip and a board.
  • the present disclosure discloses a data processing apparatus, comprising: a control circuit configured to parse a convolution instruction, the convolution instruction including a sparse flag bit for indicating whether to perform a structured sparse convolution operations; a storage circuit configured to store pre-convolution and/or post-convolution information; and an arithmetic circuit configured to perform a corresponding convolution operation according to the convolution instruction.
  • the present disclosure provides a chip including the data processing apparatus of any embodiment of the foregoing first aspect.
  • the present disclosure provides a board including the chip of any embodiment of the foregoing second aspect.
  • the present disclosure provides a data processing method, the method comprising: parsing a convolution instruction, the convolution instruction including a sparse flag bit for indicating whether to perform a structured sparse convolution operation; according to the A convolution instruction reads a corresponding operand; and a corresponding convolution operation is performed on the operand according to the convolution instruction.
  • the embodiments of the present disclosure provide a convolution instruction, which includes a sparse flag bit for indicating whether to perform a structured sparse convolution operation .
  • the corresponding operation circuit can be configured according to the value of the flag bit to perform the corresponding convolution operation.
  • the arithmetic circuit may be configured to perform structured sparse processing, and then perform convolution on the sparse processed data.
  • FIG. 1 is a structural diagram illustrating a board according to an embodiment of the present disclosure
  • FIG. 2 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram showing the internal structure of a single-core computing device according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram illustrating a data processing apparatus according to an embodiment of the present disclosure.
  • FIGS. 7A-7C show partial structural schematic diagrams of the operation circuit according to the embodiment of the present disclosure.
  • FIG. 8 shows an exemplary pipeline diagram of structured sparse processing according to an embodiment of the present disclosure
  • FIG. 9 is an exemplary flowchart illustrating a data processing method of an embodiment of the present disclosure.
  • FIG. 10 shows a schematic diagram of a data storage space according to an embodiment of the present disclosure
  • FIG. 11 shows a schematic diagram of data partitioning in a data storage space according to an embodiment of the present disclosure
  • FIG. 12 is a schematic structural diagram illustrating a data processing apparatus according to another embodiment of the present disclosure.
  • FIG. 13 is an exemplary flowchart illustrating a data processing method of another embodiment of the present disclosure.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting”.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure.
  • the board 10 includes a chip 101, which is a system-on-chip (SoC), or a system-on-a-chip, and integrates one or more combined processing devices.
  • SoC system-on-chip
  • the combined processing device is an artificial
  • the intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage capacity and computing capacity of the platform.
  • the board 10 in this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing power.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be transmitted back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected to the control device 106 and the chip 101 through a bus and performs data transmission.
  • the control device 106 in the board 10 is configured to control the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing a combined processing device in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201 , an interface device 202 , a processing device 203 and a storage device 204 .
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete a user-specified operation.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write the input data into the storage device on-chip of the computing device 201 .
  • the computing device 201 can obtain the control instruction from the processing device 203 via the interface device 202 and write it into the control cache on the computing device 201 .
  • the interface device 202 can also read the data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201, and the like.
  • the processing device 203 may be one or more types of central processing unit (CPU), graphics processing unit (GPU), or other general-purpose and/or special-purpose processors.
  • processors include but are not limited to digital signal processors (DSPs), application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field-programmable gate arrays
  • Programmable logic devices discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are considered together, the two are considered to form a heterogeneous multi-core structure.
  • the storage device 204 is used to store data to be processed, which may be a DRAM, or a DDR memory, with a size of 16G or more, and is used to save the data of the computing device 201 and/or the processing device 203 .
  • FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 as a single core.
  • the single-core computing device 301 is used to process input data such as computer vision, speech, natural language, and data mining.
  • the single-core computing device 301 includes three modules: a control module 31 , an arithmetic module 32 and a storage module 33 .
  • the control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, and it comprises an instruction fetch unit (instruction fetch unit, IFU) 311 and an instruction decoding unit (instruction decode unit, IDU) 312.
  • the instruction fetching unit 311 is used to acquire the instruction from the processing device 203 , and the instruction decoding unit 312 decodes the acquired instruction, and sends the decoding result to the operation module 32 and the storage module 33 as control information.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 .
  • the vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, that is, matrix multiplication and convolution.
  • the storage module 33 is used to store or transport related data, including a neuron storage unit (neuron RAM, NRAM) 331, a parameter storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons and intermediate results after calculation;
  • WRAM 332 is used to store the convolution kernel of the deep learning network, that is, weights;
  • DMA 333 is connected to DRAM 204 through bus 34 and is responsible for the single-core computing device Data transfer between 301 and DRAM 204.
  • FIG. 4 shows a schematic diagram of the internal structure of the computing device 201 with multiple cores.
  • the multi-core computing device 41 adopts a layered structure design, and the multi-core computing device 41 is a system-on-a-chip, which includes at least one cluster, and each cluster includes multiple processor cores.
  • the multi-core computing device 41 is a system-on-chip- Cluster - a hierarchy of processor cores.
  • the multi-core computing device 41 includes an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnect module 403 , a synchronization module 404 and multiple clusters 405 .
  • the peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to perform tasks.
  • the on-chip interconnection module 403 connects the external storage controller 401 , the peripheral communication module 402 and the multiple clusters 405 to transmit data and control signals among the modules.
  • the synchronization module 404 is a global synchronization barrier controller (GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information.
  • GBC global synchronization barrier controller
  • the plurality of clusters 405 are the computing cores of the multi-core computing device 41, and 4 are exemplarily shown in the figure. With the development of hardware, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more. Multiple clusters 405. Cluster 405 is used to efficiently execute deep learning algorithms.
  • each cluster 405 includes multiple processor cores (IPU cores) 406 and one memory core (MEM core) 407 .
  • IPU cores processor cores
  • MEM core memory core
  • the processor cores 406 are exemplarily shown as four in the figure, and the present disclosure does not limit the number of the processor cores 406 . Its internal structure is shown in Figure 5.
  • Each processor core 406 is similar to the single-core computing device 301 in FIG. 3 , and also includes three major modules: a control module 51 , an arithmetic module 52 and a storage module 53 .
  • the functions and structures of the control module 51 , the arithmetic module 52 and the storage module 53 are substantially the same as those of the control module 31 , the arithmetic module 32 and the storage module 33 , and will not be described again.
  • the storage module 53 includes an input/output direct memory access (IODMA) 533 and a move direct memory access (MVDMA) 534.
  • IODMA input/output direct memory access
  • MVDMA move direct memory access
  • the IODMA 533 controls the memory access of the NRAM 531/WRAM 532 and the DRAM 204 through the broadcast bus 409; the MVDMA 534 is used to control the memory access of the NRAM 531/WRAM 532 and the storage unit (SRAM) 408.
  • the storage core 407 is mainly used for storage and communication, that is, to store the shared data or intermediate results between the processor cores 406, and to execute the communication between the cluster 405 and the DRAM 204, the communication between the clusters 405, and the processor Communication among the cores 406, etc.
  • the memory core 407 has scalar operation capability for performing scalar operations.
  • the storage core 407 includes an SRAM 408 , a broadcast bus 409 , a cluster direct memory access (CDMA) 410 and a global direct memory access (GDMA) 411 .
  • the SRAM 408 assumes the role of a high-performance data transfer station.
  • the data multiplexed between different processor cores 406 in the same cluster 405 does not need to be obtained from the DRAM 204 through the processor cores 406, but is stored in the processor through the SRAM 408.
  • the storage core 407 only needs to quickly distribute the multiplexed data from the SRAM 408 to the multiple processor cores 406, so as to improve the communication efficiency between the cores and greatly reduce the on-chip and off-chip I/O accesses.
  • the broadcast bus 409, the CDMA 410 and the GDMA 411 are used to perform the communication between the processor cores 406, the communication between the clusters 405 and the data transmission between the clusters 405 and the DRAM 204, respectively. They will be explained separately below.
  • the broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405.
  • the broadcast bus 409 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast.
  • Unicast refers to point-to-point (such as a single processor core to a single processor core) data transmission
  • multicast is a communication method that transmits a piece of data from SRAM 408 to specific processor cores 406, and broadcast is a communication method.
  • the communication method in which copies of data are transmitted from SRAM 408 to all processor cores 406 is a special case of multicast.
  • the CDMA 410 is used to control the memory access of the SRAM 408 between different clusters 405 within the same computing device 201.
  • the GDMA 411 cooperates with the external memory controller 401 to control the memory access from the SRAM 408 of the cluster 405 to the DRAM 204, or to read data from the DRAM 204 to the SRAM 408.
  • the communication between the DRAM 204 and the NRAM 431 or the WRAM 432 can be implemented through two channels.
  • the first channel is to directly contact DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second channel is to transfer data between DRAM 204 and SRAM 408 through GDMA 411, and then through MVDMA 534 to transfer data between SRAM 408 and NRAM 431 or WRAM 432 transfers.
  • the bandwidth of the second channel is much larger than that of the first channel, so DRAM 204 and NRAM 431 or Communication between the WRAMs 432 may be more efficient through a second channel.
  • the embodiments of the present disclosure can select data transmission channels according to their own hardware conditions.
  • GDMA 411 and the functionality of IODMA 533 may be integrated in the same component.
  • GDMA 411 and IODMA 533 are regarded as different components.
  • the function of GDMA 411, the function of IODMA 533, the function of CDMA 410, and the function of MVDMA 534 can also be realized by the same component.
  • an embodiment of the present disclosure provides a data processing solution that performs a structured sparse convolution operation according to a sparse flag bit included in a convolution instruction.
  • FIG. 6 shows a structural block diagram of a data processing apparatus 600 according to an embodiment of the present disclosure.
  • the data processing device 600 may be implemented, for example, in the computing device 201 of FIG. 2 .
  • the data processing apparatus 600 may include a control circuit 610 , a storage circuit 620 and an arithmetic circuit 630 .
  • control circuit 610 may be similar to the control module 31 of FIG. 3 or the control module 51 of FIG. 5 , and it may include, for example, an instruction fetch unit to obtain instructions from, for example, the processing device 203 of FIG. 2 , and an instruction decoding unit, It is used to decode the acquired instruction, and send the decoded result to the operation circuit 630 and the storage circuit 620 as control information.
  • instruction fetch unit to obtain instructions from, for example, the processing device 203 of FIG. 2
  • instruction decoding unit It is used to decode the acquired instruction, and send the decoded result to the operation circuit 630 and the storage circuit 620 as control information.
  • control circuit 610 may be configured to parse a convolution instruction, wherein the convolution instruction includes a sparse flag bit for indicating whether to perform a structured sparse convolution operation.
  • the sparse flag may take a value of "1", indicating that the current convolution instruction performs a structured sparse convolution operation; correspondingly, the sparse flag may take a value of "0", indicating that the current convolution instruction performs a conventional convolution operation Convolution operation; vice versa.
  • Storage circuitry 620 may be configured to store pre-convolution and/or post-convolution information.
  • the operands of the convolution instruction include the weights of the convolutional layers in the neural network and the neuron data of the neural network.
  • the storage circuit may include, for example, the WRAM 332 of FIG. 3 or the WRAM 532 of FIG. 5 , for storing weights; and the NRAM 331 of FIG. 3 or the NRAM 531 of FIG. 5 , for storing neuron data.
  • the arithmetic circuit 630 may be configured to perform a corresponding convolution operation according to the convolution instruction.
  • arithmetic circuit 630 may include structured sparse circuit 632 and convolution circuit 633 .
  • the operation circuit 630 may be configured accordingly.
  • the structured thinning circuit 632 in the arithmetic circuit 630 may be configured to perform structured thinning processing on at least one input data, and output the thinned input data to the convolution circuit 633 .
  • the convolution circuit 633 may be configured to receive the data to be convolved and perform a convolution operation thereon.
  • the data to be convoluted includes at least the thinned input data received from structured thinning circuit 632 .
  • the structured sparse circuit 632 and the convolution circuit 633 can implement the structured sparse convolution process.
  • the input data may include neuron data for the neural network and weights for the neural network.
  • the structured thinning circuit 632 is used to perform structured thinning processing, which includes selecting n data elements from every m data elements as valid data elements, where m>n.
  • n can also take other values, such as 1 or 3.
  • the convolution circuit 633 is used to perform a convolution operation on the input data.
  • the sparse flag is set to "1" that is, when the structured sparse convolution operation is performed
  • the data received by the convolution circuit 633 at least includes the sparsed input data from the structured sparse circuit 632 .
  • the thinning flag is set to "0" that is, when a conventional convolution operation is performed
  • the data received by the convolution circuit 633 is the data that has not been thinned out.
  • the input data to be convoluted may exist in various forms, so the structured sparse circuit and the convolution circuit may need to perform structured sparse convolution processing according to different requirements.
  • the structured sparse circuit 710 may include a first structured sparse sub-circuit 712 and/or a second structured sparse sub-circuit 714 therein.
  • the first structured sparse subcircuit 712 may be configured to perform structured sparse processing on the input data according to a specified sparse mask.
  • the second structured sparse subcircuit 714 may then be configured to perform structured sparse processing on the input data according to predetermined sparse rules.
  • one of the data to be convoluted (assuming it is the first data) may have been subjected to structured sparse processing in advance, and the other data to be convolved (assumed to be the second data) needs to be
  • the sparse mode of the first data is structured sparse.
  • the second data may be thinned out by using the first structured thinning subcircuit.
  • FIG. 7A shows a partial structural schematic diagram of the arithmetic circuit in the first scenario.
  • the structured sparse circuit 710 includes a first structured sparse sub-circuit 712 that receives the input second data and an index portion of the pre-structured sparse first data.
  • the first structured sparse subcircuit 712 uses the index portion of the first data as a sparse mask to perform structured sparse processing on the second data. Specifically, the first structured sparse subcircuit 712 extracts the data at the corresponding position from the first data as the valid data according to the valid data position indicated by the index part of the first data.
  • the first structured sparse sub-circuit 712 may be implemented by a circuit such as vector multiplication or matrix multiplication, for example.
  • the convolution circuit 720 receives the structured thinned first data and the thinned second data output from the first structured thinning subcircuit 712, and performs convolution on both.
  • the first data and the second data may be thinned out by using the second structured thinning subcircuit.
  • FIG. 7B shows a schematic diagram of a partial structure of the arithmetic circuit in the second scenario.
  • the structured sparse circuit 710 may include two second structured sparse subcircuits 714, which respectively receive the first and second data to be convolved for simultaneous, independent
  • the first data and the second data are respectively subjected to structured sparse processing, and the sparsed data is output to the convolution circuit 720 .
  • the second structured sparse subcircuit 714 may be configured to perform structured sparse processing according to a predetermined filtering rule, for example, according to the filtering rule with a relatively large absolute value, filter out n pieces of data with a relatively large absolute value from every m data elements element as a valid data element.
  • the second structured sparse sub-circuit 714 can implement the above-mentioned processing by, for example, configuring a multi-stage operation pipeline composed of circuits such as comparators.
  • the structured sparse circuit 710 may also include only one second structured sparse sub-circuit 714, which sequentially performs structured sparse processing on the first data and the second data.
  • neither of the two data to be convolved is subjected to structured sparse processing, the two data need to be subjected to structured sparse processing before convolution, and one of the data (for example, the first data) needs to use
  • the thinning-processed index portion of another data serves as a thinning mask.
  • FIG. 7C shows a schematic diagram of part of the structure of the arithmetic circuit in the third scenario.
  • structured sparse circuit 710 may include a first structured sparse sub-circuit 712 and a second structured sparse sub-circuit 714 .
  • the second structured thinning subcircuit 714 may be configured to perform structured thinning processing on, for example, the second data according to predetermined filtering rules, and provide the first structured thinning subcircuit 712 with an index portion of the thinned second data.
  • the first structured sparse subcircuit 712 uses the index portion of the second data as a sparse mask to perform structured sparse processing on the first data.
  • the convolution circuit 720 receives the structured sparse first data and the second process from the first structured sparse sub-circuit 712 and the second structured sparse sub-circuit 714, respectively, and performs convolution on both.
  • the structured sparse circuit may include two first structured sparse sub-circuits for processing.
  • FIG. 8 illustrates an exemplary operational pipeline for structured sparse processing according to one embodiment of the present disclosure.
  • This pipeline can be used, for example, to implement the aforementioned second structured sparse subcircuit.
  • the above-described structured sparse processing can be performed using a multi-stage pipeline operation circuit composed of an absolute value operator, a comparator, and the like.
  • the first pipeline stage may include m(4) absolute value operators 810 for synchronously performing absolute value operations on the 4 input data elements A, B, C and D, respectively.
  • the first pipeline stage will simultaneously output the original data elements (ie, A, B, C, and D) and the data after the absolute value operation (ie,
  • the second pipeline stage may include a permutation and combination circuit 820 for permuting and combining the m absolute values to generate m groups of data, wherein each group of data includes the m absolute values, and the m absolute values are in each group The locations in the data are different from each other.
  • the permutation combination circuit may be a cyclic shifter that performs m-1 cyclic shifts on a permutation of m absolute values (eg,
  • m absolute values eg,
  • four sets of data are generated, namely: ⁇
  • each group of data is output, the corresponding original data element is also output, and each group of data corresponds to one original data element.
  • the third pipeline stage includes a comparison circuit 830 for comparing the absolute values in the m sets of data and generating a comparison result.
  • the third pipeline stage may include m comparison circuits, each comparison circuit includes m-1 comparators (831, 832, 833), and m-1 comparators in the i-th comparison circuit are used for One absolute value in the i-th group of data is compared with the other three absolute values in turn and a comparison result is generated, where 1 ⁇ i ⁇ m.
  • the third pipeline stage can also be considered as m-1(3) sub-pipeline stages.
  • Each sub-pipeline stage includes m comparators for comparing one of its corresponding absolute values with other absolute values.
  • the m-1 sub-pipeline stages are to sequentially compare a corresponding absolute value with the other m-1 absolute values.
  • the four comparators 831 in the first sub-pipeline stage are used to compare the first absolute value and the second absolute value of the four sets of data respectively, and output the comparison respectively.
  • the four comparators 832 in the second sub-pipeline stage are used to compare the first absolute value with the third absolute value in the four sets of data respectively, and output the comparison results w1, x1, y1 and z1 respectively.
  • the four comparators 833 in the third sub-pipeline stage are used to compare the first absolute value and the fourth absolute value of the four sets of data respectively, and output the comparison results w2, x2, y2 and z2 respectively.
  • the comparison result of each absolute value and the other m-1 absolute values can be obtained.
  • the comparison result may be represented using a bitmap. For example, at the first comparator of the first comparison circuit, when
  • , w0 1; at the second comparator of the first channel, when
  • , w1 0; at the third comparator of the first channel, when
  • , w2 1, thus, the output result of the first comparison circuit is ⁇ A, w0, w1, w2 ⁇ , this time ⁇ A, 1, 0, 1 ⁇ .
  • the output of the second comparison circuit is ⁇ B, x0, x1, x2 ⁇
  • the output of the third comparison circuit is ⁇ C, y0, y1, y2 ⁇
  • the output of the fourth comparison circuit is ⁇ D, z0, z1, z2 ⁇ .
  • the fourth pipeline stage includes a screening circuit 840 for selecting n data elements with larger absolute values from the m data elements as valid data elements according to the comparison result of the third stage, and outputting these valid data elements and corresponding indexes .
  • the index is used to indicate the position of these valid data elements within the input m data elements. For example, when A and C are filtered out from the four data elements of A, B, C, and D, their corresponding indices can be 0 and 2.
  • appropriate logic can be designed to select n data elements with larger absolute values.
  • selection is performed according to a specified priority order.
  • the priority of A can be set to be the highest and the priority of D to be the lowest according to the way of fixing the priority from low to high.
  • the absolute values of the three numbers A, C, and D are the same and greater than the absolute value of B, the selected data are A and C.
  • the structured sparse processing of selecting two out of four can also be realized by the multi-stage pipeline operation circuit of FIG. 8 .
  • the result after sparse processing consists of two parts: the data part and the index part.
  • the data part includes the data after sparse processing, that is, the valid data elements extracted according to the filtering rules of the structured sparse processing.
  • the index part is used to indicate the data after thinning, that is, the original positions of valid data elements in the original data before thinning (that is, the data to be thinned).
  • Structured sparse data can be represented and/or stored in a variety of forms.
  • the structured sparse processed data may be in the form of a structure.
  • the data part and the index part are bound to each other.
  • each 1 bit in the index portion may correspond to one data element.
  • each 1 bit in the index part may correspond to 8 bits of data.
  • each 1 bit in the index part in the structure may be set as a position corresponding to N-bit data, and N is determined based at least in part on the hardware configuration.
  • the data part in the structure may be aligned according to the first alignment requirement, and the index part in the structure may be aligned according to the second alignment requirement, so that the entire structure also meets the alignment requirement.
  • the data part can be aligned according to 64B
  • the index part can be aligned according to 32B
  • the entire structure can be aligned according to 96B (64B+32B).
  • the data part and the index part can be used uniformly. Since in structured sparse processing, the ratio of valid data elements to original data elements is fixed, such as n/m, the size of data after sparse processing is also fixed or predictable. Thus, structures can be densely stored in memory circuits without performance loss.
  • the data portion and the index portion obtained after the thinning process may also be represented and/or stored separately for separate use.
  • the index portion of the structured thinning-processed second input data may be provided to the first structured thinning circuit 712 to be used as a mask to perform the structured thinning-processing on the first input data.
  • each 1 bit in the separately provided index part can indicate whether a data element is valid or not.
  • Convolution Circuits A variety of circuit configurations can be used to implement convolution operations.
  • the convolution circuit can share the same processing circuit for regular convolution and structured sparse convolution, and the convolution circuit can also be implemented by configuring separate processing circuits for regular convolution and structured sparse convolution operations. Embodiments of the present disclosure are not limited in this regard.
  • the arithmetic circuit 630 may further include a pre-processing circuit 631 and a post-processing circuit 634 .
  • the pre-processing circuit 631 can be configured to preprocess the data before the operation performed by the structured sparse circuit 632 and/or the convolution circuit 633 according to the instruction;
  • the post-processing circuit 634 can be configured to perform the operation on the data after the convolution circuit 633. post-processing.
  • the preprocessing circuit 631 may read the input data from the storage circuit 620 and send the structured sparse circuit to the structured sparse circuit at the first rate. 632 outputs the input data.
  • the preprocessing circuit 631 may read the input data from the storage circuit 620 and output the input data to the convolution circuit 633 at the second rate.
  • the first rate is greater than the second rate, for example, by a ratio equal to the sparse ratio in the structured sparse process, eg, m/n. For example, in 4-to-2 structured sparse processing, the first rate is twice the second rate.
  • the first rate is determined based, at least in part, on the processing power of the convolution circuit 633 and the sparse ratio of the structured sparse processing.
  • the aforementioned preprocessing and postprocessing may also include, for example, data splitting and/or data splicing operations.
  • the post-processing circuit 634 may perform fusion processing on the output of the convolution circuit, such as addition, subtraction, multiplication, and the like.
  • the operands of the convolution instruction can be data in the neural network, such as weights, neurons, etc., that is, the convolution instruction is used for structured and sparse convolution operations in the neural network.
  • Data in a neural network usually contains multiple dimensions.
  • data may exist in four dimensions: input channels, output channels, length, and width.
  • the structured sparse in the above-described convolution instructions may be performed for at least one dimension of multidimensional data in a neural network.
  • convolution instructions may be used for structured sparse convolution operations in the forward process (eg, inference, or forward training) of a neural network, where the structured sparse processing is specific to the neural network The input channel dimension of the multidimensional data in is performed.
  • the convolution instruction may be used for structured sparse convolution operations in the reverse process (eg, reverse training) of the neural network, where the structured sparse processing is directed to multidimensional data processing in the neural network.
  • the input channel dimension and the output channel dimension are executed simultaneously.
  • the aforementioned convolution instruction may be a microinstruction or control signal that runs inside one or more multi-stage operation pipelines, which may include (or instruct) one or more multi-stage operation pipelines to be executed. operation operation.
  • FIG. 9 shows an exemplary flowchart of a data processing method 900 according to an embodiment of the present disclosure.
  • a convolution instruction is parsed, and the convolution instruction includes a sparse flag bit for indicating whether to perform a structured sparse convolution operation.
  • This step may be performed, for example, by the control circuit 610 of FIG. 6 .
  • step 920 the corresponding operand is read according to the convolution instruction. This step may be performed, for example, by the control circuit 610 of FIG. 6 controlling the storage circuit 620 .
  • step 930 a corresponding convolution operation is performed on the read operand according to the convolution instruction.
  • This step can be performed, for example, by the arithmetic circuit 630 of FIG. 6 .
  • the sparse flag in the convolution instruction there can be different convolution operations. For example, when the sparse flag has a value of "0”, a regular convolution operation can be performed. When the sparse flag takes a value of "1”, a structured sparse convolution operation is performed.
  • the structured sparse convolution operation may include: performing structured sparse processing on at least one input data; and performing a convolution operation on the sparsed input data.
  • performing structured sparse processing on at least one input data includes any of the following: performing structured sparse processing on the first input data and the second input data to be convoluted, respectively, and An input data and a second input data are output to a convolution circuit to perform a convolution operation; or the index part corresponding to the first or second input data that has been structured sparse processing is used as a sparse mask, and the second or first input data is perform structured sparse processing on the data, and output the sparsed first input data to a convolution circuit to perform a convolution operation with the structured sparse-processed second input data, wherein the index portion indicates the structured sparse to be performed The positions of valid data elements in the sparse.
  • the structured sparse process includes selecting n data elements from every m data elements as valid data elements, as indicated by the index portion, where m>n.
  • the first or second input data that has undergone structured sparse processing may be previously subjected to structured sparse processing and stored in a storage circuit, or the first or second input data that have undergone structured sparse processing may be online structured sparse processing. After processing, it is directly provided to the convolution operation.
  • Structured, sparsely processed data can be provided in a variety of forms.
  • the structured sparse processed data is in the form of a structure, the structure includes a data part and an index part bound to each other, the data part includes valid data elements after the structured sparse process, and the index part is used to indicate The original position of the thinned data in the pre-sparse data.
  • the method may further include: delivering the input data at a first rate to perform the structured sparse processing, wherein the first rate is based at least in part on the processing power of the hardware performing the convolution operation and the structured sparse processing The sparse ratio of processing is determined.
  • the first input data may be neuron data of a neural network
  • the second input data may be weights of convolutional layers in the neural network; and vice versa.
  • the instructions of conventional processors are designed to perform basic single-data scalar operations.
  • a single-data scalar operation means that each operand of the instruction is a scalar data.
  • the oriented operands are often multi-dimensional vector (ie, tensor data) data types, and only using scalar operations cannot make hardware efficient Complete the operation task. Therefore, how to efficiently perform multi-dimensional tensor data processing is also an urgent problem to be solved in the current computing field.
  • a convolution instruction for performing a convolution operation related to tensor data, especially a structured sparse convolution operation of tensor data.
  • At least one descriptor is included in at least one operand of the convolution instruction, and information related to tensor data can be obtained through the descriptor.
  • the descriptor may indicate at least one of the following information: shape information of tensor data, and spatial information of tensor data.
  • the shape information of the tensor data can be used to determine the data address in the data storage space of the tensor data corresponding to the operand. Spatial information of tensor data can be used to determine dependencies between instructions, which in turn can determine, for example, the execution order of instructions.
  • spatial information of tensor data may be indicated by a spatial identification (ID).
  • ID can also be called a space alias, which refers to a space area used to store the corresponding tensor data.
  • the space area can be a continuous space or multiple space. This disclosure does not have any specific composition of the space area. limit. Different spatial IDs indicate that there is no dependency between the pointed spatial regions.
  • Tensors can contain many forms of data composition. Tensors can be of different dimensions. For example, scalars can be regarded as 0-dimensional tensors, vectors can be regarded as 1-dimensional tensors, and matrices can be 2-dimensional or more than 2-dimensional tensors.
  • the shape of a tensor includes information such as the dimension of the tensor and the dimensions of each dimension of the tensor. For example, for a three-dimensional tensor:
  • the shape of the tensor data cannot be determined according to its data address (or storage area), and further related information such as the relationship between multiple tensor data cannot be determined, resulting in the processor to the tensor data. access efficiency is low.
  • the three-dimensional tensor in the example above can be represented as (2, 2, 3) with descriptors. It should be noted that the present disclosure does not limit the manner in which the descriptor indicates the shape of the tensor.
  • the value of N may be determined according to the dimension (also referred to as the order) of the tensor data, and may also be set according to the usage needs of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor can be used to indicate the shape (eg offset, size, etc.) of the three-dimensional tensor data in the three-dimensional direction. It should be understood that those skilled in the art can set the value of N according to actual needs, which is not limited in the present disclosure.
  • tensor data can be multi-dimensional, because the layout of memory is always one-dimensional, there is a correspondence between tensors and storage on memory.
  • Tensor data is usually allocated in contiguous storage space, that is, the tensor data can be expanded one-dimensionally (eg, row-major manner) and stored on the memory.
  • This relationship between tensors and the underlying storage can be represented by the offset of the dimension (offset), the size of the dimension (size), the stride of the dimension (stride), and so on.
  • the offset of a dimension refers to the offset relative to the reference position in that dimension.
  • the size of a dimension refers to the size of the dimension, that is, the number of elements in the dimension.
  • the step size of the dimension refers to the interval between adjacent elements in this dimension. For example, the step size of the three-dimensional tensor above is (6,3,1), that is, the step size of the first dimension is 6, and the second step size is 6.
  • the step size of the dimension is 3, and the step size of the third dimension is 1.
  • FIG. 10 shows a schematic diagram of a data storage space according to an embodiment of the present disclosure.
  • the data storage space 101 stores a two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward).
  • the size in the X-axis direction (the size of each row, or the total number of columns) is ori_x (not shown in the figure)
  • the size in the Y-axis direction (the total number of rows) is ori_y (not shown in the figure)
  • the starting address PA_start (reference address) of 101 is the physical address of the first data block 102 .
  • the data block 103 is part of the data in the data storage space 101, the offset 105 in the X-axis direction is represented as offset_x, the offset 104 in the Y-axis direction is represented as offset_y, and the size in the X-axis direction is represented by is size_x, and the size in the Y-axis direction is represented by size_y.
  • the data reference point of the descriptor may use the first data block of the data storage space 101, and the reference address of the descriptor may be agreed as the data storage space 101 The starting address of PA_start. Then, the size ori_x of the data storage space 101 on the X axis, the size ori_y on the Y axis, and the offset amount offset_y in the Y axis direction, the offset amount offset_x in the X axis direction, and the offset amount in the X axis direction of the data block 103 can be combined. The content of the descriptor of the data block 103 is determined by the size size_x and the size size_y in the Y-axis direction.
  • the content of the descriptor represents a two-dimensional space
  • those skilled in the art can set the specific dimension of the content of the descriptor according to the actual situation, which is not limited in the present disclosure.
  • the reference address of the data reference point of the descriptor in the data storage space may be agreed upon.
  • the base address PA_base of the data base point of the descriptor in the data storage space may be agreed.
  • a piece of data eg, data at a position (2, 2)
  • the physical address of the data in the data storage space may be used as the reference address PA_base.
  • the content of the descriptor of the data block 103 in FIG. 10 can be determined according to the positions of the two vertices at the diagonal positions relative to the data reference point.
  • the positions of at least two vertices of the diagonal positions of the data block 103 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the upper left to lower right direction are used, wherein the position of the upper left vertex is The relative position is (x_min, y_min), the relative position of the lower right vertex is (x_max, y_max), and then the relative position of the upper left vertex (x_min, y_min) and the relative position of the lower right vertex (x_max, y_max) determines the content of the descriptor of the data block 103 .
  • the following formula (2) can be used to represent the content of the descriptor (the base address is PA_base):
  • the tensor can be determined according to the reference address of the data reference point of the descriptor in the data storage space and the mapping relationship between the data description position and the data address of the tensor data indicated by the descriptor.
  • the content of the descriptor of the quantity data can be set according to actual needs. For example, when the tensor data indicated by the descriptor is three-dimensional space data, the function f(x, y, z) can be used to define The data describes the mapping relationship between the location and the data address.
  • the descriptor is further used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be is the following formula (4):
  • PA is the address parameter.
  • the address parameter can be a logical address or a physical address.
  • PA can be used as any one of the vertex, middle point or preset point of the vector shape, and the corresponding data address can be obtained by combining the shape parameters in the X direction and the Y direction.
  • the address parameter of the tensor data includes a reference address of the data reference point of the descriptor in the data storage space of the tensor data, and the reference address includes a start address of the data storage space.
  • the descriptor may further include at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be the following formula (5):
  • PA_start is a reference address parameter, which is not repeated here.
  • mapping relationship between the data description location and the data address can be set according to the actual situation, which is not limited in the present disclosure.
  • a predetermined reference address may be set in a task, the descriptors in the instructions under this task all use the reference address, and the content of the descriptor may include shape parameters based on the reference address.
  • the base address can be determined by setting the environment parameters for this task. For the relevant description and usage of the reference address, reference may be made to the foregoing embodiments.
  • the content of the descriptor can be mapped to the data address more quickly.
  • a reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the way of setting a common reference address by using environment parameters, each descriptor in this way can describe data more flexibly and use a larger data address space.
  • the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor.
  • the calculation of the data address is automatically completed by the hardware, and when the representation of the content of the descriptor is different, the calculation method of the data address is also different. This disclosure does not limit the specific calculation method of the data address.
  • the content of the descriptor in the operand is represented by formula (1)
  • the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively
  • the size is size_x*size_y
  • the The starting data address PA1 (x, y) of the tensor data indicated by the descriptor in the data storage space can be determined using the following formula (6):
  • PA1 (x,y) PA_start+(offset_y-1)*ori_x+offset_x (6)
  • the data address in the data storage space of the data corresponding to the operand can be determined according to the content of the descriptor and the data description location. In this way, part of the data (eg, one or more data) in the tensor data indicated by the descriptor can be processed.
  • the content of the descriptor in the operand is represented by formula (2).
  • the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, and the size is size_x*size_y.
  • the operand includes The data description position for the descriptor is (x q , y q ), then, the data address PA2 (x, y) of the tensor data indicated by the descriptor in the data storage space can use the following formula (7) to make sure:
  • PA2 (x,y) PA_start+(offset_y+y q -1)*ori_x+(offset_x+x q ) (7)
  • the descriptor may indicate chunked data.
  • Data block can effectively speed up the operation speed and improve the processing efficiency in many applications. For example, in graphics processing, convolution operations often use data blocks for fast processing.
  • FIG. 11 shows a schematic diagram of a data block in a data storage space according to an embodiment of the present disclosure.
  • the data storage space 1100 also stores two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward).
  • the size in the X-axis direction (the size of each row, or the total number of columns) is ori_x (not shown in the figure), and the size in the Y-axis direction (the total number of rows) is ori_y (not shown in the figure).
  • the tensor data stored in Figure 11 includes multiple data blocks.
  • the descriptor requires more parameters to represent these data chunks.
  • X dimension X dimension
  • the following parameters can be involved: ori_x, x.tile.size (size 1102 in the block), x.tile.stride (step size 1104 in the block, that is, the first small The distance between the first point of the block and the first point of the second small block), x.tile.num (the number of blocks, shown as 3 blocks in the figure), x.stride (the overall step size) , that is, the distance from the first point of the first row to the first point of the second row) and so on.
  • Other dimensions may similarly include corresponding parameters.
  • the descriptor may include the identifier of the descriptor and/or the content of the descriptor.
  • the identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be numbered; the content of the descriptor may include at least one shape parameter representing the shape of the tensor data.
  • the tensor data is 3-dimensional data. Among the three dimensions of the tensor data, the shape parameters of two dimensions are fixed, and the content of the descriptor may include the shape representing the other dimension of the tensor data. parameter.
  • the identifier and/or content of the descriptor may be stored in a descriptor storage space (internal memory), such as a register, on-chip SRAM or other medium caches, and the like.
  • the tensor data indicated by the descriptor can be stored in the data storage space (internal memory or external memory), such as on-chip cache or off-chip memory, etc.
  • the present disclosure does not limit the specific locations of the descriptor storage space and the data storage space.
  • the identifier, content of the descriptor, and tensor data indicated by the descriptor can be stored in the same area of the internal memory, for example, a continuous area of the on-chip cache can be used to store the related information of the descriptor content, its address is ADDR0-ADDR1023.
  • the addresses ADDR0-ADDR63 can be used as the descriptor storage space to store the identifier and content of the descriptor
  • the addresses ADDR64-ADDR1023 can be used as the data storage space to store the tensor data indicated by the descriptor.
  • addresses ADDR0-ADDR31 can be used to store the identifier of the descriptor
  • addresses ADDR32-ADDR63 can be used to store the content of the descriptor.
  • the address ADDR is not limited to 1 bit or one byte, and is used here to represent an address, which is an address unit.
  • Those skilled in the art can determine the descriptor storage space, the data storage space and their specific addresses according to actual conditions, which are not limited in this disclosure.
  • the identifier, content of the descriptor, and tensor data indicated by the descriptor may be stored in different areas of the internal memory.
  • a register can be used as a descriptor storage space to store the identifier and content of the descriptor in the register
  • an on-chip cache can be used as a data storage space to store the tensor data indicated by the descriptor.
  • the number of the register may be used to represent the identifier of the descriptor. For example, when the number of the register is 0, the identifier of the descriptor it stores is set to 0. When the descriptor in the register is valid, an area can be allocated in the cache space for storing the tensor data according to the size of the tensor data indicated by the descriptor.
  • the identifier and content of the descriptor may be stored in an internal memory, and the tensor data indicated by the descriptor may be stored in an external memory.
  • the identifier and content of the descriptor can be stored on-chip, and the tensor data indicated by the descriptor can be stored off-chip.
  • the data address of the data storage space corresponding to each descriptor may be a fixed address.
  • a separate data storage space can be divided for tensor data, and the starting address of each tensor data in the data storage space corresponds to a descriptor one-to-one.
  • the circuit or module responsible for parsing the computing instruction eg, an entity external to the computing device of the present disclosure
  • the descriptor when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor can also be used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor can also be Include at least one address parameter representing the address of the tensor data.
  • the content of the descriptor may include an address parameter indicating the address of the tensor data, such as the starting physical address of the tensor data, It may also include multiple address parameters of the address of the tensor data, such as the start address + address offset of the tensor data, or address parameters of the tensor data based on each dimension.
  • address parameters such as the start address + address offset of the tensor data, or address parameters of the tensor data based on each dimension.
  • the address parameter of the tensor data may include the reference address of the data reference point of the descriptor in the data storage space of the tensor data.
  • the reference address can be different according to the change of the data reference point. This disclosure does not limit the selection of data benchmarks.
  • the reference address may include the start address of the data storage space.
  • the reference address of the descriptor is the starting address of the data storage space.
  • the reference address of the descriptor is the address of the data block in the data storage space.
  • the shape parameter of the tensor data includes at least one of the following: the size of the data storage space in at least one direction of N dimensions, the size of the storage area in at least one direction of N dimensions The size of the storage area, the offset of the storage area in at least one direction of the N dimension directions, the position of at least two vertices at the diagonal positions of the N dimension directions relative to the data reference point, the tensor indicated by the descriptor The mapping relationship between the data description location of the data and the data address.
  • the data description position is the mapping position of the point or area in the tensor data indicated by the descriptor.
  • the descriptor can be represented by three-dimensional space coordinates (x, y, z).
  • the shape of the tensor data, and the data description position of the tensor data may be the position of a point or area in the three-dimensional space that the tensor data is mapped to, represented by three-dimensional space coordinates (x, y, z).
  • FIG. 12 shows a structural block diagram of a data processing apparatus 1200 according to another embodiment of the present disclosure.
  • the data processing device 1200 may be implemented, for example, in the computing device 201 of FIG. 2 .
  • the difference between the data processing apparatus 1200 of FIG. 12 and FIG. 6 is that the data processing apparatus 1200 of FIG. 12 further includes a tensor interface circuit 1212 for implementing functions related to the descriptor of tensor data.
  • the data processing apparatus 1200 may further include a control circuit 1210, a storage circuit 1220, and an operation circuit 1230, and the specific functions and implementations of these circuits are similar to those in FIG. 6, and thus will not be repeated here.
  • control circuit 1210 may be configured to parse a convolution instruction, wherein the convolution instruction includes a sparse flag bit for indicating whether to perform a structured sparse convolution operation, and at least one operand of the convolution instruction At least one descriptor is included, and the descriptor indicates at least one of the following information: shape information of the tensor data and spatial information of the tensor data.
  • the sparse flag may take a value of "1", indicating that the current convolution instruction performs a structured sparse convolution operation; correspondingly, the sparse flag may take a value of "0", indicating that the current convolution instruction performs a conventional convolution operation Convolution operation; vice versa.
  • a tensor interface unit (TIU) 1212 may be configured to implement operations associated with the descriptor under the control of the control circuit 1210. These operations may include, but are not limited to, registration, modification, cancellation, and parsing of descriptors; reading and writing of content of descriptors.
  • the present disclosure does not limit the specific hardware type of the tensor interface circuit. In this way, operations associated with descriptors can be implemented through dedicated hardware, which further improves the access efficiency of tensor data.
  • the tensor interface circuit 1212 may be configured to parse the shape information of the tensor data included in the operand of the instruction to determine the data address in the data storage space of the data corresponding to the operand.
  • the tensor interface circuit 1212 may be configured to compare the spatial information (eg, spatial ID) of tensor data included in the operands of the two instructions to determine the The dependencies of the two instructions, and then determine the out-of-order execution, synchronization and other operations of the instructions.
  • spatial information eg, spatial ID
  • the tensor interface circuit 1212 may be configured to compare the spatial information (eg, spatial ID) of tensor data included in the operands of the two instructions to determine the The dependencies of the two instructions, and then determine the out-of-order execution, synchronization and other operations of the instructions.
  • control circuit 1210 and the tensor interface circuit 1212 are shown as two separate modules in FIG. 12, those skilled in the art will understand that these two circuits may also be implemented as one module or more modules, and the present disclosure is described in There are no restrictions in this regard.
  • the arithmetic circuit 1230 may be configured to perform a corresponding convolution operation according to the convolution instruction based on the parsed descriptor.
  • arithmetic circuit 1230 may include structured sparse circuit 1232 and convolution circuit 1233 .
  • the operation circuit 1230 may be configured accordingly.
  • the structured thinning circuit 1232 in the arithmetic circuit 1230 may be configured to perform structured thinning processing on at least one input data, and output the thinned input data to the convolution circuit 1233 .
  • the convolution circuit 1233 may be configured to receive the data to be convoluted and to perform a convolution operation.
  • the data to be convoluted includes at least the thinned input data received from structured thinning circuit 1232 . Therefore, when the sparse flag is set, the structured sparse circuit 1232 and the convolution circuit 1233 can realize the structured sparse convolution processing.
  • the input data may include neuron data and weights for the neural network.
  • the structured thinning circuit 1232 is used to perform structured thinning processing, which includes selecting n data elements from every m data elements as valid data elements, where m>n.
  • n can also take other values, such as 1 or 3.
  • the convolution circuit 1233 is used to perform a convolution operation on the input data.
  • the sparse flag is set to “1”, that is, when the structured sparse convolution operation is performed
  • the data received by the convolution circuit 1233 at least includes the sparsed input data from the structured sparse circuit 1232 .
  • the thinning flag is set to "0”, that is, when a normal convolution operation is performed, the data received by the convolution circuit 1233 is the data that has not been thinned out.
  • the input data to be convoluted may exist in various forms, so the structured sparse circuit and the convolution circuit may need to perform structured sparse convolution processing according to different requirements.
  • the specific implementation of the arithmetic circuit may refer to the foregoing description in conjunction with FIGS. 7A-7C
  • the specific implementation of the structured sparse circuit may refer to the foregoing description in conjunction with FIG. 8 , which will not be repeated here.
  • FIG. 13 shows an exemplary flowchart of a data processing method 1300 according to an embodiment of the present disclosure.
  • a convolution instruction is parsed, the convolution instruction includes a sparse flag bit for indicating whether to perform a structured sparse convolution operation, and at least one operand of the convolution instruction includes at least one Descriptor, the descriptor indicates at least one of the following information: shape information of tensor data and spatial information of tensor data.
  • This step may be performed, for example, by the control circuit 1210 of FIG. 12 .
  • step 1320 the descriptor is parsed.
  • This step may be performed, for example, by the tensor interface circuit 1212 of FIG. 12 .
  • the data address of the tensor data corresponding to the operand in the data storage space can be determined according to the shape information of the tensor data; and/or the dependency relationship between the instructions can be determined according to the space information of the tensor data.
  • step 1330 based at least in part on the parsed descriptor, the corresponding operand is read.
  • the operand is tensor data
  • the data address can be obtained according to the parsed descriptor, so as to read the corresponding data.
  • This step may be performed, for example, by the control circuit 1210 of FIG. 12 controlling the storage circuit 1220.
  • step 1340 according to the convolution instruction, the corresponding convolution operation is performed on the read operand.
  • This step can be performed, for example, by the arithmetic circuit 1230 of FIG. 12 .
  • the sparse flag in the convolution instruction there can be different convolution operations. For example, when the sparse flag has a value of "0”, a regular convolution operation can be performed. When the sparse flag takes a value of "1”, a structured sparse convolution operation is performed.
  • the structured sparse convolution operation may include: performing structured sparse processing on at least one input data; and performing a convolution operation on the sparsed input data.
  • performing structured sparse processing on at least one input data includes any of the following: performing structured sparse processing on the first input data and the second input data to be convoluted, respectively, and An input data and a second input data are output to a convolution circuit to perform a convolution operation; or the index part corresponding to the first or second input data that has been structured sparse processing is used as a sparse mask, and the second or first input data is perform structured sparse processing on the data, and output the sparsed first input data to a convolution circuit to perform a convolution operation with the structured sparse-processed second input data, wherein the index portion indicates the structured sparse to be performed The positions of valid data elements in the sparse.
  • the structured sparse process includes selecting n data elements from every m data elements as valid data elements, as indicated by the index portion, where m>n.
  • the first or second input data that has undergone structured sparse processing may be previously subjected to structured sparse processing and stored in a storage circuit, or the first or second input data that have undergone structured sparse processing may be online structured sparse processing. After processing, it is directly provided to the convolution operation.
  • Structured, sparsely processed data can be provided in a variety of forms.
  • the structured sparse processed data is in the form of a structure, the structure includes a data part and an index part bound to each other, the data part includes valid data elements after the structured sparse process, and the index part is used to indicate The original position of the thinned data in the pre-sparse data.
  • the method may further include: delivering the input data at a first rate to perform the structured sparse processing, wherein the first rate is based at least in part on the processing power of the hardware performing the convolution operation and the structured sparse processing The sparse ratio of processing is determined.
  • the first input data may be neuron data of a neural network
  • the second input data may be weights of convolutional layers in the neural network; and vice versa.
  • an embodiment of the present disclosure provides a convolution instruction, which includes a sparse flag bit, which is used to indicate whether to perform a structured sparse convolution operation.
  • the corresponding operation circuit can be configured according to the value of the flag bit to perform the corresponding convolution operation.
  • the arithmetic circuit may be configured to perform structured sparse processing, and then perform convolution on the sparse processed data.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • various types of devices described herein eg, computing devices or other processing devices
  • suitable hardware processors such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • a data processing device comprising:
  • control circuit configured to parse a convolution instruction, the convolution instruction including a sparse flag bit for indicating whether to perform a structured sparse convolution operation
  • an arithmetic circuit configured to perform a corresponding convolution operation according to the convolution instruction.
  • Clause 2 The data processing apparatus of Clause 1, wherein at least one operand of the convolution instruction includes at least one descriptor indicating at least one of the following: shape information of tensor data and tensors data spatial information, and the data processing device further includes:
  • Tensor interface circuitry configured to parse the descriptor
  • the arithmetic circuit is further configured to perform a corresponding convolution operation according to the convolution instruction based on the parsed descriptor.
  • the tensor interface circuit is configured to determine, according to the shape information, a data address in the data storage space of the tensor data corresponding to the operand;
  • the tensor interface circuit is configured to determine dependencies between instructions according to the spatial information.
  • Item 4 The data processing apparatus according to any one of Items 2-3, wherein the shape information of the tensor data includes at least one shape parameter representing the shape of N-dimensional tensor data, N is a positive integer, and the tensor
  • the shape parameters of the data include at least one of the following:
  • the size of the data storage space in which the tensor data is located in at least one of the N dimensions, the size of the storage area of the tensor data in at least one of the N dimensions, the size of the storage area in N The offset in at least one direction of the N dimension directions, the position of at least two vertices at the diagonal positions of the N dimension directions relative to the data reference point, the difference between the data description position and the data address of the tensor data Mapping relations.
  • the shape information of the tensor data indicates at least one shape parameter of the shape of the N-dimensional tensor data comprising a plurality of data blocks, and N is a positive integer
  • the shape parameters include at least one of the following:
  • the size of the data storage space where the tensor data is located in at least one of the N dimensions the size of the storage area of a single data block in at least one of the N dimensions, the size of the data block in at least one of the N dimensions
  • Clause 6 The data processing apparatus according to any one of clauses 1-5, wherein the operation circuit comprises a structured sparse circuit and a convolution circuit, and when the sparse flag bit indicates to perform a structured sparse convolution operation,
  • the structured thinning circuit is configured to perform structured thinning processing on at least one input data and output the thinned input data to the convolution circuit;
  • the convolution circuit is configured to receive and perform a convolution operation on data to be convoluted, wherein the data to be convoluted includes at least the thinned input data.
  • Clause 7 The data processing apparatus of clause 6, wherein the structured sparse circuit comprises:
  • a first structured sparse subcircuit configured to perform structured sparse processing on the input data according to a specified sparse mask
  • the second structured sparse subcircuit is configured to perform structured sparse processing on the input data according to a predetermined sparse rule.
  • Clause 8 The data processing apparatus of clause 7, wherein the structured sparse circuit is further configured to perform any of the following:
  • first structured sparse subcircuit uses the index part corresponding to the first or second input data that has been structured and sparse as a sparse mask, perform structured sparse processing on the second or first input data, and sparse
  • the latter first input data is output to the convolution circuit to perform a convolution operation with the structured sparse processed second input data, wherein the index portion indicates the number of valid data elements in the structured sparse to be performed.
  • the structured thinning-processed first or second input data is pre-structured thinning and stored in the memory circuit, or
  • the structured sparse processed first or second input data is generated by using the second structured sparse subcircuit to perform structured sparse processing online.
  • Clause 10 The data processing apparatus according to any one of clauses 8 to 9, wherein the structured sparse processed first or second input data is in the form of a structure, the structure comprising mutually bound data parts and An index part, where the data part includes valid data elements that have been structured and sparsely processed, and the index part is used to indicate the position of the data after sparseness in the data before the sparseness.
  • Clause 11 The data processing apparatus of any of clauses 7-10, wherein the structured sparse processing comprises selecting n data elements from every m data elements as valid data elements, where m>n.
  • the second structured sparse subcircuit further comprises: at least one multi-stage pipeline operation circuit comprising a plurality of operators arranged in stages and configured A structured sparse process of selecting n data elements with larger absolute values as valid data elements from m data elements is performed.
  • the first pipeline stage includes m absolute value operators for respectively taking absolute values of m data elements to be sparsed to generate m absolute values;
  • the second pipeline stage includes a permutation and combination circuit for permuting and combining the m absolute values to generate m sets of data, wherein each set of data includes the m absolute values and the m absolute values are in each set The locations in the data are different from each other;
  • the third pipeline stage includes m comparison circuits for comparing absolute values in the m sets of data and generating a comparison result
  • the fourth pipeline stage includes a screening circuit for selecting n data elements with larger absolute values as valid data elements according to the comparison result, and outputting the valid data elements and a corresponding index, the index indicating the valid data the position of the element within the m data elements.
  • each comparison circuit in the third pipeline stage includes m-1 comparators, and the m-1 comparators in the i-th comparison circuit are used to compare the i-th comparison circuit.
  • One absolute value in the group data is compared with the other three absolute values in turn and a comparison result is generated, 1 ⁇ i ⁇ m.
  • Clause 15 The data processing apparatus of any one of clauses 13-14, wherein the screening circuit is further configured to select, when there are data elements with the same absolute value, in a specified priority order.
  • Clause 16 The data processing apparatus according to any one of clauses 6-15, wherein the operation circuit further comprises a pre-processing circuit, when the sparse flag bit indicates to perform a structured sparse convolution operation,
  • the preprocessing circuit reads input data from the storage circuit and outputs the input data to the structured sparse circuit at a first rate, wherein the first rate is based at least in part on processing by the convolution circuit Capability and sparsity ratio of the structured sparsity process.
  • Clause 17 The data processing apparatus of any of clauses 6-16, wherein the input data comprises neuron data and weights of a neural network.
  • Clause 18 The data processing apparatus of any one of clauses 1-17, wherein the convolution instruction is for a structured sparse convolution operation in a neural network, and the structured sparse is directed to a multidimensional Perform at least one dimension of the data.
  • the at least one dimension is selected from the input channel dimension and the output channel dimension.
  • Clause 20 A chip comprising a data processing device according to any of clauses 1-19.
  • Article 22 A data processing method comprising:
  • Parsing a convolution instruction where the convolution instruction includes a sparse flag bit for indicating whether to perform a structured sparse convolution operation
  • a corresponding convolution operation is performed on the operand according to the convolution instruction.
  • Clause 23 The data processing method of Clause 22, wherein at least one operand of the convolution instruction includes at least one descriptor indicating at least one of the following: shape information of tensor data and tensors spatial information of the data; and the method further includes:
  • the corresponding operand is read.
  • parsing the descriptor comprises:
  • the shape information determine the data address in the data storage space of the tensor data corresponding to the operand.
  • the dependencies between the instructions are determined.
  • Item 25 The data processing method according to any one of Items 23-24, wherein the shape information of the tensor data includes at least one shape parameter representing the shape of N-dimensional tensor data, N is a positive integer, and the tensor
  • the shape parameters of the data include at least one of the following:
  • the size of the data storage space in which the tensor data is located in at least one of the N dimensions, the size of the storage area of the tensor data in at least one of the N dimensions, the size of the storage area in N The offset in at least one direction of the N dimension directions, the position of at least two vertices at the diagonal positions of the N dimension directions relative to the data reference point, the difference between the data description position and the data address of the tensor data Mapping relations.
  • the size of the data storage space where the tensor data is located in at least one of the N dimensions the size of the storage area of a single data block in at least one of the N dimensions, the size of the data block in at least one of the N dimensions
  • Clause 27 The data processing method according to any one of Clauses 22-26, when the sparse flag indicates to perform a structured sparse convolution operation, the method further comprises:
  • a convolution operation is performed on the data to be convoluted by using a convolution circuit, and the data to be convolved includes at least the sparse input data.
  • Clause 28 The data processing method of clause 27, wherein the structured sparse processing comprises:
  • a structured sparse process is performed on the input data according to a predetermined sparse rule.
  • Clause 29 The data processing method of clause 28, wherein the structured sparse processing further comprises any of the following:
  • first structured sparse subcircuit uses the index part corresponding to the first or second input data that has been structured and sparse as a sparse mask, perform structured sparse processing on the second or first input data, and sparse
  • the latter first input data is output to the convolution circuit to perform a convolution operation with the structured sparse processed second input data, wherein the index portion indicates the number of valid data elements in the structured sparse to be performed.
  • Clause 30 A data processing method according to Clause 29, wherein:
  • the structured sparse-processed first or second input data is pre-structured sparse-processed and stored in a memory circuit, or
  • the structured sparse processed first or second input data is generated by using the second structured sparse subcircuit to perform structured sparse processing online.
  • Clause 31 The data processing method according to any one of clauses 28 to 29, wherein the first or second input data that has been structured and sparsely processed is in the form of a structure, the structure comprising mutually bound data parts and An index part, where the data part includes valid data elements that have been structured and sparsely processed, and the index part is used to indicate the position of the data after sparseness in the data before the sparseness.
  • Clause 32 The data processing method of any of clauses 28-31, wherein the structured sparse processing comprises selecting n data elements from every m data elements as valid data elements, where m>n.
  • Clause 33 The data processing method of any of clauses 28-32, wherein the second structured sparse subcircuit further comprises: at least one multi-stage pipeline operation circuit comprising a plurality of operators arranged in stages and configured A structured sparse process of selecting n data elements with larger absolute values as valid data elements from m data elements is performed.
  • Clause 34 The data processing method of clause 33, wherein the multi-stage pipelined circuit includes four pipeline stages, wherein:
  • the first pipeline stage includes m absolute value operators for respectively taking absolute values of m data elements to be sparsed to generate m absolute values;
  • the second pipeline stage includes a permutation and combination circuit for permuting and combining the m absolute values to generate m sets of data, wherein each set of data includes the m absolute values and the m absolute values are in each set The locations in the data are different from each other;
  • the third pipeline stage includes m comparison circuits for comparing absolute values in the m sets of data and generating a comparison result
  • the fourth pipeline stage includes a screening circuit for selecting n data elements with larger absolute values as valid data elements according to the comparison result, and outputting the valid data elements and a corresponding index, the index indicating the valid data the position of the element within the m data elements.
  • each comparison circuit in the third pipeline stage includes m-1 comparators, and the m-1 comparators in the i-th comparison circuit are used to compare the i-th comparison circuit.
  • One absolute value in the group data is compared with the other three absolute values in turn and a comparison result is generated, 1 ⁇ i ⁇ m.
  • Clause 36 The data processing method of any one of clauses 34-35, wherein the screening circuit is further configured to, when there are data elements with the same absolute value, select in accordance with a specified priority order.
  • Clause 37 The data processing method according to any one of Clauses 27-36, the method further comprising:
  • the input data is fed at a first rate to perform the structured thinning process, wherein the first rate is based at least in part on the processing capability of hardware performing the convolution operation and a sparsity ratio of the structured thinning process.
  • Clause 38 The data processing method of any of clauses 27-37, wherein the input data comprises neuron data and weights of a neural network.
  • Clause 39 The data processing method of any one of clauses 22-38, wherein the convolution instruction is for a structured sparse convolution operation in a neural network, and the structured sparse is directed to a multidimensional Perform at least one dimension of the data.
  • Clause 40 The data processing method of Clause 39, wherein: the at least one dimension is selected from an input channel dimension and an output channel dimension.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Neurology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Complex Calculations (AREA)

Abstract

L'invention concerne un appareil de traitement de données, un procédé de traitement de données et un produit associé. L'appareil de traitement de données peut être employé en tant qu'appareil de calcul compris dans un appareil de traitement combiné. L'appareil de traitement combiné peut également comprendre un appareil d'interface et d'autres appareils de traitement. L'appareil de calcul interagit avec les autres appareils de traitement afin de réaliser conjointement une opération de calcul spécifiée par un utilisateur. L'appareil de traitement combiné peut en outre comprendre un appareil de stockage, lequel est respectivement connecté à l'appareil de calcul et aux autres appareils de traitement et est utilisé pour stocker des données de l'appareil de calcul et des autres appareils de traitement. L'invention concerne une instruction spéciale pour une opération de convolution éparse structurée, et l'instruction spéciale peut simplifier le traitement, améliorant ainsi l'efficacité de traitement d'une machine.
PCT/CN2021/128187 2020-12-25 2021-11-02 Appareil de traitement de données, procédé de traitement de données et produit associé WO2022134872A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011566134.1A CN114692844A (zh) 2020-12-25 2020-12-25 数据处理装置、数据处理方法及相关产品
CN202011566148.3A CN114692846A (zh) 2020-12-25 2020-12-25 数据处理装置、数据处理方法及相关产品
CN202011566148.3 2020-12-25
CN202011566134.1 2020-12-25

Publications (1)

Publication Number Publication Date
WO2022134872A1 true WO2022134872A1 (fr) 2022-06-30

Family

ID=82157427

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/128187 WO2022134872A1 (fr) 2020-12-25 2021-11-02 Appareil de traitement de données, procédé de traitement de données et produit associé

Country Status (1)

Country Link
WO (1) WO2022134872A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107909148A (zh) * 2017-12-12 2018-04-13 北京地平线信息技术有限公司 用于执行卷积神经网络中的卷积运算的装置
US20180204110A1 (en) * 2017-01-16 2018-07-19 Electronics And Telecommunications Research Institute Compressed neural network system using sparse parameters and design method thereof
CN110197267A (zh) * 2018-02-27 2019-09-03 上海寒武纪信息科技有限公司 神经网络处理器板卡及相关产品
CN110991631A (zh) * 2019-11-28 2020-04-10 福州大学 一种基于fpga的神经网络加速系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180204110A1 (en) * 2017-01-16 2018-07-19 Electronics And Telecommunications Research Institute Compressed neural network system using sparse parameters and design method thereof
CN107909148A (zh) * 2017-12-12 2018-04-13 北京地平线信息技术有限公司 用于执行卷积神经网络中的卷积运算的装置
CN110197267A (zh) * 2018-02-27 2019-09-03 上海寒武纪信息科技有限公司 神经网络处理器板卡及相关产品
CN110991631A (zh) * 2019-11-28 2020-04-10 福州大学 一种基于fpga的神经网络加速系统

Similar Documents

Publication Publication Date Title
WO2023045445A1 (fr) Dispositif de traitement de données, procédé de traitement de données et produit associé
WO2022134873A1 (fr) Dispositif de traitement de données, procédé de traitement de données et produit associé
WO2023045446A1 (fr) Appareil informatique, procédé de traitement de données et produit associé
CN113469336A (zh) 优化神经网络模型的编译方法、执行方法及相关产品
WO2024149112A1 (fr) Procédé de compilation pour opérateur de convolution, et produit associé
WO2023236929A1 (fr) Procédé et dispositif de lecture de données cibles dans des données sur la base d'une instruction
WO2023030507A1 (fr) Procédé et appareil d'optimisation de compilation, dispositif informatique et support de stockage
WO2022134872A1 (fr) Appareil de traitement de données, procédé de traitement de données et produit associé
CN114692844A (zh) 数据处理装置、数据处理方法及相关产品
WO2022001500A1 (fr) Appareil informatique, puce de circuit intégré, carte de circuit imprimé, dispositif électronique et procédé de calcul
WO2022095675A1 (fr) Appareil et procédé d'amenuisement de réseau neuronal, et dispositif associé
CN113469337B (zh) 用于优化神经网络模型的编译方法及其相关产品
CN114281561A (zh) 处理单元、用于处理单元的同步方法及相应产品
CN114692838A (zh) 数据处理装置、数据处理方法及相关产品
WO2022134688A1 (fr) Circuit de traitement de données, procédé de traitement de données et produits associés
WO2022257980A1 (fr) Appareil informatique, procédé de mise en œuvre d'une opération de convolution à l'aide d'un appareil informatique, et produit associé
CN114692841A (zh) 数据处理装置、数据处理方法及相关产品
WO2022001454A1 (fr) Appareil informatique intégré, puce de circuit intégré, carte de circuit imprimé et procédé informatique
CN113742266B (zh) 集成电路装置、电子设备、板卡和计算方法
WO2022001499A1 (fr) Appareil de calcul, puce, carte de circuit imprimé, dispositif électronique et procédé de calcul
WO2022135599A1 (fr) Dispositif, carte et procédé pour fusionner des structures de ramification, et support de stockage lisible
CN113791996B (zh) 集成电路装置、电子设备、板卡和计算方法
WO2023087698A1 (fr) Appareil de calcul et procédé pour exécuter une opération de convolution, et produits associés
WO2022135600A1 (fr) Appareil de réseau neuronal de calcul, carte, procédé et support de stockage lisible
WO2022063183A1 (fr) Dispositif et procédé pour l'informatique neuronale, ainsi que carte et support de stockage lisible

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908873

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21908873

Country of ref document: EP

Kind code of ref document: A1