CN114692846A

CN114692846A - Data processing device, data processing method and related product

Info

Publication number: CN114692846A
Application number: CN202011566148.3A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2022-07-01

Abstract

The present disclosure discloses a data processing apparatus, a data processing method, and a related product. The data processing apparatus may be implemented as a computing apparatus included in a combined processing apparatus, which may also include interface apparatus and other processing apparatus. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. The disclosed solution provides dedicated instructions for structured sparse convolution operations that can simplify processing and improve the processing efficiency of the machine.

Description

Data processing device, data processing method and related product

Technical Field

The present disclosure relates generally to the field of processors. More particularly, the present disclosure relates to a data processing apparatus, a data processing method, a chip, and a board.

Background

In recent years, with the rapid development of deep learning, the performance of algorithms in a series of fields such as computer vision and natural language processing has been developed in a cross-over manner. However, the deep learning algorithm is a calculation-intensive and storage-intensive tool, and with the increasing complexity of information processing tasks and the increasing requirements for the real-time performance and accuracy of the algorithm, the neural network is often designed to be deeper and deeper, so that the requirements for the calculation amount and the storage space are increased, and the existing artificial intelligence technology based on deep learning is difficult to be directly applied to mobile phones, satellites or embedded devices with limited hardware resources.

Therefore, compression, acceleration, optimization of the deep neural network model becomes of great importance. A large number of researches try to reduce the calculation and storage requirements of the neural network on the premise of not influencing the model precision, and have very important significance on the engineering application of the deep learning technology at an embedded end and a mobile end. Thinning is just one of the model lightweight methods.

The network parameter sparsification is to reduce redundant components in a larger network by a proper method so as to reduce the requirement of the network on the calculation amount and the storage space. Existing hardware and/or instruction sets do not efficiently support sparsification and operations related to post-sparsification.

Disclosure of Invention

In order to at least partially solve one or more technical problems mentioned in the background, the present disclosure provides a data processing apparatus, a data processing method, a chip, and a board.

In a first aspect, the present disclosure discloses a data processing apparatus comprising: control circuitry configured to parse a convolution instruction, the convolution instruction including a sparse flag bit to indicate whether to perform a structured sparse convolution operation; a storage circuit configured to store pre-convolution and/or post-convolution information; and an arithmetic circuit configured to perform a corresponding convolution operation according to the convolution instruction.

In a second aspect, the present disclosure provides a chip comprising the data processing apparatus of any of the embodiments of the first aspect.

In a third aspect, the present disclosure provides a board card comprising the chip of any of the embodiments of the second aspect.

In a fourth aspect, the present disclosure provides a data processing method, the method comprising: parsing a convolution instruction, the convolution instruction including a sparse flag bit to indicate whether to perform a structured sparse convolution operation; reading corresponding operands according to the convolution instruction; and performing corresponding convolution operation on the operand according to the convolution instruction.

With the data processing apparatus, the data processing method, the integrated circuit chip and the board card provided as above, the embodiments of the present disclosure provide a convolution instruction including a sparse flag bit for indicating whether to perform a structured sparse convolution operation. By setting the sparse flag bit, the corresponding operation circuit can be configured according to the value of the flag bit to execute the corresponding convolution operation. In some embodiments, when the sparse flag indicates to perform a structured sparse convolution operation, the arithmetic circuitry may be configured to perform structured sparse processing and then perform convolution on the thinned data. By multiplexing the instruction domain of the convolution instruction and adding the structured sparse enabling flag bit, the processing can be simplified, and the processing efficiency of the machine is improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to like or corresponding parts and in which:

fig. 1 is a block diagram illustrating a board card of an embodiment of the present disclosure;

FIG. 2 is a block diagram illustrating a combined processing device according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating an internal architecture of a single core computing device according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating the internal structure of a multi-core computing device according to embodiments of the present disclosure;

FIG. 5 is an internal block diagram illustrating a processor core of an embodiment of the disclosure;

FIG. 6 is a schematic diagram showing the structure of a data processing apparatus of an embodiment of the present disclosure;

FIGS. 7A-7C are schematic diagrams illustrating a portion of an operational circuit according to an embodiment of the disclosure;

FIG. 8 illustrates an exemplary pipeline circuit diagram for structured sparseness processing of an embodiment of the present disclosure; and

fig. 9 is an exemplary flowchart illustrating a data processing method of an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is to be understood that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. as may appear in the claims, specification, and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the disclosure. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a storage device 204.

The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.

The interface means 202 is used for transferring data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data in a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.

The storage device 204 is used to store data to be processed, which may be a DRAM, a DDR memory, and is typically 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203.

Fig. 3 shows an internal structure diagram of the computing apparatus 201 as a single core. The single-core computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining, and the like, and the single-core computing device 301 includes three modules: a control module 31, an operation module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decoding unit 312 decodes the obtained instruction and sends the decoded result as control information to the operation module 32 and the storage module 33.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operation, and can support complex operations such as vector multiplication, addition, nonlinear transformation, etc.; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM)331, a parameter storage unit (weight RAM, WRAM)332, and a Direct Memory Access (DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; WRAM 332 is used to store the convolution kernel of the deep learning network, i.e. the weight; the DMA 333 is connected to the DRAM204 via the bus 34, and is responsible for data transfer between the single-core computing device 301 and the DRAM 204.

Fig. 4 shows a schematic diagram of the internal structure of the computing apparatus 201 with multiple cores. The multi-core computing device 41 is designed in a hierarchical structure, and the multi-core computing device 41 is a system on a chip and includes at least one cluster (cluster), each cluster including a plurality of processor cores, in other words, the multi-core computing device 41 is formed in a system on a chip-cluster-processor core hierarchy.

In a system-on-chip hierarchy, as shown in FIG. 4, the multi-core computing device 41 includes an external storage controller 401, a peripheral communication module 402, an on-chip interconnect module 403, a synchronization module 404, and a plurality of clusters 405.

There may be multiple external memory controllers 401, 2 shown by way of example in the figure, for accessing an external memory device, such as DRAM204 in figure 2, to read data from or write data to off-chip in response to an access request issued by a processor core. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute a task. The on-chip interconnect module 403 connects the external memory controller 401, the peripheral communication module 402 and the plurality of clusters 405 for transmitting data and control signals between the respective modules. The synchronization module 404 is a global synchronization barrier controller (GBC) for coordinating the operation progress of the clusters and ensuring the synchronization of the information. The plurality of clusters 405 are computing cores of the multi-core computing device 41, 4 are exemplarily shown in the figure, and as hardware advances, the multi-core computing device 41 of the present disclosure may further include 8, 16, 64, or even more clusters 405. The cluster 405 is used to efficiently execute deep learning algorithms.

Looking at the cluster level, as shown in FIG. 4, each cluster 405 includes multiple processor cores (IPU core)406 and a memory core (MEM core) 407.

The processor cores 406 are exemplarily shown in 4 in the figure, and the present disclosure does not limit the number of the processor cores 406. The internal architecture is shown in fig. 5. Each processor core 406 is similar to the single-core computing device 301 of fig. 3, again including three major modules: a control module 51, an arithmetic module 52 and a storage module 53. The functions and structures of the control module 51, the operation module 52 and the storage module 53 are substantially the same as those of the control module 31, the operation module 32 and the storage module 33, and are not described again. It should be noted that the storage module 53 includes an input/output direct memory access (IODMA) module 533 and a move direct memory access (MVDMA) module 534. IODMA 533 controls the access of NRAM 531/WRAM 532 to DRAM204 through broadcast bus 409; the MVDMA 534 is used to control access to the NRAM 531/WRAM 532 and the memory cell (SRAM) 408.

Returning to FIG. 4, the storage core 407 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 406, as well as perform communications between the cluster 405 and the DRAM204, communications among each other cluster 405, communications among each other processor cores 406, and the like. In other embodiments, memory core 407 has the capability of scalar operations to perform scalar operations.

The memory core 407 includes an SRAM 408, a broadcast bus 409, a Cluster Direct Memory Access (CDMA) module 410, and a Global Direct Memory Access (GDMA) module 411. The SRAM 408 plays a role of a high-performance data transfer station, data multiplexed between different processor cores 406 in the same cluster 405 does not need to be acquired to the DRAM204 through the processor cores 406 respectively, but is transferred among the processor cores 406 through the SRAM 408, and the memory core 407 only needs to rapidly distribute the multiplexed data from the SRAM 408 to a plurality of processor cores 406, so that the inter-core communication efficiency is improved, and on-chip and off-chip input/output access is greatly reduced.

Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication among processor cores 406, communication among cluster 405, and data transfer between cluster 405 and DRAM204, respectively. As will be described separately below.

The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405, and the broadcast bus 409 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (e.g., from a single processor core to a single processor core) data transfer, multicast is a communication that transfers a copy of data from SRAM 408 to a particular number of processor cores 406, and broadcast is a communication that transfers a copy of data from SRAM 408 to all processor cores 406, which is a special case of multicast.

CDMA 410 is used to control access to SRAM 408 between different clusters 405 within the same computing device 201.

GDMA 411 cooperates with the external memory controller 401 to control access of SRAM 408 of cluster 405 to DRAM204 or to read data from DRAM204 into SRAM 408. As can be seen from the foregoing, communication between DRAM204 and NRAM 431 or WRAM 432 may be achieved via 2 channels. The first channel is to directly contact DRAM204 with NRAM 431 or WRAM 432 through IODAM 433; the second channel is that data is transferred between the DRAM204 and the SRAM 408 through the GDMA 411, and then transferred between the SRAM 408 and the NRAM 431 or WRAM 432 through the MVDMA 534. Although seemingly the second channel requires more components and the data flow is longer, in some embodiments, the bandwidth of the second channel is substantially greater than the first channel, and thus communication between DRAM204 and NRAM 431 or WRAM 432 may be more efficient over the second channel. Embodiments of the present disclosure may select a data transmission channel based on its own hardware conditions.

In other embodiments, the functionality of GDMA 411 and the functionality of IODMA 533 may be integrated in the same component. For convenience of description, the GDMA 411 and the IODMA 533 are considered as different components, and it is within the scope of the disclosure for those skilled in the art to achieve the same functions and achieve the same technical effects as the present disclosure. Further, the functions of GDMA 411, IODMA 533, CDMA 410 and MVDMA 534 may be implemented by the same component.

One embodiment of the present disclosure provides a data processing scheme based on the foregoing hardware environment, to perform a structured sparse convolution operation according to a sparse flag included in a convolution instruction.

Fig. 6 shows a block diagram of a data processing apparatus 600 according to an embodiment of the present disclosure. The data processing apparatus 600 may be implemented, for example, in the computing apparatus 201 of fig. 2. As shown, the data processing apparatus 600 may include a control circuit 610, a memory circuit 620, and an arithmetic circuit 630.

The control circuit 610 may function similarly to the control module 31 of fig. 3 or the control module 51 of fig. 5, and may include, for example, an instruction fetch unit to fetch an instruction from, for example, the processing device 203 of fig. 2, and an instruction decode unit to decode the fetched instruction and send the decoded result as control information to the operation circuit 630 and the storage circuit 620.

In one embodiment, control circuitry 610 may be configured to parse convolution instructions that include a sparse flag bit to indicate whether to perform a structured sparse convolution operation. In one implementation, the sparse flag bit may take the value "1" to indicate that the current convolution instruction performs a structured sparse convolution operation; correspondingly, the sparse flag bit can take a value of '0', which indicates that the current convolution instruction executes conventional convolution operation; and vice versa.

Storage circuitry 620 may be configured to store pre-convolution and/or post-convolution information. In one embodiment, the operands of the convolution instruction include weights of convolution layers in the neural network and neuron data of the neural network. In this embodiment, the storage circuit may include, for example, the WRAM 332 of fig. 3 or the WRAM 532 of fig. 5, for storing the weight; and NRAM 331 of fig. 3 or NRAM 531 of fig. 5 for storing neuron data.

The arithmetic circuitry 630 may be configured to perform a corresponding convolution operation in accordance with the convolution instruction.

In some embodiments, the operational circuitry 630 may include structured sparseness circuitry 632 and convolution circuitry 633.

When a sparse flag bit in a convolution instruction indicates that the current convolution instruction needs to perform a structured sparse convolution operation, the operational circuitry 630 may be configured accordingly. For example, the structured thinning circuit 632 in the arithmetic circuit 630 may be configured to perform structured thinning processing on at least one input data and output the thinned input data to the convolution circuit 633. Convolution circuit 633 may be configured to receive data to be convolved and perform a convolution operation thereon. The data to be convolved comprises at least the thinned input data received from the structured thinning circuit 632. Thus, when the sparse flag bit is set, the structured sparse convolution processing can be realized by the structured sparse circuit 632 and the convolution circuit 633. In some implementations, the input data may include neuron data of the neural network and weights of the neural network.

Structured sparseness circuit 632 is to perform structured sparseness processing that includes selecting n data elements from every m data elements as valid data elements, where m > n. In one implementation, m is 4 and n is 2. In other implementations, when m is 4, n may also take other values, such as 1 or 3.

The convolution circuit 633 is used to perform a convolution operation on input data. When the sparse flag position "1", that is, when the structured sparse convolution operation is performed, the data received by the convolution circuit 633 includes at least the thinned input data from the structured sparse circuit 632. When the thinning flag position "0", that is, when the normal convolution operation is performed, the data received by the convolution circuit 633 is data that has not been thinned out.

The input data to be convolved may exist in various forms depending on different application scenarios, and thus the structured sparse circuit and the convolution circuit may need to perform the structured sparse convolution processing according to different requirements.

Fig. 7A-7C are schematic diagrams illustrating a portion of an operational circuit according to an embodiment of the disclosure. As shown, a first structured sparse sub-circuit 712 and/or a second structured sparse sub-circuit 714 may be included in the structured sparse circuit 710. The first structured sparseness sub-circuit 712 may be configured to perform structured sparseness on the input data according to a specified sparseness mask. The second structured sparse sub-circuit 714 may then be configured to perform structured sparse processing on the input data according to a predetermined sparse rule.

In a first scenario, one of the data to be convolved (not assumed to be the first data) may have been previously subjected to structured sparsification, while the other of the data to be convolved (assumed to be the second data) needs to be structured sparsification in a manner that the first data is sparse. At this time, the second data may be thinned out using the first structured thinning-out sub-circuit.

Fig. 7A shows a schematic diagram of a part of the structure of the arithmetic circuit in the first scenario. As shown, in this first scenario, the structured sparse circuit 710 includes a first structured sparse sub-circuit 712 that receives the input second data and the indexed portion of the pre-structured sparse first data. The first structured thinning sub-circuit 712 performs structured thinning on the second data using the index portion of the first data as a thinning mask. Specifically, the first structured sparse sub-circuit 712 extracts data of a corresponding position from the first data as valid data according to the valid data position indicated by the index portion of the first data. In these embodiments, the first structured sparse sub-circuit 712 may be implemented by a circuit such as vector multiplication or matrix multiplication, for example. The convolution circuit 720 receives the first data that has been structured and thinned and the second data that has been thinned out and output from the first structured thinning-out sub-circuit 712, and performs convolution on both.

In the second scenario, neither of the two data to be convolved is subjected to structured sparse processing, and the two data need to be subjected to structured sparse processing before convolution. At this time, the first data and the second data may be thinned out using the second structured thinning-out sub-circuit.

Fig. 7B shows a partial structural schematic diagram of the arithmetic circuit in the second scenario. As shown, in this second scenario, the structured sparsification circuit 710 may include two second structured sparsification sub-circuits 714 that respectively receive the first data and the second data to be convolved, so as to simultaneously and independently perform structured sparsification processing on the first data and the second data, respectively, and output the sparsified data to the convolution circuit 720. The second structured sparse sub-circuit 714 may be configured to perform structured sparse processing according to a predetermined screening rule, for example, screening n data elements with larger absolute values from every m data elements as valid data elements according to a rule for screening larger absolute values. In these embodiments, the second structured sparse sub-circuit 714 may implement the above-described processing by, for example, configuring a multi-stage operation pipeline composed of circuits such as comparators. It will be appreciated by those skilled in the art that the structured sparseness circuit 710 may also include only one second structured sparseness sub-circuit 714 that performs structured sparseness on the first data and the second data in sequence.

In a third scenario, neither of the two data to be convolved is subjected to structured sparse processing, the two data need to be subjected to structured sparse processing respectively before convolution, and one of the data (e.g., the first data) needs to use a thinned index portion of the other data (e.g., the second data) as a sparse mask.

Fig. 7C shows a schematic diagram of a part of the structure of the arithmetic circuit in the third scenario. As shown, in this third scenario, the structured sparse circuit 710 may include a first structured sparse sub-circuit 712 and a second structured sparse sub-circuit 714. The second structured sparsification sub-circuit 714 may be configured to perform structured sparsification on, for example, the second data according to a predetermined filtering rule, and provide the index portion of the sparsified second data to the first structured sparsification sub-circuit 712. The first structured thinning sub-circuit 712 performs structured thinning on the first data using the index portion of the second data as a thinning mask. Convolution circuit 720 receives the first structured sparse data and the second structured sparse data from first structured sparse sub-circuit 712 and second structured sparse sub-circuit 714, respectively, and performs convolution on both.

Other application scenarios can be considered by those skilled in the art, and the structured sparse circuit can be designed accordingly. For example, it may be necessary to apply the same sparse mask to two data to be convolved, in which case two first structured sparse sub-circuits may be included in the structured sparse circuit for processing.

FIG. 8 illustrates an exemplary operation pipeline for structured sparseness processing according to one embodiment of the present disclosure. This pipeline may be used, for example, to implement the aforementioned second structured sparse sub-circuit. In the embodiment of fig. 8, a structured thinning process of screening out 2 data elements having a larger absolute value from among 4 data elements A, B, C and D when m is 4 and n is 2 is shown. As shown in the figure, the structured thinning-out processing described above can be performed by a multistage pipelined arithmetic circuit composed of an absolute value calculator, a comparator, and the like.

The first pipeline stage may include m (4) absolute value operators 810 for synchronously performing absolute value operations on the 4 input data elements A, B, C and D, respectively. To facilitate the final output of valid data elements, in some embodiments, the first pipeline stage may output both the original data elements (i.e., A, B, C and D) and the data after the absolute value operation (i.e., | A |, | B |, | C | and | D |).

The second pipeline stage may include a permutation and combination circuit 820 for permutation and combination of the m absolute values to generate m sets of data, wherein each set of data includes the m absolute values, and the m absolute values are different from each other in position in each set of data.

In some embodiments, the permutation combining circuit may be a cyclic shifter that cyclically shifts the permutation of m absolute values (e.g., | A |, | B |, | C |, and | D |) m-1 times, thereby generating m sets of data. For example, in the example shown in the figure, 4 sets of data are generated, respectively: { | A |, | B |, | C |, | D | }, { | B |, | C |, | D |, | A | }, { | C |, | D |, | A |, | B | } and { | D |, | A |, | B |, | C | }. Similarly, each group of data is output, and simultaneously, the corresponding original data element is also output, and each group of data corresponds to one original data element.

The third pipeline stage includes a comparison circuit 830 for comparing absolute values in the m sets of data and generating a comparison result.

In some embodiments, the third pipeline stage may include m comparison circuits, each comparison circuit including m-1 comparators (831, 832, 833), m-1 comparators of the ith comparison circuit being configured to sequentially compare one absolute value of the ith set of data with three other absolute values and generate comparison results, where 1 ≦ i ≦ m.

As can be seen, the third stream stage can also be considered as an m-1(3) sub-stream stage. Each sub-waterline stage comprises m comparators for comparing its corresponding one of the absolute values with the other absolute values. m-1 sub-pipeline stages compare the corresponding absolute value with m-1 absolute values in turn.

For example, in the example shown in the figure, the 4 comparators 831 in the first sub-seawater level are configured to compare the first absolute value with the second absolute value in the 4 sets of data, respectively, and output comparison results w0, x0, y0, and z0, respectively. The 4 comparators 832 in the second sub-stream stage are used to compare the first absolute value with the third absolute value in the 4 groups of data, and output comparison results w1, x1, y1 and z 1. The 4 comparators 833 in the third sub-waterline stage are used to compare the first absolute value with the fourth absolute value in the 4 groups of data, and output comparison results w2, x2, y2, and z2, respectively. Thus, a comparison of each absolute value with the other m-1 absolute values can be obtained.

In some embodiments, the comparison results may be represented using a bitmap. For example, at the 1 st comparator of the 1 st way compare circuit, when | a | ≧ | B |, w0 ≧ 1; at the 1 st 2 nd comparator, when | a | < | C |, w1 is 0; at the 3 rd comparator in way 1, when | a | ≧ D |, w2 ═ 1, and thus the output result of way 1 comparison circuit is { a, w0, w1, w2}, which is { a, 1, 0, 1 }. Similarly, the output result of the 2 nd way comparison circuit is { B, x0, x1, x2}, the output result of the 3 rd way comparison circuit is { C, y0, y1, y2}, and the output result of the 4 th way comparison circuit is { D, z0, z1, z2 }.

The fourth pipeline stage includes a screening circuit 840 for selecting n data elements with larger absolute values from the m data elements as valid data elements according to the comparison result of the third stage, and outputting the valid data elements and corresponding indexes. The index is used to indicate the position of these valid data elements among the input m data elements. For example, when a and C are screened from A, B, C, D four data elements, their corresponding indices may be 0 and 2.

Based on the comparison, appropriate logic can be designed to select the n data elements with larger absolute values. In view of the fact that multiple data elements of the same absolute value may occur, in a further embodiment, when there are data elements of the same absolute value, the selection is made in a specified priority order. For example, a may be set to have the highest priority and D may be set to have the lowest priority in such a manner that the priorities are fixed from low to high in the index. In one example, when the absolute values of the A, C, D numbers are all the same and greater than the absolute value of B, the data selected are A and C.

From the foregoing comparison results, it can be analyzed that | A | is several numbers greater than { | B |, | C |, | D | } from w0, w1, and w 2. If w0, w1, and w2 are all 1, then it means | A | is greater than | B |, | C |, and | D |, which is the maximum of the four numbers, and therefore A is selected. If there are two 1 s in w0, w1, and w2, then this indicates that | A | is the next largest of the four absolute values, and therefore A is also chosen. Otherwise, A is not selected. Thus, in some embodiments, the determination may be analyzed based on the number of occurrences of these values.

In one implementation, the valid data elements may be selected based on the following logic. First, the number of times each data is larger than the other data may be counted. For example, define N_A＝sum_w＝w0+w1+w2，N_B＝sum_x＝x0+x1+x2，N_C＝sum_y＝y0+y1+y2，N_DSum _ z is z0+ z1+ z 2. Subsequently, the judgment and selection are performed under the following conditions.

The conditions for selecting A are as follows: n is a radical of_A3, or N_A2 and N_B/N_C/N_DOnly one of 3;

the conditions for selecting B were: n is a radical of_B3, or N_B2 and N_A/N_C/N_DOnly one of 3, and N_A≠2；

The conditions for selecting C were: n is a radical of_CIs equal to 3, and N_A/N_BAt most one 3, or N_C2 and N_A/N_B/N_DOnly one of 3, and N_A/N_BNone of 2;

the conditions for selecting D were: n is a radical of_DIs equal to 3, and N_A/N_B/N_CAt most one 3, or N_D2 and N_A/N_B/N_COnly one of 3, and N_A/N_B/N_CThere is no 2.

Those skilled in the art will appreciate that there is some redundancy in the above logic in order to ensure selection at a predetermined priority. Based on the size and order information provided by the comparison, one skilled in the art may devise other logic to implement the screening of valid data elements, and the disclosure is not limited in this respect. Thus, the multi-stage pipeline arithmetic circuit of fig. 8 can realize the four-out-of-two structured sparse processing.

Those skilled in the art will appreciate that other forms of pipelined arithmetic circuits may also be designed to implement structured sparseness, and the present disclosure is not limited in this respect.

The result after the sparsification processing comprises two parts: a data portion and an index portion. The data part comprises data after sparse processing, namely effective data elements extracted according to the screening rule of structured sparse processing. The index portion is used to indicate the original positions of the thinned data, i.e., the effective data elements, in the original data before thinning (i.e., the data to be thinned).

The structured sparsely processed data may take a variety of forms to represent and/or store. In one implementation, the structured sparsely processed data may be in the form of a structure. In the structure, the data portion and the index portion are bound to each other. In some embodiments, every 1 bit in the index portion may correspond to a data element. For example, when the data type is fix8, one data element is 8 bits, and every 1 bit in the index portion may correspond to 8 bits of data. In other embodiments, each 1 bit in the index portion of the fabric may be set to a location corresponding to N bits of data, N being determined based at least in part on the hardware configuration, allowing for subsequent implementation at the hardware level when the fabric is used. For example, it may be set that every 1 bit of the index portion in the structure body corresponds to a position of 4-bit data. For example, when the data type is fix8, every 2 bits in the index portion correspond to a data element of the fix8 type. In some embodiments, the data portions of the structure may be aligned according to a first alignment requirement and the index portions of the structure may be aligned according to a second alignment requirement, such that the entire structure also satisfies the alignment requirements. For example, the data portions may be aligned as 64B, the index portions may be aligned as 32B, and the entire structure may be aligned as 96B (64B + 32B). By the alignment requirement, the memory access times can be reduced during subsequent use, and the processing efficiency is improved.

By using such a structure, the data part and the index part can be used collectively. Since the proportion of the valid data elements occupying the original data elements in the structured thinning-out process is fixed, for example, n/m, the data size after the thinning-out process is also fixed or predictable. Thus, the structural body can be densely stored in the memory circuit without a performance loss.

In other implementations, the data portion and the index portion resulting from the thinning process may also be represented and/or stored separately for separate use. For example, the index portion of the structured sparse processed second input data may be provided to the first structured sparse circuit 712 to be used as a mask to perform the structured sparse processing on the first input data. At this time, in order to use different data types, each 1 bit in the separately provided index part may indicate whether one data element is valid.

Convolution circuits may use a variety of circuit configurations to implement convolution operations. For example, the convolution circuit may share the same processing circuit for both conventional convolution and structured sparse convolution, or the convolution circuit may be implemented by allocating a separate processing circuit for both conventional convolution and structured sparse convolution operations. The disclosed embodiments are not limited in this respect.

Returning to fig. 6, in some embodiments, the arithmetic circuitry 630 may also include pre-processing circuitry 631 and post-processing circuitry 634. The preprocessing circuit 631 may be configured to preprocess, according to the instruction, data before the structured sparseness circuit 632 and/or the convolution circuit 633 perform the operation; post-processing circuit 634 may be configured to post-process the data operated on by convolution circuit 633.

In some implementations, when the sparse flag bit in the convolution instruction indicates that a structured sparse convolution operation is to be performed, the preprocessing circuitry 631 can read input data from the storage circuitry 620 and output the input data to the structured sparse circuitry 632 at the first rate. And when the sparse flag indicates conventional convolution, the preprocessing circuit 631 can read input data from the storage circuit 620 and output the input data to the convolution circuit 633 at a second rate. The first rate is greater than the second rate, and the proportion thereof is, for example, equal to the thinning proportion in the structured thinning process, for example, m/n. For example, in a 2-out-of-4 structured sparse processing, the first rate is 2 times the second rate. Thus, the first rate is determined based at least in part on the processing power of convolution circuit 633 and the sparseness ratio of the structured sparseness processing.

In some application scenarios, the aforementioned pre-processing and post-processing may also include, for example, data splitting and/or data splicing operations. For example, post-processing circuit 634 may perform fusion processing, such as addition, subtraction, multiplication, etc., on the output results of the convolution circuit.

As mentioned before, the operands of the convolution instruction may be data in the neural network, such as weights, neurons, etc., i.e. the convolution instruction is used for a structured sparse convolution operation in the neural network. Data in neural networks typically contain multiple dimensions. For example, in a convolutional neural network, data may exist in four dimensions: input channel, output channel, length, and width. In some embodiments, the structured sparseness in the convolution instructions described above may be performed for at least one dimension of multidimensional data in a neural network. In particular, in one implementation, the convolution instructions may be used for a structured sparseness convolution operation in a forward process (e.g., inference, or forward training) of the neural network, where the structured sparseness processing is performed for the input channel dimensions of the multidimensional data in the neural network. In another implementation, the convolution instructions may be used for a structured sparseness convolution operation in an inverse process (e.g., inverse training) of the neural network, where the structured sparseness processing is performed simultaneously for the input channel dimensions and the output channel dimensions of the multidimensional data in the neural network.

In the context of the present disclosure, the aforementioned convolution instructions may be microinstructions or control signals that are executed within one or more multi-stage operation pipelines, which may include (or otherwise indicate) one or more operation operations that are to be performed by the multi-stage operation pipelines.

FIG. 9 illustrates an exemplary flow diagram of a data processing method 900 according to an embodiment of the disclosure.

As shown in fig. 9, in step 910, a convolution instruction is parsed, the convolution instruction including a sparse flag bit for indicating whether to perform a structured sparse convolution operation. This step may be performed, for example, by control circuitry 610 of fig. 6.

Next, in step 920, the corresponding operand is read according to the convolution instruction. This step may be performed, for example, by control circuitry 610 of fig. 6 controlling storage circuitry 620.

Finally, in step 930, the corresponding convolution operation is performed on the read operands according to the convolution instruction. This step may be performed, for example, by the arithmetic circuitry 630 of fig. 6.

Different convolution operations can be performed according to the value of the sparse flag bit in the convolution instruction. For example, when the sparse flag takes the value "0", a conventional convolution operation may be performed. When the sparse flag takes the value "1", a structured sparse convolution operation is performed.

At this time, the convolution operation of the structured sparsity may include: performing structured sparseness processing on at least one input data; and performing convolution operation on the input data after the sparsification.

In particular, in some implementations, performing structured sparseness on the at least one input data includes any of: respectively executing structural sparse processing on the first input data and the second input data to be convolved, and outputting the first input data and the second input data after sparse processing to a convolution circuit to execute convolution operation; or taking an index part corresponding to the first or second input data subjected to structured sparse processing as a sparse mask, performing structured sparse processing on the second or first input data, and outputting the first input data subjected to sparse processing to a convolution circuit so as to perform convolution operation with the second input data subjected to structured sparse processing, wherein the index part indicates the position of an effective data element in structured sparse to be performed. Accordingly, structured sparseness processing includes selecting n data elements as valid data elements from every m data elements as indicated by the index portion, where m > n.

The first or second input data subjected to structured sparsification may be previously subjected to structured sparsification and stored in the storage circuit, or the first or second input data subjected to structured sparsification may be directly provided for use in a convolution operation after being subjected to structured sparsification online.

Structured sparsely processed data may be provided in a variety of forms. In one implementation, the structured and sparsely processed data is in a structural form, the structural body comprises a data part and an index part which are bound with each other, the data part comprises effective data elements after structured and sparsely processed, and the index part is used for indicating an original position of the data after sparsifying in the data before sparsifying.

Further, before performing structured sparseness, the method may further include: the input data is conveyed at a first rate to perform structured sparsification, wherein the first rate is determined based at least in part on processing power of hardware performing the convolution operation and a sparseness ratio of the structured sparsification.

In some embodiments, the first input data may be neuron data of a neural network, and the second input data may be weight values of convolutional layers in the neural network; and vice versa.

It will be appreciated by a person skilled in the art that the steps described in the method flow chart correspond to the individual circuits of the data processing device described above in connection with fig. 6, and therefore the features described above apply equally to the method steps and are not repeated here.

From the foregoing description, it can be seen that the disclosed embodiments provide a convolution instruction that includes a sparse flag bit to indicate whether to perform a structured sparse convolution operation. By setting the sparse flag bit, the corresponding operation circuit can be configured according to the value of the flag bit to execute the corresponding convolution operation. In some embodiments, when the sparse flag indicates to perform a structured sparse convolution operation, the arithmetic circuitry may be configured to perform structured sparse processing and then perform convolution on the thinned data. By multiplexing the instruction domain of the convolution instruction and adding the structured sparse enabling flag bit, the processing can be simplified, and the processing efficiency of the machine is improved.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors and like devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause 1, a data processing apparatus, comprising:

control circuitry configured to parse a convolution instruction, the convolution instruction including a sparse flag bit to indicate whether to perform a structured sparse convolution operation;

a storage circuit configured to store pre-convolution and/or post-convolution information; and

an arithmetic circuit configured to perform a corresponding convolution operation according to the convolution instruction.

Clause 2, the data processing apparatus of clause 1, wherein the arithmetic circuitry comprises structured sparseness circuitry and convolution circuitry, when the sparseness flag indicates that a structured sparseness convolution operation is to be performed,

the structured sparse circuit is configured to perform structured sparse processing on at least one input data and output the thinned input data to the convolution circuit; and is provided with

The convolution circuit is configured to receive data to be convolved, and perform convolution operation on the data, wherein the data to be convolved at least includes the input data after the sparsification.

Clause 3, the data processing apparatus of clause 2, wherein the structured sparse circuit comprises:

a first structured sparse sub-circuit configured to perform structured sparse processing on input data according to a specified sparse mask; and/or

And the second structured sparse sub-circuit is configured to perform structured sparse processing on the input data according to a predetermined sparse rule.

Clause 4, the data processing apparatus of clause 3, wherein the structured sparse circuit is further configured to perform any one of:

respectively executing structured sparse processing on the first input data and the second input data to be convolved by utilizing a second structured sparse sub-circuit, and outputting the first input data and the second input data after sparse processing to the convolution circuit so as to execute convolution operation; or

And by utilizing a first structured sparse sub-circuit, performing structured sparse processing on second or first input data by taking an index part corresponding to the first or second input data subjected to structured sparse processing as a sparse mask, and outputting the first input data subjected to the structured sparse processing to the convolution circuit so as to perform convolution operation with the second input data subjected to the structured sparse processing, wherein the index part indicates the position of an effective data element in structured sparse to be performed.

Clause 5, the data processing apparatus of clause 4, wherein:

the first or second input data subjected to structured thinning processing is previously subjected to structured thinning processing and stored in the storage circuit, or

The structured sparsified first or second input data is generated online by structured sparsifying the first input data with the second structured sparsifying subcircuit.

Clause 6, the data processing apparatus according to any of clauses 4-5, wherein the structured sparsified first or second input data is in the form of a structure comprising a data portion and an index portion bound to each other, the data portion comprising the structured sparsified valid data elements, the index portion indicating a location of the structured sparsified data in the pre-sparsified data.

Clause 7, the data processing apparatus according to any of clauses 3-6, wherein the structured sparsification comprises selecting n data elements from every m data elements as valid data elements, wherein m > n.

Clause 8, the data processing apparatus of any of clauses 3-7, wherein the second structured sparse sub-circuit further comprises: at least one multi-stage pipelined arithmetic circuit including a plurality of operators arranged stage by stage and configured to perform structured thinning-out processing of selecting n data elements having larger absolute values as valid data elements from the m data elements.

Clause 9, the data processing apparatus of clause 8, wherein the multi-stage pipelined arithmetic circuit comprises four pipelined stages, wherein:

the first flow level comprises m absolute value calculation devices which are used for respectively taking absolute values of m data elements to be thinned so as to generate m absolute values;

the second pipeline stage comprises a permutation and combination circuit, which is used for permutation and combination of the m absolute values to generate m groups of data, wherein each group of data comprises the m absolute values and the positions of the m absolute values in each group of data are different from each other;

the third pipeline stage comprises m paths of comparison circuits for comparing absolute values in the m groups of data and generating comparison results; and

the fourth pipeline stage comprises a screening circuit for selecting n data elements with larger absolute values as valid data elements according to the comparison result, and outputting the valid data elements and corresponding indexes, wherein the indexes indicate the positions of the valid data elements in the m data elements.

Clause 10 and clause 9, where each of the comparison circuits in the third pipeline stage includes m-1 comparators, and m-1 comparators in the ith circuit are configured to sequentially compare one absolute value in the ith group of data with the other three absolute values and generate a comparison result, where i is greater than or equal to 1 and less than or equal to m.

Clause 11, the data processing apparatus according to any of clauses 9-10, wherein the filtering circuitry is further configured to select according to a specified priority order when there are data elements that are identical in absolute value.

Clause 12, the data processing apparatus according to any of clauses 2-11, wherein the arithmetic circuitry further comprises preprocessing circuitry, when the sparse flag indicates that a structured sparse convolution operation is to be performed,

the preprocessing circuit reads input data from the storage circuit and outputs the input data to the structured sparseness circuit at a first rate, wherein the first rate is based at least in part on a processing power of the convolution circuit and a sparseness ratio of the structured sparseness processing.

Clause 13, the data processing apparatus according to any of clauses 2-12, wherein the input data comprises neuron data and weights for a neural network.

Clause 14, the data processing apparatus of any of clauses 1-13, wherein the convolution instructions are for a structured sparseness convolution operation in a neural network, and the structured sparseness is performed for at least one dimension of multidimensional data in the neural network.

Clause 15, the data processing apparatus of clause 14, wherein:

the at least one dimension is selected from an input channel dimension and an output channel dimension.

Clause 16, a chip comprising the data processing apparatus of any of clauses 1-15.

Clause 17, a board comprising the chip of clause 15.

Clause 18, a data processing method, comprising:

parsing a convolution instruction, the convolution instruction including a sparse flag bit to indicate whether to perform a structured sparse convolution operation;

reading corresponding operands according to the convolution instruction; and

and executing corresponding convolution operation on the operand according to the convolution instruction.

Clause 19, the data processing method of clause 18, when the sparse flag indicates to perform a structured sparse convolution operation, further comprising:

performing structured sparse processing on at least one input data by using a structured sparse circuit to obtain sparse input data; and

and performing convolution operation on data to be convolved by utilizing a convolution circuit, wherein the data to be convolved at least comprises the input data after sparsification.

Clause 20, the data processing method of clause 19, wherein the structured sparsification process comprises:

performing structured sparse processing on input data according to a specified sparse mask by using a first structured sparse sub-circuit; and/or

And performing structured sparse processing on the input data according to a preset sparse rule by utilizing a second structured sparse sub-circuit.

Clause 21, the data processing method of clause 20, wherein the structured sparsification process further comprises any one of:

Clause 22, the data processing method of clause 21, wherein:

the first or second input data subjected to structured thinning processing is previously subjected to structured thinning processing and stored in a storage circuit, or

Clause 23, the data processing method according to any of clauses 20-21, wherein the structured sparsified first or second input data is in the form of a structure comprising a data portion and an index portion bound to each other, the data portion comprising the structured sparsified valid data elements, the index portion indicating a location of the structured sparsified data in the pre-sparsified data.

Clause 24, the data processing method according to any of clauses 20-23, wherein the structured sparse processing comprises selecting n data elements from every m data elements as valid data elements, wherein m > n.

Clause 25, the data processing method of any of clauses 20-24, wherein the second structured sparse sub-circuit further comprises: at least one multi-stage pipelined arithmetic circuit including a plurality of operators arranged stage by stage and configured to perform structured thinning-out processing of selecting n data elements having larger absolute values as valid data elements from the m data elements.

Clause 26, the data processing method of clause 25, wherein the multi-stage pipelined arithmetic circuit comprises four pipelined stages, wherein:

the second pipeline stage comprises a permutation and combination circuit, wherein the permutation and combination circuit is used for permutation and combination of the m absolute values to generate m groups of data, each group of data comprises the m absolute values, and the positions of the m absolute values in the groups of data are different from each other;

Clause 27, the data processing method according to clause 26, wherein each of the comparison circuits in the third pipeline stage includes m-1 comparators, m-1 comparators in the ith circuit are configured to sequentially compare one absolute value in the ith group of data with the other three absolute values and generate a comparison result, and i is greater than or equal to 1 and is less than or equal to m.

Clause 28, the data processing method of any of clauses 26-27, wherein the screening circuit is further configured to select according to a specified priority order when there are data elements that are identical in absolute value.

Clause 29, the data processing method according to any of clauses 19-28, further comprising:

conveying the input data at a first rate to perform the structured sparsification, wherein the first rate is based at least in part on a processing capability of hardware performing convolution operations and a sparseness ratio of the structured sparsification.

Clause 30, the method of data processing according to any of clauses 19-29, wherein the input data comprises neuron data and weights for a neural network.

Clause 31, the data processing method of any of clauses 18-30, wherein the convolution instructions are for a structured sparseness convolution operation in a neural network, and the structured sparseness is performed for at least one dimension of multidimensional data in the neural network.

Clause 32, the data processing method of clause 31, wherein:

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for a person skilled in the art, according to the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as a limitation to the present disclosure.

Claims

1. A data processing apparatus comprising:

2. The data processing apparatus according to claim 1, wherein said arithmetic circuitry comprises structured sparseness circuitry and convolution circuitry, when said sparseness flag indicates that a structured sparseness convolution operation is to be performed,

the structured sparse circuit is configured to perform structured sparse processing on at least one input data and output the thinned input data to the convolution circuit; and is

3. The data processing device of claim 2, wherein the structured sparse circuit comprises:

And the second structured sparse sub-circuit is configured to execute structured sparse processing on the input data according to a preset sparse rule.

4. The data processing apparatus of claim 3, wherein the structured sparse circuit is further configured to perform any of:

5. The data processing apparatus of claim 4, wherein:

6. The data processing apparatus according to any of claims 4 to 5, wherein the structured sparsified first or second input data is in the form of a structure comprising a data portion and an index portion bound to each other, the data portion comprising the structured sparsified valid data elements, the index portion indicating a location of the structured sparsified data in the pre-sparsified data.

7. The data processing apparatus according to any of claims 3-6, wherein said structured sparseness processing comprises selecting n data elements from every m data elements as valid data elements, wherein m > n.

8. The data processing apparatus according to any of claims 3-7, wherein the second structured sparse sub-circuit further comprises: at least one multi-stage pipelined arithmetic circuit including a plurality of operators arranged stage by stage and configured to perform structured thinning-out processing of selecting n data elements having larger absolute values as valid data elements from the m data elements.

9. The data processing apparatus of claim 8, wherein the multi-stage pipelined arithmetic circuit comprises four pipelined stages, wherein:

10. The data processing apparatus according to claim 9, wherein each of the comparison circuits in the third pipeline stage comprises m-1 comparators, and m-1 comparators in the ith pipeline comparison circuit are configured to sequentially compare one absolute value in the ith group of data with the other three absolute values and generate comparison results, and 1 ≦ i ≦ m.

11. A data processing apparatus according to any of claims 9 to 10, wherein the screening circuit is further configured to select in a specified priority order when there are data elements that are the same in absolute value.

12. The data processing apparatus according to any of claims 2-11, wherein said arithmetic circuitry further comprises preprocessing circuitry, when said sparse flag indicates that a structured sparse convolution operation is to be performed,

13. The data processing apparatus according to any one of claims 2 to 12, wherein the input data comprises neuron data and weights of a neural network.

14. The data processing apparatus according to any of claims 1-13, wherein the convolution instructions are for a structured sparseness convolution operation in a neural network, and the structured sparseness is performed for at least one dimension of multidimensional data in the neural network.

15. The data processing apparatus of claim 14, wherein:

16. A chip comprising a data processing device according to any one of claims 1 to 15.

17. A board comprising the chip of claim 15.

18. A method of data processing, comprising:

reading corresponding operands according to the convolution instruction; and

19. The data processing method of claim 18, when the sparse flag indicates to perform a structured sparse convolution operation, the method further comprising:

20. The data processing method of claim 19, wherein the structured sparseness processing comprises:

21. The data processing method of claim 20, wherein the structured sparseness processing further comprises any one of:

22. The data processing method of claim 21, wherein:

the first or second input data after structured thinning processing is previously structured and stored in a storage circuit, or

23. The data processing method according to any one of claims 20 to 21, wherein the structured sparsified first or second input data is in the form of a structure comprising a data portion and an index portion bound to each other, the data portion comprising the structured sparsified valid data elements, the index portion indicating a location of the structured sparsified data in the pre-sparsified data.

24. The data processing method of any of claims 20 to 23, wherein said structured sparsification comprises selecting n data elements from every m data elements as valid data elements, wherein m > n.

25. The data processing method of any of claims 20-24, wherein the second structured sparse sub-circuit further comprises: at least one multi-stage pipelined arithmetic circuit including a plurality of operators arranged stage by stage and configured to perform structured thinning-out processing of selecting n data elements having larger absolute values as valid data elements from the m data elements.

26. The data processing method of claim 25, wherein the multi-stage pipelined arithmetic circuit comprises four pipelined stages, wherein:

27. The data processing method of claim 26, wherein each of the comparison circuits in the third pipeline stage comprises m-1 comparators, and m-1 comparators in the ith pipeline comparison circuit are used for sequentially comparing one absolute value in the ith group of data with the other three absolute values and generating comparison results, and 1 ≦ i ≦ m.

28. A data processing method according to any of claims 26 to 27, wherein the screening circuit is further configured to select in a specified priority order when there are data elements of the same absolute value.

29. A data processing method according to any of claims 19 to 28, the method further comprising:

30. The data processing method according to any of claims 19 to 29, wherein the input data comprises neuron data and weights of a neural network.

31. The data processing method of any of claims 18 to 30, wherein the convolution instructions are for a structured sparseness convolution operation in a neural network, and the structured sparseness is performed for at least one dimension of multidimensional data in the neural network.

32. The data processing method of claim 31, wherein: