CN114692845A

CN114692845A - Data processing device, data processing method and related product

Info

Publication number: CN114692845A
Application number: CN202011566138.XA
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2022-07-01

Abstract

The present disclosure discloses a data processing apparatus, a data processing method, and a related product. The data processing apparatus may be implemented such that the computing apparatus is included in a combined processing apparatus, which may also include interface apparatus and other processing apparatus. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. The disclosed solution provides dedicated instructions for structured sparsity-related operations that can simplify processing and improve processing efficiency of the machine.

Description

Data processing device, data processing method and related product

Technical Field

The present disclosure relates generally to the field of processors. More particularly, the present disclosure relates to a data processing apparatus, a data processing method, a chip, and a board.

Background

In recent years, with the rapid development of deep learning, the performance of algorithms in a series of fields such as computer vision and natural language processing has been developed in a cross-over manner. However, the deep learning algorithm is a calculation-intensive and storage-intensive tool, and with the increasing complexity of information processing tasks and the increasing requirements for the real-time performance and accuracy of the algorithm, the neural network is often designed to be deeper and deeper, so that the requirements for the calculation amount and the storage space are increased, and the existing artificial intelligence technology based on deep learning is difficult to be directly applied to mobile phones, satellites or embedded devices with limited hardware resources.

Therefore, compression, acceleration, optimization of the deep neural network model becomes of great importance. A large number of researches try to reduce the calculation and storage requirements of the neural network on the premise of not influencing the model precision, and have very important significance on the engineering application of the deep learning technology at an embedded end and a mobile end. Thinning is just one of the model lightweight methods.

The network parameter sparsification is to reduce redundant components in a larger network by a proper method so as to reduce the requirement of the network on the calculation amount and the storage space. Existing hardware and/or instruction sets do not efficiently support sparsification.

Disclosure of Invention

In order to at least partially solve one or more technical problems mentioned in the background, the present disclosure provides a data processing apparatus, a data processing method, a chip, and a board.

In a first aspect, the present disclosure discloses a data processing apparatus comprising: control circuitry configured to parse a sparse instruction, the sparse instruction indicating an operation related to structured sparsity; a storage circuit configured to store pre-sparsification and/or post-sparsification information; and an arithmetic circuit configured to perform a corresponding operation according to the sparse instruction.

In a second aspect, the present disclosure provides a chip comprising the data processing apparatus of any of the embodiments of the first aspect.

In a third aspect, the present disclosure provides a board card comprising the chip of any of the embodiments of the second aspect.

In a fourth aspect, the present disclosure provides a data processing method, the method comprising: parsing a sparse instruction, the sparse instruction indicating an operation related to structured sparsity; reading corresponding operands according to the sparse instruction; performing the structured sparsity-related operation on the operands; and outputting the operation result.

With the data processing apparatus, the data processing method, the integrated circuit chip and the board provided as above, the embodiments of the present disclosure provide a sparse instruction for performing an operation related to structured sparsity. In some embodiments, an operating mode bit may be included in the sparse instruction to indicate a different operating mode of the sparse instruction to perform different operations. In other embodiments, multiple sparse instructions may be included, each corresponding to one or more different modes of operation, to perform various operations related to structured sparsity. By providing specialized sparse instructions to perform operations related to structured sparsity, processing may be simplified, thereby increasing the processing efficiency of the machine.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

fig. 1 is a block diagram illustrating a board card according to an embodiment of the present disclosure;

FIG. 2 is a block diagram illustrating a combined processing device according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating an internal architecture of a single core computing device according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating the internal architecture of a multi-core computing device according to an embodiment of the disclosure;

FIG. 5 is an internal block diagram illustrating a processor core of an embodiment of the disclosure;

FIG. 6 is a schematic diagram showing the structure of a data processing apparatus of an embodiment of the present disclosure;

FIG. 7A is an exemplary pipeline operational circuit illustrating structured sparseness processing according to an embodiment of the present disclosure;

FIG. 7B is an exemplary pipeline arithmetic circuit illustrating structured sparseness processing according to another embodiment of the present disclosure; and

fig. 8 is an exemplary flowchart illustrating a data processing method of an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is to be understood that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the disclosure. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System on Chip (SoC), or System on Chip, integrated with one or more combined processing devices, which is an artificial intelligence arithmetic unit for supporting various deep learning and machine learning algorithms and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a storage device 204.

The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.

The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered together with the integration of the computing device 201 and the processing device 203, both are considered to form a heterogeneous multi-core structure.

The storage device 204 is used to store data to be processed, which may be a DRAM, a DDR memory, and is typically 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203.

Fig. 3 shows an internal structure diagram of the computing apparatus 201 as a single core. The single-core computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining, and the like, and the single-core computing device 301 includes three modules: a control module 31, an operation module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decode unit 312 decodes the obtained instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, nonlinear transformation, and the like; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM)331, a parameter storage unit (weight RAM, WRAM)332, and a Direct Memory Access (DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; the WRAM 332 is used for storing a convolution kernel, namely a weight, of the deep learning network; the DMA 333 is connected to the DRAM204 via the bus 34, and is responsible for data transfer between the single-core computing device 301 and the DRAM 204.

Fig. 4 shows a schematic diagram of the internal structure of the computing apparatus 201 with multiple cores. The multi-core computing device 41 is designed in a hierarchical structure, and the multi-core computing device 41 is a system on a chip (soc) including at least one cluster (cluster), each cluster including a plurality of processor cores, in other words, the multi-core computing device 41 is formed in a soc-cluster-processor core hierarchy.

In a system-on-chip hierarchy, as shown in FIG. 4, the multi-core computing device 41 includes an external storage controller 401, a peripheral communication module 402, an on-chip interconnect module 403, a synchronization module 404, and a plurality of clusters 405.

There may be multiple external memory controllers 401, 2 shown by way of example in the figure, for accessing an external memory device, such as DRAM204 in figure 2, to read data from or write data to off-chip in response to an access request issued by a processor core. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute a task. The on-chip interconnect module 403 connects the external memory controller 401, the peripheral communication module 402 and the plurality of clusters 405 for transmitting data and control signals between the respective modules. The synchronization module 404 is a global synchronization barrier controller (GBC) for coordinating the operation progress of the clusters and ensuring the synchronization of the information. The plurality of clusters 405 are computing cores of the multi-core computing device 41, 4 are exemplarily shown in the figure, and as hardware advances, the multi-core computing device 41 of the present disclosure may further include 8, 16, 64, or even more clusters 405. The cluster 405 is used to efficiently execute deep learning algorithms.

Viewed at the cluster level, as shown in FIG. 4, each cluster 405 includes a plurality of processor cores (IPU core)406 and a memory core (MEM core) 407.

The processor cores 406 are exemplarily shown in 4 in the figure, and the present disclosure does not limit the number of the processor cores 406. The internal architecture is shown in fig. 5. Each processor core 406 is similar to the single-core computing device 301 of fig. 3, again including three major modules: a control module 51, an arithmetic module 52 and a storage module 53. The functions and structures of the control module 51, the operation module 52 and the storage module 53 are substantially the same as those of the control module 31, the operation module 32 and the storage module 33, and are not described again. It should be particularly noted that the storage module 53 includes an input/output direct memory access (IODMA) module 533 and a move direct memory access (MVDMA) module 534. IODMA 533 controls access of NRAM 531/WRAM 532 and DRAM204 through broadcast bus 409; the MVDMA 534 is used to control access to the NRAM 531/WRAM 532 and the memory cell (SRAM) 408.

Returning to FIG. 4, the storage core 407 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 406, as well as perform communications between the cluster 405 and the DRAM204, communications among each other cluster 405, communications among each other processor cores 406, and the like. In other embodiments, memory core 407 has the capability of scalar operations to perform scalar operations.

The memory core 407 includes an SRAM 408, a broadcast bus 409, a Cluster Direct Memory Access (CDMA) module 410, and a Global Direct Memory Access (GDMA) module 411. The SRAM 408 plays a role of a high-performance data transfer station, data multiplexed between different processor cores 406 in the same cluster 405 does not need to be acquired to the DRAM204 through the processor cores 406 respectively, but is transferred among the processor cores 406 through the SRAM 408, and the memory core 407 only needs to rapidly distribute the multiplexed data from the SRAM 408 to a plurality of processor cores 406, so that the inter-core communication efficiency is improved, and on-chip and off-chip input/output access is greatly reduced.

Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication among processor cores 406, communication among cluster 405, and data transfer between cluster 405 and DRAM204, respectively. As will be described separately below.

The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405, and the broadcast bus 409 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (e.g., from a single processor core to a single processor core) data transfer, multicast is a communication that transfers a copy of data from SRAM 408 to a particular number of processor cores 406, and broadcast is a communication that transfers a copy of data from SRAM 408 to all processor cores 406, which is a special case of multicast.

CDMA 410 is used to control access to SRAM 408 between different clusters 405 within the same computing device 201.

GDMA 411 cooperates with the external memory controller 401 to control access of SRAM 408 of cluster 405 to DRAM204 or to read data from DRAM204 into SRAM 408. As can be seen from the foregoing, communication between DRAM204 and NRAM 431 or WRAM 432 may be achieved via 2 channels. The first channel is to directly contact DRAM204 with NRAM 431 or WRAM 432 through IODAM 433; the second channel is that data is transferred between the DRAM204 and the SRAM 408 through the GDMA 411, and then transferred between the SRAM 408 and the NRAM 431 or WRAM 432 through the MVDMA 534. Although seemingly the second channel requires more components and the data flow is longer, in some embodiments, the bandwidth of the second channel is substantially greater than the first channel, and thus communication between DRAM204 and NRAM 431 or WRAM 432 may be more efficient over the second channel. Embodiments of the present disclosure may select a data transmission channel based on its own hardware conditions.

In other embodiments, the functionality of GDMA 411 and the functionality of IODMA 533 may be integrated in the same component. For convenience of description, the GDMA 411 and the IODMA 533 are considered as different components, and it is within the scope of the disclosure for those skilled in the art to achieve the same functions and achieve the same technical effects as the present disclosure. Further, the functions of GDMA 411, IODMA 533, CDMA 410 and MVDMA 534 may be implemented by the same component.

One embodiment of the present disclosure provides a data processing scheme based on the foregoing hardware environment to perform structured sparsity-related operations according to specialized sparsity instructions.

Fig. 6 shows a block diagram of a data processing apparatus 600 according to an embodiment of the present disclosure. The data processing apparatus 600 may be implemented, for example, in the computing apparatus 201 of fig. 2. As shown, the data processing apparatus 600 may include a control circuit 610, a memory circuit 620, and an arithmetic circuit 630.

The control circuit 610 may function similarly to the control module 31 of fig. 3 or the control module 51 of fig. 5, and may include, for example, an instruction fetch unit to fetch an instruction from, for example, the processing device 203 of fig. 2, and an instruction decode unit to decode the fetched instruction and send the decoded result as control information to the operation circuit 630 and the storage circuit 620.

In one embodiment, control circuitry 610 may be configured to parse sparse instructions, where the sparse instructions indicate operations related to structured sparsity.

The storage circuit 620 may be configured to store pre-sparsification and/or post-sparsification information. In one embodiment, the operands of the sparse instructions are data in a neural network, such as weights, neurons, and the like. In this embodiment, the memory circuit may be, for example, the WRAM 332, NRAM 331 of fig. 3 or the WRAM 532, NRAM531 of fig. 5.

The arithmetic circuitry 630 may be configured to perform corresponding operations in accordance with the sparse instruction.

In some embodiments, the arithmetic circuitry 630 may include one or more sets of pipelined arithmetic circuitry 631, wherein each set of pipelined arithmetic circuitry 631 may include one or more operators. When each set of pipelined arithmetic circuits includes a plurality of operators, the plurality of operators may be configured to perform a multi-stage pipelined arithmetic, i.e., to form a multi-stage arithmetic pipeline.

In some application scenarios, the pipelined arithmetic circuitry of the present disclosure may support operations related to structured sparseness. For example, in performing the structured thinning-out processing, a multi-stage pipelined arithmetic circuit constituted by circuits such as comparators may be employed to perform an operation of extracting n data elements from every m data elements as valid data elements, where m > n. In one implementation, m is 4 and n is 2. In other implementations, n may take other values, such as 1 or 3.

In one embodiment, the arithmetic circuit 630 may further include an arithmetic processing circuit 632, which may be configured to pre-process data before the pipeline arithmetic circuit 631 performs an operation or post-process data after the operation according to an operation instruction. In some application scenarios, the aforementioned pre-processing and post-processing may, for example, include data splitting and/or data splicing operations. In the structured sparse processing, the operation processing circuit may segment and split the data to be thinned according to m data elements, and then send the segmented and split data to the pipeline operation circuit 631 for processing.

FIG. 7A illustrates an exemplary operation pipeline for structured sparseness processing according to one embodiment of the present disclosure. In the embodiment of fig. 7A, a structured thinning process of screening out 2 data elements having larger absolute values from 4 data elements A, B, C and D when m is 4 and n is 2 is shown.

As shown in fig. 7A, the above-described structure thinning-out processing can be performed by a 4-stage pipeline operation circuit including an absolute value calculation unit and a comparator.

The first stage pipelined arithmetic circuitry may include 4 absolute value operators 710 for synchronously performing absolute value operations on the 4 input data elements A, B, C and D, respectively.

The second stage pipelined arithmetic circuit may comprise two comparators for performing a block comparison of the 4 absolute values output by the previous stage. For example, the first comparator 721 may compare the absolute values of the data elements a and B and output a larger value Max00, and the second comparator 722 may compare the absolute values of the data elements C and D and output a larger value Max 10.

The third stage pipeline operation circuit may include a third comparator 730 for comparing 2 larger values Max00 and Max10 output from the previous stage and outputting a larger value Max 0. This larger value Max0 is the value with the largest absolute value among the 4 data elements.

The fourth stage pipeline operation circuit may include a fourth comparator 740 comparing the smaller value Min0 in the previous stage with another value in the packet in which the maximum value Max0 is located and outputting the larger value Max 1. This larger value Max1 is the second largest value of the absolute values of the 4 data elements.

Therefore, the four-out-of-two structured sparse processing can be realized through the 4-stage pipeline arithmetic circuit.

FIG. 7B illustrates an exemplary operation pipeline for structured sparseness processing according to another embodiment of the present disclosure. Likewise, in the embodiment of fig. 7B, a structured thinning process is shown in which 2 data elements having larger absolute values are screened out from 4 data elements A, B, C and D when m is 4 and n is 2.

As shown in fig. 7B, the structured thinning-out processing described above can be performed by a multistage pipelined arithmetic circuit composed of an absolute value calculator, a comparator, and the like.

The first pipeline stage may include m (4) absolute value operators 750 for synchronously performing absolute value operations on the 4 input data elements A, B, C and D, respectively. To facilitate the final output of valid data elements, in some embodiments, the first pipeline stage may output both the original data elements (i.e., A, B, C and D) and the data after the absolute value operation (i.e., | A |, | B |, | C | and | D |).

The second pipeline stage may include a permutation and combination circuit 760 for permuting and combining the m absolute values to generate m sets of data, wherein each set of data includes the m absolute values, and the m absolute values are located at different positions in the sets of data.

In some embodiments, the permutation combining circuit may be a cyclic shifter that cyclically shifts the permutation of m absolute values (e.g., | A |, | B |, | C |, and | D |) m-1 times, thereby generating m sets of data. For example, in the example shown in the figure, 4 sets of data are generated, respectively: { | A |, | B |, | C |, | D | }, { | B |, | C |, | D |, | A | }, { | C |, | D |, | A |, | B | } and { | D |, | A |, | B |, | C | }. Similarly, each group of data is output, and simultaneously, the corresponding original data element is also output, and each group of data corresponds to one original data element.

The third level comprises a comparison circuit 770 for comparing absolute values in the m sets of data and generating a comparison result.

In some embodiments, the third pipeline stage may include m comparison circuits, each comparison circuit including m-1 comparators (771, 772, 773), where m-1 comparators in the ith comparison circuit are configured to sequentially compare one absolute value in the ith set of data with three other absolute values and generate a comparison result, where 1 ≦ i ≦ m.

As can be seen, the third stream stage can also be considered as an m-1(3) sub-stream stage. Each sub-stream level includes m comparators for comparing its corresponding one of the absolute values with the other absolute values. m-1 sub-pipeline stages compare the corresponding absolute value with m-1 absolute values in turn.

For example, in the example shown in the figure, 4 comparators 771 in the first sub-seawater level are used to compare the first absolute value with the second absolute value in the 4 sets of data, respectively, and output comparison results w0, x0, y0, and z0, respectively. The 4 comparators 772 in the second sub-flow stage are used for comparing the first absolute value with the third absolute value in the 4 groups of data, and outputting comparison results w1, x1, y1 and z 1. The 4 comparators 773 in the third sub-flow stage are used to compare the first absolute value with the fourth absolute value in the 4 groups of data, and output comparison results w2, x2, y2 and z2, respectively.

Thus, a comparison of each absolute value with the other m-1 absolute values can be obtained.

In some embodiments, the comparison results may be represented using a bitmap. For example, at the 1 st comparator of the 1 st way compare circuit, when | a | ≧ | B |, w0 ≧ 1; at the 1 st 2 nd comparator, when | a | < | C |, w1 is 0; at the 3 rd comparator in way 1, when | a | ≧ D |, w2 ═ 1, and thus the output result of way 1 comparison circuit is { a, w0, w1, w2}, which is { a, 1, 0, 1 }. Similarly, the output result of the 2 nd way comparison circuit is { B, x0, x1, x2}, the output result of the 3 rd way comparison circuit is { C, y0, y1, y2}, and the output result of the 4 th way comparison circuit is { D, z0, z1, z2 }.

The fourth pipeline stage includes a screening circuit 780 configured to select n data elements with larger absolute values from the m data elements as valid data elements according to the comparison result of the third stage, and output the valid data elements and corresponding indexes. The index is used to indicate the position of these valid data elements among the input m data elements. For example, when a and C are screened from A, B, C, D four data elements, their corresponding indices may be 0 and 2.

Based on the comparison, appropriate logic can be designed to select the n data elements with larger absolute values. In view of the fact that multiple data elements of the same absolute value may occur, in a further embodiment, when there are data elements of the same absolute value, the selection is made in a specified priority order. For example, a may be set to have the highest priority and D may be set to have the lowest priority in such a manner that the priorities are fixed from low to high in the index. In one example, when the absolute values of the A, C, D numbers are all the same and greater than the absolute value of B, the data selected are A and C.

From the foregoing comparison results, it can be analyzed that | A | is several numbers greater than { | B |, | C |, | D | } from w0, w1, and w 2. If w0, w1, and w2 are all 1, then it means | A | is greater than | B |, | C |, and | D |, which is the maximum of the four numbers, and therefore A is selected. If there are two 1 s in w0, w1, and w2, then this indicates that | A | is the next largest of the four absolute values, and therefore A is also chosen. Otherwise, a is not selected. Thus, in some embodiments, the determination may be analyzed based on the number of occurrences of these values.

In one implementation, the valid data elements may be selected based on the following logic. First, the number of times each data is larger than the other data may be counted. For example, define N_A＝sum_w＝w0+w1+w2，N_B＝sum_x＝x0+x1+x2，N_C＝sum_y＝y0+y1+y2，N_DSum _ z-z 0+ z1+ z 2. Subsequently, the judgment and selection are performed under the following conditions.

The conditions for selecting A are as follows: n is a radical of_A3, or N_A2 and N_B/N_C/N_DOnly one of 3;

the conditions for selecting B were: n is a radical of_B3, or N_B2 and N_A/N_C/N_DOnly one of 3, and N_A≠2；

The conditions for selecting C were: n is a radical of_CIs equal to 3, and N_A/N_BAt most one 3, or N_C2 and N_A/N_B/N_DOnly one of 3, and N_A/N_BNone of 2;

the conditions for selecting D were: n is a radical of_DIs equal to 3, and N_A/N_B/N_CAt most one 3, or N_D2 and N_A/N_B/N_COnly one of 3, and N_A/N_B/N_CThere is no 2.

Those skilled in the art will appreciate that there is some redundancy in the above logic in order to ensure selection at a predetermined priority. Based on the size and order information provided by the comparison, one skilled in the art may devise other logic to implement the screening of valid data elements, and the disclosure is not limited in this respect. Thus, the multi-stage pipeline arithmetic circuit of fig. 7B can also realize the two-out-of-four structured thinning processing.

Those skilled in the art will appreciate that other forms of pipelined arithmetic circuits may also be designed to implement structured sparseness, and the present disclosure is not limited in this respect.

As mentioned previously, the operands of the sparse instructions may be data in a neural network, such as weights, neurons, and the like. Data in neural networks typically contain multiple dimensions. For example, in a convolutional neural network, data may exist in four dimensions: input channel, output channel, length, and width. In some embodiments, the sparse instruction may be used for structured sparse processing of at least one dimension of multidimensional data in a neural network. In particular, in one implementation, the sparse instructions may be used for structured sparse processing of input channel dimensions of multidimensional data in a neural network, for example, in an inference process or a forward training process of the neural network. In another implementation, the sparse instructions may be used to simultaneously structure sparsity the input channel dimensions and output channel dimensions of multidimensional data in a neural network, for example during reverse training of the neural network.

In one embodiment, in response to receiving a plurality of sparse instructions, one or more multi-stage pipelined arithmetic circuits of the present disclosure may be configured to perform multiple data operations, such as executing single instruction multiple data ("SIMD") instructions. In another embodiment, the plurality of operations performed by the operation circuits of each stage are predetermined according to functions supported by the plurality of operation circuits arranged stage by stage in the multistage operation pipeline.

In the context of the present disclosure, the aforementioned plurality of sparse instructions may be microinstructions or control signals that are executed within one or more multi-stage operation pipelines, which may include (or indicate) one or more operation operations to be performed by the multi-stage operation pipelines. Depending on different operational scenarios, the operational operations may include, but are not limited to, arithmetic operations such as convolution operations, matrix multiplication operations, logical operations such as and operations, xor operations, or operations, shift operations, or any combination of the foregoing types of operational operations.

FIG. 8 illustrates an exemplary flow diagram of a data processing method 800 according to an embodiment of the disclosure.

As shown in fig. 8, in step 810, a sparse instruction is parsed, the sparse instruction indicating operations related to structured sparsity. This step may be performed, for example, by control circuit 610 of fig. 6.

Next, in step 820, the corresponding operand is read according to the sparse instruction. The sparse instructions may indicate different modes of operation, with corresponding operands being different, as will be described in detail below. This step may be performed, for example, by control circuitry 610 of fig. 6 for storage circuitry 620.

Next, in step 830, structured sparsity-related operations are performed on the read operands. This step may be performed, for example, by the arithmetic circuitry 630 of fig. 6.

Finally, in step 840, the result of the operation is output. For example, the result of the operation may be output by the arithmetic circuit 630 to the storage circuit 620 for subsequent use.

Operations related to structured sparsity may exist in various forms, such as structured sparsity processing, anti-sparsity processing, and the like. Various instruction schemes may be devised to implement the structured sparsity-related operations.

In one scheme, a sparse instruction may be designed, and an operation mode bit may be included in the instruction to indicate different operation modes of the sparse instruction, so as to perform different operations.

In another scheme, a plurality of sparse instructions may be designed, each instruction corresponding to one or more different operation modes, so as to execute different operations. In one implementation, a corresponding sparse instruction may be designed for each mode of operation. In another implementation, the operation modes can be classified according to their characteristics, and a sparse instruction is designed for each type of operation mode. Further, when multiple operating modes are included in a class of operating modes, an operating mode bit may be included in the sparse instruction to indicate the respective operating mode.

Regardless of the scheme, the sparse instruction may indicate its corresponding mode of operation via an operating mode bit and/or the instruction itself.

In one embodiment, the sparse instruction may indicate the first mode of operation. In a first mode of operation, the operands of the sparse instruction include data to be thinned out. At this time, the arithmetic circuit 630 may be configured to perform the structural thinning processing on the data to be thinned according to the thinning instruction, and output the thinned structural body to the storage circuit 620.

The structured thinning-out processing in the first operation mode may be structured thinning-out processing of a predetermined filtering rule, for example, according to a rule for filtering out a larger absolute value, n data elements with a larger absolute value are filtered out from every m data elements as valid data elements. The operational circuitry 630 may be configured, for example, as the pipelined operational circuitry described with reference to fig. 7A and 7B to perform this structured sparseness processing.

The result after the sparsification processing comprises two parts: a data portion and an index portion. The data part comprises data after the data to be thinned are thinned, namely effective data elements extracted according to a screening rule of structured thinning processing. The index portion is used to indicate the position of the thinned-out data, i.e., the effective data elements, in the data before thinning-out (i.e., the data to be thinned).

The structure in the embodiments of the present disclosure includes a data portion and an index portion that are bound to each other. In some embodiments, every 1 bit in the index portion may correspond to a data element. For example, when the data type is fix8, one data element is 8 bits, and every 1 bit in the index portion may correspond to 8 bits of data. In other embodiments, each 1 bit in the index portion of the fabric may be set to a location corresponding to N bits of data, N being determined based at least in part on the hardware configuration, allowing for subsequent implementation at the hardware level when the fabric is used. For example, it may be set that every 1 bit of the index part in the structure body corresponds to the position of 4-bit data. For example, when the data type is fix8, every 2 bits in the index portion correspond to a data element of the fix8 type. In some embodiments, the data portions of the structure may be aligned according to a first alignment requirement and the index portions of the structure may be aligned according to a second alignment requirement, such that the entire structure also satisfies the alignment requirements. For example, the data portions may be aligned as 64B, the index portions may be aligned as 32B, and the entire structure may be aligned as 96B (64B + 32B). By the alignment requirement, the memory access times can be reduced during subsequent use, and the processing efficiency is improved.

By using such a structure, the data part and the index part can be used collectively. Since the proportion of the valid data elements occupying all the data elements in the structured thinning-out process is fixed, for example, n/m, the data size after the thinning-out process is also fixed or predictable. Thus, the structural body can be densely stored in the memory circuit without a performance loss.

In another embodiment, the sparse instruction may indicate the second mode of operation. The second operation mode differs from the first operation mode in that the content of the output is different, and the second operation mode outputs only the structured thinning-out processed data portion without outputting the index portion.

Similarly, in the second mode of operation, the operands of the sparse instruction include data to be thinned out. At this time, the operation circuit 630 may be configured to perform the structured thinning-out processing on the data to be thinned out according to the thinning-out instruction, and output the thinned-out data portion to the storage circuit 620. The data part comprises data after the data to be thinned are thinned. The data part is densely stored in the storage circuit. The output data portion is aligned by n elements. For example, in the example where m is 4 and n is 2, the input data to be thinned is aligned by 4 elements, and the output data portion is aligned by 2 elements.

In yet another embodiment, the sparse instruction may indicate a third mode of operation. The third operation mode differs from the first operation mode in that the content of the output is different, and the third operation mode outputs only the index portion after the structured thinning processing, and does not output the data portion.

Similarly, in a third mode of operation, the operands of the sparse instruction include data to be thinned out. At this time, the operation circuit 630 may be configured to perform structured sparse processing on the data to be thinned according to the sparse instruction, and output the index portion after the sparse processing to the storage circuit 620. The index portion indicates the location of the thinned data in the data to be thinned. The index part is densely stored in the storage circuit. Every 1 bit in the output index portion corresponds to the position of one data element. Since the index portion may be used alone, for example, for structured sparseness of neurons in subsequent convolution processing, while the data type of the neurons may be uncertain, the separately stored index portion may be adapted to various data types by corresponding every 1 bit in the index portion to the position of one data element.

In yet another embodiment, the sparse instruction may indicate a fourth mode of operation. The fourth operation mode is different from the first operation mode in that the fourth operation mode specifies a filtering rule of the structured thinning-out processing, instead of performing the structured thinning-out processing in accordance with a predetermined filtering rule (for example, the foregoing rule of larger absolute value). At this time, the sparse instruction has two operands: data to be thinned and a sparse index. The operand of the added sparse index is used to indicate the position of the valid data element in the structured sparsity to be performed, i.e. to specify the filtering rules for the structured sparsity processing. Each 1 bit in the sparse index corresponds to the position of one data element, so that the sparse index can be suitable for data to be thinned of various data types.

In a fourth operation mode, the operation circuit 630 may be configured to perform structured sparse processing on the data to be thinned according to the sparse instruction and according to the position indicated by the sparse index, and output a result after the sparse processing to the storage circuit. In one implementation, the output result may be the thinned structure. In another implementation, the output result may be a thinned out portion of the data.

The meaning of the structure is the same as that in the first operation mode, and the structure comprises a data part and an index part which are mutually bound, wherein the data part comprises data after the data to be thinned is subjected to thinning processing, and the index part is used for indicating the position of the thinned data in the data to be thinned. Alignment requirements, correspondence, etc. for the data portion, the index portion, and the like in the structure are the same as in the first operation mode, and are not repeated here.

The above four operation modes provide structured thinning processing of data, for example, processing according to a predetermined filtering rule or according to a filtering rule specified by an operand of an instruction, and provide different output contents, for example, an output structure body, an output-only data portion, an output-only index portion, and the like, respectively. The instruction design can well support structured sparse processing, and provides various output options to adapt to different scene requirements, for example, when data and an index are required to be bound for use, an output structure body can be selected, and when the index part or the data part is required to be used independently, only the index part or the data part can be selected to be output.

In yet another embodiment, the sparse instruction may indicate a fifth mode of operation. The fifth operation mode does not need structured sparse processing, and only needs to bind the separated or independent data part and the index part into a structure.

In a fifth mode of operation, the operands of the sparse instruction include the sparsified data portion and a corresponding index portion. The data portion and the index portion are each in a tightly stored format, but are not bound. The input data portion is aligned by n elements. For example, in the example of m-4 and n-2, the input data portion is aligned by 2 elements. The index portion indicates a position of the data portion in the data before the thinning-out process, wherein each 1 bit of the index portion corresponds to one data element.

At this time, the arithmetic circuit 630 may be configured to bind the data portion and the index portion into a structure according to the thinning-out instruction, and output the structure to the storage circuit. The meaning of the structure, the alignment requirements for the data portion, the index portion, the correspondence, and the like are the same as in the first operation mode, and are not repeated here. Depending on the data type of the data element, the index portion in the structure needs to be generated accordingly based on the data type and the bit correspondence of the index portion in the structure. For example, when the input index portion is 0011, where each 1 bit corresponds to one data element, if the data type is fix8, that is, each data element has 8 bits, then according to the corresponding relationship of the index portion in the structure corresponding to 4 bits of data per 1 bit, the index portion in the structure should be: 00001111, i.e. 2 bits correspond to one data element.

In yet another embodiment, the sparse instruction may indicate a sixth mode of operation. The sixth operation mode is used to perform anti-sparsification processing, that is, to restore the data after sparsification to the data format or scale before sparsification.

In a sixth mode of operation, the operands of the sparse instruction include a thinned data portion and a corresponding index portion, which are each in a dense storage format, but are not bound. The input data portion is aligned by n elements. For example, in the example where m is 4 and n is 2, the input data portion is aligned by 2 elements, and the output data portion is aligned by 4 elements. The index portion indicates a position of the data portion in the data before the thinning-out process, wherein each 1 bit of the index portion corresponds to one data element.

At this time, the arithmetic circuit 630 may be configured to perform, in accordance with the thinning instruction, the anti-thinning processing on the input data portion in accordance with the position indicated by the input index portion to generate the restored data having the data format before the thinning processing, and output the restored data to the storage circuit.

In one implementation, the anti-sparsification process may include: according to the position indicated by the index part, according to the data format before the thinning processing, the data elements in the data part are respectively placed at the corresponding positions of the data format before the thinning processing, and the rest positions of the data format are filled with predetermined information (for example, 0) to generate the recovery data.

From the foregoing description, it can be seen that the disclosed embodiments provide a sparse instruction for performing operations related to structured sparsity. These operations may include forward structured sparsification operations, and may also include anti-sparsification operations, and may also include some associated format conversion operations. In some embodiments, an operating mode bit may be included in the sparse instruction to indicate different operating modes of the sparse instruction to perform different operations. In other embodiments, multiple sparse instructions may be provided directly, each corresponding to one or more different modes of operation, to perform various operations related to structured sparsity. By providing specialized sparse instructions to perform operations related to structured sparsity, processing may be simplified, thereby increasing the processing efficiency of the machine.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance instrument, a B ultrasonic instrument and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause 1, a data processing apparatus, comprising:

control circuitry configured to parse a sparse instruction, the sparse instruction indicating an operation related to structured sparsity;

a storage circuit configured to store pre-sparsification and/or post-sparsification information; and

an arithmetic circuit configured to perform a corresponding operation according to the sparse instruction.

Clause 2, the data processing apparatus of clause 1, wherein the sparse instruction indicates a first mode of operation and an operand of the sparse instruction comprises data to be thinned out,

the operation circuit is configured to perform structured sparse processing on the data to be thinned according to the sparse instruction, and output a thinned structure to the storage circuit, where the structure includes a data portion and an index portion, the data portion includes the data of the data to be thinned after the sparse processing, and the index portion is used to indicate a position of the data to be thinned in the data to be thinned.

Clause 3, the data processing apparatus of clause 1, wherein the sparse instruction indicates a second mode of operation and an operand of the sparse instruction includes data to be thinned out,

the operation circuit is configured to execute structured sparse processing on the data to be sparse according to the sparse instruction, and output a data part after sparse processing to the storage circuit, wherein the data part comprises the data after sparse processing on the data to be sparse.

Clause 4, the data processing apparatus of clause 1, wherein the sparse instruction indicates a third mode of operation and an operand of the sparse instruction comprises data to be thinned out,

the operation circuit is configured to perform structured sparse processing on the data to be thinned according to the sparse instruction, and output an index part subjected to sparse processing to the storage circuit, wherein the index part indicates the position of the data subjected to sparse processing in the data to be thinned.

Clause 5, the data processing apparatus of clause 1, wherein the sparse instruction indicates a fourth mode of operation, and operands of the sparse instruction include data to be thinned out and a sparse index indicating a location of a valid data element in a structured sparse to be performed,

the operation circuit is configured to perform structured sparse processing on the data to be thinned according to the sparse instruction and the position indicated by the sparse index, and output a thinned structure or a thinned data part to the storage circuit, where the structure includes a data part and an index part that are bound to each other, the data part includes the data of the data to be thinned after being thinned, and the index part is used to indicate the position of the data to be thinned in the data to be thinned.

Clause 6, the data processing apparatus of clause 1, wherein the sparse instruction indicates a fifth mode of operation and operands of the sparse instruction include a sparsified data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to the sparsifying,

the arithmetic circuit is configured to bind the data portion and the index portion into a structure according to the sparse instruction, and output the structure to the storage circuit.

Clause 7, the data processing apparatus of clause 1, wherein the sparse instruction indicates a sixth mode of operation and operands of the sparse instruction include a sparsified data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to the sparsifying,

the arithmetic circuit is configured to perform, according to the sparse instruction and according to the position indicated by the index portion, anti-sparsification processing on the data portion to generate recovered data having a data format before sparsification processing, and output the recovered data to the storage circuit.

Clause 8, the data processing apparatus according to any of clauses 2-5, wherein the structured sparsification includes selecting n data elements from every m data elements as valid data elements, wherein m > n.

Clause 9, the data processing apparatus of clause 8, wherein the arithmetic circuitry further comprises: at least one multi-stage pipelined arithmetic circuit including a plurality of operators arranged stage by stage and configured to perform structured thinning-out processing of selecting n data elements having larger absolute values from among the m data elements as effective data elements in accordance with the thinning-out instruction.

Clause 10, the data processing apparatus of clause 9, wherein the multi-stage pipelined arithmetic circuit comprises four pipelined stages, wherein:

the first flow level comprises m absolute value calculation devices which are used for respectively taking absolute values of m data elements to be thinned so as to generate m absolute values;

the second pipeline stage comprises a permutation and combination circuit, which is used for permutation and combination of the m absolute values to generate m groups of data, wherein each group of data comprises the m absolute values and the positions of the m absolute values in each group of data are different from each other;

the third pipeline stage comprises m paths of comparison circuits for comparing absolute values in the m groups of data and generating comparison results; and

the fourth pipeline stage comprises a screening circuit for selecting n data elements with larger absolute values as valid data elements according to the comparison result, and outputting the valid data elements and corresponding indexes, wherein the indexes indicate the positions of the valid data elements in the m data elements.

Clause 11 and clause 10, the data processing apparatus according to clause 10, wherein each of the comparison circuits in the third pipeline stage includes m-1 comparators, m-1 comparators in the ith pipeline comparison circuit are configured to sequentially compare one absolute value in the ith group of data with three other absolute values and generate a comparison result, and i is greater than or equal to 1 and is less than or equal to m.

Clause 12, the data processing apparatus according to any of clauses 10-11, wherein the filtering circuitry is further configured to select according to a specified priority order when there are data elements that are identical in absolute value.

Clause 13, the data processing apparatus of clause 7, wherein the anti-sparsification process comprises:

according to the position indicated by the index part, according to the data format before sparsifying, each data element in the data part is respectively placed at the corresponding position of the data format before sparsifying, and the rest positions of the data format are filled with predetermined information to generate the recovery data.

Clause 14, the data processing apparatus according to clause 2, 5 or 6, wherein,

each 1 bit in the index part in the structure body corresponds to the position of N bits of data, and N is determined at least partially based on hardware configuration; and/or

The data portions in the structure are aligned according to a first alignment requirement and the index portions in the structure are aligned according to a second alignment requirement.

Clause 15, the data processing apparatus of any of clauses 1-14, wherein the sparseness instruction is for structured sparseness processing of at least one dimension of the multidimensional data in the neural network.

Clause 16, the data processing apparatus of clause 15, wherein the at least one dimension is selected from an input channel dimension and an output channel dimension.

Clause 17, the data processing apparatus according to any of clauses 1-16, wherein

The sparse instruction includes an operation mode bit therein to indicate an operation mode of the sparse instruction, or

The sparse instruction includes a plurality of instructions, each instruction corresponding to one or more different operating modes.

Clause 18, a chip comprising the data processing apparatus of any of clauses 1-17.

Clause 19, a card comprising the chip of clause 18.

Clause 20, a data processing method, comprising:

parsing a sparse instruction, the sparse instruction indicating an operation related to structured sparsity;

reading corresponding operands according to the sparse instruction;

performing the structured sparsity-related operation on the operands; and

and outputting an operation result.

Clause 21, the method of data processing according to clause 20, wherein the sparse instruction indicates a first mode of operation and an operand of the sparse instruction comprises data to be thinned out, the method further comprising:

according to the sparse instruction, performing structured sparse processing on the data to be thinned; and

outputting a sparsified structure, wherein the structure comprises a data part and an index part which are bound with each other, the data part comprises the data of the data to be sparsified after the sparsification processing, and the index part is used for indicating the position of the data to be sparsified in the data to be sparsified.

Clause 22, the method of data processing according to clause 20, wherein the sparse instruction indicates a second mode of operation and an operand of the sparse instruction comprises data to be thinned out, the method further comprising:

and outputting a data part after the sparsification treatment, wherein the data part comprises the data of the data to be sparsified after the sparsification treatment.

Clause 23, the method of data processing according to clause 20, wherein the sparse instruction indicates a third mode of operation and an operand of the sparse instruction comprises data to be thinned out, the method further comprising:

outputting a thinned index part, wherein the index part indicates the position of the thinned data in the data to be thinned.

Clause 24, the data processing method of clause 20, wherein the sparse instruction indicates a fourth mode of operation and operands of the sparse instruction include data to be thinned out and a sparse index indicating a location of a valid data element in a structured sparse to be performed, the method further comprising:

according to the sparse instruction and the position indicated by the sparse index, performing structured sparse processing on the data to be thinned; and

outputting a sparsified structural body or a sparsified data part, wherein the structural body comprises a data part and an index part which are bound with each other, the data part comprises the data of the data to be sparsified, and the index part is used for indicating the position of the data to be sparsified in the data to be sparsified.

Clause 25, the method of data processing according to clause 20, wherein the sparse instruction indicates a fifth mode of operation and operands of the sparse instruction include a sparsified data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to the sparsifying, the method further comprising:

binding the data part and the index part into a structure according to the sparse instruction; and

and outputting the structural body.

Clause 26, the data processing method of claim 20, wherein the sparse instruction indicates a sixth mode of operation and the operands of the sparse instruction comprise a sparsified data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to the sparsifying, the method further comprising:

according to the sparse instruction, according to the position indicated by the index part, performing anti-sparsification processing on the data part to generate recovery data with a data format before sparsification processing; and

and outputting the recovered data.

Clause 27, the data processing method according to any one of clauses 21-24, wherein the structured sparsification comprises selecting n data elements from every m data elements as valid data elements, wherein m > n.

Clause 28, the data processing method of clause 27, wherein the structured sparsification process is implemented using an arithmetic circuit comprising: at least one multi-stage pipelined arithmetic circuit including a plurality of operators arranged stage by stage and configured to perform structured thinning-out processing of selecting n data elements having larger absolute values from among the m data elements as effective data elements in accordance with the thinning-out instruction.

Clause 29, the data processing method of clause 28, wherein the multi-stage pipelined arithmetic circuit comprises four pipelined stages, wherein:

Clause 30, the data processing method according to clause 29, wherein each of the comparison circuits in the third pipeline stage includes m-1 comparators, m-1 comparators in the ith circuit are configured to sequentially compare one absolute value in the ith group of data with the other three absolute values and generate a comparison result, and i is greater than or equal to 1 and is less than or equal to m.

Clause 31, the data processing method according to any of clauses 29-30, wherein the filtering circuit is further configured to select according to a specified priority order when there are data elements that are identical in absolute value.

Clause 32, the data processing method of claim 26, wherein the anti-sparsification process comprises:

Clause 33, the data processing method of clause 21, 24 or 25, wherein,

Clause 34, the data processing method of any of clauses 20-33, wherein the sparse instruction is for structured sparse processing of at least one dimension of multidimensional data in a neural network.

Clause 35, the data processing method of clause 34, wherein the at least one dimension is selected from an input channel dimension and an output channel dimension.

Clause 36, the data processing method according to any one of clauses 20 to 35, wherein

The sparse instructions include a plurality of instructions, each instruction corresponding to one or more different operating modes.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims

1. A data processing apparatus comprising:

2. The data processing apparatus according to claim 1, wherein the sparse instruction indicates a first mode of operation and an operand of the sparse instruction comprises data to be thinned out,

3. The data processing apparatus according to claim 1, wherein the sparse instruction indicates a second mode of operation and an operand of the sparse instruction comprises data to be thinned out,

the operation circuit is configured to execute structured sparse processing on the data to be thinned according to the sparse instruction, and output a thinned data portion to the storage circuit, where the data portion includes the data of the data to be thinned after the thinning processing.

4. The data processing apparatus according to claim 1, wherein the sparse instruction indicates a third mode of operation and an operand of the sparse instruction comprises data to be thinned out,

5. The data processing apparatus according to claim 1, wherein the sparse instruction indicates a fourth mode of operation, and operands of the sparse instruction comprise data to be thinned out and a sparse index indicating a position of an active data element in a structured sparse to be performed,

the operation circuit is configured to execute structured sparse processing on the data to be thinned according to the position indicated by the sparse index according to the sparse instruction, and output a structure body after sparse processing or a data part after sparse processing to the storage circuit, wherein the structure body comprises a data part and an index part which are bound with each other, the data part comprises the data after sparse processing on the data to be thinned, and the index part is used for indicating the position of the data after sparse processing in the data to be thinned.

6. The data processing apparatus according to claim 1, wherein the thinning instruction indicates a fifth mode of operation and operands of the thinning instruction comprise a thinned-out data portion and a corresponding index portion, the index portion indicating a position of the data portion in the data prior to thinning-out,

7. The data processing apparatus according to claim 1, wherein the sparse instruction indicates a sixth mode of operation and operands of the sparse instruction comprise a thinned-out data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to thinning-out,

8. The data processing apparatus according to any of claims 2-5, wherein said structured sparseness processing comprises selecting n data elements from every m data elements as valid data elements, wherein m > n.

9. The data processing device of claim 8, wherein the arithmetic circuitry further comprises: at least one multi-stage pipeline operation circuit including a plurality of operators arranged stage by stage and configured to perform structured thinning-out processing of selecting n data elements having larger absolute values as effective data elements from among the m data elements according to the thinning-out instruction.

10. The data processing apparatus of claim 9, wherein the multi-stage pipelined arithmetic circuit comprises four pipelined stages, wherein:

the fourth pipeline stage comprises a screening circuit for selecting n data elements having larger absolute values as valid data elements according to the comparison result and outputting the valid data elements and corresponding indexes indicating positions of the valid data elements among the m data elements.

11. The data processing apparatus according to claim 10, wherein each of the comparison circuits in the third pipeline stage comprises m-1 comparators, and m-1 comparators in the ith pipeline comparison circuit are configured to sequentially compare one absolute value in the ith group of data with the other three absolute values and generate comparison results, and 1 ≦ i ≦ m.

12. A data processing apparatus according to any of claims 10 to 11, wherein the screening circuit is further configured to select in a specified priority order when there are data elements that are the same in absolute value.

13. The data processing device of claim 7, wherein the anti-sparsification process comprises:

14. The data processing apparatus according to claim 2, 5 or 6,

The data portions in the structure are aligned according to a first alignment requirement, and the index portions in the structure are aligned according to a second alignment requirement.

15. The data processing apparatus according to any of claims 1 to 14, wherein the sparse instruction is for structured sparse processing of at least one dimension of multidimensional data in a neural network.

16. The data processing apparatus of claim 15, wherein the at least one dimension is selected from an input channel dimension and an output channel dimension.

17. A data processing apparatus as claimed in any one of claims 1 to 16, wherein

18. A chip comprising a data processing device according to any one of claims 1 to 17.

19. A board comprising the chip of claim 18.

20. A method of data processing, comprising:

reading corresponding operands according to the sparse instruction;

performing the structured sparsity-related operation on the operands; and

and outputting the operation result.

21. The data processing method of claim 20, wherein the sparse instruction indicates a first mode of operation and an operand of the sparse instruction includes data to be thinned out, the method further comprising:

outputting a sparsified structure, wherein the structure comprises a data part and an index part which are mutually bound, the data part comprises the data of the data to be sparsified after sparsified, and the index part is used for indicating the position of the data to be sparsified in the data to be sparsified.

22. The data processing method of claim 20, wherein the sparse instruction indicates a second mode of operation and an operand of the sparse instruction comprises data to be thinned out, the method further comprising:

23. The data processing method of claim 20, wherein the sparse instruction indicates a third mode of operation and an operand of the sparse instruction comprises data to be thinned out, the method further comprising:

24. The data processing method of claim 20, wherein the sparse instruction indicates a fourth mode of operation and operands of the sparse instruction include data to be thinned out and a sparse index indicating a location of a valid data element in a structured sparse to be performed, the method further comprising:

25. The method of data processing according to claim 20, wherein the sparse instruction indicates a fifth mode of operation and operands of the sparse instruction comprise a thinned-out data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to thinning-out, the method further comprising:

and outputting the structural body.

26. The data processing method of claim 20, wherein the sparse instruction indicates a sixth mode of operation and operands of the sparse instruction comprise a sparsified data portion and a corresponding index portion, the index portion indicating a location of the data portion in the data prior to sparsifying, the method further comprising:

and outputting the recovered data.

27. A method of data processing according to any of claims 21-24, wherein said structured sparseness processing comprises selecting n data elements from every m data elements as valid data elements, wherein m > n.

28. The data processing method of claim 27, wherein the structured sparseness processing is implemented using an operational circuit comprising: at least one multi-stage pipelined arithmetic circuit including a plurality of operators arranged stage by stage and configured to perform structured thinning-out processing of selecting n data elements having larger absolute values from among the m data elements as effective data elements in accordance with the thinning-out instruction.

29. The data processing method of claim 28, wherein the multi-stage pipeline arithmetic circuit comprises four pipeline stages, wherein:

30. The data processing method of claim 29, wherein each of the comparison circuits in the third pipeline stage comprises m-1 comparators, and m-1 comparators in the ith pipeline comparison circuit are used for sequentially comparing one absolute value in the ith group of data with the other three absolute values and generating comparison results, and 1 ≦ i ≦ m.

31. A data processing method according to any of claims 29 to 30, wherein the screening circuit is further configured to select in a specified priority order when there are data elements of the same absolute value.

32. The data processing method of claim 26, wherein the anti-sparsification process comprises:

33. The data processing method of claim 21, 24 or 25,

34. The data processing method of any of claims 20 to 33, wherein the sparse instruction is for structured sparse processing of at least one dimension of multidimensional data in a neural network.

35. The data processing method of claim 34, wherein the at least one dimension is selected from an input channel dimension and an output channel dimension.

36. A data processing method as claimed in any one of claims 20 to 35, wherein