CN114329324A

CN114329324A - Data processing circuit, data processing method and related product

Info

Publication number: CN114329324A
Application number: CN202111642096.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambrian Jixingge Nanjing Technology Co ltd
Current assignee: Cambrian Jixingge Nanjing Technology Co ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-12
Also published as: WO2023123919A1

Abstract

The disclosure discloses a data processing circuit, a data processing method and a related product. The data processing circuit may be implemented as a computing device included in a combined processing device, which may also include interface devices and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. The scheme disclosed by the invention provides a convolution processing scheme of sparse data, which can simplify the processing and improve the processing efficiency of a machine.

Description

Data processing circuit, data processing method and related product

Technical Field

The present disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to a data processing circuit, a data processing method, a chip, and a board.

Background

In recent years, great progress has been made in convolutional neural network-based object detection, instance segmentation, and keypoint detection. These detections are typically based on light reach (LiDAR) data or on RGB-D data, which can be applied in the fields of autopilot, robot vision, etc.

Unlike image data which is dense, LiDAR point cloud data is typically sparse and the point density varies dramatically due to factors such as uneven sampling of the 3D space, the effective range of the sensor, occlusion, and relative pose. Therefore, the conventional convolutional neural network suitable for dense data becomes very inefficient when applied to such sparse data, and particularly when convolution operation is involved, a large amount of resources such as computational power are wasted on zero-valued data points.

In view of this, it is desirable to provide an improved convolution scheme to be suitable for sparse type data such as point cloud data, thereby improving processing efficiency.

Disclosure of Invention

To at least partially solve one or more technical problems mentioned in the background, the present disclosure provides a data processing circuit, a data processing method, a chip and a board.

In a first aspect, the present disclosure discloses a data processing circuit comprising: control circuit, memory circuit and arithmetic circuit, wherein:

the control circuit is configured to control the storage circuit and the operation circuit to perform N-dimensional convolution operation processing on input data and convolution kernels, wherein N is greater than 1, and N represents the number of convolution dimensions for performing sliding accumulation in convolution operation, and the input data are sparse data and are represented in a compact form;

the storage circuitry is configured to store information, the information comprising at least pre-processing, during processing, and/or post-processing information; and

the operation circuit is configured to perform a plurality of one-dimensional convolution operations on the input data and the convolution kernel under the control of the control circuit to obtain a plurality of paths of operation results and output point coordinates on a corresponding first convolution dimension; and merging the multi-path operation results into a path of fusion data as the result of the convolution operation according to the corresponding output point coordinates, wherein the operation results with the same output point coordinates are accumulated.

In a second aspect, the present disclosure provides a chip comprising the data processing circuit of any of the embodiments of the first aspect.

In a third aspect, the present disclosure provides a board card comprising the chip of any of the embodiments of the second aspect.

In a fourth aspect, the present disclosure provides a method of processing data using the aforementioned data processing circuit.

With the data processing circuit, the method for processing data using the data processing circuit, the chip and the board provided above, the embodiments of the present disclosure provide a convolution scheme suitable for sparse data, which can greatly save the amount of computation and improve the processing efficiency by only computing non-zero/non-null data with a convolution kernel. The sparse convolution scheme provided by the disclosed embodiments may be applicable to multidimensional convolution operations, including but not limited to two-dimensional convolution and three-dimensional convolution, and thus may be applicable to processing of LiDAR point cloud data.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

fig. 1 shows a block diagram of a board card of an embodiment of the present disclosure;

FIG. 2 shows a block diagram of a combined processing device of an embodiment of the disclosure;

FIG. 3a illustrates an internal schematic diagram of a single core computing device according to an embodiment of the disclosure;

FIG. 3b illustrates a schematic diagram of the internal structure of a multi-core computing device according to an embodiment of the disclosure;

FIG. 4a illustrates the principle of operation of a conventional convolution scheme;

FIG. 4b illustrates an exemplary principle of a sparse convolution operation scheme according to an embodiment of the present disclosure;

FIG. 5 illustrates an exemplary representation of input data according to embodiments of the present disclosure;

FIG. 6 illustrates an exemplary process of a sparse convolution scheme according to an embodiment of the present disclosure;

FIG. 7 illustrates an example of splitting of an input data block in accordance with an embodiment of the present disclosure;

FIG. 8 illustrates a flowchart of an exemplary method of screening valid input data points, according to an embodiment of the present disclosure;

FIG. 9 illustrates a schematic diagram of a scan traversal of a third input parameter in accordance with an embodiment of the present disclosure;

FIG. 10 shows a schematic diagram of constructing a Q matrix according to an embodiment of the present disclosure;

FIG. 11 illustrates exemplary logic for computing wo coordinates of output data in accordance with embodiments of the present disclosure; and

FIG. 12 shows a schematic block diagram of a data processing circuit according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Exemplary hardware Environment

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the disclosure. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a storage device 204.

The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.

The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.

The storage device 204 is used to store data to be processed, which may be a DRAM, a DDR memory, and is typically 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203.

Fig. 3a shows an internal structure diagram of a processing core when the computing device 201 is a single-core device. The computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining, and the like, and the computing device 301 includes three major modules: a control module 31, an operation module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decode unit 312 decodes the obtained instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, nonlinear transformation, and the like; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM)331, a weight storage unit (weight RAM, WRAM)332, and a Direct Memory Access (DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; WRAM 332 is used to store the convolution kernel of the deep learning network, i.e. the weight; the DMA 333 is connected to the DRAM 204 via the bus 34 and is responsible for data transfer between the computing device 301 and the DRAM 204.

Fig. 3b shows a simplified schematic diagram of the internal structure of the computing device 201 for multi-core. The multi-core computing device may be abstracted with a hierarchical hardware model. As shown, the multi-Core computing device may be abstracted into four levels, namely a board level (Card)350, a Chip level (Chip)360, a processor Cluster level (Cluster)370, and a processor Core level (Core) 380. The data transmission and calculation unit part of the storage unit is mainly referred to in the embodiments of the present disclosure, so that the drawings and the description briefly show and describe the related calculation structure, and other parts are omitted.

At the board level, each board contains local DDR storage, and each processor chip serves as a compute and control unit.

At the chip level, each processor chip contains multiple processors as computing units.

At the computing cluster level, each multiprocessor comprises a plurality of accelerator cores as control and computing units, and additionally shares a storage SRAM as a storage unit.

At the processor core level, each accelerator core includes an array of local storage and local processing units. NFU refers to a neural Function Unit (Neuron Function Unit) for performing convolution calculation.

In the multi-Core computing device, the storage model includes a board global memory, an SRAM (shared memory) on the Cluster, an NRAM, a WRAM, a register, and the like on the Core. For better performance, the data movement and the balance between access/computation between the storage layers below the Card can be explicitly controlled. The SRAM is included in a Memory processing Unit MPU (Memory Process Unit Core, abbreviated as MPU or Mem Core). Core refers to an Intelligent processing Core (IPU Core or Core for short) in a multi-Core computing device. The 1 IPU Core contains NRAM, WRAM, NFU, etc. Cluster refers to a processor Cluster or computing Cluster, and generally, a multi-Core computing device comprises a plurality of clusters, wherein one Cluster comprises 1 Mem Core + N IPU cores.

Exemplary principles of convolution operation

Embodiments of the present disclosure provide a data processing circuit supporting convolution operations of sparse data based on the aforementioned hardware environment. By providing an optimized convolution scheme, convolution processing associated with sparse data, such as LiDAR point cloud data, may be simplified and accelerated. The sparse convolution scheme provided by the embodiments of the present disclosure may be applicable to multidimensional convolution operations including, but not limited to, two-dimensional convolution and three-dimensional convolution. For simplicity and ease of understanding, in some embodiments two-dimensional convolution is used as an example for illustration.

The term "N-dimensional convolution" as used in the embodiments of the present disclosure, wherein N represents the number of convolution dimensions in which sliding accumulation is performed in the convolution operation. For example, when N is 2, the convolution kernel performs a translation accumulation in two dimensions (e.g., width W and height H) according to the corresponding convolution step. When N is 3, the convolution kernels are accumulated in translation in three dimensions (e.g., width W, height H, and depth D) according to the corresponding convolution step. Reference to a "non-convolution dimension," in embodiments of the present disclosure, refers to a dimension in which convolution kernels do not slip accumulate over the dimension. Different non-convolution dimensions may have different required operations. For example, for conventional convolution, the input channel dimension Ci requires accumulation, and the output channel dimension Co does not; for another example, for depth-wise convolution (depth-wise conv), the input channel dimensions Ci are not accumulated.

In order to more clearly understand the convolution scheme of the embodiment of the present disclosure, the operation principle of the conventional convolution scheme is described by taking two-dimensional convolution as an example.

Fig. 4a shows the operational principle of a conventional convolution scheme. In this example, the convolution kernel 410 is a dense, 3 × 3 matrix, with the numbers in the convolution kernel being the corresponding weight data. The input data 420 is a 7 x 7 matrix, which is sparse with only four non-zero data: 2. 3, 5 and 6, as indicated by the dark squares. In this exemplary convolution process, the convolution step size for both dimensions is set to 2, the fill is 0, and there is no dilation. The grey squares of size 3 x 3 in the figure represent the sliding accumulation process of the convolution kernel on the input data. 430 shows the calculation at the start of the convolution, 440 shows the calculation sliding once to the right (step 2) and 450 shows the calculation sliding once down (step 2). In each step of calculation, the weight data of the convolution kernel and the input data are multiplied and accumulated in a bit-aligned mode. 460 is the final calculation as output data. The output data is a matrix of 3 × 3 size. It can be seen that the calculation of 430 corresponds to data at coordinate (0,0) in the output data, the calculation of 440 corresponds to data at coordinate (0,1) in the output data, and the calculation of 450 corresponds to data at coordinate (1,0) in the output data.

During the sparse convolution operation of the disclosed embodiments, the convolution kernel is dense, and its input format may be the same as a conventional convolution; and the input data is sparse, the input format of which can be different from that of the conventional convolved input data, thereby saving storage space.

As can be seen from the description of fig. 4a, the final result of the sparse convolution operation is only related to the operation result of the non-zero input data elements, and therefore, the multiply-add operation with the convolution kernel can be performed only for these non-zero input data elements, thereby reducing the invalid operations.

Further, as can be seen from the output data 460 of fig. 4a, these multiply-add operations can be further split into partial and column-wise accumulated results after performing one-dimensional convolution operations on rows. In other words, the N-dimensional convolution operation can be implemented by splitting the N-dimensional convolution operation into a plurality of one-dimensional convolution operations.

FIG. 4b illustrates an exemplary principle of a sparse convolution scheme according to an embodiment of the present disclosure. FIG. 4b depicts the sparse convolution operation scheme of an embodiment of the present disclosure, again taking the data of FIG. 4a as an example.

As shown in fig. 4b, the whole convolution operation can be divided into the convolution operation results calculated row by row. In this example, one row of operation results 460 corresponds to three rows of input data 420, for example, the first three rows of input data may be used to calculate the first row of output data, the middle three rows (overlapping the first three rows and the last three rows) of input data may be used to calculate the second row of output data, and the last three rows of input data may be used to calculate the last row of output data.

As can be seen from the splitting of the above operation process, the sparse data can be filtered with the granularity of input data (in the example, three lines of input data in the figure) required for one line of operation result (referred to as "first filtering granularity" herein), so as to reduce invalid operations. For example, in the example of fig. 4b, the input data in the middle three rows are all 0's, i.e., there are no non-sparse points, so the convolution operation in the next three rows may be omitted.

Further, in the calculation process of each row of output results, the convolution operation of two dimensions (H, W dimensions) can be split into convolution operations of 3 dimensions (W dimensions).

As shown in the figure, originally, a row of output data is obtained by sliding a convolution window of 3 × 3 size (shown by a black box) on three rows of input data, and the method can be converted into a method of sliding a convolution window of 3 × 3 size (shown by a dashed box) on three rows of input data (shown by 470) respectively to obtain a sum of three rows of parts (shown by 480), and then performing bit alignment accumulation to obtain a final row of output data (460).

As can be seen from the above-mentioned splitting of the convolution operation, since the two-dimensional convolution operation is decomposed into a plurality of one-dimensional convolution operations, during the convolution operation, the sparse data can be filtered with one dimension as the granularity (referred to as "second filtering granularity" herein), thereby reducing invalid operations. For example, in the example of fig. 4b, the third row of input data is all 0's, i.e., there are no non-sparse points, so the convolution operation for that row may be omitted.

Further, in each one-dimensional convolution operation, the sparse data may also be filtered at the granularity of the one-dimensional convolution window (referred to herein as the "third-screening granularity"), thereby further reducing invalid operations. For example, in the example of fig. 4b, there are no non-sparse points within the last two one-dimensional convolution windows of the first row of input data, so the operations of these two convolution windows can be omitted; and the last one-dimensional convolution window of the second row of input data does not have non-sparse points, and the operation of the convolution window can be omitted.

Therefore, through three screening granularities with different sizes, sparse data can be filtered quickly and effectively, and invalid operation is reduced as much as possible.

Accordingly, in the sparse convolution scheme of the embodiments of the present disclosure, the sparse convolution operation may include the steps of: performing a plurality of one-dimensional convolution operations on the convolution kernels and sparse input data to obtain a multi-path operation result (such as a product result or a multiply-add result, considering that some dimensions of the data need to be accumulated, such as an input channel dimension Ci) and output point coordinates on a corresponding first convolution dimension; and then merging the multi-path operation results into a path of fusion data according to the corresponding output point coordinates, and using the fusion data as the result of the sparse convolution operation. In the merging process, the operation results with the same output point coordinate are accumulated.

It will be appreciated that while the exemplary principles of the sparse convolution operation scheme of the disclosed embodiments have been illustrated above in terms of two-dimensional convolution, the above scheme may also be applied to three-dimensional convolution operations and even higher-dimensional convolution operations.

In accordance with the principles described above, when applied to an N-dimensional convolution operation, N >1, the N-dimensional convolution operation can be split into M one-dimensional convolution operations, where M is equal to the product of the magnitudes of the convolution kernels in N-1 convolution dimensions other than the first convolution dimension in which the one-dimensional convolution operation is located. For example, in the two-dimensional convolution example described above, where the convolution kernel size is kx by ky, the first convolution dimension is w (kx), then M ═ ky; in the three-dimensional convolution example, the convolution kernel size is kx ky kz, the first convolution dimension is w (kx), and M ky kz.

The operation of implementing the sparse convolution scheme of the disclosed embodiments is described in detail below.

Exemplary Structure of input and output data

In sparse convolution operations, input data, which are also referred to as input neurons in the neural network, convolution kernels and output data, which are referred to as output neurons, are involved.

In convolution operations involving LiDAR point cloud data, the convolution kernel is dense and its input format may be the same as a conventional convolution. In two-dimensional convolution, the size of the convolution kernel is typically 3 × 3, and a single convolution requires the cumulative summation of 3 × ci numbers; in three-dimensional convolution, the size of the convolution kernel is typically 3 × 3, and a single convolution requires the cumulative summation of 3 × ci numbers. The step size of the convolution is typically 2 in each dimension, e.g. S_H＝S_W＝S_D＝2。

The input data to be convolved may comprise multidimensional data and is sparse in multiple dimensions. For example, in LiDAR data-based target detection, the input data is detection data within a three-dimensional space that characterizes, for example, grayscale values, RGB, signal strength, etc. for each three-dimensional space coordinate point, so the input data element at each coordinate point may be one-dimensional, two-dimensional, three-dimensional, or higher-dimensional data, depending on the information content it is to characterize. Due to the nature of the point cloud data, coordinate points with non-zero value data elements are sparse, i.e., they are sparse in three spatial dimensions (e.g., width W, height H, and depth D).

Depending on the initial state of the input data, pre-processing may be performed before the sparse input data is provided to the arithmetic circuitry for operation. In some embodiments, such pre-processing may include, for example: merging the sparse dimensions into one dimension; densifying sparse data points in the input data in a consolidated dimension; and using a number of input parameters to represent the values and coordinates of the input data after densification.

Fig. 5 shows an exemplary representation of input data (sparse neurons). The input data may be data after the padding process is completed according to the requirement of the convolution operation. In this example, the sparse input data may be converted to a dense input data representation by a preprocessing operator, thereby saving storage space.

As shown, the sparse form of input data 510 includes five dimensions, a batch (b) dimension, a HWD three-dimensional space dimension, and an input channel Ci dimension. The input data is sparse in the B dimension and the HWD three-dimensional space, dark squares in the HWD stereo matrix in the figure represent places with numerical values (called effective input points), and other parts are all zero values. There are multiple such HWD stereo matrices in the B dimension, and the sparse pattern (i.e., the location of the dark squares) on each stereo matrix may be different. The input data is dense in the input channel Ci dimension, which is the lowest dimension. The figure shows only four dimensions at 510 due to the limited representation capability of the figure, but the Ci dimension can be understood as the thickness of each dark square. The size of the dimension Ci is uniform, i.e., the thickness of each dark square is the same.

In some embodiments, the sparse form of input data 510 may be represented with reference to the CSR format in a sparse matrix when converted into dense input data.

In the storage of the sparse matrix, only non-zero element values (sometimes referred to as valid element values) are stored for the purpose of compression, but the positions of the non-zero elements are also reserved for recovery. Thus, the storage of the sparse matrix stores not only the non-zero element values, but also its coordinate positions (row index, column index).

The CSR storage mode is called compressed sparse line format. The CSR approach uses three arrays to store a sparse matrix, which stores row pointers, column indices, and values, respectively. The length of the column index array and the numerical array is the number of non-zero elements in the sparse matrix. The row pointer array stores the offset position of the first non-zero element of each row from the first non-zero element of the sparse matrix, and the last element of the row pointer array stores the total number of the non-zero elements in the sparse matrix, so that the length of the row pointer array is the sum of the row number of the sparse matrix and 1. It will be appreciated that, depending on the definition of the row pointer array, the pointer value of the next row minus the pointer value of the previous row may result in the number of non-zero elements of the previous row.

Similarly, in embodiments of the present disclosure, three input parameters may be used to represent when converting sparse form of input data 510 into dense input data.

The first input parameter is effectively input dense data, namely compact arrangement sparse data, and is represented by Min, the shape of Min is ain × ci, wherein ain is the number of non-sparse points in the input data, and ci is the size of the dimension of an input channel.

During preprocessing, the four sparse dimensions of the input data (the B dimension and the HWD three-dimensional dimension) may be combined into one dimension ain, and the sparse data points (the dark squares in the figure) are densified in the combined dimension to form a densified input data element. That is, each HWD solid matrix in the B dimension performs the same dimension merging and densification process, resulting in preprocessed dense form input data 521, which is a two-dimensional matrix with the lower dimension Ci and the higher dimension the merged dimension ain of the BHWD. In this example, ain is 4.

In some embodiments, for the case that there are multiple batches of batch, a batch-by-batch processing method may be adopted, and the batch is split into different processor cores (for example, core in fig. 3 b) for processing. For example, SWIFT has 12 batchs, so based on the hardware environment illustrated in FIG. 3b, up to 12 batchs can be processed in parallel at one time. In these embodiments, for each batch, only the valid input points for the three sparse dimensions of the HWD need to be merged, where ain corresponds to the number of valid input points for the three dimensions of the HWD.

The second input parameter is the coordinates or index of each valid input point in the W dimension, denoted by wi _ coord, whose shape is 1 × ain. As shown in 522, for 4 valid input points in the graph, wi _ coord is [1,2,0,6 ].

The third input parameter is data in CSR (compressed sparse lines) format in H dimension or H direction of the input data, denoted by hin. hin store the offset of the first non-zero element of each row from the first non-zero element in the input data, the last element of which stores the total number of non-zero elements in the input data. The third input parameter may have a plurality of dimensions depending on the number of dimensions of the input data. For example, when the input data is the five-dimensional data illustrated in fig. 5, the third input parameters are, in order from the high dimension to the low dimension: batch (B), depth in _ D, and height in _ H, with a shape of B × Din (Hin +1), where B, Din and Hin are the B-dimension size, D-dimension size, and H-dimension size, respectively, of the input data in sparse form.

Without loss of generality, the shape of the third input parameter Hin may be represented as X (Hin +1), where X represents the product of the sizes of other dimensions than H, W, Ci that the input data may exist. For example, when the input data is three-dimensional data including H, W and Ci, X does not exist or X ═ 1; and when the input data is four-dimensional data including D, H, W and Ci, X is D.

As shown in 523, hin in the example is [0,1,2,2,2,2,2,4] for B ═ 0 and Din ═ 0. Specifically, the first element is "0", that is, the offset of the first non-zero element in the line indicating that hi is 0 from the first non-zero element of the input data is 0, because it is the first non-zero element itself. The second element "1" indicates that the first non-zero element in the line of hi-1 has an offset position of 1 from the first non-zero element in the input data, which is equal to the number of non-zero elements in the line of hi-0; the third element "2" indicates that the first non-zero element (if any) in the line hi-2 is offset from the first non-zero element in the input data by 2, which is equal to the number of non-zero elements in the first two lines hi-0 and hi-1. The fourth element "2" indicates that the first non-zero element (if any) in the line of hi-3 is offset from the first non-zero element of the input data by 2, which is equal to the number of the first three lines of hi-0-2 non-zero elements. By analogy, the last element "4" represents the total number of all non-zero elements.

Similarly, in the embodiments of the present disclosure, since the output data of the convolution operation is also sparse, the output data can be represented using three output parameters as well.

The first output parameter is dense data that is effectively output, that is, sparse data that is compactly arranged, and is represented by Mout, where Mout is shaped as aout × co, where aout is the number of non-sparse points in the output data, and co is the size of the output channel dimension.

The second output parameter is the coordinate or index of each valid output point in the W dimension, denoted by wo _ coord, whose shape is 1 × aout.

The third output parameter is data in CSR (compressed sparse row) format with output data in H direction, denoted hout. hout stores the offset position of the first non-zero element in each row from the first non-zero element in the output data, and the last element stores the total number of non-zero elements in the output data. The third output parameter, similar to the third input parameter, may also have a plurality of dimensions, for example, from the high dimension to the low dimension: batch (B), depth out _ D and height out _ H, with shape B × Dout (Hout +1), where B, Dout and Hout are the B-dimension size, D-dimension size and H-dimension size of the output data in sparse form, respectively.

For example, continuing with the example in fig. 5, assuming that the convolution kernel is 3 × 3 and the convolution step is 2, the output data includes 4 valid output points, whose corresponding W-dimension coordinates wo _ coord 532 are [0,1,0,2], and corresponding data hout 533 in the CSR format in the H direction is [0,2,2,4], where the first output parameter is not shown in the figure.

Exemplary sparse convolution operation procedure

As can be seen from the data structure of the input and output data, given the convolution operation rules (e.g., convolution kernel size, convolution step size, etc.), the coordinates of the valid output points in the output data can be determined from the coordinates of the input data. Therefore, in the sparse convolution operation scheme of the disclosed embodiment, the operation process can be decomposed into several steps of coordinate calculation and numerical calculation.

FIG. 6 illustrates an exemplary process of a sparse convolution scheme according to an embodiment of the present disclosure.

As shown, in step 610, data that each processor core needs to process each time is first screened out based on the coordinates of the input data. The input data here is data in CSR format that has undergone zero padding and sparse to dense conversion.

In the case of performing the sparse convolution operation based on the multi-core computing apparatus shown in fig. 3b, the data needs to be split in view of the limited storage space of the memory (e.g., SRAM in fig. 3 b). In some embodiments, the splitting may be performed according to the dimension of the output data, and the convolution result may be calculated row by row as known from the convolution principle described above in connection with fig. 4 b. Therefore, the splitting manner may be: each processor core calculates an output data point for a row Wo dimension at a time. Accordingly, the shape of the input data block corresponding to a row of output data points in the Wo dimension is (kz Ci) ky wi, where wi is the W dimension of the input data, Ci is the Ci dimension of the input data, and the convolution kernels have magnitudes in the W, H and D dimensions of kx, ky, and kz, respectively.

FIG. 7 illustrates an example of splitting an input data block, where a gray input data block corresponds to exactly one row of Wo dimension output data points after a convolution accumulation operation has been performed on the gray input data block.

To extract valid input data points corresponding to output data points that compute a row Wo dimension, taking into account the sparsity of the data, in some embodiments, the filtering may be performed according to a third input parameter of the input data, namely hin. From the meaning of hin, the number of non-zero elements (valid input data points) in the ith row can be found by subtracting the ith value from the ith +1 th value. Therefore, the screening can be performed by determining whether there is a valid input data point based on this characteristic of hin.

FIG. 8 illustrates a flowchart of an exemplary method of screening valid input data points, according to an embodiment of the present disclosure.

As shown, in step 811, the third input parameter is first loaded from the external storage circuit onto the on-chip memory (e.g., SRAM). The size of the storage space required by the third input parameter is (Hin +1+2 ph) (Din +2 pd) × dwidth, wherein Din and Hin are the D dimension size and the H dimension size of the input data in the sparse form respectively, ph and pd are single-side filling quantities of the H dimension and the D dimension respectively, and dwidth is the data bit width. The shape of the third input parameter may be a two-dimensional matrix (Din +2 pd) h (Hin +1+2 ph), where batch B is assumed to be 1, since processing is performed on a batch-by-batch basis.

Next, at step 812, the third input parameter is traversed by the specified scan window (first scan window), the specified scan step (first scan step) to find valid input data points.

The size of the first scanning window corresponds to the input data required to compute an output data point for one row Wo dimension, i.e., to the aforementioned "first screening granularity". The size of the first scanning window is determined according to the size of the convolution kernel. Specifically, the first scan window size is kz (ky +1), kz corresponds to the size of the scan window in the D dimension, and ky +1 is the size of the scan window in the H dimension, because the third input parameter uses the CSR format, where ky rows of H dimension data require ky +1 data to represent. The first scanning step is equal to the H-dimension convolution step Sy.

And in the scanning traversal process, detecting whether each row of data of the third input parameter is changed, and if any row is changed, indicating that a valid input data point exists. The scan window in which a valid input data point is detected may be referred to as a block (range block). After the block is detected, the corresponding hi and di coordinates of the block can be recorded, so that the corresponding ho and do coordinates in the output data can be calculated according to the hi and di coordinates.

Upon successive detection of N_IPUAfter the tile, at step 813, the found tile can be sent to the processor core (IPU), e.g., the N tiles_IPUEach block is sent to N_IPUA different IPU. Meanwhile, the H and D dimensional coordinates, namely the ho and do coordinates, corresponding to the output points processed by each IPU in the output data can be informed. It will be appreciated that each IPU, when calculating the value of output points (wo _ x, ho, do), wo _ x is varied, ranging from 0 to (wo-1), and ho and do are fixed.

Fig. 9 shows a schematic diagram of a scan traversal of a third input parameter. It will be appreciated that only a portion of the third input parameters are shown, such as the first 3 rows hin of data 910 with di 0-2. The size of the scanning window in this example is 3 x 4, the scanning step is 2, and the scanning is performed sequentially along the direction hin. It can be seen that, in the first scanning window 901 corresponding to the output point coordinate ho being 0, all the data in 3 lines hin are changed, i.e. a valid input data point exists therein, so that the scanning window 901 is recorded as the block 0, and its corresponding ho being 0 and do being 0 can be calculated. Moving to the right by 2 numbers, in this scanning window 902, there is no change in the data in line 3 hin, i.e., no valid input data point exists, and the next scanning can be continued without processing. In this way, 4 consecutive blocks can be detected, each addressed to 4 different IPUs, i.e. N in this example_IPU4. The do coordinates corresponding to the 4 blocks are all 0, and the ho coordinates are 0,2,3 and 4 respectively.

Continuing with fig. 6, in step 620, each IPU assigned to the data to be processed (indicated by the aforementioned tile) may retrieve the corresponding data and construct a matrix to be convolved, hereinafter referred to as Q matrix, according to the indication of the tile, while calculating the output point coordinate information. For example, the Q matrix is constructed by fetching from a shared memory SRAM to which input data is loaded in advance from an external memory circuit.

In some embodiments, a wi _ coord vector of valid input data points in the input data block corresponding to the allocated tile may be first retrieved from the second input parameter wi _ coord. Then, according to the wi _ coord vector, the input data corresponding to the vector is traversed by a second scanning window and a second scanning step to extract the corresponding input data point from the effectively input dense data Min (i.e., the first input parameter) to construct the Q matrix.

As previously described, the block is taken from the third input parameter hin, which records the distance of each row's first valid input data point from the designated point (i.e., the first valid input data point of the entire input data), so that a designated amount of data can be taken from the designated location of the second input parameter wi _ coord according to this information to form a corresponding wi _ coord vector. The meaning of wi _ coord vector here refers to a vector of wi coordinates of all valid input data points of a whole row of W dimensions in the input data. For example, assuming there are 34 valid input data points in the W dimension of 1 row, the vector length is 34, and the value of each vector element is the wi coordinate of the corresponding data point.

The number of wi _ coord vectors taken is kz (D dimension) × ky (H dimension), i.e. corresponding to the data range indicated by the block, and also corresponding to the number M of transformed one-dimensional convolution operations described above. Thus, the M wi _ coord vectors can cover the range of a convolution window in the H and D dimensions, and the W dimension is the whole wi. For example, in the foregoing example, kz-ky-3, so 9 wi _ coord vectors are fetched. The extracted kz × ky wi _ coord vectors can be successively spliced to construct a matrix to be convolved based thereon.

Likewise, the wi _ coord vector may be filtered during construction of the wi _ coord vector according to the meaning of hin. Since the positions of the two preceding and succeeding elements in hin represent the number of valid input data points in the corresponding row, the wi _ coord vector is empty for the row with a difference of 0, and thus the wi _ coord vector that is empty can be filtered out. This screening step corresponds to the "second screening granularity" described previously.

Next, in some embodiments, the matrix Q may be constructed by traversing the input data corresponding to the wi _ coord vector with a second scan window and a second scan step according to the extracted wi _ coord vector to extract the corresponding input data point from the first input parameter Min.

FIG. 10 shows a schematic diagram of constructing a Q matrix according to an embodiment of the disclosure. For simplicity, only a 3 row w Q matrix configuration is shown, and other rows w may be similarly configured.

As shown, for the sake of clarity, the graph 1010 shows sparse input data, the retrieved wi _ coord vector 1020 (corresponding to three rows di 0, hi 4-6) and the portion hin 1030 of block assigned to the current IPU, di 0, and the graph 1040 shows a Q matrix constructed based on 3 rows w di 0, hi 4-6.

As previously described, the wi _ coord vector may be constructed based on the information in the third input parameter hin, i.e., to determine whether there are valid input data points in the row currently being scanned. Specifically, the number of valid input data points in the ith row may be determined based on the difference between the (i +1) th value and the ith value in hin.

For example, in the example of the figure, according to the information of the block (1030): 4-2, it can be determined that there are 2 valid input data points in the line hi-4, and based on the information in block 1030: 4-0, it can be determined that there are no valid input data points in the line hi-5, and from 7-4-3, it can be determined that there are 3 valid input data points in the line hi-6. Therefore, only 2 rows of hi-4 and hi-6 need to be scanned. Thus, in this example, there is no corresponding wi _ coord vector because the line of data, hi-5, is empty. From the wi _ coord, the corresponding data can be fetched to construct a wi _ coord vector, shown at 1020, with only 2 wi _ coord vectors, according to the starting position and number indicated at hin 1030.

And then, according to the extracted wi _ coord vector, traversing the input data corresponding to the wi _ coord vector by a second scanning window and a second scanning step length to extract corresponding effective input data points to construct a matrix Q.

Specifically, scanning is performed row by row to construct corresponding rows of the Q matrix. In the example in the figure, the line hi-4 is scanned first, the line hi-5 is skipped, and the line hi-6 is scanned. During scanning, data covered by the second scanning window in which valid input data points are detected is extracted, tiled in sequence to form corresponding rows of matrix Q, while skipping over the second scanning window in which no valid input data points are detected. It can be understood that, since the sliding scan is performed line by line, the size of the second scan window corresponds to the size of the convolution window of the convolution operation in the first convolution dimension (e.g., the size of the W dimension, kx), and the second scan step corresponds to the convolution step Sx of the convolution operation in the W dimension. This step of scanning screening corresponds to the "third screening granularity" described previously.

While scanning, the input data is scanned line by line along the W dimension using a second scanning window (1 × 3 size in this example), a convolution step Sx in the W dimension (Sx ═ 2 in this example). When a valid input data point exists within the scanning window, the input data corresponding to the scanning window is extracted. In this way, the window data extracted at each time is sequentially spread and tiled along the W dimension to construct a Q matrix.

As shown in fig. 10, a line of data, hi-4, may be scanned first, as shown in the first scanning window 1001. If a valid input data point is found in the window 1001, the data in the window is extracted to form the first 3 columns of row 1 of the Q matrix 1040; then, 2 data are translated rightwards, a next scanning window 1002 is scanned, valid input data points also exist in the window, and the data of the window are extracted to form the next 3 columns of the 1 st row of the Q matrix; continuing to shift 2 data to the right, scan the next scan window 1003, which also has valid input data points, and also extract the data of this window to form the 1 st row and the last 3 columns of the Q matrix, and this row of data scan ends. During scanning, if no valid input data point exists in the scanning window, jumping to the next window.

Next, a line of hi-5 is skipped, and data of a line of hi-6 is scanned. Similarly, with a scan window size of 1 x 3, the step size is 2 data scans. The scanning results are: there are 2 valid input data points within the

scan window

1004, 2 valid input data points within the scan window 1005, and no valid input data points within the scan window 1006. The result of extracting data to construct the Q matrix is shown in line 2 of 1040, which is comprised of data covered by

scan windows

1004 and 1005.

Since the input data is in a dense form, when determining whether there is a valid input data point in the scanning window, it can be determined whether the input data point is located in a certain scanning window according to the coordinate information of the input data. Specifically, the determination may be based on the tile (1030) assigned to the IPU and the constructed wi _ coord vector (1020). It will be appreciated that a scan window substantially corresponds to (a partial sum of) an output point, and therefore the wo coordinates of the output data point to which it contributes can be inferred from the wi coordinates of valid input data points to determine whether it falls within one or more scan windows.

FIG. 11 illustrates exemplary logic for computing the wo coordinates of an output data point in accordance with embodiments of the present disclosure. In this example, assuming a convolution kernel size of 3 x 3, the convolution steps are all 2 in the HWD direction.

As can be seen from the figure, according to the convolution operation rule and the corresponding convolution parameters (including the convolution kernel size and the convolution step size), the following mapping relationship exists from wi _ coord to wo _ coord:

if wi _ coord is odd, the mapping is one-to-one, wo _ coord ═ wi _ coord-1)/2;

if wi _ coord is even, the mapping may be one-to-one (boundary point wo _ coord is 0 or wi-1) or one-to-two (non-boundary point) depending on whether the wi coordinates are boundary or not. For example, the mapping relationship may be wo _ coord ═ wi _ coord/2-1, and wo _ coord ═ wi _ coord/2.

From this mapping, wo _ coord can be calculated. For example, referring to the example of fig. 10, from the first wi _ coord vector ═ 1,4], we _ coord ═ 0,1,2] can be calculated, there are 3 valid output data points; from the second wi _ coord vector, [0,2,3], we _ coord, [0,0,1,1] can be calculated, and after deduplication, there are 2 valid output data points, [0,1 ]. In the calculation process, the number of wo coordinates of each row, that is, the number of valid output data points of each row, can be counted for subsequent use.

It will be appreciated that depending on the convolution parameters, the mapping may change accordingly. The mapping between the wi coordinates of the input data and the wo coordinates of the output data can be derived by a person skilled in the art from the specific convolution parameters (in particular the convolution kernel kx size and the convolution step Sx).

When the wo coordinates of the output data point to which it contributes are deduced from the wi coordinates of valid input data points, it can be determined which one or more second scanning windows it falls into.

For example, in the example of fig. 10, for the first wi _ coord vector, it is inferred that valid input data points with coordinate 1 fall into the second scanning window where wo-0, and valid input data points with coordinate 4 fall into the two second scanning windows where wo-1 and wo-2.

In the above example, when valid input data points are extracted to construct the Q matrix, the Q matrix may also be constructed in cases according to whether wi _ coord is odd or even. In particular, the particular location of the input data point in the second scan window may be determined based on whether the wi coordinate is odd or even. For example, an input data point with wi coordinates odd must fall in the middle of a second scanning window, while an input data point with wi coordinates even falls in two adjacent positions of two adjacent second scanning windows.

Thus, the value of the valid input data point can be read from, for example, a shared memory SRAM, and stored onto an on-chip memory (e.g., NRAM) according to this rule.

Specifically, for a valid input data point with wi coordinates being an even number, 2 copies of the storage are needed, because it would fall into two adjacent scan windows and be located at the end of the current scan window and the head of the next scan window; for valid input data points with wi coordinates odd, only 1 copy needs to be made, with its position in the middle of the current scan window.

Thus, by scanning the valid input data points line by line, a Q matrix can be constructed. For each of the M non-empty wi _ coord vectors, the process can be performed sequentially as described above, whereby the Q matrix is constructed with M rows, each row consisting of Li second scanning windows, Li depending on the number of output data points of the row, as previously counted when calculating the wo coordinates.

The counted number of output data points may be used to calculate a third output parameter hout in the output data. As defined by the third output parameter hout, the ith value represents the total number of valid output points in the previous i-1 row. Therefore, the respective values of hout can be obtained by integrating/accumulating and summing according to the number of wo coordinates of each line counted previously based on wo _ coord. For example, in the example of fig. 5, since the output data has 2 wo coordinates on line 0, no wo coordinate on

line

1, and 2 wo coordinates on line 2, the corresponding hout is [0,2,2,4 ].

Returning to fig. 6, in step 630, after the matrix Q is constructed, M one-dimensional convolution operations may be performed on the matrix Q, where M one-dimensional convolution kernels of the M one-dimensional convolution operations are obtained by splitting the original N-dimensional convolution kernel according to the first convolution dimension, and a convolution step size of the one-dimensional convolution operation is equal to the size of the second scanning window, that is, equal to the size of the N-dimensional convolution kernel in the first convolution dimension. In the disclosed embodiment, the first convolution dimension is the W dimension, so the convolution step size of the one-dimensional convolution operation is equal to kx.

Thus, M partial sums can be obtained by M one-dimensional convolution operations, which correspond to the same row wo of output points. Each way portion and result also determines the respective portion and the corresponding wo coordinate.

Next, in step 640, the M-way components and the results are merged into a way of fused data according to the output data point coordinates corresponding thereto, so as to obtain the final result of the output data points of the corresponding row wo. During the merging process, the partial sums with the same wo-coordinate are accumulated.

The merging process described above can be performed in a number of ways.

In some embodiments, the merge fusion process may be implemented in hardware. In these embodiments, this may be implemented by a hardware instruction, MERGE instruction. The basic function of the MERGE instruction is to MERGE multiple paths of data to be merged into one path of merged data according to the index sequence of the data, and accumulate the data with the same index.

In other embodiments, the merge-fusion process may be implemented in software. In these embodiments, the sorting in the merge fusion process may be implemented, for example, by an omnidirectional quantization sorting algorithm based on a multi-core processor. Then, calling a bang _ add operator to traverse the sorted data, and directly accumulating when the coordinates are the same; if the difference is not the same, the traversal is continued without accumulation.

The convolution operation scheme of sparse data of the embodiments of the present disclosure is described above in various aspects. Compared with a conventional convolution scheme, the scheme disclosed by the embodiment of the disclosure only performs operation processing on non-zero/non-null data in sparse data, so that excessive invalid operation can be avoided, the operation amount is greatly saved, and the processing efficiency is improved. Further, by filtering the input data with different levels (e.g., three levels) of filtering granularity, the data required to perform the convolution operation can be extracted quickly and efficiently. The sparse convolution scheme provided by embodiments of the present disclosure may be particularly applicable to LiDAR point cloud data-based processing.

The disclosed embodiments also provide a data processing circuit for performing convolution operation of the sparse data, and a data processing method implemented by the data processing circuit.

FIG. 12 illustrates a schematic block diagram of a data processing circuit in which embodiments of the present disclosure may be implemented. As shown in fig. 12, the data processing circuit 1200 includes a control circuit 1210, a memory circuit 1220, and an arithmetic circuit 1230.

The control circuit 1210 is responsible for handling various functions on the data processing circuit 1200 including, but not limited to, control, instruction fetching, decoding, computing, and the like. The control circuit 1210 may include, for example, the control module 31 of fig. 3.

In some embodiments, the control circuit 1210 may be configured to control the storage circuit 1220 and the operation circuit 1230 to perform an N-dimensional convolution operation process on the input data and the convolution kernel, N >1, N representing a degree of a convolution dimension in which sliding accumulation is performed in the convolution operation. In this convolution operation, the input data is sparsified and characterized in a dense form.

Further, the control circuit 1210 may be configured to: the input data blocks allocated to the operation of the operation circuit 1230 at the current round are filtered according to the input parameters of the input data, wherein the operation circuit 1230 calculates output data of a W dimension at each round.

The memory circuit 1220 may be used for storing information, which includes at least information before and/or after processing, and may also include intermediate information that needs to be buffered during processing, which may be, for example, various RAMs shown in fig. 3, or on-chip buffers. In some embodiments, the storage circuitry 1220 may be configured to store input data, convolution kernels, convolution operation results, and/or cache intermediate results, such as cache portions and results, or to provide cache space required during execution of a MERGE instruction.

The operational circuitry 1230 may be configured to perform various operational operations in accordance with the associated instructions. Specifically, the operation circuit 1230 may be configured to perform a plurality of one-dimensional convolution operations on the input data and the convolution kernel under the control of the control circuit 1210 to obtain a multi-path operation result and corresponding output point coordinates in a first convolution dimension; and merging the multi-path operation results into a path of fusion data as a convolution operation result according to the corresponding output point coordinates, wherein the operation results with the same output point coordinates are accumulated.

In some embodiments, the above-described N-dimensional convolution operation is split into M one-dimensional convolution operations, M being equal to the product of the sizes of the convolution kernels in N-1 convolution dimensions other than the first convolution dimension.

In one embodiment, the operation circuit 1230 may further include an operation processing circuit (not shown), which may be configured to pre-process data before the operation circuit performs the operation or post-process data after the operation according to the operation instruction. In some application scenarios, the aforementioned pre-processing and post-processing may, for example, include data splitting and/or data splicing operations.

In some embodiments, the arithmetic circuitry 1230 may include multiple processor cores, each of which may process an input data block allocated by the control circuitry 1210 at a time, e.g., compute a row of W output points at a time.

Specifically, each processor core may be further configured to perform operations as follows: constructing a matrix Q to be subjected to one-dimensional convolution operation based on the allocated blocks indicating the input data blocks to be processed; calculating coordinates of each part of the output point of the one-dimensional convolution operation and the result on a first convolution dimension; performing a plurality of one-dimensional convolution operations on the matrix Q to obtain a multi-path part and a result; and merging and fusing the multi-path part sum result to obtain a final convolution operation result.

Although in the above description the step of determining the coordinates is described as being performed by an arithmetic circuit, it will be understood by a person skilled in the art that the step of determining the coordinates may also be performed by software, for example by a control circuit. Further, although in the above description the various processing steps have been described generally as being performed on arithmetic circuitry, the arithmetic circuitry herein may also be distributed, e.g., including arithmetic circuitry in heterogeneous systems, such that a portion of the operations are performed on a CPU and another portion of the operations are performed on a GPU, for example. In one implementation, preprocessing of the input data, which may include, for example, densification of the input data in a sparse form, etc., may be performed, for example, on a CPU. The processing of one-dimensional convolution operation of input data and convolution kernels, merging and fusing of multipath parts and results and the like can be executed on the GPU, and therefore the advantages of the heterogeneous system are fully exerted.

Those skilled in the art will appreciate that the description of the convolution operation processing with sparse-type data of the embodiment of the present disclosure described above with reference to the drawings can be equally applied to the data processing circuit of fig. 12, and therefore, a repeated description will not be made.

The present disclosure also provides a chip which may comprise the data processing apparatus of any of the embodiments described above in connection with the figures. Further, the disclosure also provides a board card, which may include the aforementioned chip.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims

1. A data processing circuit comprising a control circuit, a storage circuit and an arithmetic circuit, wherein:

2. The data processing circuit of claim 1, wherein the N-dimensional convolution operation is split into M of the one-dimensional convolution operations, M being equal to a product of sizes of the convolution kernels in N-1 convolution dimensions other than the first convolution dimension, the first convolution dimension being a width W dimension.

3. The data processing circuit of claim 2, wherein the input data comprises at least width W, height H, and input channel Ci dimensions, the input data being dense and uniform in size at least in Ci dimensions, the dense form of the input data comprising three input parameters:

a first input parameter Min representing effectively input dense data, the shape of which is ain × Ci, wherein ain is the number of effective input data points in the input data, and Ci is the size of Ci dimension of the input data;

a second input parameter wi _ coord representing the coordinates of each of the valid input data points in the W dimension, the shape of which is 1 × ain; and

a third input parameter Hin, representing the input data in the compressed sparse row CSR format in the H dimension, has a shape of X (Hin +1), where Hin represents the size of the H dimension of the input data, and X represents the product of the sizes of other dimensions than H, W, Ci in which the input data may exist.

4. The data processing circuit of claim 3, wherein the control circuit is further configured to:

and screening the input data blocks distributed to the operation of the operation circuit in the current round according to the input parameters of the input data, wherein the operation circuit calculates the output data of a W dimension in one row in each round.

5. The data processing circuit of claim 4, wherein the control circuit is further configured to filter the input data block as follows:

traversing the third input parameter by a first scanning window and a first scanning step length to detect whether effective input data points exist in the scanning window, wherein the first scanning window corresponds to the size of an input data block required by calculating output data of a W dimension, and the first scanning step length is equal to the convolution step length of the convolution operation in an H dimension; and

and allocating the blocks in the third input parameters corresponding to the first scanning window for which the existence of the valid input data points is detected to the operational circuit.

6. The data processing circuit of claim 5, wherein the arithmetic circuit comprises N_IPUA processor core, and the control circuitry is further to:

n to be detected continuously_IPUEach block is sent to the N_IPUAnd each processor core performs processing.

7. The data processing circuit of any of claims 5-6, wherein the arithmetic circuitry is further configured to:

constructing a matrix Q to be subjected to the one-dimensional convolution operation based on the received blocks; and

and calculating coordinates of each part of the output point of the one-dimensional convolution operation and the result on the first convolution dimension.

8. The data processing circuit of claim 7, wherein the arithmetic circuitry is further to construct the matrix Q as follows:

extracting wi _ coord vectors of effective input data points in the input data blocks corresponding to the blocks from the second input parameters wi _ coord;

according to the wi _ coord vector, traversing the input data corresponding to the wi _ coord vector by a second scanning window and a second scanning step length to take out corresponding input data points from the first input parameter Min to construct the matrix Q.

9. The data processing circuit of claim 8, wherein the arithmetic circuit is further for:

according to the indication of the block, a specified amount of data is taken from a specified position of the wi _ coord to form a wi _ coord vector, and the amount of the taken wi _ coord vector is equal to M; and

the wi _ coord vectors that are empty are filtered out.

10. A data processing circuit according to any of claims 8-9, wherein the arithmetic circuitry is further adapted to scan line by line to construct corresponding lines of the matrix Q according to the wi _ coord vectors as follows:

extracting data covered by a second scanning window in which valid input data points are detected, and tiling the data in sequence to construct corresponding rows of the matrix Q; and

skipping a second scan window in which no valid input data points are detected;

wherein the second scan window corresponds to a size of an N-dimensional convolution window of the convolution operation in a W dimension, the second scan step being equal to a convolution step of the convolution operation in the W dimension.

11. A data processing circuit according to any of claims 7-10, wherein the arithmetic circuitry is further configured to:

determining a mapping relation between the W-dimension coordinate of the input data and the W-dimension coordinate of the output data according to the convolution kernel size and the convolution step length of the convolution operation; and

and determining the W-dimension coordinates of the corresponding one or more output points according to the W-dimension coordinates of the effective input points on the basis of the mapping relation to serve as the W-dimension coordinates of the partial sum.

12. A data processing circuit according to any of claims 7-11, wherein the arithmetic circuitry is further configured to:

and respectively executing the one-dimensional convolution operation on each row of the matrix Q to obtain a multi-path part and a result, wherein a one-dimensional convolution kernel of the one-dimensional convolution operation corresponds to a W-dimension row corresponding to an N-dimension convolution kernel of the N-dimension convolution operation, and a convolution step length of the one-dimensional convolution operation is equal to the size of the N-dimension convolution kernel in a W dimension.

13. The data processing circuit of claim 12, wherein the arithmetic circuit is further for:

and merging the multi-path parts and the results into one path of fused data according to the corresponding W-dimension coordinates, and accumulating the parts and the results with the same W-dimension coordinates in the merging process to obtain the final result of the output data points of the corresponding row.

14. A chip comprising a data processing circuit according to any of claims 1-13.

15. A board comprising the chip of claim 14.

16. A method of processing data using a data processing circuit as claimed in any of claims 1 to 13.