WO2023123919A1 - Circuit de traitement de données, procédé de traitement de données et produit associé - Google Patents

Circuit de traitement de données, procédé de traitement de données et produit associé Download PDF

Info

Publication number
WO2023123919A1
WO2023123919A1 PCT/CN2022/100306 CN2022100306W WO2023123919A1 WO 2023123919 A1 WO2023123919 A1 WO 2023123919A1 CN 2022100306 W CN2022100306 W CN 2022100306W WO 2023123919 A1 WO2023123919 A1 WO 2023123919A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
convolution
input data
dimension
dimensional
Prior art date
Application number
PCT/CN2022/100306
Other languages
English (en)
Chinese (zh)
Inventor
郑鎏韬
徐健
孙尧
Original Assignee
寒武纪行歌(南京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 寒武纪行歌(南京)科技有限公司 filed Critical 寒武纪行歌(南京)科技有限公司
Publication of WO2023123919A1 publication Critical patent/WO2023123919A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/491Computations with decimal numbers radix 12 or 20.
    • G06F7/498Computations with decimal numbers radix 12 or 20. using counter-type accumulators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present disclosure relates generally to the field of data processing. More specifically, the present disclosure relates to a data processing circuit, a data processing method, a chip and a board.
  • LiDAR point cloud data is usually sparse, and the point density varies drastically due to factors such as uneven sampling of 3D space, effective range of sensors, occlusions, and relative poses. Therefore, the traditional convolutional neural network suitable for dense data will become very inefficient when applied to such sparse data, especially when convolution operations are involved, a large amount of calculations will be wasted on zero-valued data points. power and other resources.
  • the solution disclosed in the present disclosure provides a data processing circuit, a data processing method, a chip, and a board.
  • the present disclosure discloses a data processing circuit, including: a control circuit, a storage circuit, and an operation circuit, wherein:
  • the control circuit is configured to control the storage circuit and the operation circuit to perform N-dimensional convolution operation processing on the input data and the convolution kernel, N>1, and N represents the number of convolution dimensions for performing sliding accumulation in the convolution operation , wherein the input data is sparse data and represented in a dense form;
  • the storage circuit is configured to store information including at least information before, during and/or after processing
  • the operation circuit is configured to perform multiple one-dimensional convolution operations on the input data and the convolution kernel under the control of the control circuit, to obtain multi-channel operation results and the corresponding first convolution dimension output point coordinates; and merging the multi-path operation results into one path of fusion data according to the corresponding output point coordinates as the result of the convolution operation, wherein the operation results with the same output point coordinates are accumulated.
  • the present disclosure provides a chip, including the data processing circuit of any one of the embodiments of the foregoing first aspect.
  • the present disclosure provides a board, including the chip in any embodiment of the second aspect.
  • the present disclosure provides a method of processing data using the aforementioned data processing circuit.
  • the embodiment of the present disclosure provides a convolution scheme suitable for sparse data, which uses only non-zero/non
  • the operation of empty data and convolution kernel can greatly save the amount of calculation and improve processing efficiency.
  • the sparse convolution scheme provided by the embodiments of the present disclosure can be applied to multi-dimensional convolution operations, including but not limited to two-dimensional convolution and three-dimensional convolution, and thus can be applied to the processing of LiDAR point cloud data.
  • Fig. 1 shows the structural diagram of the board card of the disclosed embodiment
  • FIG. 2 shows a structural diagram of a combination processing device according to an embodiment of the present disclosure
  • Fig. 3a shows a schematic diagram of the internal structure of a single-core computing device according to an embodiment of the present disclosure
  • Fig. 3b shows a schematic diagram of the internal structure of a multi-core computing device according to an embodiment of the present disclosure
  • Figure 4a shows the operational principle of a conventional convolution scheme
  • Figure 4b shows an exemplary principle of a sparse convolution operation scheme according to an embodiment of the present disclosure
  • Fig. 5 shows an exemplary representation method of input data according to an embodiment of the present disclosure
  • FIG. 6 shows an exemplary process of a sparse convolution scheme according to an embodiment of the present disclosure
  • Figure 7 shows an example of splitting of an input data block according to an embodiment of the present disclosure
  • FIG. 8 shows a flowchart of an exemplary method for screening valid input data points according to an embodiment of the present disclosure
  • FIG. 9 shows a schematic diagram of scanning traversal for a third input parameter according to an embodiment of the present disclosure.
  • FIG. 10 shows a schematic diagram of constructing a Q matrix according to an embodiment of the present disclosure
  • FIG. 11 illustrates exemplary logic for calculating wo coordinates of output data according to an embodiment of the present disclosure.
  • Fig. 12 shows a schematic structural diagram of a data processing circuit according to an embodiment of the present disclosure.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure.
  • the board card 10 includes a chip 101, which is a system-on-chip (System on Chip, SoC), or system-on-chip, integrated with one or more combined processing devices, and the combined processing device is an artificial
  • SoC System on Chip
  • the intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements for the storage capacity and computing power of the platform.
  • the board 10 of this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing capabilities.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus.
  • the control device 106 in the board 10 is configured to regulate the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing the combined processing means in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201 , an interface device 202 , a processing device 203 and a storage device 204 .
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor for performing deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete user-specified operations.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write it into a storage device on the computing device 201 .
  • the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 .
  • the interface device 202 can also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 .
  • the processing device 203 may be one or more types of a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or other general-purpose and/or special-purpose processors.
  • processors including but not limited to digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits, ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when considering the integration of the computing device 201 and the processing device 203 together, they are considered to form a heterogeneous multi-core structure.
  • the storage device 204 is used to store data to be processed, which may be a DRAM, which is a DDR memory, and its size is usually 16G or larger, and is used to store data of the computing device 201 and/or the processing device 203 .
  • Fig. 3a shows a schematic diagram of the internal structure of a processing core when the computing device 201 is a single-core device.
  • the computing device 301 is used to process input data such as computer vision, speech, natural language, data mining, etc.
  • the computing device 301 includes three modules: a control module 31 , a computing module 32 and a storage module 33 .
  • the control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, which includes an instruction fetch unit (IFU) 311 and an instruction decoding unit (instruction decode unit, IDU) 312.
  • the instruction fetching unit 311 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 312 decodes the obtained instructions and sends the decoding results to the computing module 32 and the storage module 33 as control information.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 .
  • the vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 33 is used to store or transfer relevant data, including a neuron storage unit (neuron RAM, NRAM) 331, a weight storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons and calculated intermediate results;
  • WRAM 332 is used to store convolution kernels of deep learning networks, that is, weights;
  • DMA 333 is connected to DRAM 204 through bus 34, responsible for computing device 301 Data transfer between DRAM 204 and DRAM 204.
  • Fig. 3b shows a simplified schematic diagram of the multi-core internal structure of the computing device 201.
  • Multi-core computing devices can be abstracted using a hierarchical hardware model. As shown in the figure, the multi-core computing device can be abstracted into four levels, namely card level (Card) 350 , chip level (Chip) 360 , processor cluster level (Cluster) 370 and processor core level (Core) 380 .
  • Card card level
  • Chip chip level
  • Core processor core level
  • the embodiments of the present disclosure mainly involve the data transmission of the storage unit and the calculation unit, so the drawings and description briefly show and introduce the relevant calculation structure, and other parts are omitted.
  • each board contains local DDR storage, and each processor chip acts as a computing and control unit.
  • each processor chip contains multiple multiprocessors as computing units.
  • each multiprocessor includes multiple accelerator cores as control and computing units, and a shared storage SRAM as a storage unit.
  • each accelerator core contains local storage and an array of local processing units.
  • NFU refers to the Neuron Function Unit, which is used for convolution calculations.
  • the storage model includes board global memory, SRAM (shared memory) on the Cluster, NRAM, WRAM and registers on the Core, and the like.
  • SRAM is included in the storage processing unit MPU (Memory Process Unit Core, referred to as MPU, or Mem Core).
  • MPU Memory Process Unit Core
  • Mem Core refers to an intelligent processing core (Intelligent Process Unit Core, referred to as IPU Core or Core) in a multi-core computing device.
  • IPU Core contains NRAM, WRAM, NFU and so on.
  • Cluster refers to a processor cluster or a computing cluster.
  • a multi-core computing device includes several Clusters, and a Cluster includes 1 Mem Core+N IPU Cores.
  • the embodiments of the present disclosure provide a data processing circuit based on the aforementioned hardware environment, which supports convolution operations on sparse data.
  • the convolution processing associated with sparse data such as LiDAR point cloud data can be simplified and accelerated.
  • the sparse convolution scheme provided by the embodiments of the present disclosure can be applied to multi-dimensional convolution operations, including but not limited to two-dimensional convolution and three-dimensional convolution.
  • two-dimensional convolution is used as an example for illustration.
  • N represents the number of convolution dimensions for performing sliding accumulation in the convolution operation.
  • the convolution kernel performs translation and accumulation in two dimensions (eg, width W and height H) according to corresponding convolution steps.
  • the convolution kernel performs translation and accumulation in three dimensions (for example, width W, height H, and depth D) according to corresponding convolution steps.
  • the "non-convolution dimension” mentioned in the embodiments of the present disclosure refers to a dimension on which the convolution kernel does not perform sliding accumulation. There may be different required operations on different non-convolutional dimensions. For example, for conventional convolution, the input channel dimension Ci is required to be accumulated, and the output channel dimension Co is not accumulated; for another example, for depth-wise conv, the input channel dimension Ci is not accumulated.
  • Figure 4a shows the operational principle of a conventional convolution scheme.
  • the convolution kernel 410 is dense and is a 3 ⁇ 3 matrix, and the numbers in the convolution kernel are corresponding weight data.
  • the input data 420 is a 7 ⁇ 7 matrix, which is sparse with only four non-zero data: 2, 3, 5 and 6, as shown by the dark squares.
  • the convolution stride is set to 2 in both dimensions, the padding is 0, and there is no dilation.
  • the 3 ⁇ 3 gray square in the figure represents the sliding accumulation process of the convolution kernel on the input data.
  • 430 shows the calculation at the beginning of the convolution
  • 440 shows the calculation of one slide to the right (step size 2)
  • 450 shows the calculation of one slide down (step size 2).
  • the weight data of the convolution kernel and the input data are multiplied and accumulated.
  • 460 is the final calculation result as output data.
  • the output data is a matrix of size 3 ⁇ 3. It can be seen that the calculation of 430 corresponds to the data at coordinates (0,0) in the output data, the calculation of 440 corresponds to the data at coordinates (0,1) in the output data, and the calculation of 450 corresponds to the coordinates in the output data Data at (1,0).
  • the convolution kernel is dense, and its input format can be the same as that of conventional convolution; while the input data is sparse, its input format can be different from that of conventional convolution, This saves storage space.
  • these multiplication and addition operations can be further divided into the part obtained after performing the one-dimensional convolution operation by row and the result of bitwise accumulation by column.
  • the N-dimensional convolution operation can be divided into multiple one-dimensional convolution operations for implementation.
  • FIG. 4b shows an exemplary principle of a sparse convolution scheme according to an embodiment of the present disclosure.
  • FIG. 4b still uses the data in FIG. 4a as an example to describe the sparse convolution operation scheme of the embodiment of the present disclosure.
  • one row of operation results 460 corresponds to three rows of input data 420, for example, the first three rows of input data can be used to calculate the first row of output data, and the middle three rows (overlapping with the first three rows and the last three rows) of input data can be used The second row of output data is calculated, and the last three rows of input data are used to calculate the last row of output data.
  • the sparse data can be filtered at the granularity (referred to as "first filtering granularity" here) of the input data required to correspond to one row of operation results (the example in the figure is three rows of input data), Thereby reducing invalid operations.
  • first filtering granularity the input data of the middle three rows are all 0s, that is, there are no non-sparse points, so the convolution operation of the next three rows can be omitted.
  • the two-dimensional (H, W dimension) convolution operation can be split into three one-dimensional (W dimension) convolution operations.
  • the original 3*3 size convolution window slides and calculates one row of output data on three rows of input data, which can be converted into three 1*3 size convolution windows ( (shown in a dotted box) slide calculations on the three lines of input data (as shown in 470 ), to obtain the three-line partial sum result (as shown in 480 ), and then add up the bits to obtain the final line of output data ( 460 ).
  • the two-dimensional convolution operation is decomposed into multiple one-dimensional convolution operations, during the convolution operation, one dimension can be used as the granularity (herein referred to as "Second Filtering Granularity") to filter sparse data, thereby reducing invalid operations.
  • the input data of the third row is all 0, that is, there is no non-sparse point, so the convolution operation of this row can be omitted.
  • sparse data may also be filtered at a granularity of a one-dimensional convolution window (herein referred to as “the third filtering granularity”), thereby further reducing invalid operations.
  • the third filtering granularity a one-dimensional convolution window
  • the sparse convolution operation may include the following steps: performing multiple one-dimensional convolution operations on the convolution kernel and sparse input data to obtain multi-channel operation results (such as product results Or multiplication and addition results, considering that some dimensions of the data need to be accumulated, such as the input channel dimension Ci) and the output point coordinates on the corresponding first convolution dimension; then the multi-channel operation results are merged according to their corresponding output point coordinates To fuse data all the way, as the result of sparse convolution operation. During the merging process, the results of operations with the same output point coordinates are accumulated.
  • N-dimensional convolution operations when applied to N-dimensional convolution operations, N>1, N-dimensional convolution operations can be split into M one-dimensional convolution operations, where M is equal to the convolution kernel in addition to the one-dimensional convolution operation The product of the sizes of the N-1 convolution dimensions other than the first convolution dimension of .
  • input data In the sparse convolution operation, input data, convolution kernel and output data are involved.
  • the input data is also called input neuron in the neural network, and the output data is called output neuron.
  • the convolution kernel In convolution operations involving LiDAR point cloud data, the convolution kernel is dense, and its input format can be the same as that of regular convolution. In two-dimensional convolution, the size of the convolution kernel is usually 3*3, and a single convolution needs to accumulate and sum the numbers of 3*3*ci; in three-dimensional convolution, the size of the convolution kernel is usually 3* 3*3, a single convolution needs to accumulate and sum the numbers of 3*3*3*ci.
  • the input data to be convolutionally processed may include multi-dimensional data, and it is sparse in multiple dimensions.
  • the input data is the detection data in the three-dimensional space, which characterizes the gray value, RGB, signal strength, etc. of each coordinate point in the three-dimensional space, so according to the information content to be represented,
  • the input data elements at each coordinate point can be one-dimensional, two-dimensional, three-dimensional or higher dimensional data. Due to the characteristics of point cloud data, coordinate points with non-zero value data elements are sparse, that is, they are sparse in three spatial dimensions (e.g., width W, height H, and depth D).
  • preprocessing may be performed before the sparse input data is provided to the operation circuit for operation.
  • preprocessing may include, for example: combining sparse dimensions into one dimension; densifying the sparse data points in the input data in the combined dimension; and using several input parameters to represent the densification The value and coordinates of the input data after.
  • Fig. 5 shows an exemplary representation method of input data (sparse neurons).
  • the input data here may be the data after filling processing according to the requirements of the convolution operation.
  • a preprocessing operator can be used to convert the sparse input data into a dense input data representation, thereby saving storage space.
  • the input data 510 in sparse form includes five dimensions, batch (B) dimension, HWD three-dimensional space dimension and input channel Ci dimension.
  • the input data is sparse in the B dimension and the HWD three-dimensional space.
  • the dark squares in the HWD three-dimensional matrix in the figure represent places with values (called valid input points), and all other parts are zero values.
  • There are multiple such HWD stereo matrices on the B dimension and the sparse pattern (that is, the position of the dark square) on each stereo matrix may be different.
  • the input data is dense in the input channel Ci dimension, which is the lowest dimension. Due to the limited expressive ability of the drawing, 510 in the figure only shows four dimensions, but the Ci dimension can be understood as the thickness of each dark square.
  • the size of the Ci dimension is uniform, that is, the thickness of each dark square is the same.
  • the sparse input data 510 when converting the sparse input data 510 into dense input data, it can be represented by referring to the CSR format in the sparse matrix.
  • non-zero element values sometimes called effective element values
  • the storage of a sparse matrix not only stores the value of non-zero elements, but also stores its coordinate position (row index, column index).
  • the CSR storage method is called compressed sparse row format.
  • the CSR method uses three arrays to store the sparse matrix, which respectively store row pointers, column indexes and values.
  • the length of the column index array and the numeric array is the number of nonzero elements in the sparse matrix.
  • the row pointer array stores the offset position of the first non-zero element of each row from the first non-zero element of the sparse matrix, and its last element stores the total number of non-zero elements in the sparse matrix, so the length of the row pointer array is Increment the number of sparse matrix rows by 1. It can be understood that, according to the definition of the row pointer array, the number of non-zero elements in the previous row can be obtained by subtracting the pointer value of the previous row from the pointer value of the next row.
  • three input parameters may be used for representation.
  • the first input parameter is the effective input dense data, that is, densely arranged sparse data, represented by Min, the shape of Min is ain*ci, where ain is the number of non-sparse points in the input data, and ci is the input channel dimension the size of.
  • a batch-by-batch method may be adopted, and the batches may be split into different processor cores (such as cores in FIG. 3b ) for processing.
  • SWIFT has 12 batches, so based on the hardware environment of the example in Figure 3b, up to 12 batches can be processed in parallel at one time.
  • ain corresponds to the number of valid input points of the three dimensions of HWD.
  • the second input parameter is the coordinate or index of each valid input point in the W dimension, represented by wi_coord, and its shape is 1*ain. As shown in 522 in the figure, for the 4 valid input points in the figure, wi_coord is [1,2,0,6].
  • the third input parameter is data in CSR (compressed sparse row) format of input data in H dimension or H direction, represented by hin. hin stores the offset position of the first non-zero element in each row from the first non-zero element in the input data, and its last element stores the total number of non-zero elements in the input data.
  • the third input parameter may have multiple dimensions.
  • the third input parameter from high dimension to low dimension is: batch batch (B), depth in_d and height in_h, and its shape is B*Din*(Hin +1), where B, Din, and Hin are the B dimension size, D dimension size, and H dimension size of the input data in sparse form, respectively.
  • the last element "4" represents the total number of all non-zero elements.
  • the first output parameter is the dense data that is effectively output, that is, the sparse data that is tightly arranged, represented by Mout, and the shape of Mout is aout*co, where aout is the number of non-sparse points in the output data, and co is the output channel dimension the size of.
  • the second output parameter is the coordinate or index of each effective output point in the W dimension, represented by wo_coord, and its shape is 1*aout.
  • the third output parameter is data in CSR (compressed sparse row) format of the output data in the H direction, represented by hout.
  • Hout stores the offset position of the first non-zero element of each line in the output from the first non-zero element in the output data, and its last element stores the total number of non-zero elements in the output data.
  • the third output parameter is similar to the third input parameter, and may also have multiple dimensions, for example, from high dimension to low dimension: batch batch (B), depth out_d and height out_h, and its shape is B*Dout*(Hout +1), where B, Dout, and Hout are the B-dimension size, D-dimension size, and H-dimension size of the output data in sparse form, respectively.
  • the output data includes 4 effective output points, and the corresponding W dimension coordinate wo_coord 532 is [0,1, 0,2], the corresponding data hout 533 in the CSR format of the H direction is [0,2,2,4], and the first output parameter is not shown in the figure.
  • the effective output in the output data can be determined according to the coordinates of the input data The coordinates of the point. Therefore, in the sparse convolution operation scheme of the embodiment of the present disclosure, the operation process can be disassembled into several steps of coordinate calculation and value calculation.
  • FIG. 6 shows an exemplary process of a sparse convolution scheme according to an embodiment of the present disclosure.
  • step 610 first, based on the coordinates of the input data, the data to be processed by each processor core is screened out each time.
  • the input data here is data in CSR format that has been zero-filled and converted from sparse to dense.
  • splitting may be performed according to the dimensions of the output data.
  • the convolution operation results may be calculated row by row. Therefore, the splitting method may be: each processor core calculates a row of output data points of the Wo dimension each time.
  • the shape of the input data block corresponding to a row of Wo dimension output data points is (kz*ci)*ky*wi, where wi is the W dimension size of the input data, ci is the Ci dimension size of the input data, and the convolution kernel
  • the sizes in W, H, and D dimensions are kx, ky, and kz, respectively.
  • Fig. 7 shows an example of splitting the input data block, where the gray input data block exactly corresponds to a row of Wo-dimensional output data points after the volume accumulation operation is completed.
  • screening may be performed according to the third input parameter of the input data, namely hin. From the meaning of hin, we can know that subtracting the i-th value from the i+1 value can get the number of non-zero elements (valid input data points) in the i-th row. Therefore, based on this characteristic of hin, it can be judged whether there are valid input data points for screening.
  • FIG. 8 shows a flowchart of an exemplary method for screening valid input data points according to an embodiment of the disclosure.
  • the third input parameter is loaded from an external storage circuit to an on-chip memory (such as SRAM).
  • the size of the storage space required by the third input parameter is (Hin+1+2*ph)*(Din+2*pd)*dwidth, where Din and Hin are the size of D dimension and H dimension of the input data in sparse form, respectively, ph and pd are the one-sided padding amounts of the H dimension and D dimension respectively, and dwidth is the data bit width.
  • step 812 traverse the third input parameter with the specified scanning window (the first scanning window) and the specified scanning step (the first scanning step), so as to find valid input data points.
  • the size of the first scanning window corresponds to the input data required to calculate a row of Wo-dimensional output data points, that is, corresponds to the "first screening granularity" mentioned above.
  • the size of the first scanning window is determined according to the size of the convolution kernel. Specifically, the size of the first scan window is kz*(ky+1), kz corresponds to the size of the scan window in the D dimension, and ky+1 is the size of the scan window in the H dimension, because the third input parameter uses the CSR format, Among them, ky rows of H-dimension data need ky+1 data to represent.
  • the first scanning step is equal to the H-dimensional convolution step Sy.
  • the scan window in which valid input data points are detected may be referred to as a range_block.
  • the hi and di coordinates corresponding to the block can be recorded, so that the corresponding ho and do coordinates in the output data can be calculated according to the hi and di coordinates.
  • the found blocks may be sent to the processor core (IPU), for example, the NIPU blocks are respectively sent to NIPU different IPUs.
  • IPU processor core
  • each IPU may be notified of the H and D dimension coordinates corresponding to the output points processed by them in the output data, that is, the ho and do coordinates. It can be understood that when each IPU calculates the value of the output point (wo_x, ho, do), wo_x changes, ranging from 0 to (wo-1), and ho and do are fixed.
  • the size of the scanning window is 3*4 numbers
  • the hin data of the 3 rows have not changed, that is to say, there is no valid input data point, and the next step of scanning can be continued without processing.
  • the do coordinates corresponding to these 4 blocks are all 0, and the ho coordinates are 0, 2, 3, 4 respectively.
  • each IPU allocated with data to be processed (indicated by the aforementioned block) can take out the corresponding data according to the indication of the block and construct the matrix to be convoluted, hereafter referred to as the Q matrix , and calculate the output point coordinate information at the same time.
  • the Q matrix is constructed by fetching data from the shared memory SRAM, which is preloaded with input data from an external storage circuit.
  • the wi_coord vector of valid input data points in the input data block corresponding to the allocated block may first be obtained from the second input parameter wi_coord. Next, according to the wi_coord vector, the input data corresponding to the vector can be traversed with the second scan window and the second scan step to extract the corresponding input data points from the effectively input dense data Min (that is, the first input parameter) to construct the Q matrix.
  • the effectively input dense data Min that is, the first input parameter
  • the block is obtained from the third input parameter hin, which records the distance between the first valid input data point of each line and the specified point (that is, the first valid input data point of the entire input data), so it can be based on this
  • the information takes the specified amount of data from the specified position of the second input parameter wi_coord to form the corresponding wi_coord vector.
  • the meaning of the wi_coord vector refers to the vector formed by the wi coordinates of all valid input data points in a row of W dimension in the input data. For example, assuming that there are 34 valid input data points in the W dimension of one row, the length of the vector is 34, and the value of each vector element is the wi coordinate of the corresponding data point.
  • the wi_coord vector can be filtered during the process of constructing the wi_coord vector. Since the two elements before and after hin represent the number of valid input data points in the corresponding row, the wi_coord vector is empty for the row with a difference of 0, so the empty wi_coord vector can be filtered out.
  • This screening step corresponds to the "second screening granularity" described above.
  • the input data corresponding to the wi_coord vector can be traversed with the second scan window and the second scan step to extract the corresponding input data point from the first input parameter Min to construct the matrix Q.
  • FIG. 10 shows a schematic diagram of constructing a Q matrix according to an embodiment of the present disclosure.
  • the figure only shows the Q matrix construction of 3 rows w, and similar constructions can be carried out for w of other rows.
  • the wi_coord vector can be constructed based on the information in the third input parameter hin, that is, it is determined whether there is a valid input data point in the current row to be scanned. Specifically, the number of valid input data points in the i-th row can be determined according to the difference between the i+1th value and the i-th value in hin.
  • the input data corresponding to the wi_coord vector is traversed with the second scan window and the second scan step to extract corresponding valid input data points to construct a matrix Q.
  • scanning is performed row by row to construct corresponding rows of the Q matrix.
  • the data covered by the second scanning window in which valid input data points are detected is extracted, tiled in order to form the corresponding row of matrix Q, and the second scanning window in which no valid input data points are detected is skipped .
  • the size of the second scan window corresponds to the size of the convolution window of the convolution operation in the first convolution dimension (for example, the size of the W dimension, kx), and the second scan step size is Corresponding to the convolution step Sx of the convolution operation in the W dimension.
  • This step of scanning and screening corresponds to the "third screening granularity" described above.
  • the second scanning window in this example, the size is 1*3
  • scan and traverse the input data line by line along the W dimension when valid input data points exist in the scan window, the input data corresponding to the scan window is extracted. In this way, the window data extracted each time are sequentially expanded and tiled along the W dimension to construct the Q matrix.
  • a scanning window of size 1*3 is used, and the step size is 2 data for scanning.
  • the scanning result is: there are 2 valid input data points in the scanning window 1004 , there are 2 valid input data points in the scanning window 1005 , and there is no valid input data point in the scanning window 1006 .
  • the result of extracting the data and constructing the Q matrix is shown in the second row of 1040, which is composed of the data covered by the scanning windows 1004 and 1005.
  • the input data is in a compact form, when judging whether there is a valid input data point in the scanning window, it can be determined whether it is located in a certain scanning window according to the coordinate information of the input data. Specifically, it can be judged according to the block (1030) allocated to the IPU and the constructed wi_coord vector (1020). It can be understood that a scan window essentially corresponds to an output point (a partial sum), so the wo coordinates of the output data points contributed by it can be deduced according to the wi coordinates of the valid input data points, so as to determine whether it falls into one or more scan window.
  • FIG. 11 illustrates exemplary logic for calculating wo coordinates of output data points according to an embodiment of the present disclosure.
  • the convolution kernel size is 3*3*3
  • the convolution step size is 2 in the HWD direction.
  • mapping relationship may change accordingly.
  • convolution parameters specifically convolution kernel kx size and convolution step size Sx.
  • the Q matrix when valid input data points are extracted to construct the Q matrix, the Q matrix can also be constructed according to whether wi_coord is odd or even. Specifically, the specific position of the input data point in the second scanning window may be determined according to whether the wi coordinate is odd or even. For example, the input data points whose wi coordinates are odd must fall in the middle of the second scanning window, while the input data points whose wi coordinates are even fall in two adjacent positions of two adjacent second scanning windows.
  • the value of the effective input data point can be read from, for example, the shared memory SRAM, and stored in the on-chip memory (such as NRAM).
  • a Q matrix can be constructed. For each of the M non-empty wi_coord vectors, it can be sequentially processed in the above manner, and the Q matrix thus constructed has M rows, each row is composed of Li second scan windows, and Li depends on the output data of the row The number of points, as previously counted when calculating the wo coordinates.
  • the counted number of output data points can be used to calculate the third output parameter hout in the output data.
  • step 630 after the matrix Q is constructed, M one-dimensional convolution operations can be performed on the matrix Q, and the M one-dimensional convolution kernels of the M one-dimensional convolution operations are composed of the original N
  • the one-dimensional convolution kernel is obtained by splitting according to the first convolution dimension, and the convolution step size of the one-dimensional convolution operation is equal to the size of the second scanning window, that is, equal to the size of the N-dimensional convolution kernel in the first convolution dimension.
  • the first convolution dimension is the W dimension, so the convolution step of the one-dimensional convolution operation is equal to kx.
  • M-way partial sums can be obtained through M one-dimensional convolution operations, which correspond to the output points of the same row of wo.
  • Each part and the corresponding wo coordinates are also determined in each part and result.
  • step 640 the M road parts and results are merged into one road fusion data according to the corresponding output data point coordinates, and the final result of the corresponding row wo output data points is obtained.
  • the parts and results with the same wo coordinates are accumulated.
  • the merging process described above can be performed in a number of ways.
  • the merging and merging process may be implemented in a hardware manner.
  • it can be realized by a hardware instruction MERGE instruction.
  • the basic function of the MERGE instruction is to merge multiple channels of data to be fused into one channel of fused data according to their index order, and to accumulate the data with the same index.
  • software may be used to implement the merging and fusion processing.
  • a sorting algorithm in the process of merge and fusion may be implemented by a fully vectorized sorting algorithm based on a multi-core processor. Then call the bang_add operator to traverse the sorted data. When the coordinates are the same, add them directly; if they are different, you don’t need to add them, just continue to traverse.
  • the convolution operation scheme of sparse data in the embodiment of the present disclosure has been described above from various aspects. Compared with conventional convolution schemes, the schemes of the disclosed embodiments only perform operations on non-zero/non-null data in sparse data, which can avoid excessive invalid operations, greatly save computation, and improve processing efficiency. Further, the input data is screened through different levels (for example, three levels) of screening granularity, so that data that needs to perform convolution operations can be extracted quickly and efficiently.
  • the sparse convolution scheme provided by the embodiments of the present disclosure can be especially suitable for processing based on LiDAR point cloud data.
  • Embodiments of the present disclosure also provide a data processing circuit for performing the convolution operation of the aforementioned sparse data, and a data processing method implemented by the data processing circuit.
  • FIG. 12 exemplarily shows a schematic structural diagram of a data processing circuit that can implement an embodiment of the present disclosure.
  • the data processing circuit 1200 includes a control circuit 1210 , a storage circuit 1220 and an operation circuit 1230 .
  • the control circuit 1210 is responsible for processing various functions on the data processing circuit 1200, including but not limited to control, instruction fetching, decoding, calculation and so on.
  • the control circuit 1210 may include, for example, the control module 31 in FIG. 3 .
  • control circuit 1210 can be configured to control the storage circuit 1220 and the operation circuit 1230 to perform N-dimensional convolution operation processing on the input data and the convolution kernel, N>1, and N indicates that the sliding accumulation is performed in the convolution operation The number of convolution dimensions.
  • N N-dimensional convolution operation processing
  • the input data is sparsified and represented in a dense form.
  • control circuit 1210 may be configured to: filter the input data blocks allocated to the operation circuit 1230 in the current round according to the input parameters of the input data, wherein the operation circuit 1230 calculates a row of W-dimensional output data in each round.
  • the storage circuit 1220 can be used to store information, which includes at least pre-processing and/or post-processing information, and can also include intermediate information that needs to be cached during processing. It can be, for example, various RAMs shown in FIG. 3 , or on-chip cache. In some embodiments, the storage circuit 1220 may be configured to store input data, convolution kernels, convolution operation results and/or cache intermediate results, such as cache parts and results, or provide cache space required during execution of the MERGE instruction.
  • the arithmetic circuit 1230 may be configured to perform various arithmetic operations according to related instructions.
  • the operation circuit 1230 can be configured to perform multiple one-dimensional convolution operations on the input data and the convolution kernel under the control of the control circuit 1210, to obtain multi-channel operation results and corresponding outputs on the first convolution dimension. Point coordinates; and merge the results of multi-channel operations according to their corresponding output point coordinates into one channel of fusion data as the result of convolution operation, where the operation results with the same output point coordinates are accumulated.
  • the above-mentioned N-dimensional convolution operation is split into M one-dimensional convolution operations, and M is equal to the product of the size of the convolution kernel in N-1 convolution dimensions other than the first convolution dimension .
  • the arithmetic circuit 1230 may further include an arithmetic processing circuit (not shown), which may be configured to preprocess the data before the arithmetic circuit performs the arithmetic or post-process the data after the arithmetic according to the arithmetic instruction.
  • the foregoing preprocessing and postprocessing may, for example, include data splitting and/or data splicing operations.
  • the computing circuit 1230 may include multiple processor cores, and each processor core may process the input data blocks allocated by the control circuit 1210 each time, for example, calculate a row W of output points each time.
  • each processor core may be further configured to perform an operation as follows: constructing a matrix Q on which a one-dimensional convolution operation is to be performed based on an allocated block indicating a block of input data to be processed; computing the one-dimensional convolution The coordinates of each part of the output point of the operation and the result on the first convolution dimension; perform multiple one-dimensional convolution operations on the matrix Q to obtain the multi-path part and the result; and merge and fuse the multi-path part and the result, Get the final convolution operation result.
  • the step of determining the coordinates is described as being performed by an arithmetic circuit, those skilled in the art can understand that the step of determining the coordinates can also be performed by software, for example, by a control circuit.
  • each processing step is generally described as being executed on an operation circuit, the operation circuit here may also be distributed, for example, including operation circuits in a heterogeneous system, so that a part of operations such as Executed on the CPU, and another part of the calculation is performed on the GPU, for example.
  • preprocessing of input data which may include, for example, densification of input data in sparse form, etc., may be performed on a CPU, for example.
  • the one-dimensional convolution operation of the input data and the convolution kernel, the merging and fusion of the multi-channel part and the result can be performed on the GPU, thereby giving full play to the advantages of the heterogeneous system.
  • the present disclosure also provides a chip, which may include the data processing device of any embodiment described above with reference to the accompanying drawings. Further, the present disclosure also provides a board, which may include the aforementioned chip.
  • the electronic equipment or devices disclosed in this disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, Internet of Things terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment.
  • Said vehicles include airplanes, ships and/or vehicles;
  • said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods;
  • said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph.
  • the electronic equipment or device disclosed herein can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical treatment. Further, the electronic device or device disclosed herein can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge, and terminal.
  • electronic devices or devices with high computing power according to the disclosed solutions can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present disclosure is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present disclosure, those skilled in the art may understand that certain steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily required for the realization of one or some solutions of the present disclosure. In addition, according to different schemes, the description of some embodiments in this disclosure also has different emphases. In view of this, those skilled in the art may understand the part that is not described in detail in a certain embodiment of the present disclosure, and may also refer to related descriptions of other embodiments.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit.
  • the aforementioned components or units may be located at the same location or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit, or each unit exists physically independently.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits.
  • the physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein may be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • RRAM variable resistance memory
  • DRAM Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • High Bandwidth Memory High Bandwidth Memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)

Abstract

Sont divulgués un circuit de traitement de données, un procédé de traitement de données et un produit associé. Le circuit de traitement de données peut être mis en œuvre sous la forme d'un appareil de calcul dans un appareil de traitement combiné, et l'appareil de traitement combiné peut en outre comprendre un appareil d'interfaçage et un autre appareil de traitement. L'appareil de calcul interagit avec un autre appareil de traitement afin de réaliser conjointement une opération de calcul spécifiée par un utilisateur. L'appareil de traitement combiné peut en outre comprendre un appareil de stockage, l'appareil de stockage étant connecté séparément à l'appareil de calcul et à l'autre appareil de traitement, et ledit appareil de stockage étant utilisé pour stocker des données de l'appareil de calcul et de l'autre appareil de traitement. Une solution de la présente divulgation concerne une solution de traitement de convolution pour des données éparses, et ladite solution peut simplifier le traitement et améliorer l'efficacité de traitement d'une machine.
PCT/CN2022/100306 2021-12-29 2022-06-22 Circuit de traitement de données, procédé de traitement de données et produit associé WO2023123919A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111642096.8 2021-12-29
CN202111642096.8A CN114329324A (zh) 2021-12-29 2021-12-29 数据处理电路、数据处理方法及相关产品

Publications (1)

Publication Number Publication Date
WO2023123919A1 true WO2023123919A1 (fr) 2023-07-06

Family

ID=81016541

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100306 WO2023123919A1 (fr) 2021-12-29 2022-06-22 Circuit de traitement de données, procédé de traitement de données et produit associé

Country Status (2)

Country Link
CN (1) CN114329324A (fr)
WO (1) WO2023123919A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117831596A (zh) * 2024-03-05 2024-04-05 悦芯科技股份有限公司 一种存储芯片稀疏失效单元电路的修复方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114329324A (zh) * 2021-12-29 2022-04-12 寒武纪行歌(南京)科技有限公司 数据处理电路、数据处理方法及相关产品
CN116828070B (zh) * 2023-08-28 2023-11-07 无锡市锡容电力电器有限公司 一种智慧电网数据优化传输方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052989A (zh) * 2018-02-07 2018-05-18 深圳市唯特视科技有限公司 一种基于样条卷积神经网络的图像分类方法
US20180181857A1 (en) * 2016-12-27 2018-06-28 Texas Instruments Incorporated Reduced Complexity Convolution for Convolutional Neural Networks
CN109084796A (zh) * 2018-08-27 2018-12-25 深圳市烽焌信息科技有限公司 路径导航方法及相关产品
CN109840585A (zh) * 2018-01-10 2019-06-04 中国科学院计算技术研究所 一种面向稀疏二维卷积的运算方法和系统
CN114329324A (zh) * 2021-12-29 2022-04-12 寒武纪行歌(南京)科技有限公司 数据处理电路、数据处理方法及相关产品

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181857A1 (en) * 2016-12-27 2018-06-28 Texas Instruments Incorporated Reduced Complexity Convolution for Convolutional Neural Networks
CN109840585A (zh) * 2018-01-10 2019-06-04 中国科学院计算技术研究所 一种面向稀疏二维卷积的运算方法和系统
CN108052989A (zh) * 2018-02-07 2018-05-18 深圳市唯特视科技有限公司 一种基于样条卷积神经网络的图像分类方法
CN109084796A (zh) * 2018-08-27 2018-12-25 深圳市烽焌信息科技有限公司 路径导航方法及相关产品
CN114329324A (zh) * 2021-12-29 2022-04-12 寒武纪行歌(南京)科技有限公司 数据处理电路、数据处理方法及相关产品

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117831596A (zh) * 2024-03-05 2024-04-05 悦芯科技股份有限公司 一种存储芯片稀疏失效单元电路的修复方法
CN117831596B (zh) * 2024-03-05 2024-05-24 悦芯科技股份有限公司 一种存储芯片稀疏失效单元电路的修复方法

Also Published As

Publication number Publication date
CN114329324A (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2023123919A1 (fr) Circuit de traitement de données, procédé de traitement de données et produit associé
CN107895191B (zh) 一种信息处理方法及相关产品
CN109284823B (zh) 一种运算装置及相关产品
CN109871510B (zh) 二维卷积运算处理方法、系统、设备及计算机存储介质
CN109189473A (zh) 神经网络处理装置及其执行向量交换指令的方法
JP2024510265A (ja) 高解像度ニューラル・レンダリング
WO2023045445A1 (fr) Dispositif de traitement de données, procédé de traitement de données et produit associé
WO2023093623A1 (fr) Procédé d'optimisation de graphe de calcul, procédé de traitement de données et produit associé
CN112799599B (zh) 一种数据存储方法、计算核、芯片和电子设备
WO2023045446A1 (fr) Appareil informatique, procédé de traitement de données et produit associé
WO2022111002A1 (fr) Procédé et appareil permettant d'entraîner un réseau neuronal et support de stockage lisible par ordinateur
CN112633490A (zh) 执行神经网络模型的数据处理装置、方法及相关产品
WO2022134873A1 (fr) Dispositif de traitement de données, procédé de traitement de données et produit associé
US20240160689A1 (en) Method for optimizing convolution operation of system on chip and related product
Xiao et al. FPGA-based scalable and highly concurrent convolutional neural network acceleration
WO2019127926A1 (fr) Procédé de calcul et dispositif de calcul pour réseau neuronal épars, dispositif électronique, support de stockage lisible par ordinateur et produit programme informatique
CN115221103A (zh) 计算装置、数据处理方法及相关产品
WO2022257980A1 (fr) Appareil informatique, procédé de mise en œuvre d'une opération de convolution à l'aide d'un appareil informatique, et produit associé
CN111125627A (zh) 用于池化多维矩阵的方法及相关产品
CN114692838A (zh) 数据处理装置、数据处理方法及相关产品
WO2022134872A1 (fr) Appareil de traitement de données, procédé de traitement de données et produit associé
WO2022135599A1 (fr) Dispositif, carte et procédé pour fusionner des structures de ramification, et support de stockage lisible
Jiang et al. DeepGCNs-Att: Point cloud semantic segmentation with contextual point representations
CN115221105A (zh) 数据处理装置、数据处理方法及相关产品
WO2023087698A1 (fr) Appareil de calcul et procédé pour exécuter une opération de convolution, et produits associés

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913208

Country of ref document: EP

Kind code of ref document: A1