CN116108902B

CN116108902B - Sampling operation implementation system, method, electronic device and storage medium

Info

Publication number: CN116108902B
Application number: CN202310154737.8A
Authority: CN
Inventors: 陈帅; 金贝贝
Original assignee: Chengdu Denglin Technology Co ltd
Current assignee: Chengdu Denglin Technology Co ltd
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2024-01-05
Anticipated expiration: 2043-02-22
Also published as: CN116108902A

Abstract

The application provides a sampling operation realizing system, a sampling operation realizing method, electronic equipment and a storage medium, and relates to the technical field of computers, wherein the sampling operation realizing system comprises a first processing unit with a pulse array accelerator and a second processing unit connected with the first processing unit; processing the data to be convolved and the convolution kernel respectively, converting the data to be convolved into an input feature matrix, and converting the convolution kernel into a kernel matrix; and performing matrix multiplication operation on the input feature matrix and the nuclear matrix through the pulse array accelerator to obtain a processing result corresponding to the sampling operation. The sampling operation realization system provided by the application realizes the acceleration calculation of the sampling operation, and greatly improves the data processing efficiency.

Description

Sampling operation implementation system, method, electronic device and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a system, a method, an electronic device, and a storage medium for implementing sampling operation.

Background

Deep learning (Deep Learn i ng) is a machine learning (Mach ine Learn ing) method for learning the inherent regularity and presentation hierarchy of sample data, with the goal of enabling a machine to analyze learning like a person, and to recognize text, image, sound, etc. data.

In a deep learning task, it is often necessary to sample feature data (e.g., feature map) for resizing the input feature data to obtain a desired resolution. When the res ize operation is used for increasing the feature size, the size of the original feature map is usually enlarged, a plurality of areas needing to be supplemented are vacated, and then the value of the area to be supplemented is calculated through a certain interpolation algorithm; the principle of res ize operation for feature size reduction is similar and requires calculation by a certain interpolation algorithm.

Among other things, a common linear interpolation algorithm for res ize may include: nearest neighbor interpolation (Nearest I nterpo l at ion) algorithm, bilinear interpolation (Bi l inear I nterpo l at ion) algorithm, bicubic interpolation (Bicubic I nterpo l at ion) algorithm and the like, wherein the interpolation methods adopt the same interpolation kernel in the image interpolation process, and the positions of pixels to be interpolated are not distinguished. For example, the nearest neighbor interpolation (Nearest I nterpo l at ion) algorithm corresponds a point in the target image to the original image, finds the pixel value of the nearest integer coordinate point, and outputs the pixel value as the pixel value of the point; the bilinear interpolation (Bi l inear I nterpo l at ion) algorithm can perform two linear interpolations in the horizontal direction (e.g., X-axis direction) and then perform one interpolation in the vertical direction (Y-axis direction); the bicubic interpolation (Bicubic I nterpo l at ion) algorithm is applicable to two-dimensional space, with smoother image edges than bilinear interpolation.

However, the res ize operation is generally computationally intensive, and is relatively long and inefficient to process using a computing device such as a conventional CPU (Centra l Process ing Un it ) or a conventional GPU (Graph ics Process ing Un it, image processor).

Disclosure of Invention

The application provides a sampling operation realization system, a sampling operation realization method, electronic equipment and a storage medium, wherein the second processing unit and the first processing unit with the pulsation array accelerator are matched for use, so that the sampling operation can be accelerated and calculated, and the data processing efficiency is greatly improved.

In a first aspect, embodiments of the present application provide a sampling operation implementation system that includes a first processing unit having a systolic array accelerator, and a second processing unit coupled to the first processing unit; the second processing unit is used for: acquiring data to be processed; converting the sampling operation of the data to be processed into convolution operation, and determining the data to be convolved and a convolution kernel corresponding to the convolution operation; processing the data to be convolved and the convolution kernel respectively, converting the data to be convolved into an input feature matrix, and converting the convolution kernel into a kernel matrix; transmitting the input feature matrix and the kernel matrix to the first processing unit, and converting the matrix multiplication result into a processing result corresponding to the sampling operation in response to receiving the matrix multiplication result of the input feature matrix and the kernel matrix transmitted by the first processing unit; the first processing unit is used for: in response to receiving the input feature matrix and the kernel matrix sent by the second processing unit, performing matrix multiplication operation on the input feature matrix and the kernel matrix through the pulse array accelerator to obtain a matrix multiplication operation result; the first processing unit is further configured to: and sending the matrix multiplication operation result to a second processing unit, or converting the matrix multiplication operation result into a processing result corresponding to the sampling operation.

The sampling operation in this application refers to a res ize operation that may be used to resize data to be processed (for example, a feature map to be processed is scaled in at least one dimension, and the scaling of each dimension may be the same or different). The data to be processed by the sampling operation may be any one of image feature data, text feature data, voice feature data.

In a second aspect, embodiments of the present application provide a sampling operation implementation method, where the method is applied to a sampling operation implementation system, the system including a first processing unit having a systolic array accelerator, and a second processing unit connected to the first processing unit, the method including: acquiring data to be processed through the second processing unit; converting the sampling operation of the data to be processed into convolution operation through the second processing unit, and determining the data to be convolved and a convolution kernel corresponding to the convolution operation; the data to be convolved and the convolution kernel are processed through the second processing unit respectively, the data to be convolved are converted into an input feature matrix, the convolution kernel is converted into a kernel matrix, and the input feature matrix and the kernel matrix are sent to the first processing unit; the first processing unit responds to the input feature matrix and the kernel matrix sent by the second processing unit, and performs matrix multiplication operation on the input feature matrix and the kernel matrix through the pulse array accelerator to obtain a matrix multiplication operation result; the first processing unit sends the matrix multiplication operation result to a second processing unit; the second processing unit responds to the receiving of the matrix multiplication operation result of the input feature matrix and the kernel matrix sent by the first processing unit, and converts the matrix multiplication operation result into a processing result corresponding to the sampling operation; or after the first processing unit obtains the matrix multiplication operation result, converting the matrix multiplication operation result into a processing result corresponding to the sampling operation by the first processing unit.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor; the memory has stored therein a computer program which, when executed by the processor, performs the method provided by the second aspect described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method provided in the second aspect above.

The beneficial effects that above-mentioned at least one technical scheme that this application embodiment adopted can reach are: the sampling operation implementation system can convert the sampling operation which is not suitable for the execution of the pulse array accelerator into the convolution operation which is suitable for the execution of the pulse array accelerator, so that the processing result corresponding to the sampling operation is obtained, and the acceleration calculation of the sampling operation is realized by using the first processing unit (used for executing matrix multiplication) and the second processing unit (used for executing the operation outside the matrix multiplication) with the pulse array accelerator, so that the data processing efficiency of executing the sampling operation is greatly improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the technical aspects of the application.

Fig. 1 shows a block diagram of a sampling operation implementation system provided in an embodiment of the present application.

Fig. 2 is a schematic diagram illustrating a correspondence relationship between input features and output features according to an embodiment of the present application.

Fig. 3 shows a schematic diagram of a structured convolution kernel provided by an embodiment of the present application.

Fig. 4 shows a dimension conversion schematic diagram of data to be processed according to an embodiment of the present application.

Fig. 5 shows a schematic diagram for converting a sampling operation into a convolution operation according to an embodiment of the present application.

FIG. 6 illustrates a schematic diagram of converting convolution operations to matrix multiplication operations provided by embodiments of the present application.

Fig. 7 shows a schematic diagram of a systolic array accelerator provided in an embodiment of the present application.

Fig. 8 shows a convolution kernel segmentation schematic diagram provided in an embodiment of the present application.

Fig. 9 shows a schematic diagram of a sampling operation implementation method provided in an embodiment of the present application.

Detailed Description

Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numerals designate identical or similar elements, and objects.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits have not been described in detail as not to unnecessarily obscure the present application.

Neural networks (AI) are the basis for artificial intelligence (Art ificia l I nte l l igence, AI) applications, which have been popular in various application fields such as speech recognition, image recognition, video processing, autopilot, and even in many fields, the accuracy of neural networks has exceeded that of humans. The neural network may include a plurality of network layers of different functions, such as a convolutional layer, a pooling layer, an activation layer, and the like.

The convolution operation process may have a large amount of data multiplexing and calculation regularity, and has a large optimization space in terms of architecture design and calculation optimization, and it may be difficult for the existing mainstream computing device (using a CPU or a conventional GPU) to efficiently complete the convolution operation. Accordingly, in the related art, an acceleration chip for convolution operation, for example, a systolic array accelerator (Systo l ic Arrays) is created. Systolic array accelerators are designed to implement matrix multiplication, which can be combined with on-chip storage to efficiently accelerate convolution operations in neural networks to achieve the effect of providing higher computational power and lower energy consumption. A systolic array accelerator is a processor of an array architecture for transferring data in a pipelined fashion between the array processing units of the systolic array accelerator, each of which can process its data concurrently in parallel.

In this embodiment of the present application, the sampling operation refers to a res ize operation that can be used to resize data to be processed, and the weighted value of the element in the neighborhood range in the data to be processed can be calculated as the element (of the feature matrix) of the sampling output. The sampling operation may include downsampling for shrinking the data to be processed, reducing the number of sampling points, and reducing the resolution of the data to be processed; the up-sampling is used for amplifying the data to be processed, increasing the sampling point number and improving the resolution of the data to be processed.

For res ize operation, the use of a CPU or a conventional GPU or other computing device may be relatively long and not efficient, however, it has not been found that an acceleration chip specially adapted for res ize operation may effectively improve res ize processing efficiency.

In view of this, an embodiment of the present application provides a sampling operation implementation system, including a first processing unit with a systolic array accelerator, and a second processing unit connected to the first processing unit, where the second processing unit is used in conjunction with the first processing unit with the systolic array accelerator, so that a sampling operation of data to be processed can be converted into a convolution operation, to determine data to be convolved and a convolution kernel corresponding to the convolution operation, and to process the data to be convolved and the convolution kernel respectively, to convert the data to be convolved into an input feature matrix, and to convert the convolution kernel into a kernel matrix; then, matrix multiplication operation can be carried out on the input characteristic matrix and the nuclear matrix through a pulse array accelerator, so that a processing result corresponding to the sampling operation is obtained.

The sampling operation realization system can convert the sampling operation which is not suitable for the execution of the pulse array accelerator into the convolution operation which is suitable for the execution of the pulse array accelerator, and the data processing efficiency of executing the sampling operation is greatly improved by utilizing the cooperation of a first processing unit (for executing matrix multiplication) and a second processing unit (for executing the operation of matrix multiplication) with the pulse array accelerator to realize the acceleration calculation of the sampling operation on the data to be processed.

The scheme provided by the embodiment of the application can be applied to various technical fields, and can be applied to products or technologies in the technical fields of automatic driving, intelligent security, face recognition and the like.

Fig. 1 shows a block diagram of a sampling operation implementation system provided in an embodiment of the present application, and as shown in fig. 1, the system includes a first processing unit 11 with a systolic array accelerator, and a second processing unit 12 connected to the first processing unit 11. The first processing unit 11 and the second processing unit 12 may be directly connected or indirectly connected.

The second processing unit 12 is configured to: acquiring data to be processed; converting the sampling operation of the data to be processed into convolution operation, and determining the data to be convolved and a convolution kernel corresponding to the convolution operation; processing the data to be convolved and the convolution kernel respectively, converting the data to be convolved into an input feature matrix, and converting the convolution kernel into a kernel matrix; the input feature matrix and the kernel matrix are sent to the first processing unit 11.

The first processing unit 11 is configured to: and in response to receiving the input feature matrix and the kernel matrix sent by the second processing unit 12, performing matrix multiplication operation on the input feature matrix and the kernel matrix by using the pulse array accelerator to obtain a matrix multiplication operation result.

The first processing unit 11 is further operable to: after the matrix multiplication result is obtained, the matrix multiplication result is sent to the second processing unit 12.

The second processing unit 12 is further operable to: in response to receiving the matrix multiplication result of the input feature matrix and the kernel matrix sent by the first processing unit 11, the matrix multiplication result is converted into a processing result corresponding to the sampling operation.

In some embodiments, the conversion process of converting the matrix multiplication result into the processing result corresponding to the sampling operation may also be performed by the first processing unit 11, that is, in some embodiments, the first processing unit 11 may further be configured to: and after the matrix multiplication operation result is obtained, converting the matrix multiplication operation result into a processing result corresponding to the sampling operation.

In one possible implementation, the sampling operation implementation system may be an electronic device that includes a systolic array accelerator, which may be any electronic product that can interact with a user, such as a personal computer, a tablet, a smart phone, a personal digital assistant (Persona l Digita l Ass I stant, PDA), a gaming machine, an interactive web television (I nternet Protoco l Te levi s ion, I PTV), a handheld device, a server device, a vehicle-mounted device, a smart wearable device, etc.

In a possible implementation, the electronic device may be a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the processing component of the electronic device may include a first processing unit 11 having a pulse array accelerator, and a second processing unit 12 of another type connected to the first processing unit 11.

The first processing unit 11 and the second processing unit 12 in the sampling operation implementation system may be completely new design, or may be modified based on an existing processing unit or an existing processor, for example, may be modified by an existing central processing unit (Centra l Process ing Un it, CPU), a graphics processing unit (Graph ic Process ing Un it, GPU), a general-purpose graphics processing unit (general-Purpose Comput ing on Graph ics Process ing Un its, GPGPU), a digital signal processing unit (Digita l Signa l Processor, DSP), an application specific integrated circuit (App l icat ion Specific I ntegrated Ci rcu it, ASIC), a tensor processing unit (Tensor Process ing Un it, TPU), a field programmable gate array (Fie ld Programmab le Gate Array, FPGA), or the like. By way of example, the second processing unit 12 may be a central processing unit (Centra l Process ing Un it, CPU), a general purpose processing unit (CU), or a general purpose graphics processing unit (general-Purpose Comput ing on Graph ics Process ing Un its, GPGPU). The first processing unit 11 may be any type of processing unit with a systolic array accelerator (in some application scenarios the first processing unit may be the systolic array accelerator itself). The specific types of the first processing unit 11 and the second processing unit 12 are not limited in the present application, as long as the acceleration processing for the sampling operation can be completed by the systolic array accelerator.

In the embodiment of the present application, the first processing unit 11 and the second processing unit 12 in the sampling operation implementation system may be different processing units integrated on the same chip, or may be processing units independently distributed on different chips.

As an embodiment, the sampling operation implementation system may be presented as an on-chip heterogeneous processor chip, with the first processing unit 11 as one module packaged in the chip and the second processing unit 12 as another module packaged in the chip. As another implementation, the sampling operation implementation system may be presented as an electronic device, and the first processing unit 11, the second processing unit 12 may be different processing modules or different processor chips in the electronic device. As yet another implementation, the sampling operation implementation system may be presented as a cluster with a plurality of processing units, the first processing unit 11, the second processing unit 12 may be different processor chips or electronic devices in the cluster.

The sampling operation implementation system may integrate a plurality of first processing units 11 and a plurality of second processing units 12, and the number of the first processing units 11 and the number of the second processing units 12 that can be integrated in the same chip are not limited in this application.

In the embodiment of the present application, the second processing unit 12 (e.g. a CPU module) connected to the first processing unit 11 with the systolic array accelerator may be configured to convert the sampling operation of the data to be processed into a matrix multiplication operation between the input feature matrix and the kernel matrix, and send the input feature matrix and the kernel matrix to the first processing unit 11 with the systolic array accelerator, so as to perform the matrix multiplication operation on the input feature matrix and the kernel matrix according to the received input feature matrix and the kernel matrix through the systolic array accelerator.

The data to be processed, which is acquired by the sampling operation implementation system through the second processing unit 12, may be feature data in a deep learning task, where the feature data may be any one of image feature data, voice feature data, and text feature data, and the embodiment of the present application does not limit the type of the data to be processed. After sampling operation is carried out on the data to be processed, new characteristic data with specified resolution can be obtained.

For example, in a scenario in which a target object is face-identified using a deep neural network, a sampling operation (referred to as a res ize operation) may be required on image feature data (e.g., a face feature map) of the target object to obtain a resolution required by the deep neural network (e.g., to obtain a feature map with more details that is higher in resolution, or to obtain a thumbnail with lower resolution). In this case, the sampling operation realizing system may acquire image feature data of the target object as data to be processed through the second processing unit 12.

For example, in a scenario in which a deep neural network is used to perform speech recognition on a target object, a sampling operation (referred to as res ize operation) may be performed on speech feature data (e.g., a speech feature matrix) of the target object to obtain a resolution required by the deep neural network. In this case, the sampling operation implementing system may acquire the voice feature data of the target object through the second processing unit 12, and take the voice feature data of the target object as the data to be processed.

Similarly, in a scenario where the deep neural network is used to perform text recognition on the target document, a sampling operation (referred to as res ize operation) may be performed on text feature data (for example, a text feature matrix) of the target document, where the sampling operation implementation system may acquire the text feature data of the target document through the second processing unit 12, and use the text feature data of the target document as data to be processed.

In one possible implementation, the data to be processed acquired by the second processing unit 12 may be a multidimensional tensor (tensor), which may be regarded as a multidimensional matrix, and the data to be processed may be three-dimensional data, four-dimensional data or other multidimensional data, etc., and the dimension of the data to be processed is not limited in the embodiments of the present application.

As described above, in consideration that the systolic array accelerator can accelerate the convolution operation efficiently in combination with on-chip storage, in order to improve the processing efficiency of implementing the sampling operation, the sampling operation implementing system may convert the sampling operation for the data to be processed into the convolution operation by the second processing unit 12, determine the data to be convolved and the convolution kernel corresponding to the convolution operation, perform the matrix multiplication operation with the systolic array accelerator of the first processing unit 11 to accelerate the sampling operation, and implement the sampling operation (refer to res ize) by the convolution operation.

Specifically, according to the embodiment of the application, aiming at the calculation characteristics of the sampling operation, a corresponding relation between each element position in the output characteristic matrix of the sampling operation and each element position in the data to be processed is established, a convolution kernel is constructed according to the corresponding relation, and the sampling operation process is converted into a convolution operation process for processing. The second processing unit 12 converts the sampling operation of the data to be processed into the convolution operation of the data to be convolved and the convolution kernel by constructing the convolution kernel for the sampling operation of the data to be processed, so that the sampling operation result of the data to be processed corresponds to the convolution operation result between the data to be convolved and the convolution kernel.

In one possible implementation, the second processing unit 12 of the sampling operation implementation system is configured to: in the process of converting the sampling operation of the data to be processed into convolution operation and determining the data to be convolved and the convolution kernel corresponding to the convolution operation, establishing a corresponding relation between each element position in the output feature matrix and each element position in the data to be processed according to the size of the data to be processed, the size of the output feature matrix corresponding to the sampling operation and a sampling mode aiming at the sampling operation of any sampling dimension; constructing a convolution kernel for the data to be processed and the output feature matrix corresponding to the sampling operation according to the corresponding relation, wherein the convolution kernel is assigned based on the weight of the data to be processed under the corresponding relation; and carrying out dimension transformation on the data to be processed, and transforming the sampling dimension into a channel dimension to obtain the data to be convolved.

The size of the output characteristic matrix can be determined according to the size of the data to be processed and the sampling proportion, or can be directly acquired or designated, and the size of the convolution kernel is determined according to the size of the data to be processed and the size of the output characteristic matrix.

Illustratively, the data to be processed is three-dimensional data, the size of which is (n, c, w) = (1, 3); where n represents the dimension in which the number is located, c represents the dimension in which the number of channels is located (e.g., the number of channels of a black-and-white image c=1, the number of channels of an rgb color image c=3), w represents the dimension in which the width is located, and w=3 represents that there are 3 elements in the width dimension.

The w dimension of the data to be processed can be sampled in a nearest neighbor sampling (nearest), and rounding is performed in an f loor mode; the rounding in f loor mode is used to determine the coordinate position when the transformed coordinates are non-integer during the sampling operation.

The size of the output feature matrix of the sampling operation is (n, c, w) = (1, 6). In practical applications, the size of the output feature matrix of the sampling operation may be directly obtained, or the size of the output feature matrix may be determined according to the size of the data to be processed and the sampling ratio, for example, if the size of the data to be processed is (n, c, w) = (1, 3) and the sampling ratio is scale= (1, 2), the size of the output feature matrix may be obtained as (n, c, w) = (1, 3×2) = (1, 6), which is not limited in this application.

Fig. 2 is a schematic diagram illustrating a correspondence relationship between input features and output features according to an embodiment of the present application. As shown in fig. 2, according to the size (n, c, w) = (1, 3) of the data to be processed, the size (n, c, w) = (1, 6) of the output feature matrix, and the sampling manner, a corresponding relationship between each element position in the output feature matrix and each element position in the input data to be processed can be established. For example, an element at the 0 position in the output feature matrix corresponds to the f loor (2/5×0) =0 position of the data to be processed, where "2" here represents the position of the last element of the data to be processed in the sampling dimension w, and "5" here represents the position of the last element of the output feature matrix in the sampling dimension w; outputting the f loor (2/5 multiplied by 1) =0 position of the 1 position element in the feature matrix corresponding to the data to be processed; outputting f loor (2/5×2) =0 positions of the data to be processed corresponding to elements in the 2 positions in the feature matrix; outputting the f loor (2/5×3) =1 position of the data to be processed corresponding to the element at the 3 position in the feature matrix; outputting f loor (2/5×4) =1 positions of the data to be processed corresponding to elements at 4 positions in the feature matrix; the element of 5 positions in the output feature matrix corresponds to the f loor (2/5×5) =2 position of the data to be processed.

In this way, a corresponding relation between each element position in the output feature matrix and each element position in the input data to be processed is established, elements at 0,1 and 2 positions in the output feature matrix correspond to 0 position of the data to be processed, elements at 3 and 4 positions in the output feature matrix correspond to 1 position of the data to be processed, and elements at 5 positions in the output feature matrix correspond to 2 positions of the data to be processed. Correspondingly, the element values at the 0,1 and 2 positions in the output feature matrix depend on the element value at the 0 position of the data to be processed, the element values at the 3 and 4 positions in the output feature matrix depend on the element value at the 1 position of the data to be processed, and the element value at the 5 position in the output feature matrix depend on the element value at the 2 position of the data to be processed.

It should be understood that if the transformed coordinates are non-integers, other rounding methods may be used, for example, the ceii method may be rounded, and the correspondence may change (i.e., different from the correspondence illustrated in fig. 2, for example, the element at the 0 position in the output feature matrix may correspond to the 0 position of the data to be processed, the element at the 1,2 positions in the output feature matrix may correspond to the 1 position of the data to be processed, and the element at the 3,4,5 positions in the output feature matrix may correspond to the 2 position of the data to be processed.

After determining the correspondence between each element in the output feature matrix and each element in the data to be processed, the second processing unit 12 may set the value of the convolution kernel to the weight value of the corresponding data to be processed according to the correspondence; the size of the convolution kernel may be determined according to the size of the data to be processed and the size of the output feature matrix.

Fig. 3 shows a schematic diagram of a structured convolution kernel provided by an embodiment of the present application. As shown in fig. 3, a convolution kernel fi lter (oc, ic, 1) = (6,3,1) may be constructed according to the size (n, c, w) = (1, 3) of the data to be processed and the size (n, c, w) = (1, 6) of the output feature matrix, where oc=6 represents the sampling dimension output after the sampling operation and ic=3 represents the dimension to be sampled input before the sampling operation.

Based on the above-mentioned correspondence, since the elements at the 0,1,2 positions in the output feature matrix all correspond to the 0 position of the data to be processed, the weight [1, 0] corresponding to the 0 position of the data to be processed can be determined as the value of the 31 part of the convolution kernel shown in fig. 3; in the nearest neighbor sampling (nearest) method, the element value of the data to be processed in the corresponding position can be directly used as the weight.

Since the elements at the 3,4 positions in the output feature matrix all correspond to the 1 position of the data to be processed, the weight [0,1,0] corresponding to the 1 position of the data to be processed can be determined as the value of the 32 part of the convolution kernel as shown in fig. 3.

Since the element of the 5-position in the output feature matrix corresponds to the 2-position of the data to be processed, the weight value [0, 1] corresponding to the 2-position of the data to be processed can be determined as the value of the 33 part of the convolution kernel as shown in fig. 3.

Fig. 4 shows a schematic dimension conversion diagram of data to be processed according to the embodiment of the present application, as shown in fig. 4, the data to be processed (n, c, w) = (1, 3) may be subjected to dimension conversion, and the sampling dimension w is converted into the channel dimension c, so as to obtain data Conv (n, w, c) = (1, 3, 1) to be convolved.

Fig. 5 shows a schematic diagram of converting a sampling operation into a convolution operation, where, as shown in fig. 5, a convolution operation may be performed on a convolution kernel fi lter (6,3,1) and data Conv (1, 3, 1) to be convolved, to obtain a convolution result Output (1,6,1). Channel transformation is performed on the convolution result Output (1,6,1), and channel dimension c=1 is transformed to sampling dimension w=6, so that sampling results (1, 6) of the data to be processed can be obtained.

According to the sampling operation implementation system provided by the embodiment of the application, the sampling operation which is not suitable for the execution of the pulse array accelerator originally can be converted into the convolution operation which is suitable for the execution of the pulse array accelerator, so that the sampling operation is realized through the pulse array accelerator, the data processing speed is improved, and the res ize processing efficiency is improved.

It should be understood that, besides the above-described nearest neighbor sampling, the sampling manner may be adjacent point sampling, bi-cubic sampling, etc., and the structured convolution kernel may be assigned according to different sampling manners, where the nearest neighbor manner may directly assign the element value of the data to be processed as a weight to the convolution kernel, and other sampling manners may determine the weight (for example, the weight may be a weighted sum of the element values) based on the element value of the data to be processed first, and then assign the weight to the convolution kernel, and specific reference may be made to the nearest neighbor sampling manner, which is not repeated herein.

Wherein the second processing unit 12 of the sampling operation enabling system is further operable to: in the process of converting the sampling operation into the convolution operation, respectively performing cyclic replication processing on the data to be convolved and the convolution kernel of the convolution operation, converting the data to be convolved into a two-dimensional input feature matrix, and converting the convolution kernel into a two-dimensional kernel matrix.

In one possible implementation manner, the second processing unit 12 of the sampling operation implementation system may perform a cyclic replication process on the convolution kernel, and sequentially convert the data included in the convolution kernel into a column vector or a row vector, and arrange the column vector or the row vector into a two-dimensional kernel matrix; and performing cyclic replication processing on the data to be convolved, and sequentially converting the data to be convolved corresponding to the convolution kernel receptive field into column vectors or row vectors, and arranging the column vectors or the row vectors into a two-dimensional input feature matrix. When the data to be convolved is converted into a column vector or a row vector, the convolution kernel can be moved on the data to be convolved according to a preset step length, the data to be convolved in the coverage area of the convolution kernel receptive field is sequentially converted into the column vector or the row vector, and the column vector or the row vector is arranged into an input feature matrix.

Illustratively, the second processing unit 12 may utilize an im2co l (image to co l umn) operation (which is a commonly used convolution operation optimization operation) to spread the data to be convolved and the convolution kernel into two-dimensional matrices, respectively, namely: a feature matrix and a kernel matrix are input. Multiplying the two matrices yields the correct convolution result.

Exemplary, as shown in fig. 6, the convolution operation corresponds to a convolution kernel of four-dimensional data having dimensions (n, c, h, w) = (2,3,2,2), 2 elements in the number dimension n of the convolution kernel, 3 elements in the channel dimension c, 2 elements in the height dimension h and width dimension w, respectively, and the first channel data of the first convolution kernel is The second channel data of the first convolution kernel is +.>The third channel data of the first convolution kernel isThe first channel data of the second convolution kernel is +.>The second channel data of the second convolution kernel is +.>The third channel data of the second convolution kernel is +.>

The second processing unit 12 may multiply the convolution kernel in a convolution loop through the loopThe replication and expansion are performed into a kernel matrix, namely, each channel data of the first convolution kernel and the second convolution kernel are sequentially converted into column vectors, and the column vectors are arranged into a two-dimensional kernel matrix, namely:

The data to be convolved corresponding to the convolution operation is three-dimensional data with the size of (c, h, w) = (3, 3), and 3 elements in the channel dimension c, the height dimension h, and the width dimension w, respectively, wherein the first channel data of the data to be convolved isThe second channel data of the data to be convolved is +.>The third channel data of the data to be convolved is +.>

The second processing unit 12 may multiply the input data to be convolved according to a convolution cycle, by performing cyclic replication corresponding to the kernel matrix element, as shown in fig. 6, and may move the convolution kernel on the data to be convolved according to a preset step length of "1", sequentially convert each channel data of the data to be convolved (such as a circular coverage on the data to be convolved in fig. 6) in a coverage area of the convolution kernel into a row vector, expand the row vector into an input feature matrix according to a column, and then become an input feature matrix suitable for matrix multiplication after expanding, that is:it should be understood that fig. 6 is only exemplified by a preset step of "1".

In this way, the sampling operation implementation system can implement convolution operation in a matrix multiplication (GEMM, genera l Matr ix Mu lt ip l icat ion) manner, respectively spread the data to be convolved and the convolution kernel into two-dimensional matrices, and multiply the two-dimensional matrices to obtain a correct convolution result. It can be seen that there is a large amount of data multiplexing in the computation and regularity of the computation, so there is a large optimization space in terms of hardware architecture design and computation optimization, which is beneficial to realizing sampling operation using a systolic array accelerator.

After the second processing unit 12 of the sampling operation implementation system obtains the input feature matrix and the kernel matrix, the input feature matrix and the kernel matrix may be sent to the first processing unit 11 having the systolic array accelerator, so that the first processing unit 11 performs matrix multiplication operation on the input feature matrix and the kernel matrix through the systolic array accelerator in response to receiving the input feature matrix and the kernel matrix sent by the second processing unit 12, to obtain a matrix multiplication operation result. In this way, the input feature matrix and the nuclear matrix can be sent into the systolic array, and matrix multiplication operation of the input feature matrix and the nuclear matrix is completed through the systolic array, so that data processing efficiency is improved.

In a possible implementation manner, the first processing unit 11 performs matrix multiplication operation on the input feature matrix and the kernel matrix through a pulse array accelerator to obtain a matrix multiplication operation result, and sends the matrix multiplication operation result to the second processing unit 12; in order to enable the second processing unit 12 to respond to the receiving of the matrix multiplication operation result of the input feature matrix and the kernel matrix sent by the first processing unit 11, perform dimension transformation on the matrix multiplication operation result, and transform the channel dimension of the matrix multiplication operation result to the sampling dimension, so as to obtain a processing result, and the processing result is used as a result of the sampling operation performed on the data to be processed by the sampling operation implementation system.

Fig. 7 shows a schematic diagram of a systolic array accelerator provided in an embodiment of the present application. As shown in fig. 7, the systolic array accelerator of the first processing unit 11 may include a two-dimensional systolic array, a first interface and a second interface, where the two-dimensional systolic array may include a plurality of operation units (such as multiplier-adder), and data may be flow multiplexed in an array formed by the operation units, so that the number of access times may be reduced, and the hardware structure of the chip may be more regular, and the wiring may be more uniform, thereby improving the calculation frequency, and effectively solving the problem that the data access speed is far greater than the data processing speed.

In the process of performing matrix multiplication operation by using the systolic array accelerator, the kernel matrix can be preloaded to the second interface, so that each group of convolution kernels can be multiplied and added with the input feature matrix of the first interface in the subsequent process.

Illustratively, in the transverse direction of the systolic array, each element in the input feature matrix is sequentially propagated in clock cycles, specifically, the rows of the input feature matrix correspond to the columns of the kernel matrix, multiplication of two corresponding elements is completed in each clock cycle, and propagation into the systolic array is started at intervals of one clock cycle between the rows of the input feature matrix.

In each operation unit, multiplying the elements of the input feature matrix transmitted to the operation unit by the elements of the kernel matrix, and accumulating the partial sum of the previous operation unit to obtain the partial sum of the current operation unit in the current clock period.

In the longitudinal direction of the systolic array, the partial sum calculation results of the individual arithmetic units are propagated and accumulated in clock cycles.

And until part of the sum calculation result is transmitted out of the pulse array, each element in the matrix multiplication result is sequentially obtained until all results are obtained, namely the matrix multiplication result.

After the first processing unit 11 obtains the matrix multiplication result by using the systolic array accelerator, the matrix multiplication result may be sent to the second processing unit 12, so that the second processing unit 12 may perform dimensional transformation on the matrix multiplication result, and transform the channel dimension of the matrix multiplication result into the sampling dimension, to obtain the processing result.

In a possible implementation manner, in a case that the first processing unit 11 further integrates a dimension processing module for performing dimension transformation, the first processing unit 11 performs matrix multiplication on the input feature matrix and the kernel matrix through a pulse array accelerator to obtain a matrix multiplication result, and then performs dimension transformation on the matrix multiplication result, transforms a channel dimension of the matrix multiplication result to a sampling dimension to obtain a processing result, and sends the processing result to the second processing unit 12; so that the second processing unit 12 uses the processing result sent by the first processing unit 11 as a result of the sampling operation performed on the data to be processed by the sampling operation implementing system.

It should be understood that in the above process, if the first processing unit 11 with the systolic array accelerator further integrates a dimension processing module for performing dimension transformation, the dimension transformation between the sampling dimension and the channel dimension may be performed by the second processing unit 12 or may be performed by the first processing unit 11. For example, when the systolic array accelerator of the first processing unit 11 obtains the matrix operation result, the dimension processing module of the first processing unit 11 may transform the channel dimension of the matrix multiplication operation result to the sampling dimension to obtain the processing result, and directly transmit the processing result corresponding to the sampling operation back to the second processing unit 12.

The sampling operation implementation system of the application is matched with the first processing unit 11 with the pulse array accelerator through the second processing unit 12, and for a certain sampling dimension to be subjected to sampling operation, a correlation corresponding relation between each element position of the output feature matrix and each element position in the data to be processed can be established; transforming the current sampling dimension of the data to be processed into the channel dimension to obtain the data to be convolved; the method can also set the value of the convolution kernel as the weight of the data to be processed in the corresponding position relationship through predefining the convolution kernel, and carry out convolution operation on the data to be convolved by utilizing the constructed convolution kernel, so that the sampling operation which is not suitable for being executed by the pulse array accelerator originally can be converted into the convolution operation which can be suitable for being executed by the pulse array accelerator, thereby realizing the acceleration calculation on the sampling operation through the pulse array accelerator and greatly improving the data processing efficiency of executing the sampling operation (res ize).

In one possible implementation, the second processing unit 12 may be further configured to: and under the condition that the sampling operation is sampling of N dimensions, converting the sampling operation of the data to be processed into N times of convolution operation, respectively determining the data to be convolved and the convolution kernel corresponding to each convolution operation, wherein N is an integer larger than 1.

Illustratively, a certain data D (n, c, h, w) to be processed is four-dimensional data, where n represents a dimension in which the number is located, c represents a dimension in which the number of channels is located, h represents a dimension in which the height is located, and w represents a dimension in which the width is located.

If only the data D (N, c, h, w) to be processed is sampled in the height dimension h, the sampling operation is single-dimension sampling (n=1), the second processing unit 12 may convert the sampling operation of the data D (N, c, h, w) to be processed into a single convolution operation, and determine the data to be convolved and the convolution kernel corresponding to the sampling operation in the height dimension h.

If the data D (N, c, h, w) to be processed is sampled in the height dimension h and the width dimension w, the sampling operation is 2-dimensional sampling (n=2), the second processing unit 12 may convert the sampling operation of the data D (N, c, h, w) to be processed into 2 convolution operations, and determine the data to be convolved and the convolution kernel corresponding to the sampling operation in the height dimension h, and the data to be convolved and the convolution kernel corresponding to the sampling operation in the width dimension w.

By analogy, if the data D (N, c, h, w) to be processed is sampled in the number dimension N, the channel dimension c, the height dimension h, and the width dimension w, the sampling operation is sampled in 4 dimensions (n=4), the second processing unit 12 may convert the sampling operation of the data D (N, c, h, w) to be processed into 4 convolution operations, determine the data and the convolution kernel to be convolved corresponding to the sampling operation in the number dimension N, the data and the convolution kernel to be convolved corresponding to the sampling operation in the channel dimension c, the data and the convolution kernel to be convolved corresponding to the sampling operation in the height dimension h, and the data and the convolution kernel to be convolved corresponding to the sampling operation in the width dimension w.

In this way, the second processing unit 12 may convert the sampling operation of any multiple dimensions of the data to be processed into a convolution operation of a plurality of times, so as to expand the application range of the sampling operation implementation system and improve the applicability of the sampling operation implementation system.

In one possible implementation, the N dimensions include a first sampling dimension and a second sampling dimension, and the sampling operation implementation system is further operable to: performing a first sampling operation of a first sampling dimension on the data to be processed through the first processing unit 11 and the second processing unit 12 to obtain a first processing result, wherein the first processing result is used as a result corresponding to the sampling operation of the data to be processed in the first sampling dimension; and performing second sampling operation of a second sampling dimension on the first processing result through the first processing unit 11 and the second processing unit 12 to obtain a second processing result, wherein the second processing result is used as a result corresponding to the sampling operation of the data to be processed in the first sampling dimension and the second sampling dimension.

Wherein, by the first processing unit 11 and the second processing unit 12 performing a first sampling operation of a first sampling dimension on the data to be processed, a first processing result is obtained, which may include: converting the first sampling operation of the data to be processed into a first convolution operation by the second processing unit 12, and determining first data to be convolved and a first convolution kernel corresponding to the first convolution operation; processing the first data to be convolved and the first convolution kernel respectively by the second processing unit 12, converting the first data to be convolved into a first input feature matrix, and converting the first convolution kernel into a first kernel matrix; the second processing unit 12 sends the first input feature matrix and the first kernel matrix to the first processing unit 11, so that the first processing unit 11 performs matrix multiplication operation on the first input feature matrix and the first kernel matrix through a pulse array accelerator according to the first input feature matrix and the first kernel matrix, the matrix multiplication operation result is sent to the second processing unit 12, and the second processing unit 12 converts the matrix multiplication operation result into a first processing result corresponding to the first sampling operation in response to receiving the matrix multiplication operation result of the first input feature matrix and the first kernel matrix sent by the first processing unit 11.

Performing, by the first processing unit 11 and the second processing unit 12, a second sampling operation in a second sampling dimension on the first processing result, to obtain a second processing result may include: converting the second sampling operation of the first processing result into a second convolution operation by the second processing unit 12, and determining second data to be convolved and a second convolution kernel corresponding to the second convolution operation; processing the second data to be convolved and the second convolution kernel respectively by the second processing unit 12, converting the second data to be convolved into a second input feature matrix, and converting the second convolution kernel into a second kernel matrix; the second processing unit 12 sends the second input feature matrix and the second kernel matrix to the first processing unit 11, so that the first processing unit 11 performs matrix multiplication operation on the second input feature matrix and the second kernel matrix through a pulse array accelerator according to the second input feature matrix and the second kernel matrix, and sends the matrix multiplication operation result to the second processing unit 12, and the second processing unit 12 converts the matrix multiplication operation result into a second processing result as a result corresponding to sampling operation on the data to be processed in a first sampling dimension and a second sampling dimension in response to receiving the matrix multiplication operation result of the second input feature matrix and the second kernel matrix sent by the first processing unit 11.

For example, assuming that sampling operations are performed on two dimensions of the data D (n, c, h, w) to be processed, a first sampling dimension may be a height dimension h, a second sampling dimension may be a width dimension w, the sampling operations of the data D (n, c, h, w) to be processed may be converted into 2 convolution operations, the first processing unit 11 and the second processing unit 12 perform first sampling operations on the height dimension h to obtain a first processing result, and then the first processing unit 11 and the second processing unit 12 perform second sampling operations on the width dimension w to obtain a final second processing result, where the final second processing result is used as a result corresponding to the sampling operations of the data D (n, c, h, w) to be processed in the height dimension h and the width dimension w.

For the first sampling operation with the sampling dimension being the height dimension H, the second processing unit 12 may first establish a corresponding relationship between each element position in the output feature matrix H1 (n, c, oh, w) of the first sampling operation and each element position in the data D (n, c, H, w) to be processed according to the size of the data D (n, c, H, w) to be processed, the size of the output feature matrix H1 (n, c, oh, w) of the first sampling operation, and the sampling manner.

The value of oh represents the dimension of the output feature matrix in the height dimension, and since the sampling operation is currently performed on the height dimension H of the data D (n, c, H, w) to be processed, the dimension oh of the feature matrix H1 (n, c, oh, w) in the height dimension is the dimension required by the target task, and the dimensions of the remaining dimensions (n, c and w) are consistent with the data D (n, c, H, w) to be processed. In a specific implementation, the size oh of the output feature matrix of the first sampling operation in the height dimension may be directly obtained, or the size oh may be determined according to the size h of the data to be processed in the height dimension and the sampling proportion sh of the height dimension (for example, sh=oh/h).

The second processing unit 12 constructs a first convolution kernel fi lter1 according to the corresponding relation between the position of each element in the output feature matrix H1 (n, c, oh, w) and the position of each element in the data to be processed D (n, c, H, w), and assigns a value to the first convolution kernel; and carrying out dimension transformation on the data D (n, C, h, w) to be processed, and transforming the height dimension h into the channel dimension C to obtain first data C1 (n, h, C, w) to be convolved.

In this way, the first data to be convolved C1 (n, h, C, w) and the first convolution kernel fi lter1 corresponding to the first convolution operation can be obtained.

The second processing unit 12 may send the first data to be convolved C1 (n, h, C, w) and the first convolution kernel fi lter1 to the first processing unit 11, so that the first processing unit 11 performs the first convolution operation S1 (n, oh, C, w) =c1 (n, h, C, w) fi lter1 by using the systolic array accelerator, and returns S1 (n, oh, C, w) to the second processing unit 12, so that the second processing unit 12 performs a dimensional transformation on S1 (n, oh, C, w), and transforms its channel dimension C to a height dimension oh, to obtain a first processing result S1 (n, C, oh, w).

The second processing unit 12 may convert the first data to be convolved C1 (n, h, C, w) into a first input feature matrix, and convert the first convolution kernel fi lter1 into a first kernel matrix, so that the pulse array accelerator of the first processing unit 11 performs matrix multiplication operation on the first input feature matrix and the first kernel matrix to obtain a first processing result S1 (n, C, oh, w), and the specific principle may be referred to above, which is not described herein.

After the sampling operation implementation system obtains the first processing result S1 (n, c, oh, w), the first processing unit 11 and the second processing unit 12 may perform a second sampling operation on the first processing result S1 (n, c, oh, w) with a sampling dimension being a width dimension w.

The second processing unit 12 may first establish a corresponding relationship between each element position in the output feature matrix H2 (n, c, oh, ow) of the second sampling operation and each element position in the first processing result S1 (n, c, oh, w) according to the size of the first processing result S1 (n, c, oh, w), the size of the output feature matrix H2 (n, c, oh, ow) of the second sampling operation, and the sampling manner.

The value of ow represents the size of the output feature matrix in the width dimension, and since the sampling operation is currently performed on the width dimension w of the first processing result S1 (n, c, oh, w), the size of the feature matrix H2 (n, c, oh, ow) in the width dimension ow is the size required by the target task, and the sizes of the remaining dimensions (n, c and oh) are consistent with the first processing result S1 (n, c, oh, w). In a specific implementation, the dimension ow of the output feature matrix of the second sampling operation in the width dimension may be directly obtained, or the dimension ow may be determined according to the dimension w of the data to be processed in the width dimension and the sampling proportion sw in the width dimension (for example, sw=ow/w).

The second processing unit 12 constructs a second convolution kernel fi lter2 according to the corresponding relation between the position of each element in the output feature matrix H2 (n, c, oh, ow) and the position of each element in the first processing result S1 (n, c, oh, w), and assigns a value to the second convolution kernel; and performing dimension transformation on the first processing result S1 (n, C, oh, w), and transforming the width dimension w to the channel dimension C to obtain second data C2 (n, w, oh, C) to be convolved.

In this way, the second data C2 (n, w, oh, C) to be convolved corresponding to the second convolution operation and the corresponding second convolution kernel fi lter2 can be obtained.

The second processing unit 12 may send the second data to be convolved C2 (n, w, oh, C) and the corresponding second convolution kernel fi lter2 to the first processing unit 11, so that the first processing unit 11 may perform the second convolution operation S2 (n, ow, oh, C) =c2 (n, w, oh, C) fi lter2 by using the pulse array accelerator, and transmit S2 (n, ow, oh, C) back to the second processing unit 12, so that the second processing unit 12 performs dimensional transformation on S2 (n, ow, oh, C) and transforms its channel dimension C to the width dimension ow, to obtain a second processing result S2 (n, C, oh, ow), as a result of performing the sampling operation on the data to be processed D (n, C, h, w) in the height dimension h and the width dimension w.

The second processing unit 12 may convert the second data to be convolved C2 (n, w, oh, C) into a second input feature matrix, and convert the second convolution kernel fi lter2 into a second kernel matrix, so that the pulse array accelerator of the first processing unit 11 performs matrix multiplication operation on the second input feature matrix and the second kernel matrix to obtain a second processing result S2 (n, C, oh, ow), and the specific principle may refer to the above and will not be repeated herein.

It should be appreciated that the dimension transformation between the sample dimension and the channel dimension may be performed by the second processing unit 12 or by the first processing unit 11 (if the first processing unit 11 has the processing capability to perform the dimension transformation operation). After the first processing unit 11 performs the first convolution operation S1 (n, oh, C, w) =c1 (n, h, C, w) =flter1 by using the systolic array accelerator, the dimension transformation may also be directly performed on S1 (n, oh, C, w), the channel dimension C is transformed to the height dimension oh, to obtain the first processing result S1 (n, C, oh, w), and then the first processing result S1 (n, C, oh, w) is returned to the second processing unit 12. Similarly, after the first processing unit 11 performs the second convolution operation S2 (n, ow, oh, C) =c2 (n, w, oh, C) =flter 2 by using the systolic array accelerator, the dimension transformation may be directly performed on S2 (n, ow, oh, C), the channel dimension C thereof may be transformed to the width dimension ow, to obtain the second processing result S2 (n, C, oh, ow), and then the second processing result S2 (n, C, oh, ow) may be returned to the second processing unit 12.

It may be understood that, in some application scenarios, the first processing unit 11 and the second processing unit 12 may also perform sampling operation on the width dimension w to obtain a first processing result, and then perform sampling operation on the height dimension h of the first processing result to obtain a final processing result, which is not described herein again.

It should be understood that, in the embodiment of the present application, taking the sampling operation of the height dimension h and the width dimension w as an example, in practical application, the sampling operation may be performed in all 4 dimensions of the data to be processed according to practical situations, and the total number of dimensions of the data to be processed and the number of dimensions N of the data to be processed in the data to be processed are not limited.

By the method, sampling operation of any multiple dimensions of data to be processed can be converted into convolution operation for multiple times, so that the application range of a sampling operation realization system is expanded, and the applicability of the sampling operation realization system is improved.

In the practical application process, in order to further improve the processing efficiency of the sampling operation implementation system, it is found that when the value of the sampling dimension of the data to be processed is large (for example, a large number of elements are provided in the sampling dimension), a large number of 0 values may exist in the convolution kernel obtained by conversion, and the data processing utilization rate is not high during convolution operation, so for the data with the large sampling dimension, in order to further optimize the data processing efficiency, the convolution kernel may be diced after the convolution kernel is ready, and the part of which is cut out to be all 0 is directly not needed to be calculated.

In a possible implementation, the second processing unit 12 is further configured to: under the condition that the convolution kernel comprises all-zero data blocks, performing segmentation processing on the convolution kernel to obtain all-zero data blocks and non-zero data blocks; respectively determining a data block to be convolved corresponding to each non-zero data block; respectively carrying out cyclic replication processing on the data sub-blocks to be convolved and the non-zero data blocks, converting each data sub-block to be convolved into a two-dimensional input characteristic sub-matrix, and converting the non-zero data blocks into a two-dimensional nucleon matrix; transmitting each nucleomatrix and the corresponding input feature submatrix to the first processing unit 11, and performing a splicing process on a plurality of matrix multiplier results in response to receiving the matrix multiplier result of each nucleomatrix and the corresponding input feature submatrix transmitted by the first processing unit 11, so as to obtain a spliced result; performing dimension transformation on the spliced result to obtain a processing result, wherein the processing result is used as a result of sampling operation on the data to be processed; the first processing unit 11 is further configured to: in response to receiving each nucleomatrix and the corresponding input feature submatrix, performing matrix multiplication operation on each nucleomatrix and the corresponding input feature submatrix through the pulse array accelerator to obtain a plurality of matrix multiplication results; the plurality of matrix multiplier results are sent to the second processing unit 12.

The sampling operation realization system can carry out matrix multiplication operation on each submatrix and the corresponding input characteristic submatrix through the pulse array accelerator to obtain a plurality of matrix multiplication results, and after the splicing processing and the dimensional transformation of the matrix multiplication results, a processing result is obtained and is used as a result of the sampling operation on the data to be processed.

Fig. 8 shows a convolution kernel segmentation schematic diagram provided in the embodiment of the present application, as shown in fig. 8, assuming that a sampling mode is nearest neighbor sampling (nearest), and when the transformed coordinates are non-integers, rounding is performed in an f loor mode; the value iw=512 of the sampling dimension of the data to be processed and the dimension value ow=1024 after sampling transformation, the structured convolution kernels can be segmented at the corresponding part of iw according to the segment length 256, because in the nearest neighbor sampling mode, only one effective value exists in each convolution kernel block. Finding a weight value of 1 in the data block with iw=256, in this example, the corresponding ow=513, cutting the first segment with iw=255, ow=512, and then cutting the next 256 segments in iw direction, where iw=512, beyond the maximum iw=511, so that it is just that iw and ow are taken as the last segment directly by the last segment.

As shown in fig. 8, two all-zero data blocks (e.g., white areas in fig. 8) and two non-zero data blocks can be obtained by a slicing process on the convolution kernel. In this case, the convolution operation may be performed on two non-zero data blocks, respectively, instead of performing the operation on all zero data blocks in the convolution kernel. The data blocks to be convolved corresponding to each non-zero data block can be determined respectively, and two non-zero data blocks can be taken for deconvolution with the corresponding data blocks to be convolved during calculation, and the obtained results are spliced to obtain a final convolution result.

Further, the second processing unit 12 may perform cyclic replication processing on any sliced non-zero data block and the corresponding data sub-block to be convolved, and convert the data sub-block to be convolved into an input feature sub-matrix, and convert the non-zero data block into a nucleomatrix, so that the pulse array accelerator of the first processing unit 11 performs matrix multiplication operation on the nucleomatrix and the corresponding input feature sub-matrix to obtain a matrix multiplier result. The second processing unit 12 may splice the multiple matrix multiplier results, and transform the channel dimension of the spliced result to the sampling dimension, so as to obtain a transformed processing result, that is, a sampling operation result of the data to be processed corresponding to the convolution kernel before splitting.

In this way, in the case where the convolution kernel includes an all-zero data block, the all-zero data block does not participate in the convolution operation, and the processing result corresponding to the sampling operation is determined according to the convolution operation of the data to be convolved and the non-zero data block in the convolution kernel. By the method, the all-zero data block can not participate in convolution operation, so that the calculated amount of sampling operation executed by the sampling operation realization system is further reduced, the consumption of hardware resources is reduced, and the data processing efficiency of the sampling operation realization system is improved.

In summary, in the system for implementing sampling operation based on the pulse array accelerator provided in the embodiments of the present application, the second processing unit 12 and the first processing unit 11 with the pulse array accelerator are used cooperatively to convert the sampling operation of the data to be processed into the convolution operation, determine the data to be convolved and the convolution kernel corresponding to the convolution operation, and respectively convert the data to be convolved and the convolution kernel into the input feature matrix and the kernel matrix; and then, matrix multiplication operation is carried out on the input feature matrix and the nuclear matrix through the pulse array accelerator to obtain a processing result corresponding to sampling operation (res ize), so that the sampling operation which is not suitable for the pulse array accelerator can be converted into convolution operation which can be suitable for the pulse array accelerator, and the acceleration calculation of the sampling operation is realized by using the first processing unit 11 and the second processing unit 12 with the pulse array accelerator in a matched manner, and the data processing efficiency of the sampling operation implementation system for executing the sampling operation is greatly improved.

It will be appreciated that the above-mentioned method embodiments of the present application may be combined with each other to form a combined embodiment without departing from the principle logic, which is not repeated herein, and the present application is limited to the description. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.

Fig. 9 is a schematic diagram of a method for implementing a sampling operation according to an embodiment of the present application, as shown in fig. 9, where the method is applied to a sampling operation implementing system, and the system includes a first processing unit 11 with a systolic array accelerator, and a second processing unit 12 connected to the first processing unit 11, and the method includes:

in step S11, data to be processed is acquired by the second processing unit 12;

in step S12, the second processing unit 12 converts the sampling operation of the data to be processed into a convolution operation, and determines the data to be convolved and a convolution kernel corresponding to the convolution operation;

in step S13, the data to be convolved and the convolution kernel are processed by the second processing unit 12, the data to be convolved is converted into an input feature matrix, the convolution kernel is converted into a kernel matrix, and the input feature matrix and the kernel matrix are sent to the first processing unit 11;

In step S14, in response to receiving the input feature matrix and the kernel matrix sent by the second processing unit 12, the first processing unit 11 performs a matrix multiplication operation on the input feature matrix and the kernel matrix by using the systolic array accelerator, to obtain a matrix multiplication operation result;

in step S15, the first processing unit 11 sends the matrix multiplication result to the second processing unit 12, and the second processing unit 12 converts the matrix multiplication result into a processing result corresponding to the sampling operation in response to receiving the matrix multiplication result of the input feature matrix and the kernel matrix sent by the first processing unit 11; or after the first processing unit obtains the matrix multiplication operation result, converting the matrix multiplication operation result into a processing result corresponding to the sampling operation by the first processing unit.

Optionally, step S12 may include: the second processing unit 12 establishes a corresponding relation between each element position in the output feature matrix and each element position in the data to be processed according to the size of the data to be processed, the size of the output feature matrix corresponding to the sampling operation and the sampling mode aiming at the sampling operation of any sampling dimension; the second processing unit 12 constructs a convolution kernel for the data to be processed and the output feature matrix corresponding to the sampling operation according to the corresponding relation, wherein the convolution kernel is assigned based on the weight of the data to be processed under the corresponding relation; the second processing unit 12 performs dimension transformation on the data to be processed, and transforms the sampling dimension into a channel dimension to obtain the data to be convolved.

Optionally, in the case that the sampling operation is sampling in N dimensions, the sampling operation implementation method further includes: the second processing unit 12 converts the sampling operation of the data to be processed into N convolution operations, and determines the data to be convolved and the convolution kernel corresponding to each convolution operation, where N is an integer greater than 1.

Optionally, step S13 may include: the second processing unit 12 performs cyclic replication processing on the convolution kernel, and sequentially converts data included in the convolution kernel into column vectors or row vectors, and the column vectors or the row vectors are arranged into a two-dimensional kernel matrix; and performing cyclic replication processing on the data to be convolved, and sequentially converting the data to be convolved corresponding to the convolution kernel receptive field into corresponding column vectors or row vectors, and arranging the corresponding column vectors or row vectors into a two-dimensional input feature matrix.

Optionally, the method for implementing the sampling operation further includes: in the case that the convolution kernel includes an all-zero data block, the second processing unit 12 performs a slicing process on the convolution kernel to obtain an all-zero data block and a non-zero data block; the second processing unit 12 determines the data sub-blocks to be convolved corresponding to each non-zero data block respectively; the second processing unit 12 performs cyclic replication processing on the data sub-blocks to be convolved and the non-zero data blocks respectively, converts each data sub-block to be convolved into a two-dimensional input feature sub-matrix, and converts the non-zero data blocks into a two-dimensional nucleon matrix; the second processing unit 12 sends each nucleomatrix and the corresponding input feature submatrix to the first processing unit 11; the first processing unit 11 performs matrix multiplication operation on each nucleomatrix and the corresponding input characteristic submatrix through the pulse array accelerator in response to receiving each nucleomatrix and the corresponding input characteristic submatrix, so as to obtain a plurality of matrix multiplication results; the first processing unit 11 sends the plurality of matrix multiplier results to the second processing unit 12; the second processing unit 12 performs a splicing process on the multiple matrix multiplier results in response to receiving the matrix multiplier results of each nucleon matrix and the corresponding input feature submatrix sent by the first processing unit 11, so as to obtain a spliced result; the second processing unit 12 performs dimension transformation on the spliced result to obtain a processing result, which is used as a result of sampling operation on the data to be processed.

Optionally, step S14 may include: the first processing unit 11 performs matrix multiplication operation on the input feature matrix and the kernel matrix through a pulse array accelerator to obtain a matrix multiplication operation result; step S15 may include: the second processing unit 12 performs dimension transformation on the matrix multiplication result in response to receiving the matrix multiplication result of the input feature matrix and the kernel matrix sent by the first processing unit 11, and transforms a channel dimension of the matrix multiplication result to a sampling dimension, so as to obtain a processing result, which is used as a result of performing the sampling operation on the data to be processed.

In some embodiments, in step S14 after step S13, in response to receiving the input feature matrix and the kernel matrix sent by the second processing unit 12, the first processing unit 11 performs a matrix multiplication operation on the input feature matrix and the kernel matrix by using the systolic array accelerator, to obtain a matrix multiplication operation result, and converts the matrix multiplication operation result into a processing result corresponding to the sampling operation.

Optionally, in case the first processing unit 11 further integrates a dimension processing module for performing dimension transformation, the method further comprises: the first processing unit 11 performs matrix multiplication operation on the input feature matrix and the nuclear matrix through a pulse array accelerator to obtain a matrix multiplication operation result, and the first processing unit 11 performs dimension transformation on the matrix multiplication operation result through a dimension processing module to transform the channel dimension of the matrix multiplication operation result into a sampling dimension to obtain a processing result; the first processing unit 11 sends the processing result to the second processing unit 12; the second processing unit 12 takes the processing result sent by the first processing unit 11 as a result of the sampling operation performed on the data to be processed by the sampling operation implementing system.

Optionally, in the case that the N dimensions include a first sampling dimension and a second sampling dimension, the sampling operation implementation method further includes: performing a first sampling operation of a first sampling dimension on the data to be processed through the first processing unit 11 and the second processing unit 12 to obtain a first processing result, wherein the first processing result is used as a result corresponding to the sampling operation of the data to be processed in the first sampling dimension; and performing second sampling operation of a second sampling dimension on the first processing result through the first processing unit 11 and the second processing unit 12 to obtain a second processing result, wherein the second processing result is used as a result corresponding to the sampling operation of the data to be processed in the first sampling dimension and the second sampling dimension.

For further details of the method, reference is made to the foregoing description of the system, which is not repeated here for brevity. The functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part, where the functions, if implemented in the form of software functional modules and sold or used as a single product, may be stored in a computer-readable storage medium.

Accordingly, embodiments of the present application also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-provided method. The embodiments of the present application also provide a computer program product, which includes software functional modules for implementing the above-mentioned methods, and when the computer readable program instructions corresponding to the software functional modules are executed in a processor of an electronic device, the above-mentioned methods are executed.

The embodiment of the application also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the computer program can execute the method when being executed by the processor. The electronic device may be provided as a computer, a cell phone, a server, an in-vehicle device, or other modality of device.

The processor (which may include the first processing unit and the second processing unit) in the electronic device for executing the method may be a new design, or may be a modified processing unit of an existing processor, where the types of the existing processing unit may include, but are not limited to: a central processing unit (Centra l Process ing Un it, CPU), a graphics processing unit (Graph ic Process ing Un it, GPU), a general-purpose graphics processing unit (general-Purpose Comput ing on Graph ics Process ing Un its, GPGPU), a digital signal processing unit (Digita l Signa l Processor, DSP), an application specific integrated circuit (App l icat ion Specific I ntegrated Ci rcu it, ASIC), a tensor processing unit (Tensor Process ing Un it, TPU), a field programmable gate array (Fie ld Programmab le Gate Array, FPGA), or other programmable logic device, and may also include a processing unit of a micro-processing unit or other conventional processor.

The memory of the electronic device may be used to store program instructions (e.g., program instructions of an application, a driver, or even an operating system) executable by the processor, and the processor in the electronic device for performing the above-described methods is configured to implement the above-described methods by executing the computer program stored in the memory.

By way of example, the processor of the electronic device may include two processing units: the first processing unit 11 with the systolic array accelerator (which may be used to perform the matrix multiplication operation in the method described above) and the second processing unit 12 without the systolic array accelerator (which may be used to perform the method steps in the method described above other than the matrix multiplication operation) may be implemented by cooperation of two processing units, which may be integrated in the same processor or may be distributed over different processors of the electronic device. Of course, in some cases, the systolic array accelerator may also be separately provided for the second processing unit, which is still considered as a processing idea and principle of the aforementioned sampling operation implementation system and method.

Optionally, the electronic device may also include further components, such as a power supply component, a wired or wireless network interface, an input-output interface. The exemplary components of the electronic device should not be construed as limiting the present application.

It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts (implementation principle and technical effect) between the embodiments are referred to each other.

The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Variations and substitutions will be readily apparent to those skilled in the art within the scope and spirit of the various embodiments described and disclosed herein, and are intended to be encompassed within the scope of the present application.

Claims

1. A sampling operation realization system, comprising a first processing unit with a systolic array accelerator, and a second processing unit connected to the first processing unit;

the second processing unit is used for:

acquiring data to be processed;

converting the sampling operation of the data to be processed into convolution operation, and determining the data to be convolved and a convolution kernel corresponding to the convolution operation;

processing the data to be convolved and the convolution kernel respectively, converting the data to be convolved into an input feature matrix, and converting the convolution kernel into a kernel matrix;

Transmitting the input feature matrix and the kernel matrix to the first processing unit, and converting the matrix multiplication result into a processing result corresponding to the sampling operation in response to receiving the matrix multiplication result of the input feature matrix and the kernel matrix transmitted by the first processing unit;

the first processing unit is used for:

in response to receiving the input feature matrix and the kernel matrix sent by the second processing unit, performing matrix multiplication operation on the input feature matrix and the kernel matrix through the pulse array accelerator to obtain a matrix multiplication operation result;

the matrix multiplication operation result is sent to a second processing unit, or the matrix multiplication operation result is converted into a processing result corresponding to the sampling operation;

the method for determining the convolution kernel and the data to be convolved corresponding to the convolution operation comprises the steps of:

aiming at sampling operation of any sampling dimension, establishing a corresponding relation between each element position in the output characteristic matrix and each element position in the data to be processed according to the size of the data to be processed, the size of the output characteristic matrix corresponding to the sampling operation and a sampling mode;

Constructing a convolution kernel for the data to be processed and the output feature matrix corresponding to the sampling operation according to the corresponding relation, wherein the convolution kernel is assigned based on the weight of the data to be processed under the corresponding relation;

and carrying out dimension transformation on the data to be processed, and transforming the sampling dimension into a channel dimension to obtain the data to be convolved.

2. The sampling operation implementation system according to claim 1, wherein processing the data to be convolved and the convolution kernel, respectively, converts the data to be convolved into an input feature matrix, and converts the convolution kernel into a kernel matrix, comprises:

performing cyclic replication processing on the convolution kernel, sequentially converting data included in the convolution kernel into column vectors or row vectors, and arranging the column vectors or the row vectors into a two-dimensional kernel matrix;

and performing cyclic replication processing on the data to be convolved, and sequentially converting the data to be convolved corresponding to the convolution kernel receptive field into corresponding column vectors or row vectors, and arranging the corresponding column vectors or row vectors into a two-dimensional input feature matrix.

3. The sampling operation implementation system according to claim 1, wherein converting the matrix multiplication result into a processing result corresponding to the sampling operation comprises:

And carrying out dimension transformation on the matrix multiplication operation result, and transforming the channel dimension of the matrix multiplication operation result into a sampling dimension to obtain a processing result which is used as a result of carrying out the sampling operation on the data to be processed.

4. A sampling operation implementing system according to any one of claims 1-3, wherein the second processing unit is further configured to:

under the condition that the convolution kernel comprises all-zero data blocks, performing segmentation processing on the convolution kernel to obtain all-zero data blocks and non-zero data blocks;

respectively determining a data sub-block to be convolved corresponding to each non-zero data block;

respectively carrying out cyclic replication processing on the data sub-blocks to be convolved and the non-zero data blocks, converting each data sub-block to be convolved into a two-dimensional input characteristic sub-matrix, and converting the non-zero data blocks into a two-dimensional nucleon matrix;

transmitting each nucleomatrix and the corresponding input characteristic submatrix to the first processing unit, and performing splicing processing on a plurality of matrix submatrices in response to receiving a matrix multiplier result of each nucleomatrix and the corresponding input characteristic submatrix transmitted by the first processing unit to obtain a spliced result;

Performing dimension transformation on the spliced result to obtain a processing result, wherein the processing result is used as a result of sampling operation on the data to be processed;

the first processing unit is further configured to:

in response to receiving each nucleomatrix and the corresponding input feature submatrix, performing matrix multiplication operation on each nucleomatrix and the corresponding input feature submatrix through the pulse array accelerator to obtain a plurality of matrix multiplication results;

and transmitting the matrix multiplier results to a second processing unit.

5. A sampling operation implementing system according to any one of claims 1-3, wherein the second processing unit is further configured to: and under the condition that the sampling operation is sampling of N dimensions, converting the sampling operation of the data to be processed into N times of convolution operation, respectively determining the data to be convolved and the convolution kernel corresponding to each convolution operation, wherein N is an integer larger than 1.

6. The sampling operation implementation system of claim 5, wherein the N dimensions comprise a first sampling dimension and a second sampling dimension, the sampling operation implementation system further to:

performing a first sampling operation of a first sampling dimension on the data to be processed through the first processing unit and the second processing unit to obtain a first processing result, wherein the first processing result is used as a result corresponding to the sampling operation of the data to be processed in the first sampling dimension;

And performing second sampling operation of a second sampling dimension on the first processing result through the first processing unit and the second processing unit to obtain a second processing result, wherein the second processing result is used as a result corresponding to the sampling operation of the data to be processed in the first sampling dimension and the second sampling dimension.

7. A sampling operation implementation method, wherein the sampling operation implementation method is applied to a sampling operation implementation system, the sampling operation implementation system including a first processing unit having a systolic array accelerator, and a second processing unit connected to the first processing unit, the method comprising:

acquiring data to be processed through the second processing unit;

converting the sampling operation of the data to be processed into convolution operation through the second processing unit, and determining the data to be convolved and a convolution kernel corresponding to the convolution operation;

the data to be convolved and the convolution kernel are processed through the second processing unit respectively, the data to be convolved are converted into an input feature matrix, the convolution kernel is converted into a kernel matrix, and the input feature matrix and the kernel matrix are sent to the first processing unit;

The first processing unit responds to the input feature matrix and the kernel matrix sent by the second processing unit, and performs matrix multiplication operation on the input feature matrix and the kernel matrix through the pulse array accelerator to obtain a matrix multiplication operation result;

the first processing unit sends the matrix multiplication operation result to a second processing unit, and the second processing unit converts the matrix multiplication operation result into a processing result corresponding to the sampling operation in response to receiving the matrix multiplication operation result of the input feature matrix and the kernel matrix sent by the first processing unit;

or after the first processing unit obtains the matrix multiplication operation result, converting the matrix multiplication operation result into a processing result corresponding to the sampling operation by the first processing unit;

wherein the second processing unit is configured to: in the process of converting the sampling operation of the data to be processed into convolution operation and determining the data to be convolved and the convolution kernel corresponding to the convolution operation:

8. An electronic device, comprising: a memory and a processor;

stored in the memory is a computer program which, when executed by the processor, performs the sampling operation implementation method of claim 7.

9. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the sampling operation implementation method of claim 7.