CN112861846B

CN112861846B - Method and device for processing tensor data

Info

Publication number: CN112861846B
Application number: CN201911100157.0A
Authority: CN
Inventors: 李军
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2024-04-19
Anticipated expiration: 2039-11-12
Also published as: CN112861846A

Abstract

A method and apparatus for processing tensor data is disclosed. The method may include: storing a plurality of instruction sequences, each instruction sequence for processing a number of tensors corresponding to the instruction sequence; for each region of interest of the input tensor, determining a local tensor corresponding to the region of interest from the input tensor; and executing a first sequence of instructions for processing the first number of tensors for each of the first number of acquired local tensors. This reduces the amount of data transferred and increases the execution speed.

Description

Method and device for processing tensor data

Technical Field

The present disclosure relates generally to the field of artificial intelligence, and in particular to a method and apparatus for processing tensor data.

Background

In the target detection scheme based on the region of interest (Region of Interest, ROI) proposed in the related art, for each input tensor (tensor), a large number of rois, for example, thousands, tens of thousands, hundreds of thousands, or even more rois, can be generated by, for example, a region recommendation network (Region Proposal Network, RPN) or the like.

Then, various processes such as sorting, non-maximum suppression, interpolation, and alignment may be performed for each partial tensor corresponding to each ROI generated in the input tensor, and further, a detection result for the input tensor may be obtained by, for example, a Region-based convolutional neural network (Region-based Convolution Neural Network, RCNN) or the like.

Disclosure of Invention

According to one aspect of the present disclosure, a method of processing tensor data is provided, the method may include: storing a plurality of instruction sequences, each instruction sequence for processing a number of tensors corresponding to the instruction sequence; for each region of interest of an input tensor, determining a local tensor corresponding to the region of interest from the input tensor; and executing a first instruction sequence of the plurality of instruction sequences for processing the first number of tensors for each first number of local tensors acquired.

According to another aspect of the present disclosure, there is also provided an apparatus for processing tensor data, the apparatus may include: an instruction storage module configured to store a plurality of instruction sequences, each instruction sequence for processing a number of tensors corresponding to the instruction sequence; a tensor acquisition module configured to acquire, for each region of interest of an input tensor, a local tensor corresponding to the region of interest from the input tensor; and a first instruction execution module configured to execute a first instruction sequence of the plurality of instruction sequences for processing the first number of tensors for each first number of local tensors acquired by the tensor acquisition module.

According to another aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a computer program for performing the above-described method.

According to another aspect of the disclosure, there is also provided an electronic device that may include a processor and a memory for storing instructions executable by the processor, the processor being configured to read the instructions from the memory and execute the instructions to implement the above-described method.

According to the method, the device and the electronic equipment, the data transmission quantity in the tensor data processing process can be reduced, and the tensor data processing speed is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing embodiments thereof in more detail with reference to the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 illustrates an example apparatus for processing tensor data.

Fig. 2 illustrates an example method for processing tensor data according to an embodiment of this disclosure.

Fig. 3 illustrates an example of local tensors and ROIs according to an embodiment of the disclosure.

Fig. 4 shows an example of a process according to an embodiment of the present disclosure.

Fig. 5 shows an example of an operation to be performed for a tensor.

Fig. 6 shows an example of a procedure for the operation of fig. 5.

Fig. 7 illustrates an example of a process for the operation of fig. 5, according to an embodiment of the present disclosure.

Fig. 8 illustrates an example method for processing tensor data according to an embodiment of this disclosure.

Fig. 9 illustrates an example method for processing tensor data according to an embodiment of this disclosure.

Fig. 10 illustrates an example method for processing tensor data according to an embodiment of this disclosure.

Fig. 11 illustrates an example apparatus for processing tensor data according to an embodiment of the disclosure.

Fig. 12 illustrates an example apparatus for processing tensor data according to an embodiment of the disclosure.

Fig. 13 illustrates an example apparatus for processing tensor data according to an embodiment of the disclosure.

Fig. 14 illustrates an example apparatus for processing tensor data according to an embodiment of the disclosure.

Fig. 15 illustrates an example of an electronic device according to an embodiment of the disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

Summary of the application

In ROI-based object detection and labeling schemes, the number of ROIs generated tends to be quite large and uncertain, and also varies with the input tensor. For processing based on the respective local tensors corresponding to the respective ROIs after the generation of the ROIs, it is difficult to determine and deploy in advance an instruction sequence capable of efficiently processing all the local tensors. If the same instruction sequence is repeatedly executed for each local tensor, relevant data and parameters (e.g., data and parameters related to the layer operations of the convolutional neural network) need to be repeatedly loaded, and these relevant parameters may occupy more memory resources, and the repeated data loading will cause serious bandwidth waste, thereby seriously affecting the execution efficiency.

To this end, as shown in fig. 1, the instruction templates 150 may be generated in advance based on an instruction sequence for processing a single local tensor, and the generated instruction templates 150 may be stored into the memory 120 of the example apparatus 100 for performing the object detection task. In operation of the example apparatus 100, a processor 110 developed based on a central processing unit (Central Processing Unit, CPU), an advanced reduced instruction set computing machine (ADVANCED RISC MACHINE, ARM) processor, a field programmable gate array (Field Programmable GATE ARRAY, FPGA), or the like may generate an instruction sequence 160 for processing all local tensors from the instruction templates 150 read from the memory 120 and the number of ROIs actually generated, and store the instruction sequence 160 in the memory 140. The processor 130, such as a convolutional neural network acceleration core, in the example apparatus 100 may then read the generated instruction sequence 160 from the memory 140 and execute the instruction sequence 160 for all local tensors.

However, processors 110 and 130 in example apparatus 100 are heterogeneous, which results in example apparatus 100 having a natural systematic risk. For example, it is difficult to detect an execution error of the processor 110 during the processing performed by the processor 130. In addition, there is also a data flow from memory 120 to memory 140 during operation of the example apparatus 100, which causes additional bandwidth overhead.

Exemplary method

Fig. 2 illustrates an example method 200 for processing tensor data, which example method 100 may include steps 110, 120, and 130, according to an embodiment of this disclosure.

In step 110, M instruction sequences may be stored, where M may be any natural number, and each of the M instruction sequences may be used to process a number of tensors corresponding to the instruction sequence, respectively.

For example, the M instruction sequences may include an instruction sequence for batching N ₁ tensors, an instruction sequence for batching N ₂ tensors, … …, an instruction sequence for batching N _M tensors. For example, if M >1, N ₁、……、N_M or the like may be any natural number different from each other, and 1.ltoreq.N ₁<……<N_M; if m=1, then N _M (i.e., N ₁) can be any number greater than 1. For example, N _M may be an integer multiple of M, and may be set to N ₁＝N_M/M、N₂＝2N_M/M, … …, and the like.

Each of the M instruction sequences 240 may be used to process the same one or more tasks, such as ordering, non-maximum suppression, interpolation, alignment, execution RCNN models, and so on. In addition, each of the M instruction sequences may process the same task or tasks described above in the same or different ways, and may also be optimized differently for different tensor numbers in advance.

The "tensor" in the present disclosure may include, for example, "feature data" or "feature map" in the field of artificial intelligence, and the like, and corresponding examples may include, but are not limited to, colored or monochromatic images, pictures or video frames that can be visually discerned or viewed by the human eye, and abstract data that cannot be visually discerned or viewed by the human eye, and the like. For example, the tensor may be a color image, a gray-scale image, a frame image in an image sequence, or the like, or may be an intermediate result of a layer of a neural network, or the like. In addition, the tensor may include one or more channels, and each channel may carry features relating to one or more aspects of the tensor. For example, the tensor may include color information channels respectively characterizing different colors of red, yellow, blue, etc., gray information channels characterizing gray information, and one or more other channels characterizing other characteristic information such as sharpness information, edge information, etc. The present disclosure is not limited to the type, form, shape, content, etc. of tensors.

Then, in step 120, for each ROI of the input tensor, a local tensor corresponding to the ROI may be determined from the input tensor.

FIG. 3 illustrates an example tensor 300 in which each small square represents an element point in the example tensor 300. The individual ROIs of the example tensor 300 may be obtained in any suitable manner, which is not limiting of the present disclosure. As shown in fig. 3, from one ROI 310 of the example tensor 300, one local tensor 320 corresponding to the ROI 310 may be determined from the example tensor 300, wherein the local tensor 320 is one data slice of the example tensor 300, which may have a different width and height than the example tensor 300, but the same depth (i.e., number of channels).

In the example of fig. 3, the example tensor 300 and the local tensor 320 are represented by a data cube, and the ROI 310 is represented by a two-dimensional rectangle. However, the present disclosure is not limited to the example of fig. 3. For example, the input tensor and the local tensor may also be a two-dimensional data array (e.g., a monochromatic image such as a gray-scale image), etc., and the ROI determined for the input tensor may also have a circular, triangular, polygonal, or other arbitrary shape, and so on.

In addition, because of the correspondence of the ROI to the local tensors, for example, ordering the ROI data is actually equivalent to ordering the local tensors described using the ROI data (or otherwise corresponding to the ROI data), the confidence data in the ROI data is actually the confidence of the local tensors corresponding to the ROI data. Thus, in the present disclosure, the local tensor of the ROI may not be differentiated between the two, unless otherwise specified. For example, a "region of interest" or "ROI" may be more focused on describing attributes of the tensor, such as boundaries and confidence, etc., while a "local tensor" may be more focused on the tensor's data itself, such as individual elements within the tensor.

Then, in step 130, a first instruction sequence of the above-mentioned M instruction sequences for processing the tensor of the first number C ₁ may be executed for each first number of local tensors acquired.

The first number C ₁ may be set to any of the values of N ₁、……、N_M described above, such as C ₁＝N_M. As shown in fig. 4, for a set of N _M partial tensors 420 of the input tensor, a first sequence of instructions 410 of the M sequences of instructions 400 for processing a first number C ₁＝N_M of tensors may be executed. Similarly, for another set of N _M partial tensors 430 of the input tensor, execution of the first sequence of instructions 410 may continue. And so on until all local tensors of the input tensor have been processed.

In the example method 200, M instruction sequences are set and stored, and each of the M instruction sequences may be used to process a number of tensors corresponding to the instruction sequence, respectively, and then for each set of first number of local tensors of the input tensor, the first instruction sequence of the M instruction sequences for processing the first number of tensors is executed. Each of such M instruction sequences may be an instruction sequence previously compiled offline by a high-performance processor such as a CPU into a suitable execution by a processor such as a convolutional neural network acceleration core. Thus, in an apparatus for processing tensors, multiple processors of differing implementation and processing capabilities may not be provided, thereby avoiding systematic risks caused by module heterogeneity, while also avoiding data flow between memories associated with the different processors.

In addition, as previously described, each of the M instruction sequences in the example method 200 may be an instruction sequence that is pre-optimized for a different tensor number, allowing for higher execution efficiency with relatively fewer memory resources. This is described in more detail below with reference to fig. 5 to 7.

Fig. 5 shows an example operation 500 to be performed for the tensor corresponding to each ROI generated, including a convolution operation 510 using parameters 520, a convolution operation 530 using parameters 540, and a convolution operation 550 using parameters 560.

FIG. 6 illustrates an example process 600 for the example operation 500. The example process 600 may be repeatedly performed for each local tensor corresponding to each ROI generated.

As shown in fig. 6, for each local tensor corresponding to each ROI, the example process 600 includes: step 610, loading the local tensor from the memory; step 620, loading parameters 520 for convolution operation 510 from memory; step 630, performing convolution operation 510; step 640, loading parameters 540 for convolution operation 530 from memory; step 650, performing convolution operation 530; step 660, loading parameters 560 for convolution operation 550 from memory; step 670, performing convolution operation 550; in step 680, the resulting tensor obtained by the convolution operation 550 is stored in memory.

FIG. 7 illustrates an example process 700 for the example operation 500. The example process 700 may be one example of the step 230 of the example method 200 of fig. 2, and adjustments to the steps in the example process 600 are made.

As shown in fig. 7, in the example process 700, after loading the parameters 520 for the convolution operation 510 by step 620, a loop 710 is entered for each local tensor. In loop 710, for each local tensor, the local tensor is loaded, via step 610, then the convolution operation 510 is performed, via step 630, and then the first result obtained by the convolution operation 510 is stored, via step 720. Then, after loading the parameters 540 for the convolution operation 530, via step 640, a loop 730 is entered for each first result. In loop 730, for each first result, the first result is loaded, via step 740, then the convolution operation 530 is performed, via step 650, and then the second result obtained by the convolution operation 530 is stored, via step 750. Then, after loading the parameters 560 for the convolution operation 550, via step 660, a loop 760 is entered for each second result. In loop 760, for each first result, the second result is loaded, via step 740, then the convolution operation 550 is performed, via step 670, and then the result tensor obtained by the convolution operation 550 is stored, via step 680.

Assuming that in the above example processes 600 and 700, the time spent loading each parameter for each convolution operation from memory is T _P, the time spent loading the tensor from memory and storing the tensor into memory is T _F, and the time spent performing each convolution operation is T _C, the time spent performing the example process 600 may be T ₆₀₀＝2KT_F+3KT_P+3KT_C for K partial tensors corresponding to K (K is any natural number) ROIs, and the time spent performing the example process 700 may be T ₇₀₀＝6KT_F+3T_P+3KT_C. Thus, the difference Δt=t ₆₀₀-T₇₀₀＝3(K-1)T_P-4KT_F between the time of execution of the example process 600 and the time of execution of the example process 700.

In practical applications, such as in the actual RCNN model, the total size of the parameters for each convolution operation is often much larger than the size of the tensor for that convolution operation. Accordingly, for each convolution operation, the time-of-use T _P to load the parameters for that convolution operation may be much greater than the time-of-use T _F to load the tensor for that convolution operation or to store the output result of that convolution operation, e.g., T _P≥10T_F, and the time-of-use T _P to load the parameters for that convolution operation may be several times the time-of-use T _C to perform that convolution operation. Assuming T _P≈10T_F and T _P≈T_C, the example process 700 may have a higher execution speed relative to the example process 600 when K+.gtoreq.2. For example, when k=3, the increase in execution speed of example process 700 relative to example process 600 may be approximately (T ₆₀₀-T₇₀₀)/T₆₀₀ ++13K-15)/(31K) =24/93++25.8%, and the greater K, the greater the magnitude of the increase in execution speed.

In addition, during actual processing, since the capacity of the high-speed memory associated with a processor such as a convolutional neural network acceleration core is often quite limited, the first results obtained by step 630 and the second results obtained by step 650 also tend to need to be stored into a larger capacity but slower access off-chip memory (e.g., dynamic random access memory) for the example process 600. In view of this, the improvement in execution efficiency of the example process 700 relative to the example process 600 may actually be higher.

As previously described, each of the M instruction sequences of the example method 200 may be an instruction sequence optimized for a particular number of tensors, such as the example process 700, for example, a first instruction sequence for processing a first number C ₁ of tensors. For example, if L (L is any natural number) convolution operations need to be performed for each local tensor, the time spent processing the K local tensors may be approximated as T ₁＝LKT_P+2KT_F+LKT_C and the time spent performing the example method 200 may be approximated as T ₂＝(LK/C₁)T_P+2LKT_F+LKT_C in a similar manner to the processing of the example process 600. Thus, the improvement in the execution efficiency of the example method 200 may be (T ₁-T₂)/T₁≈(4LC₁+C₁-5L)/10LC₁+C₁). If the number of convolution layers is high and the mathematical of the local tensor per batch is also configured as much as possible, the improvement in performance efficiency of the example method 200 may be about 4LC ₁/10LC₁ = 40%. Thus, with the example method 200, local tensors corresponding to respective ROIs can be processed with greater efficiency.

As previously described, the number of ROIs generated for the input tensor is uncertain, so the number of ROIs ultimately generated may not be an integer multiple of the first number C ₁.

As shown in fig. 8, in one embodiment, the example method 200 may further include: step 810, for the obtained local tensor of the second number C ₂, executing a second instruction sequence for processing the tensor of the third number C ₃ of the above M instruction sequences, where the third number C ₃ is greater than or equal to the second number C ₂ and less than or equal to the first number C ₁.

For example, assume that an instruction sequence IS ₁ for processing 8 ROIs, an instruction sequence IS ₂ for processing 4 ROIs, an instruction sequence IS ₃ for processing 2 ROIs, and an instruction sequence IS ₄ for processing 1 ROI are included in the M instruction sequences, and the first number IS set to C ₁ =8. Thus, the instruction sequence IS ₁ may be executed for the corresponding 8 local tensors of each 8 ROIs acquired, as per steps 220 and 230 of the example method 200.

If all ROIs for the input tensor have been generated all after executing the instruction sequence IS ₁ several times the counter re-counts to 4, for the remaining 4 ROIs, the corresponding 4 partial tensors can be determined by step 220. Then, for these 4 partial tensors, the second number may be set to C ₂ =4 and the third number to C ₃ =4, executing instruction sequence IS ₂ 1 time; it IS also possible to set the second number to C ₂ =2 and the third number to C ₃ =2, so that the instruction sequence IS ₃ IS executed 2 times; it IS also possible to first set the second number to C ₂ =2 and the third number to C ₃ =2, then the second number to C ₂ =1 and the third number to C ₃ =1, thereby executing the instruction sequences IS ₃ and IS ₄ times 2.

For example, if all ROIs for the input tensor have been generated all after several instruction sequences IS ₁ have been executed and the counter re-counts to 7, then for the remaining 7 ROIs, the corresponding 7 local tensors can be determined by step 220. Then, for these 7 partial tensors, the second number may be set to C ₂ =4 and the third number to C ₃ =4, executing instruction sequence IS ₂ 1 time; then, the second number IS set to C ₂ =2, and the third number IS set to C ₃ =2, thereby executing the instruction sequence IS ₃ 1 time; then, the second number IS set to C ₂ =1, and the third number IS set to C ₃ =1, thereby executing the instruction sequence IS ₄ 1 time.

The example method 200 is capable of processing any number of ROIs, via step 810.

In another embodiment, as shown in fig. 9, the example method 200 may further include: at step 910, at least one copy of at least one of the acquired local tensors of the second quantity C ₂ is acquired such that the sum of the quantity of the second quantity C ₂ and the at least one copy is equal to the third quantity C ₃.

Continuing the assumption above, if, for example, after executing instruction sequence IS ₁ several times, the counter re-counts to 6, all ROIs for the input tensor have been generated entirely, then for the remaining 6 ROIs, the corresponding 6 local tensors may be determined by step 220. Then, for these 6 partial tensors, the second number may be set to C ₂ =6, the third number to C ₃ =8, and 2 copies of any one partial tensor may be acquired in step 910, or for any two partial tensors of these 6 partial tensors, one copy (2 copies in total) for each of these two partial tensors may be acquired in step 910, so that 8 partial tensors (including 6 partial tensors and 2 copies) are acquired in total. Instruction sequence IS ₁ may then be executed in step 810 for these 8 partial tensors.

For example, if all ROIs for the input tensor have been generated all after several instruction sequences IS ₁ have been executed and the counter re-counts to 3, for the remaining 3 ROIs, the corresponding 3 partial tensors can be determined by step 220. Then, for these 3 partial tensors, the second number may be set to C ₂ =3, the third number to C ₃ =4, and 1 copy of any one partial tensor is acquired in step 910, thereby obtaining a total of 4 partial tensors (including 3 partial tensors and 1 copy). Instruction sequence IS ₂ may then be executed in step 810 for these 4 partial tensors.

In this embodiment, instead of splitting the last remaining number in the previous embodiment, a conditional number of partial tensor data is acquired by acquiring at least one copy of any one or more partial tensors, thereby avoiding the potentially more complex decision process required in the number splitting process. Through step 910, it can be ensured that the example method 100 is applicable to any number of ROIs at less expense.

In one embodiment, as shown in fig. 10, the example method 200 may further include: in step 1010, each local tensor acquired is adjusted to a predetermined size. In step 1010, each local tensor obtained may be adjusted to a predetermined size using any suitable method, such as bilinear interpolation, dilation convolution, or the like.

The corresponding local tensors of the respective ROIs generated for the input tensors may have different sizes and shapes. By adjusting the respective local tensor to a predetermined size and shape in step 1010, each of the aforementioned M instruction sequences can be allowed to process the respective local tensor in a unified, simple processing logic, thereby allowing the instruction sequences to be optimized by appropriate optimization techniques, thereby providing higher execution efficiency.

It should be understood that the method according to embodiments of the present disclosure may not be limited to the example method 200 described above, and that the order and implementation details of the various steps may not be limited to the examples described above.

Exemplary apparatus

Fig. 11 illustrates an example apparatus 1100 for processing tensor data, which example apparatus 1100 may include an instruction storage module 1110, a tensor acquisition module 1120, and a first instruction execution module 1130, according to an embodiment of the disclosure.

The instruction storage module 1110 may be configured to store the M instruction sequences described above, where each instruction sequence of the M instruction sequences may be used to process a number of tensors corresponding to the instruction sequence, respectively, and for example, the M instruction sequences may include an instruction sequence for batch processing N ₁ tensors, an instruction sequence for batch processing N ₂ tensors, … …, and an instruction sequence for batch processing N _M tensors. For example, M instruction sequences may be stored in instruction storage module 1110 by way of example method 200 at step 110. Instruction storage module 1110 may be any suitable memory that can be used to store sequences of instructions.

Tensor acquisition module 1120 may be configured to perform step 120 of example method 200. For example, the tensor acquisition module 1120 may receive the input tensor T _IN and the ROI with respect to the input tensor T _IN and output the local tensor T _ROI corresponding to the ROI.

The first instruction execution module 1130 may be configured to perform step 130 of the example method 200. In the example of fig. 11, the first instruction execution module 1130 may include a counter 1140 and a buffer memory 1150. For example, if 1+.N ₁<……<N_M and the first number is set to C ₁＝N_M, the maximum count value of the counter 1140 and the maximum capacity of the buffer memory 1150 may be set to C ₁＝N_M. The first instruction execution module 1130 may then buffer the local tensor T _ROI from the tensor fetch module 1120 into the buffer memory 1150 and count the number of local tensors in the buffer memory 1150 by the counter 1140. For example, when the count value of counter 1140 reaches a first number C ₁＝N_M, instruction storage module 1110 may be notified and a sequence of instructions for batching N _M tensors may be loaded and executed from instruction storage module 1110. For example, the counter 1140 may be reset and the buffer memory 1150 emptied after execution of the sequence of instructions for batching N _M tensors. In another example, only the buffer memory 1150 may be employed, and the instruction storage module 1110 may be notified when the capacity of the buffer memory 1150 reaches a maximum capacity, for example, and a sequence of instructions for batch processing of N _M tensors may be loaded and executed from the instruction storage module 1110. In further examples, the first instruction execution module 1130 may also include an instruction buffer memory for buffering the sequence of instructions.

In further embodiments, the tensor acquisition module 1120 and the first instruction execution module 1130 may be implemented by a suitable processor developed, for example, based on a graphics processor, FPGA, or the like.

The example apparatus 1100 may be used to implement the example method 200. As shown in fig. 11, in an example apparatus 1100, the same processor may be used to avoid systematic risks caused by module heterogeneity while also avoiding data flow between memories associated with different processors. In addition, as described above, each of the M instruction sequences stored in the instruction storage module 1110 may be an instruction sequence that is optimized in advance for a different tensor number, so that the example apparatus 1100 may have a higher execution efficiency.

In one embodiment, as shown in fig. 12, the example apparatus 1100 may further include a second instruction execution module 1210, which second instruction execution module 1210 may be configured to perform step 810 described above, thereby enabling the example apparatus 1100 to process any number of ROIs. In one example, similar to the first instruction execution module 1130, the second instruction execution module 1210 may include a counter and a buffer memory for buffering local tensors. In further examples, the second instruction execution module 1210 may also include an instruction buffer memory for buffering instruction sequences. In further examples, the second instruction execution module 1210 may reuse circuitry in the first instruction execution module 1130 or may be implemented integrally with the first instruction execution module 1130.

In another embodiment, as shown in fig. 13, the example apparatus 1100 may further include a tensor replication module 1310, the tensor replication module 1310 may be configured to perform step 910 of the example method 200. In this example, the local tensors are supplemented by the tensor replication module 1310, such that the example apparatus 1100 is able to avoid implementing potentially more complex decision processes required in the number splitting process, thereby reducing design complexity of the example apparatus 1100 and improving execution efficiency of the example apparatus 1100.

In another embodiment, as shown in fig. 13, the example apparatus 1100 may further include a tensor adjustment module 1410. The tensor adjustment module 1410 may be configured to perform step 1010 of the example method 200 to provide the first instruction execution module 1130 and the second instruction execution module 1210 (not shown in fig. 13) with local tensors having a predetermined size, thereby allowing the M instruction sequences in the instruction storage module 1110 to be instruction sequences optimized by a suitable optimization technique.

In one embodiment, the various modules in the example apparatus 1100 may be implemented integrally with a suitable processor developed, for example, based on a graphics processor, FPGA, or the like.

It should be appreciated that an apparatus according to an embodiment of the present disclosure is not limited to the example apparatus 1100 described above. For example, other modules, such as a convolution operation module, may also be included in the example apparatus 100. In addition, the various modules in the apparatus may be connected or coupled together in any suitable manner, and in the examples of fig. 11-14 described above, the arrows between the modules are merely used to indicate the flow of data or signals of interest, and do not indicate that the flow of data or signals between the modules is in the direction of the arrows.

Exemplary electronic device

Fig. 15 illustrates an electronic device 1500, which electronic device 1500 may include one or more processors 1510 and memory 1520, in accordance with embodiments of the disclosure.

The processor 1510 may be a CPU or other form of processing unit having data processing and/or instruction execution capabilities and may control other components in the electronic device 1500 to perform the desired functions.

Memory 1520 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or nonvolatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer-readable storage medium that can be executed by the processor 1510 to implement the example method 200 described above and/or other desired functions.

In one example, electronic device 1500 may also include input devices 1530 and output devices 1540, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown). For example, the input device 1530 may also include, for example, a keyboard, mouse, and the like. The output device 1540 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc. In addition, electronic device 1500 may also include any other suitable components or modules.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in a sound source localization method according to various embodiments of the present disclosure described in the above "exemplary methods" section of the present description.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform the steps in a sound source localization method according to various embodiments of the present disclosure described in the above "exemplary method" section of the present disclosure.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, a RAM, a ROM, an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present disclosure have been described above in connection with specific embodiments, but it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.

The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the apparatus, devices and methods of the present disclosure, components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the disclosure to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A method of processing tensor data, comprising:

storing a plurality of instruction sequences, each instruction sequence for processing a number of tensors corresponding to the instruction sequence;

for each region of interest of an input tensor, determining a local tensor corresponding to the region of interest from the input tensor; and

For each first number of local tensors acquired, if the number of regions of interest is an integer multiple of the first number, executing a first sequence of instructions of the plurality of sequences of instructions for processing the first number of tensors,

If the number of regions of interest is not an integer multiple of the first number, executing a second sequence of instructions of the plurality of sequences of instructions for processing a third number of tensors for the acquired second number of local tensors, the third number being greater than or equal to the second number and less than or equal to the first number.

2. The method of claim 1, further comprising:

At least one copy of at least one of the second number of local tensors is acquired such that a sum of the second number and the number of at least one copy is equal to the third number.

3. The method of any of claims 1-2, further comprising:

Each acquired local tensor is adjusted to a predetermined size.

4. An apparatus for processing tensor data, comprising:

an instruction storage module configured to store a plurality of instruction sequences, each instruction sequence for processing a number of tensors corresponding to the instruction sequence;

A tensor acquisition module configured to acquire, for each region of interest of an input tensor, a local tensor corresponding to the region of interest from the input tensor; and

A first instruction execution module configured to execute a first instruction sequence of the plurality of instruction sequences for processing the first number of tensors, if the number of regions of interest is an integer multiple of the first number, for each first number of local tensors acquired by the tensor acquisition module,

If the number of regions of interest is not an integer multiple of the first number, further comprising:

A second instruction execution module configured to execute, for a second number of local tensors acquired by the tensor acquisition module, a second sequence of instructions of the plurality of sequences of instructions for processing a third number of tensors, the third number being greater than or equal to the second number and less than or equal to the first number.

5. The apparatus of claim 4, further comprising:

a tensor replication module configured to obtain at least one copy of the second number of partial tensors obtained by the tensor obtaining module such that a sum of the second number and the number of the at least one copy is equal to the third number.

6. The apparatus of any of claims 4 to 5, further comprising:

And a tensor adjustment module configured to adjust each local tensor acquired by the tensor acquisition module to a predetermined size.

7. A computer readable storage medium having stored thereon a computer program for performing the method according to any of claims 1 to 3.

8. An electronic device, comprising:

A processor; and

A memory for storing instructions executable by the processor;

The processor is configured to read the instructions from the memory and execute the instructions to implement the method according to any one of claims 1 to 3.