EP4133368A1

EP4133368A1 - Device and method for data processing

Info

Publication number: EP4133368A1
Application number: EP20726086.0A
Authority: EP
Inventors: Nicola BRANDONISIO; Stephen Busch; Eric Badi
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2023-02-15
Also published as: WO2021228392A1

Abstract

The present disclosure relates to a device for data processing. The device includes a processor core comprising a plurality of data-paths for processing data. Each data-path comprises at least one operator, and at least some of the operators of different data-paths are connected by hard-wiring. The processor core is configured to process a plurality of input vectors in parallel, wherein each input vector is processed by a different data-path.

Description

DEVICE AND METHOD FOR DATA PROCESSING

TECHNICAL FIELD

The present disclosure relates generally to the field of data processing, and particularly, to a device comprising a processor core for data processing. The processor core (of the device) may comprise a plurality of data-paths, which may process a plurality of input vectors in parallel. For instance, each input vector may be processed by a different data-path of the processor core.

BACKGROUND

In many application scenarios, implementing a computing intensive algorithm or program entirely in hardware may become obsolete. In particular, hardwiring such algorithm or program rules out a later adaption thereof. For example, an Image Signal Processor (ISP) is nowadays requested to be adaptable to an image sensor, since development cycles of image sensors are much smaller than development cycles of ISPs. Thus, a device for data processing that comprises programmable hardware is generally needed.

Conventional devices for data processing, which address the above need, are based on processors such as Single Instruction Multiple Data (SIMD), Multi Instruction Multiple Data (MIMD), Graphic Processor Unit (GPU), Very Long Instruction Width (VLIW), Algorithm Instruction Specific Processor (AISP) architectures, and arrays of processors or even Convolution Neural Network (CNN).

However, an issue of the conventional devices is that they are limited regarding their, e.g., computational resources (the target is 1 Tera Operation (Top)/s), number of parallel instructions, power consumption, number of cycles needed to run an algorithm (latency of thousands of cycles), etc.

Furthermore, some conventional devices have an instruction decoder in addition to a control flow that is generally 30% of the processor, which is a waste of energy.

SUMMARY

In view of the above-mentioned problems and disadvantages, embodiments of the present invention aim to improve the conventional devices and methods for data processing. An objective is to provide a device for data processing with a new programmable processor core. In particular, by means of the programmable processor core, the device should be reconfigurable to carry out a new computer algorithm or program. That is, the device should enable a programmer to adapt it by programming or re-programming.

The objective is achieved by the embodiments of the invention as described in the enclosed independent claims. Advantageous implementations of the embodiments of the invention are further defined in the dependent claims.

In particular, embodiments of the invention may provide a data processing device with both programmable and non-programmable hardware. The device has the programmable processor core, by which the device may be configurable or re-configurable for performing changes in its data processing functionality. The programmability, in particular the re-configurability of the device, may include changes in the operation of hardwired data-paths of the processor core (e.g., by selecting different operators of the data-paths), changes in an execution of instructions for a specific application, adapting to a new algorithm, etc.

For example, the device of present disclosure may provide a programmable (e.g., thread based) hardware accelerator. The device of the present disclosure enables flexibility, when developing hardwire computing intensive algorithms.

A first aspect of the present disclosure provides a device for data processing, the device comprising a processor core comprising a plurality of data-paths for processing data, wherein each data-path comprises at least one operator, and wherein at least some of the operators of different data-paths are connected by hard-wiring, wherein the processor core is configured to process a plurality of input vectors in parallel, wherein each input vector is processed by a different data-path.

The device may be, or may be incorporated in, for example, an electronic device such as a personal computer, a desktop computer, a laptop, a tablet, a mobile phone, a smart phone, a digital camera, etc.

The device comprising the processor core may be used for an Image Signal Processor (ISP), which may be adaptable to different types of image sensors (e.g., the device comprising the processor core may be adaptable to different patterns of a camera’s image sensor). That is the device is reconfigurable.

In some embodiments, the reconfigurability of the device may solve an issue of providing new hardwire computing intensive algorithms. The reconfigurability of the device may enable hardware reconfigurations, a software programmability, etc. For instance, the development cycle time of an image sensor may be smaller than the development cycle time of an ISP. Furthermore, a programmable hardware may be needed, e.g., when an algorithm is changing or adapting to new inputs, or when the algorithm is replaced by another algorithm, etc.

In some embodiments, the reconfigurability of the device may target one or more classes of algorithms. Moreover, the device comprising the processor core may be implemented such that it consumes a low power, for example, the consumer power may be as low as a hardwired accelerator, and the device may execute approximately 2000 operation per cycle.

In some embodiments, the device may be based on a post-silicon changeable instruction decoder, which may provide (e.g., infinitely) a higher degree of freedom.

The device may comprise circuitry. The circuitry may comprise hardware and software. The hardware may comprise analog or digital circuitry, or both analog and digital circuitry. In some embodiments, the circuitry comprises one or more processors and a non-volatile memory connected to the one or more processors. The non-volatile memory may carry executable program code which, when executed by the one or more processors, causes the device to perform the operations or methods described herein.

In an implementation form of the first aspect, at least some operators of the plurality of data paths are controllable, in particular are programmable to perform one or more arithmetic and/or logic operations.

In some embodiments, the plurality of data-paths may be connected such that there may be no branches, moreover, the plurality of data-paths may be controlled by a program. The program may be executed linearly which may provide a simple implementation of the pipeline and the control flow. For instance, “if” statements may be handled with conditional stream selection from two parallel computing paths. In a further implementation form of the first aspect, the processor core comprises a plurality of groups of data-paths, wherein at least some of the groups are connected by hard-wiring.

For example, in some embodiments, a large number of potential parallel data-path (threads) (e.g., 32) may run, further, each data-path may comprise potentially two or three operators. This may enable a high use ratio of the computing resources, enable computation reuse, etc.

In some embodiments, the plurality of groups of data-paths may comprise, for example, 128 parallel data-paths (threads).

In particular, the groups may be connected by hard-wiring (e.g., partially pre-wired) and reconfigurable compute trees which may minimize data movements. Moreover, the thread concept may enable simplifying the program code because an algorithm mapped on a hardware (HW) can be easily expressed as thread communicating between each other’s.

In a further implementation form of the first aspect, the processor core comprises a plurality of clusters, wherein each cluster comprises a set of groups of data-paths, and wherein at least some of the clusters are connected by hard-wiring.

In a further implementation form of the first aspect, the device is further comprising at least one router configured to route the plurality of input vectors to the different data-paths.

In a further implementation form of the first aspect, the device is further comprising a memory for storing one or more control vectors, wherein the device is configured to use each control vector to control at least one of: a set of the operators; a set of the data-paths; a distribution of the input vectors to the data-paths; an operation of one or more operators.

For example, the control vectors may be generated by a python tool and may further be stored in a memory such as a static read access memory (SRAM), without limiting the present disclosure. Hereinafter, the “instructions” are referred to as “Py-Templates”. Moreover, the programs may have any size (e.g., in some embodiments, an order of magnitude may be 100 instructions. Moreover, in some embodiments, a compressing process may be used and the order of magnitude of more than 1000 instructions may be obtained).

In a further implementation form of the first aspect, the device is further configured to use at least one control vector per each processing cycle.

In some embodiments, the device may execute wide vectors that may directly control computing resources or data routing at low level. For example, the instructions may be replaced by a vectors of control bits assigned at each resource, the configurability is not limited by the instruction formats as in processors.

In a further implementation form of the first aspect, the device is further configured to perform a synchronization of one or more data-paths inside a group of data-paths; and/or perform a synchronization of one or more data-paths between one or more clusters of groups of data paths.

In a further implementation form of the first aspect, one or more data-paths are forked into sub- data-paths, wherein respective input vectors of the one or more data-paths are each processed by each sub-data-path of the respective data-paths.

In a further implementation form of the first aspect, the device is further configured to process the plurality of input vectors according to a processing tree, which is implemented by the data paths and the operators.

In a further implementation form of the first aspect, the device is further configured to process image data of an image sensor comprising a block of pixels.

In a further implementation form of the first aspect, the device is further configured to organize the image data of the block of pixels into the plurality of input vectors, wherein each input vector is based on image data of a set of vertical pixels. In a further implementation form of the first aspect, the device is further configured to obtain results of processing from two or more data-paths; and combine the obtained results, for obtaining an output result.

Moreover, the device according to the first aspect and its implementation forms may provide one or more of the following advantages:

• Minimizing the HW overhead for the flexibility (+50%).

• Full entitlement of the architecture may be obtained by removing instructions.

• Minimizing the data moving, e.g., stream of ordered data in first in-first out (FIFOs) may be spread in the data-paths and a router (which may result in a lower power consumption).

• Hardwired trees of operators by remove congestion to the SRAM for accessing the data.

• There may be no need for a compiler, a scheduler may generate the control vectors.

A second aspect of the invention provides a method for data processing, the method comprises processing, by a processor core, a plurality of input vectors in parallel, wherein each input vector is processed by a different data-path, wherein the processor core comprising a plurality of data-paths for processing data, wherein each data-path comprises at least one operator, and wherein at least some of the operators of different data-paths are connected by hard-wiring.

In an implementation form of the second aspect, at least some operators of the plurality of data paths are controllable, in particular are programmable to perform one or more arithmetic and/or logic operations.

In a further implementation form of the second aspect, the processor core comprises a plurality of groups of data-paths, wherein at least some of the groups are connected by hard-wiring.

In a further implementation form of the second aspect, the processor core comprises a plurality of clusters, wherein each cluster comprises a set of groups of data-paths, and wherein at least some of the clusters are connected by hard-wiring. In a further implementation form of the second aspect, the method further comprises routing, by at least one router, the plurality of input vectors to the different data-paths.

In a further implementation form of the second aspect, the method further comprises storing by a memory, one or more control vectors, and controlling, by using each control vector, at least one of: a set of the operators; a set of the data-paths; a distribution of the input vectors to the data-paths; an operation of one or more operators.

In a further implementation form of the second aspect, the method further comprises using at least one control vector per each processing cycle.

In a further implementation form of the second aspect, the method further comprises performing a synchronization of one or more data-paths inside a group of data-paths; and/or performing a synchronization of one or more data-paths between one or more clusters of groups of data paths.

In a further implementation form of the second aspect, one or more data-paths are forked into sub-data-paths, wherein respective input vectors of the one or more data-paths are each processed by each sub-data-path of the respective data-paths.

In a further implementation form of the second aspect, the method further comprises processing the plurality of input vectors according to a processing tree, which is implemented by the data paths and the operators.

In a further implementation form of the second aspect, the method further comprises processing image data of an image sensor comprising a block of pixels.

In a further implementation form of the second aspect, the method further comprises organizing the image data of the block of pixels into the plurality of input vectors, wherein each input vector is based on image data of a set of vertical pixels. In a further implementation form of the second aspect, the method further comprises obtaining results of processing from two or more data-paths; and combining the obtained results, for obtaining an output result.

The method of the second aspect and its implementation forms achieve the same advantages as the device of the first aspect and its respective implementation forms.

A third aspect of the present disclosure provides a computer program comprising a program code for performing the method according to the second aspect or any of its implementation forms.

A fourth aspect of the present disclosure provides a non-transitory storage medium storing executable program code which, when executed by a processor, causes the method according to the second aspect or any of its implementation forms to be performed.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above described aspects and implementation forms will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

FIG. 1 shows a schematic view of a device for data processing, according to an embodiment of the invention; FIG. 2 shows a schematic view of a diagram illustrating the device processing a plurality of input vectors according to a processing tree;

FIGS. 3A-3B shows schematic views of diagrams illustrating the device comprising an instruction decoder (FIG. 3A) and the device comprising a memory for storing control vectors (FIG. 3B);

FIG. 4 shows a schematic view of a diagram illustrating the device comprising a plurality of clusters for processing a block of pixels;

FIG. 5 shows a schematic view of a diagram illustrating the device comprising a cluster of four groups;

FIG. 6 shows a schematic view of a diagram illustrating the reconfigurability of the device based on image sensors; and

FIG. 7 shows a method for data processing, according to an embodiment of the invention.

DETAILED DESCRIPTION OF EMBODIMENTS

FIG. 1 shows a schematic view of a device 100 for data processing, according to an embodiment of the invention.

The device 100 may be an electronic device such as a personal computer, a laptop, a digital mobile camera, a smart phone, etc. For example, the device 100 may be used for an ISP of a digital camera.

The device 100 comprises a processor core 10 comprising a plurality of data-paths 110, 120 for processing data, wherein each data-path 110, 120 comprises at least one operator 111, 112, 113, 121, 122, 123, and wherein at least some of the operators 112, 113, 122, 123 of different data paths 110, 120 are connected by hard-wiring.

For example, the data-path 110 comprises the operators 111, 112, 113 and the data-path 120 comprises the operators 121, 122, 123. Moreover, the operator 112 of the data-path 110 is connected by hard-wiring to the operator 122 of the data-path 120. Furthermore, the operator 113 of the data-path 110 is connected by hard-wiring to the operator 123 of the data-path 120.

Moreover, the processor core 10 is configured to process a plurality of input vectors in parallel, wherein each input vector is processed by a different data-path 110, 120.

The device 100, particularly the processor core 10, may comprise processing circuitry (not shown in detail in FIG. 1) configured to perform, conduct or initiate the various operations of the device 100 described herein. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. In one embodiment, the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors. The non- transitory memory may carry executable program code which, when executed by the one or more processors, causes the device 100 to perform, conduct or initiate the operations or methods described herein.

FIG. 2 shows a schematic view of a diagram illustrating the device 100 processing a plurality of input vectors according to a processing tree 200.

The processing tree 200 is implemented by the data-paths 110, 120, 210, 220 and their corresponding operators.

For example, the device 100 may process image data of an image sensor, the image data comprising at least one block of pixels. The device 100 may implement the processing tree 200 by connecting some of the operators. In FIG. 2, the Gxx are pixels located in an image sensor at corresponding x and x coordinates, and the device 100 is configured to compute an operation according to Eq. (1) as follows: abs((G_03*340+G _13*684)/1024 - G₁₂)

+ abs((G ₂₁ *340+ G ₃₁ *684 )/ 1024 - G₃₀)

+ abs((G ₂₅ *340 G₃₅*684 )/ 1024 - G₃₄)

+abs((G₄₃*340+G₅₃*684)/1024 - G₅₂) The device 100 may organize the computation tree 200, as it is shown in FIG. 2, in order to compute the above operation. For instance, the device 200 may organize all computation as a tree, and may further optimize the computation tree 200 by including hard routing between (some of) the operators.

In FIG. 2, as an example, the operators of the data-path 110 are indicated by references 111, 112, 113 and 201. Moreover, the operator 113 of the data-path 110 may be connected by a hardwired connection to a respective operator of the data-path 120.

Moreover, the device 100 is further configured to obtain results of processing of the data-paths 110, 120, 210, 220, and combine the obtained results, for obtaining an output result. Furthermore, the processing may be applied on stream of ordered vectors of pixels and there may be no need for random data fetch accesses.

Table. I compares information for processing an instruction, when the operand fetching is performed based on first-in, first-out (FIFO) of vector and based on random accesses.

Table I: exemplary information for processing an instruction.

Reference is now made to FIG. 3A and FIG. 3B, which are schematic views of diagrams illustrating an implementation of the device comprising an instruction decoder (FIG. 3 A) and the device comprising a memory for storing control vectors (FIG. 3B).

Generally, the implementation of FIG. 3 A and FIG. 3B may be used for controlling one or more hardware data-paths and selecting the respective data-paths. In FIG. 3 A, an instruction decoder and a control flow are used for controlling the data-paths.

The device 100 may also be, for example, a programmable hardware with limited area overhead that does not include an instruction decoder and there is no control flow.

For example, the device 100 of FIG. 3B comprises the memory 310, which stores the control vectors. The device may use one or more of the stored control vectors for controlling, e.g., a set of the operators 111, 112, 113, 201, 121, 122, 123, a set of the data-paths 110, 120, 210, 220, a distribution of the input vectors to the data-paths, an operation of one or more operators 111, 112, 113, 201, 121, 122, 123, etc. As an example, the control vectors may be used for controlling the hardware data-path and data selection (-100 vector per algorithm). There may be no instruction decoder, and the control vectors may be built offline.

FIG. 4 shows a schematic view of a diagram illustrating the device 100 comprising a plurality of clusters 401 for processing a block of pixels.

The device 100 may process a block of pixels in a very limited number of cycles (~400pix in lOOcyc) and the processing may be organized in threads. For example, the block of pixels may be organized in a succession of vertical vectors of pixels that are accessed sequentially. Hereinafter, it is called “stream” of vectors. Furthermore, a thread may be a stream processed by a data-path including an operand fetch and the operators. The device may combine the threads together to build a computation tree. Hereinafter, the computation tree is called a “Py- template”. In FIG. 4, the device 100 includes eight clusters 401 each comprising four groups 402. Each group 402 comprises four datapath 110, 120, 210, 220, of two or three operators 111, 112, 113, 121, 122, 123.

Moreover, the device 100 further comprises at least one router 403 to connect, for example, some data-paths 110, 120, 210, 220 in the groups 402, or to connect some groups 402 in the cluster 401, or to connect some cluster 401 together. The processing data-paths are seen as threads.

The device 100 may further, for example, synchronize the threads inside a group 402, synchronize the threads between the clusters 401, fork the threads to process differently the same data, and select the result between two or more threads. Furthermore, the data may be located (mostly) in the data-path 110, 120, 210, 220, and in the synchronization resources.

FIG. 5 shows a schematic view of a diagram illustrating the device 100 comprising a cluster 401 of four groups 402, 501, 502, 503.

For example, the device 100 of FIG. 5 may be a programmable HW with no instruction, but including control vectors which may be generated by a tool.

The device 100 is further configured to use at least one control vector per each processing cycle and may further control all of the HW resources.

The diagram of FIG. 5 illustrates a high paralleled architecture of the device 100. The device may process stream of vectors based on mapping pyramidal hardware computation trees.

The pyramidal computation tree is at first implemented in the blocks level, wherein four data paths 110, 120, 210, 220 are interconnected together, then the computation tree is implemented in the clusters 401, wherein four blocks are interconnected together, and afterward at the “router” level, wherein the clusters 401 are interconnected together in an infinite loop by the router 403. Moreover, the stream of vectors are travelling on the connections, and are processed by the operators 111, 112, 113, 121, 122, 123. Furthermore, the operators 111, 112, 113, 121, 122, 123 may be based on arithmetic’s units such as adders, multipliers, divisors, etc. For example, the operand fetch may be obtained by manipulation of vectors in the streams. Before each operand, there may be a “column arrange” unit and a vector assembly unit. These two units may be in charge to manipulate the vectors of the stream in order to, for example, pull two vectors from a single stream and to use two vectors as operands of the operators and/or shift vertically one of the two vector operands to change the alignment of the two operands, etc. The column arrange unit can also accept two different streams.

The device 100 may be capable to process a patch of pixels of 34x34, without limiting the present disclosure. Moreover, when a patch has been processed, then the device 100 may processes the next patch with an overlap to for seamless computations.

Without limiting the present disclosure, an example of a reconfigurability of the device 100, in particular a programmability of the device 100, is discussed with respect to FIG. 6.

FIG. 6 shows a schematic view of a diagram illustrating the device 100 being reconfigured according to a type of an image sensor.

The device 100 comprises the processor core 10 comprising the plurality of data-paths 110, 120. The processor core 10 may be used for an ISP, which may be adaptable to different types of one or more image sensors 611, 612, 613, 614, 615, 616 included in a camera 600.

As an example of the reconfigurability, the device 100 may enable a user to adapt the device 100 by programming or re-programming it. For example, the device 100 may comprise a memory 310 and the processor core 10. The processor core 10 of the device is reconfigurable for different patterns of the image sensors 611, 612, 613, 614, 615, 616.

For instance, the device 100 may enable the user to store a set of programs 601, 602, 603, 604, 605, 606 in the memory 310. Moreover, each program may enable a specific configuration of the processor core 10 for processing data of a specific image sensor. For example, the program 601 (e.g., stored by the user in the memory, during operation of the device) may select the operators 111, 112, 113, 121, 122, 123 of the data-paths 110, 120 such that the processor core 10 is adapted, to process data of the image sensors 611, according to its pattern. Hence, the device 100 is a programmable device 100 that includes reconfigurable hardware (the processor core 10 may be reconfigured). FIG. 7 shows a method 700 according to an embodiment of the invention for data processing. The method 700 may be carried out by the device 100, as it is described above.

The method 700 comprises a step S701 of processing, by a processor core 10, a plurality of input vectors in parallel, wherein each input vector is processed by a different data-path 110, 120, wherein the processor core 10 comprising a plurality of data-paths 110, 120 for processing data, wherein each data-path 110, 120 comprises at least one operator 111, 112, 113, 121, 122, 123, and wherein at least some of the operators 112, 113, 122, 123 of different data-paths are connected by hard-wiring. The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A device (100) for data processing, the device (100) comprising: a processor core (10) comprising a plurality of data-paths (110, 120) for processing data, wherein each data-path (110, 120) comprises at least one operator (111, 112, 113, 121, 122, 123), and wherein at least some of the operators (112, 113, 122, 123) of different data-paths (110, 120) are connected by hard-wiring, wherein the processor core (10) is configured to process a plurality of input vectors in parallel, wherein each input vector is processed by a different data-path (110, 120).

2. The device (100) according to claim 1, wherein: at least some operators (111, 112, 113, 121, 122, 123) of the plurality of data-paths (110, 120) are controllable, in particular are programmable to perform one or more arithmetic operations.

3. The device (100) according to claim 1 or 2, wherein: the processor core (10) comprises a plurality of groups of data-paths (110, 120, 210, 220), wherein at least some of the groups are connected by hard-wiring.

4. The device (100) according to claim 3, wherein: the processor core (10) comprises a plurality of clusters (401), wherein each cluster (401) comprises a set of groups (402, 501, 502, 503) of data-paths (110, 120, 210, 220), and wherein at least some of the clusters are connected by hard-wiring.

5. The device (100) according to one of the claims 1 to 4, further comprising: at least one router (403) configured to route the plurality of input vectors to the different data-paths (110, 120, 210, 220).

6. The device (100) according to one of the claims 1 to 5, further comprising: a memory (310) for storing one or more control vectors, wherein the device (100) is configured to use each control vector to control at least one of:

- a set of the operators (111, 112, 113, 121, 122, 123); a set of the data-paths (110, 120, 210, 220); a distribution of the input vectors to the data-paths (110, 120, 210, 220); an operation of one or more operators (111, 112, 113, 121, 122, 123).

7. The device (100) according to claim 6, wherein: the device (100) is further configured to use at least one control vector per each processing cycle.

8. The device (100) according to one of the claims 1 to 7 when depending on claim 3 or 4, further configured to: perform a synchronization of one or more data-paths inside a group of data-paths; and/or perform a synchronization of one or more data-paths between one or more clusters (401) of groups of data-paths.

9. The device (100) according to one of the claims 1 to 8, wherein: one or more data-paths (110, 120, 210, 220) are forked into sub-data-paths, wherein respective input vectors of the one or more data-paths are each processed by each sub-data-path of the respective data-paths.

10. The device (100) according to one of the claims 1 to 9, configured to: process the plurality of input vectors according to a processing tree (200), which is implemented by the data-paths (110, 120, 210, 220) and the operators (111, 112, 113, 121, 122, 123).

11. The device (100) according to one of the claims 1 to 10, further configured to: process image data of an image sensor comprising a block of pixels.

12. The device (100) according to claim 11, further configured to: organize the image data of the block of pixels into the plurality of input vectors, wherein each input vector is based on image data of a set of vertical pixels.

13. The device (100) according to one of the claims 1 to 12, further configured to: obtain results of processing from two or more data-paths (110, 120, 210, 220); and combine the obtained results, for obtaining an output result.

14. A method (700) for data processing, the method (700) comprising: processing (S701), by a processor core (10), a plurality of input vectors in parallel, wherein each input vector is processed by a different data-path (110, 120), wherein the processor core (10) comprising a plurality of data-paths (110, 120) for processing data, wherein each data-path (110, 120) comprises at least one operator (111, 112, 113, 121, 122, 123), and wherein at least some of the operators (112, 113, 122, 123) of different data-paths are connected by hard-wiring.

15. A computer program which, when executed by a computer, causes the method (700) of claim 14 to be performed.