EP4133368A1 - Vorrichtung und verfahren zur datenverarbeitung - Google Patents

Vorrichtung und verfahren zur datenverarbeitung

Info

Publication number
EP4133368A1
EP4133368A1 EP20726086.0A EP20726086A EP4133368A1 EP 4133368 A1 EP4133368 A1 EP 4133368A1 EP 20726086 A EP20726086 A EP 20726086A EP 4133368 A1 EP4133368 A1 EP 4133368A1
Authority
EP
European Patent Office
Prior art keywords
data
paths
processing
operators
processor core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP20726086.0A
Other languages
English (en)
French (fr)
Inventor
Nicola BRANDONISIO
Stephen Busch
Eric Badi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of EP4133368A1 publication Critical patent/EP4133368A1/de
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3889Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
    • G06F9/3891Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute organised in groups of units sharing resources, e.g. clusters

Definitions

  • the present disclosure relates generally to the field of data processing, and particularly, to a device comprising a processor core for data processing.
  • the processor core (of the device) may comprise a plurality of data-paths, which may process a plurality of input vectors in parallel. For instance, each input vector may be processed by a different data-path of the processor core.
  • processors such as Single Instruction Multiple Data (SIMD), Multi Instruction Multiple Data (MIMD), Graphic Processor Unit (GPU), Very Long Instruction Width (VLIW), Algorithm Instruction Specific Processor (AISP) architectures, and arrays of processors or even Convolution Neural Network (CNN).
  • SIMD Single Instruction Multiple Data
  • MIMD Multi Instruction Multiple Data
  • GPU Graphic Processor Unit
  • VLIW Very Long Instruction Width
  • AISP Algorithm Instruction Specific Processor
  • CNN Convolution Neural Network
  • some conventional devices have an instruction decoder in addition to a control flow that is generally 30% of the processor, which is a waste of energy.
  • embodiments of the present invention aim to improve the conventional devices and methods for data processing.
  • An objective is to provide a device for data processing with a new programmable processor core.
  • the device should be reconfigurable to carry out a new computer algorithm or program. That is, the device should enable a programmer to adapt it by programming or re-programming.
  • embodiments of the invention may provide a data processing device with both programmable and non-programmable hardware.
  • the device has the programmable processor core, by which the device may be configurable or re-configurable for performing changes in its data processing functionality.
  • the programmability in particular the re-configurability of the device, may include changes in the operation of hardwired data-paths of the processor core (e.g., by selecting different operators of the data-paths), changes in an execution of instructions for a specific application, adapting to a new algorithm, etc.
  • the device of present disclosure may provide a programmable (e.g., thread based) hardware accelerator.
  • the device of the present disclosure enables flexibility, when developing hardwire computing intensive algorithms.
  • a first aspect of the present disclosure provides a device for data processing, the device comprising a processor core comprising a plurality of data-paths for processing data, wherein each data-path comprises at least one operator, and wherein at least some of the operators of different data-paths are connected by hard-wiring, wherein the processor core is configured to process a plurality of input vectors in parallel, wherein each input vector is processed by a different data-path.
  • the device may be, or may be incorporated in, for example, an electronic device such as a personal computer, a desktop computer, a laptop, a tablet, a mobile phone, a smart phone, a digital camera, etc.
  • an electronic device such as a personal computer, a desktop computer, a laptop, a tablet, a mobile phone, a smart phone, a digital camera, etc.
  • the device comprising the processor core may be used for an Image Signal Processor (ISP), which may be adaptable to different types of image sensors (e.g., the device comprising the processor core may be adaptable to different patterns of a camera’s image sensor). That is the device is reconfigurable.
  • ISP Image Signal Processor
  • the reconfigurability of the device may solve an issue of providing new hardwire computing intensive algorithms.
  • the reconfigurability of the device may enable hardware reconfigurations, a software programmability, etc.
  • the development cycle time of an image sensor may be smaller than the development cycle time of an ISP.
  • a programmable hardware may be needed, e.g., when an algorithm is changing or adapting to new inputs, or when the algorithm is replaced by another algorithm, etc.
  • the reconfigurability of the device may target one or more classes of algorithms.
  • the device comprising the processor core may be implemented such that it consumes a low power, for example, the consumer power may be as low as a hardwired accelerator, and the device may execute approximately 2000 operation per cycle.
  • the device may be based on a post-silicon changeable instruction decoder, which may provide (e.g., infinitely) a higher degree of freedom.
  • the device may comprise circuitry.
  • the circuitry may comprise hardware and software.
  • the hardware may comprise analog or digital circuitry, or both analog and digital circuitry.
  • the circuitry comprises one or more processors and a non-volatile memory connected to the one or more processors.
  • the non-volatile memory may carry executable program code which, when executed by the one or more processors, causes the device to perform the operations or methods described herein.
  • At least some operators of the plurality of data paths are controllable, in particular are programmable to perform one or more arithmetic and/or logic operations.
  • the plurality of data-paths may be connected such that there may be no branches, moreover, the plurality of data-paths may be controlled by a program.
  • the program may be executed linearly which may provide a simple implementation of the pipeline and the control flow. For instance, “if” statements may be handled with conditional stream selection from two parallel computing paths.
  • the processor core comprises a plurality of groups of data-paths, wherein at least some of the groups are connected by hard-wiring.
  • a large number of potential parallel data-path (e.g., 32) may run, further, each data-path may comprise potentially two or three operators. This may enable a high use ratio of the computing resources, enable computation reuse, etc.
  • the plurality of groups of data-paths may comprise, for example, 128 parallel data-paths (threads).
  • the groups may be connected by hard-wiring (e.g., partially pre-wired) and reconfigurable compute trees which may minimize data movements.
  • the thread concept may enable simplifying the program code because an algorithm mapped on a hardware (HW) can be easily expressed as thread communicating between each other’s.
  • the processor core comprises a plurality of clusters, wherein each cluster comprises a set of groups of data-paths, and wherein at least some of the clusters are connected by hard-wiring.
  • the device is further comprising at least one router configured to route the plurality of input vectors to the different data-paths.
  • the device is further comprising a memory for storing one or more control vectors, wherein the device is configured to use each control vector to control at least one of: a set of the operators; a set of the data-paths; a distribution of the input vectors to the data-paths; an operation of one or more operators.
  • control vectors may be generated by a python tool and may further be stored in a memory such as a static read access memory (SRAM), without limiting the present disclosure.
  • SRAM static read access memory
  • the “instructions” are referred to as “Py-Templates”.
  • the programs may have any size (e.g., in some embodiments, an order of magnitude may be 100 instructions. Moreover, in some embodiments, a compressing process may be used and the order of magnitude of more than 1000 instructions may be obtained).
  • the device is further configured to use at least one control vector per each processing cycle.
  • the device may execute wide vectors that may directly control computing resources or data routing at low level.
  • the instructions may be replaced by a vectors of control bits assigned at each resource, the configurability is not limited by the instruction formats as in processors.
  • the device is further configured to perform a synchronization of one or more data-paths inside a group of data-paths; and/or perform a synchronization of one or more data-paths between one or more clusters of groups of data paths.
  • one or more data-paths are forked into sub- data-paths, wherein respective input vectors of the one or more data-paths are each processed by each sub-data-path of the respective data-paths.
  • the device is further configured to process the plurality of input vectors according to a processing tree, which is implemented by the data paths and the operators.
  • the device is further configured to process image data of an image sensor comprising a block of pixels.
  • the device is further configured to organize the image data of the block of pixels into the plurality of input vectors, wherein each input vector is based on image data of a set of vertical pixels.
  • the device is further configured to obtain results of processing from two or more data-paths; and combine the obtained results, for obtaining an output result.
  • the device according to the first aspect and its implementation forms may provide one or more of the following advantages:
  • Minimizing the data moving e.g., stream of ordered data in first in-first out (FIFOs) may be spread in the data-paths and a router (which may result in a lower power consumption).
  • FIFOs first in-first out
  • a scheduler may generate the control vectors.
  • a second aspect of the invention provides a method for data processing, the method comprises processing, by a processor core, a plurality of input vectors in parallel, wherein each input vector is processed by a different data-path, wherein the processor core comprising a plurality of data-paths for processing data, wherein each data-path comprises at least one operator, and wherein at least some of the operators of different data-paths are connected by hard-wiring.
  • At least some operators of the plurality of data paths are controllable, in particular are programmable to perform one or more arithmetic and/or logic operations.
  • the processor core comprises a plurality of groups of data-paths, wherein at least some of the groups are connected by hard-wiring.
  • the processor core comprises a plurality of clusters, wherein each cluster comprises a set of groups of data-paths, and wherein at least some of the clusters are connected by hard-wiring.
  • the method further comprises routing, by at least one router, the plurality of input vectors to the different data-paths.
  • the method further comprises storing by a memory, one or more control vectors, and controlling, by using each control vector, at least one of: a set of the operators; a set of the data-paths; a distribution of the input vectors to the data-paths; an operation of one or more operators.
  • the method further comprises using at least one control vector per each processing cycle.
  • the method further comprises performing a synchronization of one or more data-paths inside a group of data-paths; and/or performing a synchronization of one or more data-paths between one or more clusters of groups of data paths.
  • one or more data-paths are forked into sub-data-paths, wherein respective input vectors of the one or more data-paths are each processed by each sub-data-path of the respective data-paths.
  • the method further comprises processing the plurality of input vectors according to a processing tree, which is implemented by the data paths and the operators.
  • the method further comprises processing image data of an image sensor comprising a block of pixels.
  • the method further comprises organizing the image data of the block of pixels into the plurality of input vectors, wherein each input vector is based on image data of a set of vertical pixels.
  • the method further comprises obtaining results of processing from two or more data-paths; and combining the obtained results, for obtaining an output result.
  • a third aspect of the present disclosure provides a computer program comprising a program code for performing the method according to the second aspect or any of its implementation forms.
  • a fourth aspect of the present disclosure provides a non-transitory storage medium storing executable program code which, when executed by a processor, causes the method according to the second aspect or any of its implementation forms to be performed.
  • FIG. 1 shows a schematic view of a device for data processing, according to an embodiment of the invention
  • FIG. 2 shows a schematic view of a diagram illustrating the device processing a plurality of input vectors according to a processing tree
  • FIGS. 3A-3B shows schematic views of diagrams illustrating the device comprising an instruction decoder (FIG. 3A) and the device comprising a memory for storing control vectors (FIG. 3B);
  • FIG. 4 shows a schematic view of a diagram illustrating the device comprising a plurality of clusters for processing a block of pixels
  • FIG. 5 shows a schematic view of a diagram illustrating the device comprising a cluster of four groups
  • FIG. 6 shows a schematic view of a diagram illustrating the reconfigurability of the device based on image sensors
  • FIG. 7 shows a method for data processing, according to an embodiment of the invention.
  • FIG. 1 shows a schematic view of a device 100 for data processing, according to an embodiment of the invention.
  • the device 100 may be an electronic device such as a personal computer, a laptop, a digital mobile camera, a smart phone, etc.
  • the device 100 may be used for an ISP of a digital camera.
  • the device 100 comprises a processor core 10 comprising a plurality of data-paths 110, 120 for processing data, wherein each data-path 110, 120 comprises at least one operator 111, 112, 113, 121, 122, 123, and wherein at least some of the operators 112, 113, 122, 123 of different data paths 110, 120 are connected by hard-wiring.
  • the data-path 110 comprises the operators 111, 112, 113 and the data-path 120 comprises the operators 121, 122, 123.
  • the operator 112 of the data-path 110 is connected by hard-wiring to the operator 122 of the data-path 120.
  • the operator 113 of the data-path 110 is connected by hard-wiring to the operator 123 of the data-path 120.
  • the processor core 10 is configured to process a plurality of input vectors in parallel, wherein each input vector is processed by a different data-path 110, 120.
  • the device 100 may comprise processing circuitry (not shown in detail in FIG. 1) configured to perform, conduct or initiate the various operations of the device 100 described herein.
  • the processing circuitry may comprise hardware and software.
  • the hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry.
  • the digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors.
  • the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors.
  • the non- transitory memory may carry executable program code which, when executed by the one or more processors, causes the device 100 to perform, conduct or initiate the operations or methods described herein.
  • FIG. 2 shows a schematic view of a diagram illustrating the device 100 processing a plurality of input vectors according to a processing tree 200.
  • the processing tree 200 is implemented by the data-paths 110, 120, 210, 220 and their corresponding operators.
  • the device 100 may process image data of an image sensor, the image data comprising at least one block of pixels.
  • the device 100 may implement the processing tree 200 by connecting some of the operators.
  • the Gxx are pixels located in an image sensor at corresponding x and x coordinates, and the device 100 is configured to compute an operation according to Eq. (1) as follows: abs((G 03* 340+G 13* 684)/1024 - G 12 )
  • the device 100 may organize the computation tree 200, as it is shown in FIG. 2, in order to compute the above operation. For instance, the device 200 may organize all computation as a tree, and may further optimize the computation tree 200 by including hard routing between (some of) the operators.
  • the operators of the data-path 110 are indicated by references 111, 112, 113 and 201.
  • the operator 113 of the data-path 110 may be connected by a hardwired connection to a respective operator of the data-path 120.
  • the device 100 is further configured to obtain results of processing of the data-paths 110, 120, 210, 220, and combine the obtained results, for obtaining an output result. Furthermore, the processing may be applied on stream of ordered vectors of pixels and there may be no need for random data fetch accesses.
  • Table. I compares information for processing an instruction, when the operand fetching is performed based on first-in, first-out (FIFO) of vector and based on random accesses.
  • Table I exemplary information for processing an instruction.
  • FIG. 3A and FIG. 3B are schematic views of diagrams illustrating an implementation of the device comprising an instruction decoder (FIG. 3 A) and the device comprising a memory for storing control vectors (FIG. 3B).
  • FIG. 3 A and FIG. 3B may be used for controlling one or more hardware data-paths and selecting the respective data-paths.
  • an instruction decoder and a control flow are used for controlling the data-paths.
  • the device 100 may also be, for example, a programmable hardware with limited area overhead that does not include an instruction decoder and there is no control flow.
  • the device 100 of FIG. 3B comprises the memory 310, which stores the control vectors.
  • the device may use one or more of the stored control vectors for controlling, e.g., a set of the operators 111, 112, 113, 201, 121, 122, 123, a set of the data-paths 110, 120, 210, 220, a distribution of the input vectors to the data-paths, an operation of one or more operators 111, 112, 113, 201, 121, 122, 123, etc.
  • the control vectors may be used for controlling the hardware data-path and data selection (-100 vector per algorithm). There may be no instruction decoder, and the control vectors may be built offline.
  • FIG. 4 shows a schematic view of a diagram illustrating the device 100 comprising a plurality of clusters 401 for processing a block of pixels.
  • the device 100 may process a block of pixels in a very limited number of cycles ( ⁇ 400pix in lOOcyc) and the processing may be organized in threads.
  • the block of pixels may be organized in a succession of vertical vectors of pixels that are accessed sequentially.
  • a thread may be a stream processed by a data-path including an operand fetch and the operators.
  • the device may combine the threads together to build a computation tree.
  • the computation tree is called a “Py- template”.
  • the device 100 includes eight clusters 401 each comprising four groups 402.
  • Each group 402 comprises four datapath 110, 120, 210, 220, of two or three operators 111, 112, 113, 121, 122, 123.
  • the device 100 further comprises at least one router 403 to connect, for example, some data-paths 110, 120, 210, 220 in the groups 402, or to connect some groups 402 in the cluster 401, or to connect some cluster 401 together.
  • the processing data-paths are seen as threads.
  • the device 100 may further, for example, synchronize the threads inside a group 402, synchronize the threads between the clusters 401, fork the threads to process differently the same data, and select the result between two or more threads.
  • the data may be located (mostly) in the data-path 110, 120, 210, 220, and in the synchronization resources.
  • FIG. 5 shows a schematic view of a diagram illustrating the device 100 comprising a cluster 401 of four groups 402, 501, 502, 503.
  • the device 100 of FIG. 5 may be a programmable HW with no instruction, but including control vectors which may be generated by a tool.
  • the device 100 is further configured to use at least one control vector per each processing cycle and may further control all of the HW resources.
  • the diagram of FIG. 5 illustrates a high paralleled architecture of the device 100.
  • the device may process stream of vectors based on mapping pyramidal hardware computation trees.
  • the pyramidal computation tree is at first implemented in the blocks level, wherein four data paths 110, 120, 210, 220 are interconnected together, then the computation tree is implemented in the clusters 401, wherein four blocks are interconnected together, and afterward at the “router” level, wherein the clusters 401 are interconnected together in an infinite loop by the router 403.
  • the stream of vectors are travelling on the connections, and are processed by the operators 111, 112, 113, 121, 122, 123.
  • the operators 111, 112, 113, 121, 122, 123 may be based on arithmetic’s units such as adders, multipliers, divisors, etc.
  • the operand fetch may be obtained by manipulation of vectors in the streams.
  • each operand there may be a “column arrange” unit and a vector assembly unit. These two units may be in charge to manipulate the vectors of the stream in order to, for example, pull two vectors from a single stream and to use two vectors as operands of the operators and/or shift vertically one of the two vector operands to change the alignment of the two operands, etc.
  • the column arrange unit can also accept two different streams.
  • the device 100 may be capable to process a patch of pixels of 34x34, without limiting the present disclosure. Moreover, when a patch has been processed, then the device 100 may processes the next patch with an overlap to for seamless computations.
  • FIG. 6 shows a schematic view of a diagram illustrating the device 100 being reconfigured according to a type of an image sensor.
  • the device 100 comprises the processor core 10 comprising the plurality of data-paths 110, 120.
  • the processor core 10 may be used for an ISP, which may be adaptable to different types of one or more image sensors 611, 612, 613, 614, 615, 616 included in a camera 600.
  • the device 100 may enable a user to adapt the device 100 by programming or re-programming it.
  • the device 100 may comprise a memory 310 and the processor core 10.
  • the processor core 10 of the device is reconfigurable for different patterns of the image sensors 611, 612, 613, 614, 615, 616.
  • the device 100 may enable the user to store a set of programs 601, 602, 603, 604, 605, 606 in the memory 310.
  • each program may enable a specific configuration of the processor core 10 for processing data of a specific image sensor.
  • the program 601 e.g., stored by the user in the memory, during operation of the device
  • the device 100 may select the operators 111, 112, 113, 121, 122, 123 of the data-paths 110, 120 such that the processor core 10 is adapted, to process data of the image sensors 611, according to its pattern.
  • the device 100 is a programmable device 100 that includes reconfigurable hardware (the processor core 10 may be reconfigured).
  • FIG. 7 shows a method 700 according to an embodiment of the invention for data processing. The method 700 may be carried out by the device 100, as it is described above.
  • the method 700 comprises a step S701 of processing, by a processor core 10, a plurality of input vectors in parallel, wherein each input vector is processed by a different data-path 110, 120, wherein the processor core 10 comprising a plurality of data-paths 110, 120 for processing data, wherein each data-path 110, 120 comprises at least one operator 111, 112, 113, 121, 122, 123, and wherein at least some of the operators 112, 113, 122, 123 of different data-paths are connected by hard-wiring.
  • the present invention has been described in conjunction with various embodiments as examples as well as implementations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Image Processing (AREA)
EP20726086.0A 2020-05-14 2020-05-14 Vorrichtung und verfahren zur datenverarbeitung Pending EP4133368A1 (de)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/063420 WO2021228392A1 (en) 2020-05-14 2020-05-14 Device and method for data processing

Publications (1)

Publication Number Publication Date
EP4133368A1 true EP4133368A1 (de) 2023-02-15

Family

ID=70738555

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20726086.0A Pending EP4133368A1 (de) 2020-05-14 2020-05-14 Vorrichtung und verfahren zur datenverarbeitung

Country Status (2)

Country Link
EP (1) EP4133368A1 (de)
WO (1) WO2021228392A1 (de)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7100026B2 (en) * 2001-05-30 2006-08-29 The Massachusetts Institute Of Technology System and method for performing efficient conditional vector operations for data parallel architectures involving both input and conditional vector values
WO2007143278A2 (en) * 2006-04-12 2007-12-13 Soft Machines, Inc. Apparatus and method for processing an instruction matrix specifying parallel and dependent operations
US8180998B1 (en) * 2007-09-10 2012-05-15 Nvidia Corporation System of lanes of processing units receiving instructions via shared memory units for data-parallel or task-parallel operations
US9977677B2 (en) * 2016-04-07 2018-05-22 International Business Machines Corporation Execution slice with supplemental instruction port for an instruction using a source operand from another instruction port

Also Published As

Publication number Publication date
WO2021228392A1 (en) 2021-11-18

Similar Documents

Publication Publication Date Title
JP6821002B2 (ja) 処理装置と処理方法
US11586907B2 (en) Arithmetic unit for deep learning acceleration
JP3573755B2 (ja) 画像処理プロセッサ
KR101703797B1 (ko) 벡터 소팅 알고리즘 및 다른 알고리즘들을 지원하기 위한 트리 구조를 갖춘 기능 유닛
Severance et al. Embedded supercomputing in FPGAs with the VectorBlox MXP matrix processor
JP2021508125A (ja) 行列乗算器
US7305649B2 (en) Automatic generation of a streaming processor circuit
JP2020522825A (ja) 再構成可能並列処理
Yu et al. Vector processing as a soft-core CPU accelerator
US20120278590A1 (en) Reconfigurable processing system and method
JP2005531848A (ja) 再構成可能なストリーミングベクトルプロセッサ
JP2020109605A (ja) マルチスレッドプロセッサのレジスタファイル
KR20010031192A (ko) 기계시각시스템에서의 영상데이터와 같은 논리적으로인접한 데이터샘플들을 위한 데이터처리시스템
JP2020109604A (ja) ロード/ストア命令
WO2017185392A1 (zh) 一种用于执行向量四则运算的装置和方法
US6934938B2 (en) Method of programming linear graphs for streaming vector computation
US11372804B2 (en) System and method of loading and replication of sub-vector values
CN112199119B (zh) 向量运算装置
Nieto et al. SIMD/MIMD dynamically-reconfigurable architecture for high-performance embedded vision systems
EP4133368A1 (de) Vorrichtung und verfahren zur datenverarbeitung
EP1936492A1 (de) SIMD-Prozessor mit Reduktions-Einheit
CN113867799A (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN114968911B (zh) 算子频度压缩及上下文配置调度的fir可重构处理器
WO2008077803A1 (en) Simd processor with reduction unit
Sergienko et al. Image buffering in application specific processors

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20221110

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)