WO2010083723A1 - 可重构数据处理平台 - Google Patents

可重构数据处理平台 Download PDF

Info

Publication number
WO2010083723A1
WO2010083723A1 PCT/CN2010/000072 CN2010000072W WO2010083723A1 WO 2010083723 A1 WO2010083723 A1 WO 2010083723A1 CN 2010000072 W CN2010000072 W CN 2010000072W WO 2010083723 A1 WO2010083723 A1 WO 2010083723A1
Authority
WO
WIPO (PCT)
Prior art keywords
reconfigurable
data processing
memory
configuration
processing platform
Prior art date
Application number
PCT/CN2010/000072
Other languages
English (en)
French (fr)
Inventor
林正浩
Original Assignee
上海芯豪微电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海芯豪微电子有限公司 filed Critical 上海芯豪微电子有限公司
Publication of WO2010083723A1 publication Critical patent/WO2010083723A1/zh
Priority to US13/187,841 priority Critical patent/US8468335B2/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/61Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • G06F9/3895Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros
    • G06F9/3897Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros with adaptable data path
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/12Selection from among a plurality of transforms or standards, e.g. selection between discrete cosine transform [DCT] and sub-band transform or selection between H.263 and H.264
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/12Selection from among a plurality of transforms or standards, e.g. selection between discrete cosine transform [DCT] and sub-band transform or selection between H.263 and H.264
    • H04N19/122Selection of transform size, e.g. 8x8 or 2x4x8 DCT; Selection of sub-band transforms of varying structure or type
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation

Definitions

  • the present invention relates to the field of integrated circuit design. Background technique
  • the chip can process stream data at high speed, and can perform a large number of high-speed addition, multiplication, fast Fourier transform, discrete cosine transform and other operations, and Timely updates to meet the rapidly changing market needs.
  • CPUs general-purpose processors
  • DSPs digital signal processors
  • the processing power of convective data is insufficient, and the throughput rate is not high, which limits its application.
  • the digital signal processor is optimized in computing resources, and some arithmetic units are added, but the computing resources are still limited. In some chips such as UC.
  • the ASIC (Appl ication Spec ific Integrated Circui t ) chip can process stream data very well and has high throughput rate, which can meet a large number of high-speed data computing requirements.
  • the ASIC chip has a long design time and high design cost.
  • NRE non-recurring engineering cost
  • ASIC chips lack flexibility, and they cannot easily change functions when market demand changes. It is necessary to redesign a chip. If different modes of operation are implemented on an ASIC chip, such as for different video decoding standards, it needs to design different modules for different video decoding standards, which are integrated on the same chip, which increases the cost.
  • the FPGA Field Programmable Gate Array
  • LUT Lookup Table
  • the interconnect delay dominates the latency of the FPGA, such as the calculation of the CRCX cycl ic redundancy check using the lattice model of the LFSC25.
  • the interconnect latency is the total. Delayed by 78.3%. It can be seen that the interconnect delay of the FPGA severely limits the performance improvement. And the interconnect delay of the FPGA can only be known after the design is mapped to the FPGA. The approximate delay of the design cannot be known at the time of design. This makes it possible to change the design multiple times for delay convergence and extend the design when designing with FPGA. cycle.
  • the present invention is directed to the deficiencies of the prior art, and proposes a reconfigurable data processing platform for implementing arithmetic of different bit widths/word lengths by different configurations of regular versatile ubiquitous digital signal processing blocks.
  • the logic operation can support high-order radix multiplication.
  • the basic unit (cel l) can avoid the use of a look-up table (LUT) and avoid complicated internal wiring. Therefore, the present invention has high throughput and high performance of the ASIC. At the same time, it combines the flexibility and low cost of FPGA.
  • a reconfigurable data processing platform can implement different functions by configuring a specific configuration memory.
  • the reconfigurable data processing platform is composed of the following singular or plural partial or all modules:
  • Reconfigurable control unit (reconfigurable control unit), which can be used alone to generate control signals, and can also implement configurable logic functions together with other logic; including but not limited to reconfigurable ubiquitous digital data Processing blocks and reconfigurable memory.
  • the reconfigurable ubiquitous digital data processing block of the present invention is composed of a plurality of basic units (cel l ) which realize a single bit (bit) basic operation, and a regular connection of the interconnection structure.
  • the entire ubiquitous digital signal processing block has a preset connection that is not axi s asymmetrical, can implement a specific function, and can be reconstructed by configuring a specific configuration memory, and is divided into singular numbers by using the basic unit as a minimum unit.
  • the configuration memory may be located wholly or partially at the reconfigurable ubiquitous digital data Within the block, it may also be located in whole or in part outside the reconfigurable ubiquitous digital data processing block;
  • the plurality of digital data processors may be independent of each other, or may be partially or completely interconnected, serial, parallel or Serial and mixed digital signal processing; when the digital data processor is a separate digital data processor, it can bypass/circle other digital data processing through the vertical bus and/or the horizontal bus or bypass path
  • the digital data processing includes, but is not limited to, an arithmetic, a logical operation, and a combination of arithmetic and logical operations; the arithmetic operation includes a fixed point of any bit width less than or equal to a total bit width of the ubiquitous digital signal processing block. Operations and floating point operations;
  • the basic unit of the present invention can complete one-bit logical operation and one-bit addition operation by configuration; a row formed by the connection relationship of the basic unit configured and determined according to the configuration information can realize that the basic unit cannot be implemented.
  • Independently implemented complex bit logic operations and specific complex bit arithmetic operations including but not limited to addition, subtraction, and shifting; arrays ( arra y) formed by connection connections of the basic unit configured and determined by configuration information, Specific complex bit operations that cannot be implemented independently of the row, including but not limited to shifting, multiplication, multiply-add, multiply-subtract, limit, extract, absolute, and round.
  • the row is an arithmetic device connected by the basic unit, which is generally a row in the sense of physical location, but the concept of the row in the present invention is only to more clearly illustrate the technical idea and implementation scheme of the present invention.
  • the positional relationship between rows and rows is also only a logical relationship, and does not specifically refer to a relationship at a physical location.
  • the shape of the row may be arbitrary, and the shape of the ubiquitous digital signal processing block formed by the rows may also be arbitrary.
  • the basic unit of the present invention can implement different functions with the same set of logic without using a lookup table (LUT). Such functions include, but are not limited to, logical operations and addition operations.
  • the base unit includes at least carry generation logic, sum (sum) generation logic, a plurality of configuration memories, a plurality of input ports, and a plurality of output ports.
  • the bit width of the input port and the output port may be a single digit or a complex digit.
  • the input port is selected in whole or in part by the input multiplexer to send data into the carry generation logic/and generate logic.
  • the select control signal of the multiplexer is derived from a configuration memory.
  • the input multiplexer includes at least one carry input multiplexer and a multiplexer for selecting an input according to a high radix multiplication re-encoding result.
  • the data of the output port is derived in whole or in part from an output delay (delay) device with a bypass function.
  • the output delay device with bypass function includes, but is not limited to, a multiplexer and a register/latch.
  • the multiplexer in the output delay device with bypass function is used to select data from the carry generation logic and the output logic output and the delayed output as the final output, and the selection terminal control signal is derived from the configuration memory.
  • the delay device for outputting delay in the basic unit and the corresponding multiplexer and configuration memory may be directly connected At the output, it can also be connected to the corresponding input.
  • a delay device and a corresponding multiplexer and configuration memory may be added to the input port, and the output port is omitted.
  • the carry generation logic of the present invention may generate a carry output of the full adder according to the configuration, or may generate an output of the logical operation according to the configuration.
  • the sum generation logic may generate a full adder output according to the configuration, or may generate an output of the logic operation according to the configuration.
  • the configuration memory in the ubiquitous digital signal processing block of the present invention is used to store configuration information, which may be located in whole or in part in the basic unit, or may be wholly or partially located outside the basic unit; Or partially located outside of the base unit, the configuration memory can be used to control a respective number of the plurality of base units or corresponding ones of the plurality of base units.
  • the input of the carry input multiplexer in the basic unit of the present invention may be from the output port of the adjacent basic unit in the right side of the line according to the configuration, or may be from the output port of the specific basic unit in the upper part of the array, or may be from
  • the configuration information for the corresponding configuration memory may also be assigned a logic "0" or a logical ""; the input of the multiplexer for selecting the input according to the high-order multiplication re-encoding result may come from different configurations depending on the configuration.
  • the output port of the specific basic unit above the array may also be the multiplicand and its inverse code, the left multiplied multiplicand and its inverse code or logic "0", or may be the coefficient from the corresponding configuration memory.
  • Configuration information; the input of the other input multiplexer may be input from the operand data according to the configuration, or may be configuration information for the coefficient from the corresponding configuration memory.
  • the basic unit of the present invention can be configured to be multiplied by a shift addition method by configuring and interconnecting, and can also constitute a high-order multiplier, which significantly reduces the area of the multiplier and improves performance.
  • the high-order multiplier includes, but is not limited to, a radi ix-2 multiplier; the re-encoder includes but is not limited to a booth coder; the re-encoder may be independently present in the In the reconfigurable data processing platform, for a single number or a plurality of multipliers, the reconfigurable control unit may be configured by the reconfigurable control unit, or may be external to the reconfigurable data processing platform.
  • the encoded result of the recoder may be transmitted as an input to the multiplexer constituting the multiplier to control the multiplexer that selects the multiplication input according to the re-encoding result, or may be used as a coefficient It is stored in a configuration memory for controlling the multiplexer that selects the multiplication input according to the re-encoding result.
  • the row and array of the present invention can be constructed by the basic unit.
  • different arithmetic devices can be constructed by different connection methods between the basic units.
  • the computing device package These include but are not limited to adders, subtractors, shifters, and logic operators.
  • the array generally occupies a plurality of rows in the sense of a physical position, and is an arithmetic device connected by the basic unit. According to different internal configurations of the basic unit, different computing devices can be formed by different connection modes between the basic units. .
  • the arithmetic device includes, but is not limited to, a multiplier, a multiplier, and a multiplier.
  • the configuration memory in the row or array stores configuration information of rows or arrays for corresponding configuration.
  • the arithmetic unit in the basic unit is configured as an adding unit, and the input and output ports of the basic unit are configured as an operand input port, a low carry input port, a local bit and an output port, and a local carry output port according to the addition definition.
  • a row-wave carry adder can be constructed by connecting the lower carry input port between the adjacent left and right basic units to the local carry output port.
  • the delay device in the basic unit can be flexibly configured according to actual needs, and the addition segmentation is configured as a pipeline type adder to increase the clock frequency. It can be configured to have an additional output port for the base unit, in conjunction with the advancement bit logic, to implement a carry-forward adder to increase the clock frequency.
  • the connection between the base unit and the basic unit can be configured to form other types of adders.
  • the method of constructing the subtractor is similar to the method of constructing the adder.
  • the arithmetic unit in the basic unit is configured as a bypass, and the input and output ports of the basic unit are configured as an operand input port and an operand output port according to the addition definition, and the basic unit of the specific position relationship is
  • the operand input port is connected to the operand output port to form a shifter.
  • a large-span shifting shifter can be composed of a plurality of small-span shifting shifters, or can be realized by one shift by a connection between basic units of a specific positional relationship.
  • the arithmetic components in the basic unit are configured and operated, and the input and output ports of the basic unit are configured as the operand input port and the operand output port according to the addition definition, thereby forming a logic and a device.
  • Other logical operations including but not limited to, OR, XOR, XOR, OR, can be implemented in a similar manner.
  • the arithmetic unit in the basic unit is configured as an adding unit, and the input and output ports of the basic unit are configured as an operand input port, a low carry input port, a local and output port, and a local carry output port according to the addition definition.
  • an array multiplier can be constructed by connecting the lower carry input port, the local and output ports, the local carry output port and the operand input port of the basic unit of a specific position relationship according to a certain rule.
  • the connection between the basic unit and the basic unit is configured according to other rules, and can constitute other types of multipliers, including but not limited to the second rad ix-2 multiplier.
  • multiplier array As an example, constructing according to the construction method of the multiplier array, and finally configuring one line of addition, a multiplier can be constructed. In addition, it can be configured according to certain rules to achieve a more efficient multiplier. Method and composition multiplication The method of adding is similar.
  • the row or array may contain a singular or plural number of valid bits or no valid bits depending on the configuration information.
  • the singular or plural valid bits can be used to control whether a corresponding partial or all rows or arrays are run and identify the validity of the running result; the valid bits are implemented in a memory, including but not limited to registers, latches, and random Access the memory.
  • the valid information in the row or array valid bits can be transmitted to other rows or arrays or to the outside of the reconfigurable ubiquitous digital data processing block. If the information in the transmitted valid bits is valid, then the row or array needs to perform a corresponding operation and send the valid information to the corresponding next row or array or to the outside of the reconfigurable ubiquitous digital data processing block.
  • the row or array does not perform a corresponding operation, and sends the invalid information to the corresponding next row or array or the reconfigurable ubiquitous digital data processing block.
  • Methods of controlling the row or array to not operate include, but are not limited to, turning off the corresponding clock or turning off the corresponding power source.
  • the valid bit can also be used only to identify validity.
  • the valid bit can also be bypassed by a bypass logic.
  • the specific data processing function can be implemented by a single number or a plurality of the rows or arrays according to the configuration information stored in the configuration memory, and the data processing can be performed independently, or the data processing can be performed serially, in parallel, or in series and in parallel with the row or array. .
  • the rows or arrays that make up the particular data processing function may be isomorphic or heterogeneous.
  • the matrix multiplication can be equivalent to the accumulation of multiple multiplications and additions in the algorithm, and the matrix multiplication function can be realized by connecting a plurality of multiplier arrays according to a specific rule. Similar methods can be used to implement digital data processing algorithms such as filtering and fast Fourier transform (FFT).
  • FFT fast Fourier transform
  • the row or array and the basic unit of the present invention are all part of the reconfigurable ubiquitous digital data processing block, and the division thereof is relative, and the hierarchy is listed only for more clearly expressing the technical solution of the present invention. The levels are not all necessary.
  • the rows may consist of a single basic unit, and the ubiquitous digital signal processing block may also consist of a single row or array.
  • the reconfigurable ubiquitous digital data processing block is configured as a digital data processing of a specific function.
  • the digital data processor with a specific function is an independent digital data processor, it can bypass/circle other digital data processors and functions with specific functions through a vertical bus and/or a horizontal bus or a bypass path.
  • the external link of the reconfigurable ubiquitous digital data processing block or the reconfigurable data processing platform is described.
  • a single instruction multiple data stream (SIMD) can be realized; when a plurality of the digital functions having a specific function are implemented Data processors have different functions and are independent of each other, while running in parallel
  • the digital data processor that implements the specific function reconstructed according to the configuration information by the multi-transmit may include a single number or a plurality of valid bits according to the configuration information, or may not include a valid bit. A single or multiple valid bits can be used to control whether the corresponding digital class data processor is running and to identify the validity of the results of the run.
  • the valid bits are implemented in memory, including but not limited to registers, latches, and random access memory.
  • the valid information in the valid bits may be transmitted to other digital data processors in the reconfigurable ubiquitous digital data processing block, or may be transmitted to the outside of the reconfigurable ubiquitous digital data processing block. If the information in the valid bits is valid, the row or array in the digital data processor or the digital data processor needs to perform corresponding operations, and send the valid information to the corresponding other digital data processor.
  • a row or array in the digital data processor or an external portion of the reconfigurable ubiquitous digital data processing block if received in a valid bit If the information is invalid, the digital data processor does not perform a corresponding operation, and sends the invalid information to the corresponding other digital data processor or the outside of the reconfigurable ubiquitous digital data processing block.
  • Methods in which the digital data processor does not operate include, but are not limited to, turning off the corresponding clock or turning off the corresponding power supply.
  • the valid bit can also be used only to identify validity.
  • the valid bit can also be bypassed by bypass logic instead of kick in.
  • the interconnect structure in the reconfigurable ubiquitous digital data processing block of the present invention is divided into two levels: a high speed local connection and a global bus.
  • the high speed local connection is used for high speed connection between adjacent basic units; the high speed local connection is a hard wired close-range fixed connection for most delay critical paths;
  • the configuration of the way selector can reconstruct the connection relationship between the basic units.
  • the connection relationship includes, but is not limited to, a carry-over relationship of addition and subtraction in the same row and a multiplicative partial product transfer relationship between different rows.
  • the reconfigurable local bus can be configured to form an adder-subtraction carry chain and a multiplicative partial carry-in for constructing arithmetic components, including but not limited to adders, subtractors, multipliers.
  • the global bus includes a reconfigurable vertical bus structure for transferring data and a horizontal reconfigurable lateral shift structure for vertically connecting rows or arrays;
  • the reconfigurable vertical bus structure is for data/data stream reconfigurable Longitudinal transmission in a ubiquitous digital data processing block;
  • the reconfigurable horizontal shifting structure may shift data outputted by a specific row or array to a left side, a right side or a next line or array according to a configuration, and may also Achieve large-span data shift operations and data cross-transfer operations.
  • the reconfigurable vertical bus structure and the reconfigurable horizontal shift structure can be reconstructed by configuring a specific configuration memory to jointly form a complete bus structure, and realizing the data flow of the reconfigurable ubiquitous digital data processing block.
  • the reconfigurable vertical bus structure for vertical connection is divided into two directions of a reconfigurable down bus and a reconfigurable up bus.
  • the number of reconfigurable down buses and reconfigurable up buses can be different.
  • Single or multiple rows or arrays Form a vertical transmission unit.
  • the reconfigurable down bus may input data external to the reconfigurable ubiquitous digital data processing block to each of the vertical transmission units.
  • the reconfigurable down bus may be disconnected at any of the longitudinal transmission units according to a configuration.
  • the output of any of the vertical transmission units may also transfer data to other vertical transmission units below it using the reconfigurable down bus.
  • the reconfigurable up bus is coupled to the data output of each of the vertical transmission units, and the output data of the particular vertical transmission unit can be sent to the outside of the reconfigurable ubiquitous digital data processing block.
  • the output of any of the vertical transmission units may also utilize the reconfigurable up bus to transmit data to other vertical transmission units thereon.
  • a single or multiple rows or arrays form a horizontal transmission unit.
  • the reconfigurable lateral shifting structure for lateral connection may shift the data output by the horizontal transmission unit to the next one of the horizontal transmission units according to a configuration.
  • the number of bits of the shift may be a single digit, or a complex digit, or may not be shifted.
  • the reconfigurable memory of the present invention can be reconstructed by configuring a specific configuration memory, which can be used for storing data and having a data reordering function; the bit width/word length of the reconfigurable memory can be fixed.
  • the configuration memory may be located in whole or in part in the reconfigurable memory, or may be located wholly or partially outside the reconfigurable memory.
  • the address map can convert the input address to a new address and output for data reordering.
  • the address mapping may have a single-numbered mapping or multiple mappings. The specific mapping relationship and number of mappings can be determined through configuration. Through the different mapping relationships, different physical addresses of the same memory can be accessed by the same logical address, or the memory can be written or/and read out in different physical address order by the same logical address order, thereby conveniently implementing data address conversion. And constitute a first-in first-out buffer (FIF0).
  • the reconfigurable memory can also be configured as a plurality of independent memories according to different bit widths/word lengths, the independent memories having independent address decode/address logic and word lines. Any of the independent memories may have independent address mappings, or may share address mappings with other of the independent memories.
  • the reconfigurable memory of the present invention may be located outside or below the reconfigurable ubiquitous digital data processing block, or may be located inside the reconfigurable ubiquitous digital data processing block.
  • the reconfigurable memory can be connected to the reconfigurable ubiquitous digital data processing block according to a specific rule according to a configuration for data storage and buffering.
  • the reconfigurable control unit of the present invention comprises a reconfigurable random logic and a reconfigurable finite state machine, which can be reconstructed by configuring a specific configuration memory, and the basic components are used to generate control signals and finite states required for processing different transactions.
  • the configuration memory may be located in whole or in part in the reconfigurable control unit, or may be located in whole or in part. Refactoring outside the control unit. .
  • the reconfigurable random logic includes a reconfigurable functional unit and a reconfigurable connection; the reconfigurable functional unit can implement any kind of logical function through configuration; the reconfigurable connection can implement multiple reconfigurable configurations through configuration Any kind of connection of functional units; singular or plural reconfigurable functional units and singular or plural reconfigurable connections can achieve random logic reconstruction through a specific configuration; the reconfigurable random logic can also Depending on the configuration, a register/latch is inserted in the random logic consisting of the reconfigurable functional unit and the reconfigurable connection to ensure latency requirements.
  • the reconfigurable finite state machine is comprised of a randomly accessible memory, a current state register, a reconfigurable multiplexer, and reconfigurable random logic.
  • the random access memory stores an input definition value (qualification), a state transition value, and an output control signal value.
  • Each row in the randomly accessible memory stores a corresponding singular or complex array input defined value, a single array or complex array state transition value, and a single array or complex array output control signal value.
  • the bit width/word length of each of the input defined value, the state transition value, and the output control signal value in each row of the randomly accessible memory is variable.
  • the boundary information between different values in each row of the randomly accessible memory may be fixed or dynamically variable. When the boundary information between different values in each row in the randomly accessible memory can be dynamically variable, the boundary flag memory stores the corresponding boundary information for each row.
  • the current status register is used to store the current state value, and to implement the migration from the next state to the current state.
  • the reconfigurable multiplexer is configured to select a next state from the complex array state transition values, and select a control signal output that satisfies the condition from the complex array output control signal values.
  • the reconfigurable random logic can be reconfigured by configuring a particular configuration memory for performing specific random logic functions.
  • the reconfigurable finite state machine may further include a counter composed of reconfigurable random logic to implement a function of state transition according to the counting result.
  • the value in the current status register directly points as an address to a row in the randomly accessible memory, and the input defined value in the corresponding output is transmitted to the weight with the signal input from outside the reconfigurable finite state machine And generating a selection signal for selecting a next state from the state transition values in the corresponding output of the randomly accessible memory, and from the output control signal value in the corresponding output of the randomly accessible memory Select the output control signal.
  • the next state is stored in the current status register as the address of the next randomly accessible memory address.
  • the reconfigurable finite state machine can support multi-state machine alternately concurrently after configuration (mul ti-thread ⁇ according to this
  • the new multiplexer selects the value in the current state register of the running finite state machine as the current state.
  • the state machine is switched due to the multi-state machine alternately concurrent, the state of the finite state machine in the original operation is saved in the corresponding current state register, and the newly added multiplexer selects the current state of the finite state machine to be operated accordingly.
  • the value in the status register continues to run as the current state, and the state machine can be switched.
  • the reconfigurable control unit of the present invention can implement a child-configurable logic function in conjunction with other logic; the reconfigurable logic functions include, but are not limited to, a reconfigurable input/output external interface.
  • the reconfigurable input/output external interface can configure a finite state machine in the reconfigurable control unit to implement a single or multiple interface protocols simultaneously with the same set of hardware structures. If a complex hardware interface is required to implement multiple interface protocols at the same time, a finite state machine can be configured to implement the arbitration module, and simultaneous support for a relatively low-speed complex interface protocol can be realized by time-division multiplexing under the high clock frequency of the platform. .
  • the configuration information for configuring the reconfigurable module in the reconfigurable data processing platform of the present invention may be stored in the platform or may be stored outside the platform. If configuration information is already stored in the reconfigurable data processing platform, each reconfigurable module can be configured directly. When the configuration information is stored outside the reconfigurable data processing platform, the reconfigurable data processing platform can be regarded as a memory, and all configuration information is transmitted to the reconfigurable data in a manner of storing data to the memory. Process each reconfigurable module in the platform. The configuration information can also be transmitted to the various configuration memories while the reconfigurable platform is running, enabling dynamic switching between different functions.
  • the configuration information of the module transmitted to the reconfigurable configuration according to the present invention may be uncoded original configuration information or encoded information.
  • the manner of encoding includes, but is not limited to, encryption and compression.
  • the key used to decrypt the encrypted encoded information may be present in hardware form with the reconfigurable data processing platform, or may be input to the reconfigurable data processing platform via configuration information.
  • the configuration information described in the present invention may be manually created or automatically generated by an automatic tool according to a mapping rule.
  • the mapping includes, but is not limited to, mapping of hardware description language (HDL) to configuration information, mapping of computer programming language to configuration information, mapping of computer modeling to configuration information, and mapping of algorithm descriptions to configuration information.
  • Hardware description languages include, but are not limited to, Veri logHDL and VHDL;
  • computer programming languages include, but are not limited to, C, C++, and JAVA;
  • computer modeling includes, but is not limited to, Matlab modeling;
  • algorithm descriptions include, but are not limited to, pseudo-instructions for specific algorithms (pseudo Instruction ) Description.
  • a configuration information template for commonly used operations can be implemented in advance. When the automatic information is automatically generated according to the mapping rule using the automatic tool, the automatic tool can directly call the configuration information in the template.
  • the configuration memory of the present invention may be volatile or non-volatile.
  • the configuration memory When the configuration memory is When non-volatile, the function can be kept unchanged after the configuration of a specific digital data processing function until the next configuration, which is convenient for the user.
  • the non-volatile memory includes, but is not limited to, a FLUSH memory. It is also possible to directly cure the configuration when designing and producing the reconfigurable data processing platform.
  • the method of the direct cure configuration includes, but is not limited to, replacing a particular layer in the layout at the time of production. While doing so allows the reconfigurable data processing platform to achieve only certain functions, it can still significantly shorten the development cycle.
  • the reconfigurable data processing platform of the present invention may further comprise an expansion module to adapt to a wider range of requirements;
  • the expansion module includes but is not limited to: a random logic controller, an analog unit, a central processing unit, a digital signal processor ( DSP), packet header detector and logic zero number detector.
  • DSP digital signal processor
  • the delay between each input and output of the basic unit is known in advance, and all delay critical paths can be connected to each other through high-speed interconnections, and the delay of the interconnection is also known.
  • a 16x16 bit For example, the final delay depends on a 32-bit over-advanced full adder. When the partial accumulation plus delay is less than the delay of the 32-bit full adder, the partial accumulated addition result is sent directly to the full adder.
  • the pipeline segment required by the required multiplier is automatically increased, and the intermediate value during the partial accumulation addition process is registered, and then the partial Accumulate plus, and finally output to the 32-bit full adder for the final accumulation.
  • the reconfigurable ubiquitous digital signal processing block is matched
  • the delay of the set multiplier can be predicted before the design, and can be implemented in the template. Based on the design of the reconfigurable memory and the reconfigurable control unit, this method can also predict the delay before design and avoid doing it. The delay converges.
  • the reconfigurable data processing platform of the present invention can be self-tested, and can perform self-test of the chip without relying on external devices in the case of power-on operation.
  • a specific singular number of basic units or rows or arrays, or a plurality of basic units or rows or arrays in the reconfigurable ubiquitous digital data processing block can be configured to be compared.
  • An excitation having a specific relationship is given to a corresponding complex array other basic unit or row or array and a basic unit or a combination of rows or arrays in the reconfigurable ubiquitous digital data processing block, and the complex array is compared by the comparator Whether the output of other rows or arrays and combinations of rows or arrays corresponds to the corresponding specific relationship.
  • the incentives may come from a particular module in the reconfigurable data processing platform or from outside the reconfigurable data processing platform.
  • the specific relationships include, but are not limited to, equal, opposite, reciprocal, and complementary.
  • the test result can be Sended to the outside of the reconfigurable data processing platform. It can also be stored in a memory in the reconfigurable data processing platform.
  • the self-test may be performed at the time of system startup during wafer testing, post-package integrated circuit testing or chip use; or self-test conditions and periods may be manually set, and self-tests are periodically performed during operation; It can be volatile or non-volatile.
  • the reconfigurable data processing platform of the present invention can have self-repair capability under the premise of self-test capability.
  • the failed basic unit or the failed row or the failed array may be marked, and the mark is performed on the reconfigurable data processing platform.
  • the failed basic unit or the failed line or the failed array can be bypassed according to the corresponding mark, so that the reconfigurable data processing platform can still work normally and realize self-repair.
  • the self-repair may be performed after the wafer is tested, after the integrated circuit test after the package is tested, or after the chip is used, the test is performed after the system is started; the self-test self-repair condition and the period may be manually set, and the work period is regular. Perform after self-test.
  • the basic unit of the reconfigurable ubiquitous digital data processing block in the present invention may not include a lookup table, which simplifies the design, and realizes different operations by the output of the basic unit and the different connections of adjacent basic units located in different rows. The function is greatly improved compared with the prior art;
  • the basic unit of the reconfigurable universal digital data processing block can be directly connected to form a high-order multiplier, which does not require an additional decoding module, can significantly reduce the area of the multiplier, and improve performance;
  • connection between the basic units for realizing a specific function in the reconfigurable ubiquitous digital data processing block of the present invention is very close, and is usually directly connected to adjacent basic units in a row or adjacent rows, so that the delay is The signal transmission delay between basic units on the critical path is minimized to improve performance;
  • the reconfigurable memory of the present invention can implement data reordering by configuring an address mapping relationship and changing the addressing mode of the memory, which is not involved in the prior art;
  • the reconfigurable finite state machine of the present invention is implemented on the basis of a randomly accessible memory, and can support alternate concurrency, which is different from the existing reconfigurable finite state machine implementation technology;
  • the reconfigurable data processing platform of the present invention can implement different functions with a set of hardware, and the present invention can effectively reduce the performance compared with an application specific integrated circuit (ASIC) that integrates multiple hardwares that implement different functions.
  • ASIC application specific integrated circuit
  • the reconfigurable ubiquitous digital data processing block of the present invention contains a huge number of arithmetic units, having a general-purpose processor or a digital
  • the signal processor has no throughput at all and a wider breadth of multi-emission capability;
  • the reconfigurable ubiquitous digital data processing block of the present invention is more structural and the connection between related logics is more tight, and the logic is The connection relationship is much simpler, can get higher performance, and can change the configuration more quickly, and quickly switch from one working mode to another, which can achieve relatively low speed with high clock frequency and time division multiplexing. Simultaneous support for multiple functions.
  • the unit timing is known because the basic unit, row or array is configured to have different functions, and the timing critical path is on the critical critical path.
  • the basic units are all close-range connections.
  • the delay of the rows or arrays in the preset template is determini stic. It is convenient to calculate the delay before design according to the scale and number of bits in the specific design. It is not necessary to perform delay convergence (timi ng closure), avoiding possible design rework problems and effectively shortening the development cycle.
  • the configuration information according to the present invention can be automatically generated by the automatic tool according to the mapping rule from the hardware description language, the computer programming language, the computer modeling, the algorithm description, etc., not only shortening the development cycle, but also not knowing the hardware description language.
  • an algorithmic engineer in a computer programming language maps a more abstract model or algorithm description directly to configuration information, eliminating many of the steps in the design and improving productivity.
  • FIG. 1 is a structural block diagram of a reconfigurable data processing platform according to the present invention.
  • Figure 2 is an embodiment of the internal structure of the basic unit.
  • Figure 3 (a) is an embodiment of the internal configuration of a single basic unit when implementing a normal addition operation.
  • Figure 3 (b) is an embodiment of the basic inter-cell connection when implementing a normal 8-bit addition operation.
  • Figure 4 (a) is an embodiment of the internal configuration of a single basic unit when implementing logical operations.
  • Figure 4 (b) is an embodiment of the basic inter-cell connection when implementing logical operations.
  • Figure 5 is an embodiment of a basic unit connection implementing a 4-bit multiplication.
  • Figure 6 is an embodiment of a 4-bit general multiplier.
  • FIG. 7 A data channel of similar width implements an embodiment of a single instruction stream single data stream (SISD) and a single instruction stream multiple data stream (SIMD) instruction.
  • SISD single instruction stream single data stream
  • SIMD single instruction stream multiple data stream
  • Figure 8 is an embodiment of the division of functional modules in a data channel when implementing a floating point multiply operation.
  • Figure 9 is an embodiment of implementing a multiplier.
  • Figure 10 (a) is an embodiment of the basic unit connection when the right shift is performed one bit.
  • Figure 10 (b) is an embodiment of the basic unit connection when the left shift is performed one bit.
  • Figure 11 (a) is an embodiment of the basic unit of the vertical direction data stream.
  • Figure 11 (b) is an embodiment of the bus structure.
  • Figure 12 (a) is an embodiment of a lateral shifting unit.
  • Figure 12 (b) is an embodiment of a reconfigurable ubiquitous digital data processing block.
  • Figure 13 (a) shows a reconfigurable ubiquitous digital data signal processing block that has not been configured.
  • Figure 13 (b) is a reconfigurable ubiquitous digital data signal processing block configured as a single instruction stream multiple data stream (SIMD).
  • Figure 13 (c) is a reconfigurable ubiquitous digital data signal processing block configured for serial operation.
  • Figure 13 (d) shows the input and output trend of a configured reconfigurable ubiquitous digital data signal processing block.
  • Figure 13 (e) shows the input and output trend of a configured reconfigurable ubiquitous digital data signal processing block.
  • Figure 14 (a) is an embodiment of the reconfigurable memory address translation in the present invention.
  • Figure 14 (b) is an embodiment of the reconfigurable memory in the present invention.
  • Figure 15 (a) shows an embodiment of a finite state machine.
  • Figure 15 (b) is an embodiment of the finite state machine of Figure 15 (a) implemented with the reconfigurable finite state machine of the present invention.
  • Figure 15 (c) is an embodiment in which the reconfigurable finite state machine of the present invention is configured to alternately concurrently multi-state machines.
  • Figure 16 is an embodiment of a reconfigurable output port of the present invention.
  • Figure 17 is an embodiment of a flow diagram description by video decoding.
  • Fig. 18 is a configuration example of a valid bit when configured as a multiplier.
  • Figure 19 is an embodiment of an absolute value, clipping, comparison selection operation.
  • Figure 20 is an embodiment of a reconfigurable random logic unit.
  • the reconfigurable data processing platform (101) of the present invention is composed of a reconfigurable ubiquitous digital data processing block (102), a reconfigurable memory (103), and a reconfigurable control unit (104).
  • the reconfigurable ubiquitous digital data processing block (102) is structured by the basic unit (105), and the entire reconfigurable ubiquitous digital data processing block (102) has a non-isotropic preset connection to achieve a specific function. , and can be refactored by configuring a specific configuration memory.
  • Figure 2 (a) is an embodiment of the internal structure of the basic unit.
  • the three sets of data input ports of the basic unit (201) are defined as A, B, and C, respectively, and the two sets of data output ports are defined as Co and S, respectively.
  • the input port of group A is six bits, corresponding to logic "0", Al, A2, A1N, A2N, A3, respectively, where Al, A2, A1N, A2N are input as part of the product when multiplication is performed, and when doing other functions as Global signal input.
  • the six-bit input selects one of the data inputs of the Group A inputs that are actually input to carry generation logic (208), and (sum) generation logic (209) via a six-to-one multiplexer (210).
  • the six-choice multiplexer (210) is controlled by the configuration memory (202); the group B input ports are three bits, corresponding to Bl, B2, configuration memory (205), and the three-bit input is selected by three-to-one multiplex.
  • the device (211) selects a data input that is actually input to the carry generation logic (208) and the generation logic (209) in the Group B input.
  • the three-selection multiplexer (211) is controlled by the configuration memory (203); the C group input port is four bits, corresponding to Cl, C2, logic "1", logic "0", and the four-bit input passes through a four Selecting a multiplexer (213) selects a data input that is actually input to the carry generation logic (208) and the generation logic (209) in the C group input.
  • the four-to-one multiplexer (213) is controlled by the configuration memory (207).
  • the alternative multiplexer (212) can optionally be the current input value of the carry generation logic (208) or the value stored by the register (215) as the Co output of the base unit (201).
  • the register (215) is controlled by the configuration memory (204) to control whether the output of the carry generation logic (208) is latched.
  • the two-in-one multiplexer (214) can be selected to generate the current output value of the logic (209) or the value stored in the register (216) as the S output of the base unit (201).
  • Register (216) is controlled by configuration memory (206) to latch and generate the output of logic (209).
  • the alternative multiplexer (214) is controlled by a configuration memory (206).
  • Configuration memory (202), configuration memory (203), The configuration memory (204), the configuration memory (205), the configuration memory (206), and the configuration memory (207) can all be shared with the corresponding memory of the adjacent base unit to simplify the logic, and all the basic units of the same row can perform the same function. Use only one set of memory to configure all base units on the same line.
  • the specific configuration method is shown in the following embodiment.
  • Figure 3 (a) is an embodiment of the internal configuration of a single basic unit when implementing a normal addition operation.
  • the six-in-one multiplexer (210) is controlled by the configuration memory (202) to select A2 as the group A input.
  • the three-in-one multiplexer (211) controls the output of the configuration memory (203) as a group B input by the configuration memory (203).
  • the four-in-one multiplexer (213) is controlled by the configuration memory (207) to select C2 as the C group input.
  • the configuration memory (206) is configured to latch the current output of the generation logic (209) in a register (216), and the alternate multiplexer (214) is configured by the configuration memory (206) to be selected by the register (216).
  • the value is used as the S output of the base unit (303).
  • the two-in-one multiplexer (212) is configured to select the current output value of the carry generation logic (208) as the Co output of the base unit (303).
  • Group B inputs can also be connected to external global signal inputs.
  • Figure 3 (b) is an embodiment of the basic inter-cell connection when implementing a normal 8-bit addition.
  • the A terminal inputs of all base units (301-308) are a set of signals G7-G0 output from non-adjacent lines, and the B terminal inputs are the values stored by the respective configuration memories (205).
  • All base units (301-308) form a carry chain through a line (311-317) between the single base unit Co end and the C end of the left adjacent base unit, and the carry chain may contain overtravel logic.
  • the SIMD addition instruction can be implemented by configuring the appropriate configuration of the memory (207) to break the carry chain into segments.
  • Figure 4 (a) is an embodiment of the internal configuration of a single basic unit when implementing logical operations.
  • the six-to-one multiplexer (210) is controlled by the configuration memory (202) to select A3 as the group A input.
  • the three-in-one multiplexer (211) is controlled by the configuration memory (203) to select B2 as the group B input.
  • the four-in-one multiplexer (213) is controlled by the configuration memory (207) to select the logic "0" as the C group input.
  • the alternate multiplexer (214) is configured (206) to configure and generate the current output value of the logic (209) as the S output of the base unit (404).
  • the alternate multiplexer (212) is configured to select the current output value of the carry generation logic (208) as the Co output of the base unit (404).
  • the S-side output is the exclusive-OR logic result of input A and input B
  • the Co-side output is the logical result of input A and input B.
  • Figure 4(b) is an embodiment of the basic inter-cell connection when implementing logical operations.
  • the C terminal input of all the basic units (401-408) is logic "0"
  • the A terminal input is the addition result SUM7-SUMO of the previous row
  • the B terminal input is a group of signals G7 output from the non-adjacent row.
  • S terminal output is the bitwise XOR logical result X0R7-X0R0 of input A and input B.
  • the C terminal output is the bitwise AND logic result AND7-AND0 of input A and input B.
  • the embodiment of a 4-bit multiplier consisting of 500 and 506 in Fig.
  • connection relationship only reflects the connection configured to implement the multiplication operation, omitting all unrelated connections for performing other functions.
  • the multiplier is characterized by a constant multiplicand and a relatively fixed multiplier. After the Booth code is multiplied, the relationship between each partial product and the multiplicand generated on the specific weight bit is relatively stable, that is, ⁇ 2 times, ⁇ 1 times the multiplicand or 0. At this time, the multiplier can effectively reduce the number of partial products and improve the operation efficiency.
  • the Booth coding method of Radix2 is used to compress the partial product by a linear array to obtain two partial sums, and then the final result is obtained by a carry adder.
  • the size of a linear array can be divided into two basic cases according to the presence or absence of a symbol: When it is a signed operation, the number of rows of 201 cells is required to be N/2, where N is the smallest even number not less than n; In operation, the number of rows of 201 cells is required to be N/2+1, where N is the smallest even number greater than n; in both cases, the number of columns required for 201 cells is m X 2+2.
  • the adder used to find the final result requires a row of ni+ri 201 units.
  • the unsigned number high bit can be expanded by "0" to make it a symbol number, whether or not the number of symbols is treated differently can be flexibly determined according to the user's needs.
  • the 500 is composed of 3 X 10 201 and can complete the partial product compression array.
  • the multiplicand Y is extended by two symbols, and then the lower bit is complemented by 4 bits "0". Then the multiplicand Y is represented as a number with a bit width of 10, Y_in ⁇ 9 : 0>, which corresponds to the result of shifting one bit to the left and inverting the bit, respectively, corresponding to the A of the 201 module in the 500 array.
  • Port inputs are Al, A2, A1N, and A2N.
  • each port of the 201 module must be determined as follows:
  • Port A Determined by the configuration information of 202.
  • the configuration is based on the multiplier X passing through the Bootx code of Radix2, selecting the corresponding partial product as 0, +1 times the multiplicand, -1 times the multiplicand, +2 times the multiplier and -2 times the multiplicand.
  • the 0, AK A2, A1N, and A2N signals of the A port are selected as inputs through the multiplexer, respectively. Then, through the specific logic processing required, participate in the summation operation.
  • the configured A-side input in 501 can implement partial-segment sign-extension logic.
  • the A-side input in 502 is used to generate the correct partial product, and the A-side input in 503 is set to logic "0" through configuration. Ensure that the low-order multiplication results are correct.
  • B port Determined by the configuration information of 203.
  • the B-side input in 501 is set to logic "1" by configuration to implement partial product symbol expansion logic.
  • 504 is the first line of the beginning of the multiplication operation, wherein the configured B port input can implement part of the logic of the inverse code operation required in the partial product summation process, that is, the partial logic of the symbol extension.
  • 505 The result of the configuration is fixed to select the right end of the C-side output of the upper level 201 to shift one bit to achieve partial alignment and accumulation.
  • the C port Determined by the configuration information of 207.
  • the C-side input in 501 is the same as the B-side, and is set to logic "1" by configuration to implement partial product symbol expansion logic.
  • the C-side input in 504 is set to logic "0" by configuration.
  • the result of selecting the right-shifting two-bit output of the S-side of the upper-level 201 is fixed by the configuration to realize the alignment and accumulation of the partial product, and the high-order complement logic "0" of the data is shifted.
  • the final stage obtains two sets of partial sums, and after alignment, is added by the adder 506 to obtain the final result.
  • 506 is an adder used in the multiplication operation, and its configuration method is:
  • Port A The result of the 202 configuration is fixed to select the result of shifting the C output to the left by one bit in the upper level 201 to achieve alignment and accumulation of the partial sum.
  • the output of the Co terminal in the lower-level 201 of the same level is selected by the 207 configuration to form an adder carry chain.
  • the output S can obtain a bit width of 10 bits and Pdt ⁇ 9:0>, and the highest two bits of the result are sign extensions, which can be discarded, thereby obtaining a final multiplication result Pdt ⁇ 7 with a valid bit width of 8 bits: 0>.
  • the trade-off can be made according to the actual application.
  • Figure 6 is an embodiment of a general multiplier that enables simultaneous changes in two inputs. Still taking the 4-bit signed multiplication as an example, the main difference between the multiplier and the multiplier is that the partial product is no longer based on the multiplier X and is determined by the 202 configuration information in 201, but the multiplier X. The selection of the partial product is realized directly by a set of three-state gate-controlled buses. Only the control connection lines for the partial product selection of the three-state gates are given.
  • the setting method of the tri-state gate control bus is: using the corresponding bits of the multiplier X to control the partial product to be 1 multiplicand by 601, 602, 603 and 604 respectively (the opposite number can be obtained by the subsequent logic, and the opposite number may only appear In the last partial product) or forced to "0".
  • the two intermediate results obtained by 201 are shifted to the right and the partial products generated by the next stage are added by 201, and so on, to obtain the final product Pdt.
  • Figure 7 implements the SISD and SIMD instructions with data channels of similar width, which is the focus of the present invention.
  • Figure 7 (a) is an 8-bit multiplier of the same structure as Figure 5, and the resulting multiplication result is 16 bits.
  • the entire data channel used for the operation has a width of 18 bits and a height of 6 lines.
  • Figure 7 (b) By dividing and reconfiguring the data channel, four 4-bit multipliers 701, 702, 703, and 704 of the same structure can be realized on a data channel having a width of 20 bits and a height of 8 lines. Each multiplier has a width of 10 and a height of 4 lines, while producing four multiplication results for the subsequent operand bits.
  • Figure 8 is an embodiment of the division of functional modules in a data channel when implementing a floating point multiply operation.
  • Floating point multiplication is implemented in three parts: the sign bit, the exponent, and the mantissa.
  • the sign bit of the product obtained by floating-point multiplication is the logical OR of the two operand sign bits
  • the exponent of the product is the sum of the two operand exponents
  • the product of the mantissa is the mantissa of the product.
  • 801 realizes the same or logic by configuration, completes the symbol operation, and the result of the symbol operation is passed step by step before the final result is obtained by 802 and 803;
  • 802 is configured as an adder to complete the summation of the two indices, and the result is the final result at 803.
  • 803 is configured as a multiplier to achieve unsigned multiplication, that is, to obtain the product of the mantissa.
  • the mantissa of the floating-point multiplication product can be set to the accuracy according to the user's requirements. Let 804 produce a multiplication result.
  • the bit width is the same as the mantissa bit width, but multiplying the two operands will increase the bit width of the product.
  • various trade-offs can be selected for processing, including but not limited to the tail-tailing method, the further method, the rounding method, and the like.
  • it can be implemented by configuring 203 included in the 201 unit before the adder 806 in 805, where the B port input of the 210 unit is set to logic "1". "Yes.
  • Figure 9 is an embodiment of implementing a multiplier.
  • 901 is a multiplier of the structure of Fig. 5, and 902 is an adder of the same structure as Fig. 3.
  • a plurality of 901 and 902 structure multipliers are connected to the adder as shown in the figure, wherein 903 to 906 are configuration control lines, and as input terminals of an operand of the 901 multiplier, the input information is configured and input;
  • 907 to 910 are global buses, which are inputs to the other operand of the 901 multiplier;
  • 911, 912, and 913 are also global buses for passing intermediate results of multiplication and addition, thereby enabling multipliers.
  • the multiplier width can be configured to ensure that the intermediate results and final results at the time of summation do not occur or that there is no overflow within the scope of the claim.
  • the multiplier shown in Figure 9 implements four multiplications of aX, bY, cZ, and dW, respectively, and The four products are added to the three additions in turn, and the final result output by 914 is:
  • the above formula is a mathematical expression of vector multiplication.
  • X, Y, z, and w are a set of relatively fixed numbers, which constitute a vector with infrequent changes. Then, by using the multipliers described above, vector multiplication can be obtained. The result.
  • Figures 10(a) and 10(b) are embodiments of a shifter.
  • the two figures show the connection relationship between the Co terminal output of 500 and 506 and the B terminal input of the next basic unit 201, respectively.
  • the shifter function can be implemented, and the port configuration methods are:
  • a port Select the operand to be shifted by configuration.
  • Port B Set to logic "1" by 202 configuration.
  • C port The configuration is set to logic "0" by 207. Since the output port Co is respectively connected to the base unit 201 shifted one bit to the left in the next stage and the input port B of the one base unit 201 shifted to the right, it is only necessary to select the 202 in the next stage 201 to select it. With the appropriate input, the corresponding shift operation can be realized.
  • the basic unit 201 can be configured by the level. If the inputs of both ports 8 and C are set to logic "0", the output S of this stage still maintains the shift result, and S has a connection relationship with the input port A of the base unit 201 in the next stage. Then, by configuring 202 in it, the A port of this level can be input as the desired shift result. When shifting, the "0" or “1” can be added as needed in the high or low position of the shift result by the configuration of the highest and lowest bits.
  • Figure 11 (a) is an embodiment of the basic unit of the vertical direction data stream.
  • IN0 is the output of one of the 4 lines of the basic unit
  • IN1 is the global signal input
  • the second selection selector (1112) is controlled by the configuration memory (1101) to pass the basic unit output or the global signal input downward.
  • the two-in-one multiplexer (1120) is controlled by the configuration memory (1116) to select the current output value of the selector (1112) or the value stored by the register (1121) as the output 0UT1.
  • the register (1121) is controlled by the configuration memory (1116) to control whether the output of the second selector (1112) is latched.
  • the two-select selector (1113) is controlled by the configuration memory (1105) to select the output of the input (1114) or the second selector (1112) generated by the following basic unit row.
  • the two-to-one multiplexer (1123) is controlled by the configuration memory (1117) to control the current output value of the two-select selector (1113) or the value stored in the register (1122) as the output 0UT2.
  • the register (1122) is controlled by the configuration memory (1117) to latch the output of the second selector (1113).
  • Figure 11 (b) is an embodiment of the bus structure. There are 4 sets of buses (1124-1127) that pass signals down and a set of buses (1123) that pass signals back. Select one selector (1108) to select 4 groups of buses that pass signals down
  • a group in (1124-1127) is used as the global signal input for the base unit row (1140).
  • the other inputs to the base unit row (1140) are generated from the output of the previous row.
  • the output of every four basic unit rows is connected together by a tri-state gate (1141) as the IN0 input of the vertical direction data stream base unit (1104).
  • the vertical direction of the signal can be transmitted through the vertical direction data stream base unit (1104).
  • Figure 12 (a) is an embodiment of a lateral shifting unit.
  • the input (1209, 1210, etc.) 4-bit signals can be combined to form a group of signals from high to low or respectively as different sets of signal inputs.
  • the corresponding bits of each group of signals pass through the tri-state gate.
  • the output of the tri-state gate (1205) connected together has only one conduction at any time.
  • the tri-state gate (1205) is controlled by the configuration memory (1206).
  • a four-to-one selector (1207) implements shifting, which is controlled by configuration memory (1208).
  • the overall left shift (right shift) arbitrary bit cyclic shift operation can be implemented as required, or a separate set of input signals can be output to all group outputs (1211, 1212, etc.).
  • Figure 12 (b) is an embodiment of a reconfigurable ubiquitous digital data processing block.
  • the data enters the reconfigurable ubiquitous digital data processing block from above, and the reconfigurable ubiquitous digital data processing block is composed of an input/output unit (1206), a storage unit (1213), a horizontal shift unit (1202), and a logical unit (1203). ) constitutes.
  • the logic unit (1203) is configured to have different functions according to needs Yes, it can be a line or an array.
  • the lateral shift unit (1202) shifts the signal as required.
  • Figure 13 (a) shows a reconfigurable ubiquitous digital data signal processing block that has not been configured.
  • a block of reconfigurable digital signal processing block (1301) includes a plurality of base units (1302).
  • Figure 13 (b) is a reconfigurable ubiquitous digital data signal processing block configured as a single instruction stream multiple data stream (SIMD).
  • the reconfigurable ubiquitous digital data signal processing block (1301) is configured as a SIMD processing module for performing four-way operations simultaneously, inputting data from 1315, 1316, 1317, 1318, and outputting operations from 1319, 1320, 1321, 1322, respectively.
  • array 1322 is configured for multiplication
  • row 1323 is configured for logical AND operation
  • row 1324 is configured for subtraction operation
  • row 1325 is configured for exclusive OR operation
  • row 1326 is configured for addition operation
  • array 1327 is configured for Right shift operation
  • arrays 1328, 1329, 1330, 1331 are configured for multiply-accumulate operations
  • arrays 1332, 1334 are configured for saturation operation operations
  • row 1333 is configured for addition operations.
  • Figure 13 (c) is a reconfigurable ubiquitous digital data signal processing block configured for serial operation.
  • the input to the processing block (1340) is 1341.
  • the output of the current column is output to the input of array 1345 via path 1343.
  • the entire processing block is configured as a serially operated module via path 1349 with an output of 1351.
  • the basic unit (1302) in the processing block is configured to form a processing line with different functions, and the processing array, such as line 1353 implements a subtraction operation, line 1355 implements a limiting operation, and array 1357 implements a multiply-and-accumulate operation, and array 1359 implements a shift. Operation, etc.
  • the processing block (1340) is configured in three columns, with each column corresponding to a column of valid bits (1380, 1381, 1382).
  • the valid bit is invalid (1380, 1381, 1382)
  • the corresponding line the array is turned off, saving power.
  • Figure 13 (d) shows the input and output trend of a reconfigurable ubiquitous digital data signal processing block.
  • the output of the processing block (1390) is 1391.
  • When it is processed by the array 1392 its output 1393 is output to the array 1394, 1395, and the output of the entire block is 1396, 1397.
  • the data is passed down (1335) and input to the configured array or row for arithmetic processing.
  • the upward transfer (1336) is to pass the data through the reverse bus, and no arithmetic processing is performed.
  • Figure 13 (e) shows the input and output trend of a configured reconfigurable ubiquitous digital data signal processing block.
  • the processing block (1398) has two independent arrays: array 1344, array 1346, two array inputs, and outputs are independent.
  • the input of array 1344 is 1350 and the output is 1352; the input of array 1346 is 1354 and the output is 1356.
  • the input is passed down to the configured array or row for arithmetic processing, and the upward transfer is performed by passing the data back through the bus, without performing arithmetic processing.
  • the input 1354 of this embodiment is first input into the array via the upwardly transmitted bus (1337), and then subjected to arithmetic processing.
  • Figure 14 (a) is an embodiment of the reconfigurable memory address translation in the present invention.
  • the reconfigurable memory (1401) is composed of a plurality of words, each of which has a corresponding address.
  • the address input (1402) is binary 1001
  • the address input (1402) can be decoded (1403) to find the word corresponding to the binary address 1001 ( 1404);
  • the address input (1402) is decoded (1403) and then entered into the mapping module, which can be mapped to correspond to
  • the word (1406) of the actual binary address 0010 thus achieves a mapping conversion of different addresses.
  • Figure 1 '4 (b) is an embodiment of the reconfigurable memory in the present invention.
  • the reconfigurable memory (1401) consists of a plurality of words that implement a write addressing and a read addressing different memory.
  • the write map maps it to binary 0010, while the read map maps it to binary 1010.
  • the address input (1402) is binary 1001
  • the write map (1407) works, and the address transmitted from the decoding module (1403) is mapped to the word corresponding to the binary address 0010.
  • the logical address 1001 is used to address the word corresponding to the physical address 0010; if the read is valid (1412), the read map (1408) operates to map the address transmitted from the decoding module (1403) to the word corresponding to the binary address 1010.
  • the implementation reads the word corresponding to the physical address 1010 by the logical address 1001. Based on this embodiment, pixel reordering (zigzag) in the image compression algorithm can be conveniently implemented, and FIF0 can also be conveniently implemented.
  • the reconfigurable finite state machine of the present invention can be configured as a finite state machine as in Figure 15(a). When in state A
  • the finite state machine of Figure 15(a) can be implemented using the reconfigurable finite state machine of the present invention.
  • the configured reconfigurable finite state machine is shown in Figure 15 (b).
  • This embodiment is comprised of a randomly accessible memory (1515), a current status register (1521), a reconfigurable multiplexer, and reconfigurable random logic (1519), a randomly accessible memory (1515) in this embodiment.
  • the boundary mark memory can be omitted.
  • the value in the current status register (1521) represents state B (1502), it is in state B (1502), corresponding to line B (1524) of the randomly accessible memory.
  • the stored value in the B-line (1524) of the randomly accessible memory is output, wherein the r-bit output is transmitted as six sets of input state information to the reconfigurable combinational logic (1519), the s-bit output Transferred as 6 sets of state transition information to the next state multiplexer
  • the t-bit output is transmitted as six sets of output control signal information to the output control multiplexer (1522).
  • the reconfigurable combinational logic (1519) logically operates the input r-bit data and the external input (1525) to obtain a selection signal (1526).
  • the logical operation determines whether the external input (1525) is identical to which of the six sets of input state information, and generates a corresponding selection signal (1526) for selecting the next state multiple selection.
  • the generated selection signal selects state C (1503) in the status table and out2 (1527) in the output table, and then state C (1503) is written to the current status register (1521). ), and point to line C of the randomly accessible memory (1515) in the next cycle
  • the bit width/word length of the input defining value (1516), the state transition value (1517), and the output control signal (1518) are determined, and the boundary flag memory need not be omitted.
  • the input defined value, the state transition value, and the bit width/word length of the output control signal can also be dynamically adjusted, that is, the boundary (1529) in the memory that can be randomly accessed is changeable according to the value in the boundary flag memory. .
  • the reconfigurable finite state machine of the present invention can support multi-state machines to alternately co-ordinate after configuration.
  • the reconfigurable finite state machine is configured to alternately concurrently three state machines, and Figure 15 (c) increases the random access memory (1515) compared to Figure 15 (b) and An external input multiplexer (1536), two current status registers (1530), a current status multiplexer (1531), and two control signal registers (1535) have been added.
  • the external input multiplexer (1536) selects the current input according to the configuration and enters the reconfigurable combination logic. Only one finite state machine is operated at any time, and the current values of the other two finite state machines are stored in the corresponding current status register.
  • the new current state multiplexer (1531) selects the value in the current state register of the running finite state machine as the current state.
  • the state of the finite state machine in the original operation is saved in the corresponding current state register, and the newly added current state multiplexer (1531) selects the current state register corresponding to the finite state machine to be operated.
  • the value continues as the current state to run the state machine, and the state machine can be switched.
  • the corresponding control outputs of the states of the three state machines can be output to the control signals (1532, 1533, 1534) as shown in the figure, or the control lines can be multiplexed as needed; wherein the control signal register (1535) can be plural It can also be an singular number.
  • the reconfigurable finite state machine supporting the multi-state machine alternately concurrently is configured to be time-multiplexed corresponding to the finite state machine of different interface protocols, and the same set of hardware interface ends are defined as corresponding inputs in the plurality of interface protocols.
  • the output port enables reconfigurable multiplexing of input and output ports (I/O). Take the output ports of four different protocols as an example, see Figure 16.
  • the output data of the different protocols is transferred from the output register (1601) to the output logical channel (1602).
  • the output logical channel (1602) may be configured by a basic unit in the reconfigurable ubiquitous digital signal processing block according to different protocol requirements, or may be a random logic implemented by the reconfigurable control unit through configuration.
  • the output logical channel (1602) can be plural or singular.
  • the multi-state concurrent state machine (1603) configured by the reconfigurable finite state machine can implement the state machines corresponding to the four different protocols and can generate corresponding control signals (1608).
  • a portion of the control signal (1608) is written to the FIFO (1604) in a particular output by the control output (1601) in the output logic channel (1602).
  • Stored in the FIFO ( 1604) is the output value of each of the four protocols corresponding to the corresponding port.
  • the input/output port multiplexing proposed by the present invention may be implemented by the same finite state machine controlling different physical ports to implement different interface protocols, or the same finite state machine controlling the same/group physical port to implement different interface protocols.
  • the output port (1606) is a physical port used alone for one interface protocol
  • the output port (1607) is a physical port multiplexed by other three interface protocols
  • the multi-state finite state machine (1603) Time-division multiplexing is achieved through the Port Multiplex Selector (1605).
  • the external ports may all be used for physical ports of a specific interface protocol, or may be partially used for physical interfaces of a specific interface protocol, and partially for physical ports of multiple protocols. It can also be a physical port with multiple protocol multiplexing. It has strong flexibility and can be applied to various interface protocols with the same bit width or different bit width. In this embodiment, only the output port is described.
  • the input port and the input/output bidirectional port are similar to the output port.
  • Figure 17 is a flow chart of a typical video decompression.
  • the present invention can support video decompression of different video compression standards such as AVS, MPEG2, H.264/AVC and the like.
  • Video decompression is divided into entropy decoding (1701), inverse quantization (1703), inverse transform (1705), intra prediction (1707), inter prediction (1709), motion compensation (1711), deblocking filtering (1713). ) and so on.
  • each standard (1705) is mainly achieved by matrix multiplication, and addition and shifting.
  • the matrix size of the processing required by each video compression standard is not the same, such as AVS adopts 8x8 matrix, H.264/ AVC uses a 4x4 matrix, which Samples of different sizes can be configured to suit their needs.
  • the intra prediction (1707) of each standard is mainly realized by multiplication and shifting, and the modes of intra prediction of each standard are different, which results in different coefficients required for the operation, and the number of multiplications and additions required is also different. It is configured to pass different coefficients and the number of operations such as multiplication and addition.
  • Each standard interframe prediction (1709) and motion compensation (1711) have large differences and can be configured differently for different video compression standards.
  • each standard deblocking filter (1713) the block is divided in different ways, such as 4x4 blocks in AVS, 8x8 blocks in H.264/AVC, and the filtering conditions are different from the processing to be performed. It is also suitable for configuration. to realise.
  • Fig. 18 is a configuration example of a valid bit when configured as a multiplier.
  • the basic unit (180 wide 1815) in the figure is similar in structure to the basic unit (201) in Embodiment 2, and the unrelated portion described in the present embodiment is omitted in the drawings.
  • the row execution portion of the base unit 1806, 1807, 1808, 1809 is accumulated and added in one clock cycle, and the row in which the 1810 is located performs the multiplication operation and is added in the last step, and is completed in another clock cycle.
  • the propagation configuration of the valid bits (1816) is configured differently depending on the function to be implemented.
  • the valid bit (1816) is registered in the basic unit 1801, 1802, 1803 without a register (omitted in the figure), and is directly selected by one bit of the multiplexer (1817); the valid bit ( 1816) In the basic unit 1804, 1805, stored in the register (1818), the next clock cycle is selected by the 0 bit of the multiplexer (1817), wherein the selection signal of the multiplexer (1817) is configured by the configuration memory. (1819) produced.
  • the valid bit is passed to the corresponding three-input AND gate (1821) to control whether the register (1822) in the base unit (1806 ⁇ 1815) used for the operation is stored.
  • the other two inputs of the three-input AND gate (1821) are the configuration information and clock signal (1823) of the multiplexer (1817) of the corresponding row.
  • the registers (1822) in the base units 1806, 1807, 1808, 1811, 1812, 1813 are not latched, and the input of the multiplexer (1824) is on the 1 bit. Input;
  • the register (1822) in base unit 1809, 1810, 1814, 1815 is latched, and the input to multiplexer (1824) is its input on bit 0, the output of register (1822).
  • the three-input AND gate (1821) may be a dedicated column or may be configured by a basic unit.
  • Figure 19 is an embodiment of an absolute value, clipping, comparison selection operation.
  • the tri-state gate (1903) achieves conduction or high resistance based on the number of bits required. That is, when the input data is N bits wide, the three-state gate (1903) of the N-1th bit is turned on, and the remaining (N-2, ⁇ , 0) bits of the three-state gate (1903) are all high impedance states.
  • the tri-state gate (1903) is controlled by the configuration memory (1904).
  • the output in the first row of base units (1905) is the result of the inverse of the input data.
  • the second row basic unit (1906) selects the highest value of the input data (1902) according to the first row highest bit carry signal (1901, which can be configured as the sign bit of the input number).
  • the inversion of one line adds one result as an output; when configured as a clipping operation, the first row basic unit (1905) performs an addition or subtraction operation, and all the second selection selectors in the second row basic unit (1906) (1902) Selecting the output result of the first row or the maximum and minimum values of the clipping as the output according to the highest row carry signal (1901) of the first row; when configured to compare the selection operation, the first row basic unit (1905) performs a comparison operation, the second All of the alternative selectors (1902) of the row base unit (1906) select either input A or input B as an output based on the first row highest bit carry signal (1901, the result of the comparison).
  • Figure 20 is an embodiment of a reconfigurable random logic unit.
  • Configuration Memory 2015, 2016 Controls the four-choice selector (2005, 2006) select input (2001-2004), the four-output selector (2005, 2006) two-output (2007) as a four-choice selector
  • the control signal of (2008) selects a value in the configuration memory (2009-2013) as the output of the reconfigurable random logic unit (2014).
  • Reconfigurable random logic is implemented by connecting the output of a row of reconfigurable random logic cells to the input of the next row of reconfigurable random logic cells.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Discrete Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Logic Circuits (AREA)

Description

可重构数据处理平台
技术领域
本发明涉及集成电路设计领域。 背景技术
随着多媒体技术的发展,对集成电路的功能提出了新的要求,需要芯片可以高速的处理 流数据, 能进行大量的、 高速的加法、 乘法、 快速傅立叶变换、 离散余弦变换等运算, 并可 及时更新功能以面向快速变化的巿场需求。
传统通用处理器(CPU )和数字信号处理器(DSP)在功能上具有很高的灵活性, 只需要 通过更新相应的应用程序就可以面向不同的应用,满足用户的需求。可是由于通用处理器运 算资源有限,对流数据的处理能力不足,吞吐率不高, 限制了它的应用。即使采用多核结构, 其运算资源仍旧是个限制,同时其并行度受到应用程序的限制,运算资源的分配也受到限制, 吞吐率仍不理想。 同通用处理器相比, 数字信号处理器在运算资源上有所优化, 增加了部分 运算单元, 可是运算资源仍然有限。 在一些芯片如 UC. Davi s的 ASAP (Asynchronous Array of Simple Processors ) 中, 乘法器、 加法器、 移位器等部件直接做到一个模块中, 然后复 用该模块, 使得芯片拥有大量的运算资源, 可是该芯片的配置方法受到限制, 灵活性不够。
专用集成电路 ASIC ( Appl ication Spec ific Integrated Circui t ) 芯片可以很好的处 理流数据, 具有很高的吞吐率, 可以满足大量的、 高速的数据运算需求。 可是 ASIC芯片的 设计时间较长, 设计成本较高, 对于一款 90nm 的 ASIC 芯片, 其非经常性费用 NRE ( non-recurring engineering cost ) 可以很轻易的超过数百万美元。 同时 ASIC芯片还缺 乏灵活性, 不能在市场需求变化时, 轻易的改变功能, 需要重新设计一款芯片。 若在一块 ASIC 芯片上实现不同模式的操作, 比如面向不同的视频解码标准, 它需要为不同的视频解 码标准设计不同的模块, 集成在同一块芯片上, 提高了成本。
目前市场上的 FPGA (Field Programmable Gate Array)主要是基于査询表 LUT (Lookup Table)的, 与 ASIC相比, 具有很大的灵活性, 面对不同应用, 可以很灵活的进行配置, 且 其 NRE费用极低, 设计成本很低。 但是由于 FPGA主耍面向随机逻辑(random logi c ), 可以 方便的实现如两输入、三输入的与非, 或非等逻辑操作, 但其面向如乘法等运算操作时其性 能受到严重限制, 所综合出的乘法器等运算部件的面积较大。 虽然 FPGA中自带乘法器, 如 18x18乘法器, 但是要把它配置成 32x32或 8x8乘法器将是很困难的。 使用 FPGA进行设计 时,互连延迟在 FPGA的延迟中占据主导地位,如使用 lattice公司的型号为 LFSC25的 FPGA 实现循环冗余码校验 CRCX cycl ic redundancy check )的计算,其互连延迟占总延迟的 78. 3%。 可见 FPGA的互连延迟严重的限制了其性能的提高。且 FPGA的互连延迟要在设计映射到 FPGA 上后才能知道, 不能在设计时就知道设计的大致延迟, 这样就使得使用 FPGA做设计时, 很 可能为了延迟收敛多次更改设计, 延长了设计周期。
发明内容
本发明针对现有技术的不足, 提出一种可重构数据处理平台, 通过对规整 (regular) 的可童构的普适数字信号处理块的不同配置, 实现不同位宽 /字长的算术、 逻辑运算, 能支 持高阶 (high radix)乘法, 基本单元(cel l )可以不使用査找表 (LUT ), 并避免了复杂的 内部连线, 因此本发明具备 ASIC的高吞吐率和高性能, 同时兼具 FPGA的灵活性和低成本。
本发明所述的一种可重构数据处理平台, 可以通过配置 (set ) 特定的配置存储器 ( setting memory )实现不同的功能。所述可重构数据处理平台由以下单数个或复数个部分 或全部模块构成:
( 1 )可重构普适数字数据处理块(Reconfigurable Universal Digi tal Data processing block), 用于数字类数据处理;
( 2 ) 可重构存储器 (reconfigurable memory ), 用于存储和数据重排序;
( 3 ) 可重构控制单元 (reconfigurable control ler) , 可单独用于产生控制信号, 也 可与其他逻辑共同实现可配置的逻辑功能;所述其他逻辑包括但不限于可重构普适数字数据 处理块和可重构存储器。
本发明所述的可重构普适数字数据处理块, 由大量实现一位 (bit ) 最基本运算的单数 种或少量复数种基本单元 (cel l ) 由互联结构规整地 (regularly ) 连接构成, 整个普适数 字信号处理块具有非各轴同性 (axi s asymmetrical ) 的预设连接, 能实现特定的功能, 并 可以通过配置特定的配置存储器实现重构,以基本单元为最小单位拆分成单数个或复数个数 字类数据处理器 (digital data processor ), 每个数字类数据处理器根据相应配置实现不 同位宽的数字类数据处理功能,所述可重构普适数字数据处理块即可同时实现单数个或复数 个相关或互不相关的算法;所述配置存储器可以全部或部分位于所述可重构普适数字数据处 理块内,也可以全部或部分位于所述可重构普适数字数据处理块外;所述复数个数字类数据 处理器可以各自独立, 也可以部分或全部有相互联系, 串行、并行或串并混合地进行数字类 信号 处理; 当所述数字类数据处理器为独立的数字类数据处理器时, 能通过纵向总线和 / 或横向总线或旁路路径越过 /绕过其他数字类数据处理器与外部联系; 所述数字类数据处理 包括但不限于算术、逻辑运算及算术、逻辑运算的组合; 所述算术运算包括任意位宽小于等 于所述普适数字信号处理块总位宽的定点运算和浮点运算;所述数字类数据处理包括但不限 于数字信号处理。
本发明所述的基本单元通过配置可以完成一位逻辑运算和一位加法运算;由所述基本单 元经配置并按配置信息决定的连接关系连接构成的行 (row) 能实现所述基本单元无法独立 实现的复数位逻辑运算和特定的复数位算术运算, 包括但不限于加法、减法和移位; 由所述 基本单元经配置并按配置信息决定的连接关系连接构成的阵(array), 能实现所述行无法独 立实现的特定的复数位运算, 包括但不限于移位、 乘法、 乘加、 乘减、 限辐、 提取、 绝对值 和舍入。 所述行是由所述基本单元连接而成的运算器件, 一般即为物理位置意义上的一行, 但本发明中所述行的概念, 只是为了更清楚地阐述本发明的技术思路和实现方案,行与行之 间的位置关系也仅仅是逻辑上的关系, 而并不是特指物理位置上的关系。在实际实现中, 所 述行的形状可以是任意的, 由行构成的普适数字信号处理块的形状也可以是任意的。
本发明所述的基本单元可以不使用查找表(LUT), 用同一套逻辑实现不同的功能。所述 功能包括但不限于逻辑运算和加法运算。 所述基本单元至少包含进位 (carry ) 产生逻辑、 和 (sum ) 产生逻辑、 复数个配置存储器、 复数个输入端口和复数个输出端口。 所述输入端口和输出端口的位宽可以是单数位的,也可以是复数位的。所述输入端口全 部或部分由输入多路选择器选择出送入进位产生逻辑 /和产生逻辑的数据。 所述多路选择器 的选择端控制信号来源于配置存储器。所述输入多路选择器至少包含一个进位输入多路选择 器和一个用于根据高阶(high radix )乘法重编码结果选择输入的多路选择器。 所述输出端 口的数据全部或部分来源于带旁路功能的输出延迟 (delay ) 装置。 所述带旁路功能的输出 延迟装置包括但不限于多路选择器和寄存器 /锁存器。 所述带旁路功能的输出延迟装置中的 多路选择器用于从进位产生逻辑 /和产生逻辑的输出和延迟后的输出中选择数据作为最终的 输出, 选择端控制信号来源于配置存储器。 根据所述可重构数据处理平台具体实现的不同, 所述基本单元中用于输出延迟的延迟装置及相应的多路选择器和配置存储器可以直接连接 在输出端, 也可以连接在相应的输入端。例如: 当相邻基本单元间的一组输入端口与一组输 出端口对应连接时, 可以在所述输入端口中增加延迟装置及相应的多路选择器和配置存储 器, 并省略所述输出端口中的相应延迟装置及相应的多路选择器和配置存储器。
本发明所述的进位产生逻辑可以根据配置产生全加器的进位输出,也可以根据配置产生 逻辑运算的输出。所述和产生逻辑可以根据配置产生全加器的和输出,也可以根据配置产生 逻辑运算的输出。 本发明所述的普适数字信号处理块中的配置存储器用于存储配置信息,可以全部或部分 位于所述基本单元中,也可以全部或部分位于所述基本单元外; 当所述配置存储器全部或部 分位于所述基本单元外时,所述配置存储器可以用于控制单数个或复数个基本单元中的相应 可重构部分。
本发明所述的基本单元中的进位输入多路选择器的输入根据配置不同可以来自行内右 侧相邻基本单元的输出端口,也可以来自阵内上方特定基本单元的输出端口,也可以是来自 相应配置存储器的作系数用的配置信息, 还可以被赋值为逻辑 " 0 "或逻辑 " Γ; 所述用于 根据高阶乘法重编码结果选择输入的多路选择器的输入根据配置不同可以来自阵内上方特 定基本单元的输出端口, 也可以是被乘数及其反码、 左移后的被乘数及其反码或逻辑 " 0 ", 也可以是来自相应配置存储器的作系数用的配置信息;所述其他输入端多路选择器的输入根 据配置不同可以来自操作数的数据输入,也可以是来自相应配置存储器的作系数用的配置信 息。
本发明所述的基本单元通过配置、相互连接, 可以构成以移位加方式实现的乘法器, 也 可以构成高阶乘法器, 显著减小乘法器的面积, 并提高性能。; 所述高阶乘法器包括但不限 于阶二(rad ix-2 )乘法器; 所述重编码器包括但不限于布斯(booth )编码器; 所述重编码 器可以是独立存在于所述可重构数据处理平台中,供单数个或复数个乘法器使用,也可以不 独立存在而由所述可重构控制单元通过配置实现,还可以由所述可重构数据处理平台外部的 逻辑或处理器实现;所述重编码器的编码结果可以作为输入传输到所述构成乘法器的基本单 元中控制根据重编码结果选择乘法输入的多路选择器, 也可以作为系数 (coefficient ) 预 先存储在用于控制根据重编码结果选择乘法输入的多路选择器的配置存储器中。
由所述基本单元可以构成本发明所述的行 (row) 和阵 (array )。 根据基本单元内部配 置的不同, 通过基本单元间不同的连接方式, 就能够构成不同的运算器件。所述运算器件包 括但不限于加法器、 减法器、 移位器和逻辑运算器。 所述阵一般占物理位置意义上的多行, 是由所述基本单元连接而成的运算器件,根据基本单元内部配置的不同,通过基本单元间不 同的连接方式, 就能够构成不同的运算器件。所述运算器件包括但不限于乘法器、乘加器和 乘减器。 所述行或阵中的配置存储器存储有行或阵的配置信息, 用于相应的配置。
以加法器行为例,将基本单元中的运算部件配置为加法部件,将基本单元的输入输出端 口按加法定义配置为操作数输入端口、低位进位输入端口、本位和输出端口和本位进位输出 端口,并将左右相邻基本单元间的低位进位输入端口与本位进位输出端口连接,就可以构成 一个行波进位加法器。可以根据实际需要, 灵活配置基本单元中的延迟装置, 将加法分段构 成流水线型加法器, 以提高时钟频率。可以通过配置, 使基本单元具有额外输出端口, 配合 超前进位逻辑, 以实现超前进位加法器, 提高时钟频率。此外, 还可以对基本单元及基本单 元间的连接进行配置, 以构成其他类型的加法器。构成减法器的方法与构成加法器的方法类 似。
以移位器行为例,将基本单元中的运算部件配置为旁路,将基本单元的输入输出端口按 加法定义配置为操作数输入端口、操作数输出端口,将特定位置关系的基本单元间的操作数 输入端口与操作数输出端口连接,就可以构成一个移位器。一个大跨度移位的移位器可以由 复数个小跨度移位的移位器构成,也可以通过特定位置关系的基本单元间的连接通过一次移 位实现。
以逻辑与行为例,将基本单元中的运算部件配置为与操作,将基本单元的输入输出端口 按加法定义配置为操作数输入端口、操作数输出端口, 就可以构成一个逻辑与器件。其他逻 辑操作, 包括但不限于或、 非、 异或、 同或, 都可以通过类似的方式实现。
以乘法器阵为例,将基本单元中的运算部件配置为加法部件,将基本单元的输入输出端 口按加法定义配置为操作数输入端口、低位进位输入端口、本位和输出端口和本位进位输出 端口, 并将特定位置关系的基本单元间的低位进位输入端口、 本位和输出端口、本位进位输 出端口和操作数输入端口按一定规则连接,就可以构成一个阵列乘法器。对基本单元及基本 单元间的连接按其他规则进行配置, 可以构成其他类型的乘法器, 包括但不限于阶二 ( rad ix-2 ) 乘法器。
以乘加器阵为例, 按乘法器阵的构建方法构建, 最后再配置一行加法, 就可以构成乘加 器。此外也可以按一定规则进行配置, 实现效率更高的乘加器。构成乘减器的方法与构成乘 加器的方法类似。
所述行或阵可以根据配置信息包含单数个或复数个有效位,或不包含有效位。所述单数 个或复数个有效位可以用于控制是否运行对应的部分或全部行或阵并标识运行结果的有效 性; 所述有效位用存储器实现, 包括但不限于寄存器、 锁存器和随机访问存储器。所述行或 阵有效位中的有效信息可以传输到其他行或阵中,也可以传输到所述可重构普适数字数据处 理块的外部。如果传输来的有效位中的信息为有效, 则所述行或阵需要进行相应运算, 且将 有效信息送往相应的下一个行或阵或所述可重构普适数字数据处理块的外部;如果传输来的 有效位中的信息为无效,则所述行或阵不进行相应的运算,并将无效信息送往相应的下一个 行或阵或所述可重构普适数字数据处理块的外部。控制所述行或阵不运行的方法包括但不限 于关断相应时钟或关断相应电源。所述有效位也可以只用于标识有效性。所述有效位也可以 被旁路逻辑旁路而不起作用。
由单数个或复数个所述行或阵可以根据配置存储器中存储的配置信息实现特定数据处 理功能, 独立完成数据处理, 也可以与行或阵一起串行、 并行或串并混合地进行数据处理。 构成所述特定数据处理功能的行或阵可以是同构的, 也可以是异构的。
以实现矩阵乘法功能为例,可以在算法上将矩阵乘法等效为多次乘与加的累积,用复数 个乘加器阵按特定规则连接, 即可实现矩阵乘法功能。 用类似方法还可以实现滤波 ( fi ltering), 快速傅立叶变换 (FFT) 等数字类数据处理算法。 本发明所述的行或阵和基本单元都是所述可重构普适数字数据处理块的一部分,其划分 是相对的,列出所述的层次只是为了更清楚地表达本发明的技术方案,所述层次并不是都必 须的。在特定功能的情况下, 行可以由单独一个基本单元构成, 普适数字信号处理块也可以 由单独一个行或阵构成。 将所述可重构普适数字数据处理块按配置信息分割成行或阵、基本单元,并加以连接后, 所述可重构普适数字数据处理块即被配置成特定功能的数字类数据处理器。如果所述具有特 定功能的数字类数据处理器为独立数字类数据处理器时, 能通过纵向总线和 /或横向总线或 旁路路径越过 /绕过其他具有特定功能的数字类数据处理器与所述可重构普适数字数据处理 块外部或所述可重构数据处理平台外部联系。当复数个所述具有特定功能的数字类数据处理 器的功能相同, 且相互独立、 同时 /并行运行时, 可以实现单指令多数据流(SIMD); 当复数 个所述具有特定功能的数字类数据处理器的功能不同, 且相互独立、 同时 /并行运行时, 可 以实现多发射 (mul ti-i ssue 所述根据配置信息重构而成的实现特定功能的数字类数据处理器可以根据配置信息包 含单数个或复数个有效位,或不包含有效位。所述单数个或复数个有效位可以用于控制是否 运行对应数字类数据处理器并标识运行结果的有效性。所述有效位用存储器实现,包括但不 限于寄存器、锁存器和随机访问存储器。所述有效位中的有效信息可以传输到所述可重构普 适数字数据处理块中其他数字类数据处理器中,也可以传输到所述可重构普适数字数据处理 块的外部。如果接收到的有效位中的信息为有效,则所述数字类数据处理器或所述数字类数 据处理器中的行或阵需要进行相应运算,且将有效信息送往相应的其他数字类数据处理器或 所述数字类数据处理器中的行或阵或所述可重构普适数字数据处理块的外部;如果接收到的 有效位中的信息为无效,则所述数字类数据处理器不进行相应的运算,并将无效信息送往相 应的其他数字类数据处理器或所述可重构普适数字数据处理块的外部。控制所述数字类数据 处理器不运行的方法包括但不限于关断相应时钟或关断相应电源。所述有效位也可以只用于 标识有效性。 所述有效位也可以被旁路逻辑旁路而不起作用。
本发明所述的可重构普适数字数据处理块中的互联结构分为高速局部连接和全局总线 两个层次。
所述高速局部连接用于邻近基本单元间的高速连接;所述高速局部连接是硬连接(hard wired ) 的近距离固定连线, 用于大部分时延关键路径; 通过对基本单元内输入多路选择器 的配置可以重构基本单元间的连接关系。所述连接关系包括但不限于同一行内的加减法的进 位关系和不同行间的乘法部分积传递关系。所述可重构局部总线通过配置可以构成加减法进 位链和乘法部分积进位, 用于构成运算部件, 包括但不限于加法器、 减法器、 乘法器。
所述全局总线包括纵向连接行或阵的用于传输数据的可重构纵向总线结构和横向的可 重构横向移位结构; 所述可重构纵向总线结构用于数据 /数据流可重构普适数字数据处理块 中的纵向传输;所述可重构横向移位结构可以根据配置将特定行或阵输出的数据移位后送往 左侧、 右侧或下一个行或阵, 还可以实现大跨度的数据移位操作和数据交叉换位操作。
所述可重构纵向总线结构和可重构横向移位结构可以通过配置特定的配置存储器实现 重构,共同构成完整的总线结构,实现所述可重构普适数字数据处理块的数据流的各种走向。
所述用于纵向连接的可重构纵向总线结构分为可重构向下总线和可重构向上总线两个 方向。所述可重构向下总线和可重构向上总线的数目可以是不同的。单数个或复数个行或阵 构成一个纵向传输单位。所述可重构向下总线可以将所述可重构普适数字数据处理块外部的 数据输入到每个所述纵向传输单位。所述可重构向下总线可以根据配置在任意所述纵向传输 单位处断开。任意所述纵向传输单位的输出也可以借用所述可重构向下总线向其下的其他纵 向传输单位传输数据。 所述可重构向上总线与与每个所述纵向传输单位的数据输出端相连, 可以将特定所述纵向传输单位的输出数据送往所述可重构普适数字数据处理块的外部。任意 所述纵向传输单位的输出也可以借用所述可重构向上总线向其上的其他纵向传输单位传输 数据。
单数个或复数个行或阵构成一个横向传输单位。所述用于横向连接的可重构横向移位结 构可以根据配置将所述横向传输单位输出的数据移位后送往下一个所述横向传输单位。所述 移位的位数可以是单数位, 也可以是复数位, 还可以不移位。
本发明所述的所述可重构存储器可以通过配置特定的配置存储器实现重构,„能用于存储 数据并具备数据重排序功能; 所述可重构存储器的位宽 /字长可以是固定的, 也可以是可变 的。所述配置存储器可以全部或部分位于所述可重构存储器内,也可以全部或部分位于所述 可重构存储器外。
所述地址映射能将输入的地址转换为新的地址并输出以实现数据重排序。所述地址映射 可以有单数次映射, 也可以有复数次映射。可以通过配置决定具体映射关系和映射次数。通 过映射关系的不同,可以用相同逻辑地址访问同一个存储器的不同物理地址,也可以用相同 逻辑地址顺序对存储器以不同的物理地址顺序实现写入或 /和读出, 方便地实现数据地址转 换以及构成先入先出缓冲 (FIF0)。
所述可重构存储器还可以根据不同的位宽 /字长被配置为多个独立的存储器, 所述独立 存储器具有独立的地址译码 /寻址逻辑和字线。任意所述独立存储器可以有独立的地址映射, 也可以与其他所述独立存储器共用地址映射。 本发明所述的可重构存储器可以位于可重构普适数字数据处理块的上方外侧或下方外 侧,也可以位于所述可重构普适数字数据处理块的内部。所述可重构存储器可以根据配置与 所述可重构普适数字数据处理块按特定规则连接, 用于数据存储及缓冲。 本发明所述的可重构控制单元包括可重构随机逻辑和可重构有限状态机,可以通过配置 特定的配置存储器实现重构, 用基本部件产生处理不同事务所需的控制信号及有限状态机; 所述配置存储器可以全部或部分位于所述可重构控制单元内,也可以全部或部分位于所述可 重构控制单元外。。
所述可重构随机逻辑包括可重构功能单元和可重构连接;所述可重构功能单元通过配置 能实现任意种逻辑功能;所述可重构连接通过配置能实现复数个可重构功能单元的任意种连 接;单数个或复数个可重构功能单元及单数个或复数个可重构连接通过特定的配置, 即能实 现随机逻辑的重构;所述可重构随机逻辑还可以根据配置,在由可重构功能单元和可重构连 接构成的随机逻辑中插入寄存器 /锁存器以保证时延要求。
所述可重构有限状态机由可随机访问的存储器 (randomly accessible memory ), 当前 状态寄存器、 可重构多路选择器和可重构随机逻辑构成。
所述可随机访问的存储器内存储有输入界定值 ( qual ification )、 状态迁移 ( transi tion )值和输出控制信号值。所述可随机访问的存储器中的每一行存储对应的单数 组或复数组输入界定值、单数组或复数组状态迁移值和单数组或复数组输出控制信号值。所 述可随机访问的存储器每一行中输入界定值、 状态迁移值和输出控制信号值各自占的位宽 / 字长是可变的。所述可随机访问的存储器内每行中不同值间的边界信息可以是固定的,也可 以是动态可变的。 当所述可随机访问的存储器内每行中不同值间的边界信息可以动态可变 时,由边界标志存储器存储每行相应的边界信息。所述当前状态寄存器用于存储当前状态值, 实现下一状态到当前状态的迁移。所述可重构多路选择器用于从复数组状态迁移值中选出下 一状态,并从复数组输出控制信号值中选出满足条件的控制信号输出。所述可重构随机逻辑 可以通过配置特定的配置存储器实现重构,用于完成特定的随机逻辑功能。所述可重构有限 状态机还可以包括由可重构随机逻辑构成的计数器(counter), 以实现根据计数结果进行状 态迁移的功能。
所述当前状态寄存器中的值直接作为地址指向所述可随机访问的存储器中的一行,相应 的输出中的输入界定值与从所述可重构有限状态机外部输入的信号一起传输到可重构随机 逻辑中产生选择信号,用于从所述可随机访问的存储器相应输出中的状态迁移值中选择出下 一状态, 并从所述可随机访问的存储器相应输出中的输出控制信号值中选择出输出控制信 号。 所述下一状态被存储到当前状态寄存器中作为下一次可随机访问的存储^寻址的地址。
增加当前状态寄存器的个数并增加一个状态多路选择器和一个输入多路选择器,则所述 可重构有限状态机在配置后可以支持多状态机交替并发 (mul ti- thread ^ 根据本发明技术 方案,任意时刻只运行一个有限状态机,其他的有限状态机的当前值被存储在相应的当前状 态寄存器中,所述新增的多路选择器选择运行中有限状态机的当前状态寄存器中的值作为当 前状态。当由于多状态机交替并发而切换状态机时,将原运行中的有限状态机的状态保存在 相应当前状态寄存器中,所述新增的多路选择器选择即将运行的有限状态机相应的当前状态 寄存器中的值作为当前状态继续运行状态机, 即可完成状态机切换。
本发明所述的可重构控制单元可以与其他逻辑共同实现可童构逻辑功能;所述可重构逻 辑功能包括但不限于可重构输入 /输出对外接口。所述可重构输入 /输出对外接口,可以对可 重构控制单元中的有限状态机进行配置,以同样的一套硬件结构同时实现单数种或复数种接 口协议。如果需要用一套硬件结构同时实现复数种接口协议,则可以配置有限状态机以实现 仲裁模块,以平台内部高时钟频率条件下的分时复用实现对相对低速的复数种接口协议的同 时支持。
本发明所述的可重构数据处理平台中用于配置可重构的模块的配置信息可以存储在平 台内, 也可以存储在平台外。如果配置信息已经存储在所述可重构数据处理平台内, 则可以 直接配置各个可重构的模块。当配置信息存储在所述可重构数据处理平台外时, 则可以将所 述可重构数据处理平台视为存储器,用向存储器存储数据的方式将所有配置信息传输到所述 可重构数据处理平台中的各个可重构的模块。还可以在所述可重构平台运行时将所述配置信 息传输到各个配置存储器, 实现不同功能间的动态切换。
本发明所述的传输到可重构配置的模块的配置信息可以是未经编码的原始配置信息,也 可以是经过编码后的信息。所述编码的方式包括但不限于加密和压縮。对所述加密编码信息 解密所用的密钥可以以硬件形式存在与所述可重构数据处理平台中,也可以通过配置信息输 入所述可重构数据处理平台。
本发明所述的配置信息可以通过人工制定, 也可以通过自动工具根据映射规则自动产 生。 所述映射包括但不限于硬件描述语言 (HDL) 到配置信息的映射、 计算机程序语言到配 置信息的映射、计算机建模到配置信息的映射和算法描述到配置信息的映射。硬件描述语言 包括但不限于 Veri logHDL和 VHDL; 计算机程序语言包括但不限于 C、 C++和 JAVA; 计算机 建模包括但不限于 Matlab 建模; 算法描述包括但不限于对特定算法的伪指令 (pseudo instruction ) 描述。 可以预先实现常用的运算的配置信息模板。 当使用自动工具根据映射 规则自动产生配置信息时, 所述自动工具可以直接调用所述模板中的配置信息。
本发明所述的配置存储器可以是挥发性的,也可以是非挥发性的。当所述配置存储器是 非挥发性时,可以在一次配置形成特定数字类数据处理功能之后保持该功能不变直到下次配 置, 方便用户的使用。所述非挥发性存储器包括但不限于 FLUSH存储器。还可以在设计、 生 产所述可重构数据处理平台时,直接固化配置。所述直接固化配置的方法包括但不限于在生 产时替换版图中的特定层。这样做虽然使所述可重构数据处理平台只能实现特定功能,但依 然能大大缩短研发周期。
本发明所述的可重构数据处理平台还可以包括扩展模块, 以适应更广泛的需求;所述扩 展模块包括但不限于: 随机逻辑控制器、 模拟单元、 中央处理器、 数字信号处理器 (DSP)、 数据包头检测器和逻辑零个数检测器。
基于所述的可重构普适数字信号处理块、可重构存储器和可重构控制单元进行设计,配 置为不同的应用, 不需要做时延收敛 (timing closure )„ 这是由于本发明所述的基本单元 种类很少,基本单元的各个输入输出之间的延迟是预先可知的,且所有时延关键路径都可以 通过高速互连相互连接, 互连的延迟也已知。 以一个 16x16位的乘法为例, 其最终延迟取决 于一个 32位的超前进位全加器, 当部分积累加的延迟小于 32位全加器的延迟时,将部分积 累加的结果直接送到全加器中进行最后的累加; 当部分积累加的延迟超过 32位全加器的延 迟时, 将自动增加所需乘法器所需要的流水线段, 把部分积累加过程中的中间值经过寄存, 然后再进行部分积累加, 最后再输出到 32位全加器中进行最后的累加。 可以看出, 由所述 的可重构普适数字信号处理块所配置成的乘法器的延迟是可以在设计前就可以预知的,并做 到摸板中。基于可重构存储器和可重构控制单元的设计也可以此方法在设计前预知时延,避 免做时延收敛。
本发明所述的可重构数据处理平台可以具备自测试能力,能够在加电工作的情况下不依 赖于外部设备进行芯片的自测试。
当所述可重构数据处理平台具备自测试能力时,可以将可重构普适数字数据处理块中特 定的单数个基本单元或行或阵,或者复数个基本单元或行或阵配置成比较器,对可重构普适 数字数据处理块中相应的复数组其他基本单元或行或阵及基本单元或行或阵的组合给予具 有特定关系的激励,并用所述比较器比较所述复数组其他行或阵及行或阵的组合的输出是否 符合相应的特定关系。
所述激励可以来自所述可重构数据处理平台中的特定模块,也可以来自所述可重构数据 处理平台外部。所述特定关系包括但不限于相等、 相反、 互逆、 互补。 所述测试结果可以被 送到所述可重构数据处理平台外部。 也可以保存在所述可重构数据处理平台中的存储器中。 所述的自测试可以是在晶圆测试,封装后集成电路测试或者芯片使用时在系统启动时进 行测试; 也可以人为设定自测试条件及周期, 在工作期间定期进行自测试; 所述存储器可以 是挥发性的, 也可以是非挥发性的。
本发明所述的可重构数据处理平台在具备自测试能力的前提下, 可以具备自修复能力。 当所述测试结果保存在所述可重构数据处理平台中的存储器中时,可以对失效基本单元 或失效行或失效阵作标记, 并将标记, 在对所述可重构数据处理平台进行配置时, 可以根据 相应标记绕过失效基本单元或失效行或失效阵, 使所述可重构数据处理平台依然能正常工 作, 实现自修复。
所述自修复可以是在晶圆测试后进行,封装后集成电路测试后进行或者芯片使用时在系 统启动时进行测试后进行;也可以人为设定自测试自修复条件及周期,在工作期间定期进行 自测试后进行。
本发明提出的可重构数据处理平台与现有技术的本质区别在于:
1、 本发明中可重构普适数字数据处理块基本单元中可以不包含查找表, 简化了设计, 通过基本单元的输出与位于不同行的相邻基本单元的不同连接即可实现不同的运算功能,与 现有技术相比, 有很大的改进;
2、 本发明中可重构普适数字数据处理块基本单元相互连接可以直接构成高阶乘法器, 不需要额外的译码模块, 能显著减小乘法器的面积, 并提高性能;
3、 本发明中可重构普适数字数据处理块中用于实现特定功能的基本单元间的连接非常 紧密,通常是一行中或相邻行中的相邻基本单元直接连接,使在时延关键路径上的基本单元 间信号传输时延能减少到最小, 以提高性能;
4、 本发明中可重构存储器能通过配置地址映射关系, 改变存储器的寻址方式, 实现数 据重排序, 这是现有技术中未涉及的;
5、 本发明中可重构有限状态机是以可随机访问的存储器为基础实现的, 且能支持交替 并发, 与现有的可重构有限状态机实现技术都不同;
6、基于所述的可重构数据处理平台进行设计, 配置为不同的应用, 不需要做时延收敛, 这也是现有技术中未涉及的。 有益效果:
首先,本发明所述的可重构数据处理平台可以用一套硬件实现不同的功能,与将多个实 现不同功能的硬件集成在一起的专用集成电路(ASIC)相比, 本发明能有效减小芯片面积和 漏电流, 降低芯片成本; 与通用处理器或数字信号处理器相比, 本发明的可重构普适数字数 据处理块中包含有巨量的运算单元,具有通用处理器或数字信号处理器根本无法达到的吞吐 率和更大广度的多发射能力; 与 FPGA相比, 本发明的可重构普适数字数据处理块结构性更 强, 相关逻辑间的连接更紧密, 逻辑间的连接关系大大简单, 能得到更高的性能, 并可以更 快速地改变配置,迅速地从一种工作模式切换到另」种,能够以高时钟频率分时复用的方法 实现对相对低速的多种功能的同时支持。
其次, 在基于本发明平台进行设计研发时, 由于所述基本单元、行或阵被配置成不同功 能时的单位时延 (timing) 是己知的, 且时延关键路径 (timing critical path ) 上的基本 单元都是近距离连接, 预设模板中的行或阵的时延是既定的(determini stic ) , 由此可以根 据具体设计中逻辑的规模和位数很方便地在设计前计算时延, 不需要事后进行时延收敛 ( timi ng closure ), 避免了可能的设计返工问题, 能有效地缩短研发周期。
最后,本发明所述的配置信息可以通过自动工具根据映射规则直接从硬件描述语言、计 算机程序语言、计算机建模、 算法描述等自动产生, 不但能縮短研发周期, 还能使不懂得硬 件描述语言或计算机程序语言的算法工程师将更为抽象的模型或算法描述直接映射到配置 信息, 免去设计中很多步骤, 能提高生产效率。 附图说明
虽然该发明可以以多种形式的修改和替换来扩展,说明书中也列出了一些具体的实施图 例并进行详细阐述。应当理解的是,发明者的出发点不是将该发明限于所阐述的特定实施例, -正相反, 发明者的出发点在于保护所有基于由本权利声明定义的精神或范围内进行的改进、 等效转换和修改。
图 1是本发明所述可重构数据处理平台的结构框图。
图 2是基本单元内部结构的一个实施例。
图 3 ( a) 是实现普通加法运算时单个基本单元内部配置的一个实施例。 图 3 (b) 是实现普通 8位加法运算时基本单元间连接的一个实施例。
图 4 (a) 是实现逻辑运算时单个基本单元内部配置的一个实施例。
图 4 (b) 是实现逻辑运算时基本单元间连接的一个实施例。
图 5 是实现一个 4位乘法的基本单元连接的实施例。
图 6 是一个 4位一般乘法器的实施例。
图 7 相近宽度的数据通道实现单指令流单数据流(SISD)和单指令流多数据流(SIMD) 指令的实施例。
图 8 是实现浮点乘法操作时数据通道中各功能模块划分的实施例。
图 9 是实现乘加器的实施例。
图 10 (a) 是实现右移一位操作时, 基本单元连接的实施例。
图 10 (b) 是实现左移一位操作时, 基本单元连接的实施例。
图 11 (a) 是垂直方向数据流基本单元的一个实施例。
图 11 (b ) 是总线结构的一个实施例。
图 12 (a) 是横向移位单元的一个实施例。
图 12 (b) 是可重构普适数字数据处理块的一个实施例。
图 13 ( a) 为一块尚未进行配置的可重构普适数字数据信号处理块。
图 13 (b)为一种配置成单指令流多数据流(SIMD)的可重构普适数字数据信号处理块。 图 13 ( c) 为一种配置成串行操作的可重构普适数字数据信号处理块。
图 13 ( d) 为一种配置后的可重构普适数字数据信号处理块的输入, 输出走向图。 图 13 ( e) 为一种配置后的可重构普适数字数据信号处理块的输入, 输出走向图。 图 14 (a) 是本发明中可重构存储器地址转换的一个实施例。
图 14 (b ) 是本发明中可重构存储器在应用中的一个实施例。
图 15 ( a) 的一种有限状态机的实施例。
图 15 (b) 是采用本发明中可重构有限状态机实现图 15 (a) 中有限状态机的实施例。 图 15 ( c ) 是将本发明中可重构有限状态机配置成多状态机交替并发的实施例。 图 16是本发明中可重构输出端口的一个实施例。
图 17是通过视频解码的流程图描述的一个实施例。
图 18为配置成乘法器时有效位的配置实施例。
图 19是实现绝对值、 限幅、 比较选择操作的一个实施例。
图 20是可重构随机逻辑单元的一个实施例。
具体实施方式
如图 1 所示, 本发明所述可重构数据处理平台 (101) 由可重构普适数字数据处理块 (102)、 可重构存储器(103)和可重构控制单元(104)构成。 可重构普适数字数据处理块 (102) 由基本单元 (105) 规整地构成, 整个可重构普适数字数据处理块 (102) 具有非各 向同性的预设连接, 能实现特定的功能, 并可以通过配置特定的配置存储器实现重构。
图 2 (a) 是基本单元内部结构的一个实施例。 基本单元(201) 的三组数据输入端口分 别定义为 A、 B、 C, 两组数据输出端口分别定义为 Co和 S。 A组输入端口为六位, 分别对 应逻辑 "0"、 Al、 A2、 A1N、 A2N、 A3, 其中 Al、 A2、 A1N、 A2N在做乘法时作为部分积产生 的输入, 在做其他功能时作为全局信号输入。 所述六位输入通过六选一多路选择器 (210) 选出 A组输入中实际输入到进位 (carry) 产生逻辑 (208)、 和(sum)产生逻辑 (209) 的一 位数据输入。 六选一多路选择器 (210) 由配置存储器 (202) 控制; B组输入端口为三位, 分别对应 Bl、 B2、 配置存储器 (205), 所述三位输入通过三选一多路选择器 (211) 选出 B 组输入中实际输入到进位产生逻辑 (208)、 和产生逻辑 (209) 的一位数据输入。 三选一多 路选择器(211) 由配置存储器 (203)控制; C组输入端口为四位, 分别对应 Cl、 C2、 逻辑 "1"、 逻辑 " 0", 所述四位输入通过一个四选一多路选择器 (213) 选出 C组输入中实际输 入到进位产生逻辑 (208)、 和产生逻辑 (209) 的一位数据输入。 四选一多路选择器 (213) 由配置存储器 (207) 控制。 二选一多路选择器 (212) 可选择是进位产生逻辑 (208) 的当 前输入值或是寄存器(215)存储的值作为基本单元(201) 的 Co输出。 寄存器(215) 由配 置存储器 (204) 控制是否锁存进位产生逻辑 (208) 的输出。 二选一多路选择器 (214) 可 选择是和产生逻辑 (209) 的当前输出值或是寄存器 (216) 存储的值作为基本单元 (201) 的 S输出。 寄存器 (216) 由配置存储器 (206) 控制是否锁存和产生逻辑 (209) 的输出。 二选一多路选择器(214)由配置存储器(206)控制。配置存储器(202)、配置存储器(203)、 配置存储器 (204)、 配置存储器 (205)、 配置存储器 (206)、 配置存储器 (207) 均可与相 邻基本单元的对应存储器共用从而简化逻辑,同一行的所有基本单元在执行相同功能时可只 使用一组存储器配置同一行所有基本单元。 具体配置方法见以下实施例。
图 3 (a) 是实现普通加法运算时单个基本单元内部配置的一个实施例。 此时, 六选一 多路选择器(210)由配置存储器(202)控制选择 A2作为 A组输入。三选一多路选择器(211 ) 由配置存储器(203)控制选择配置存储器(203)的输出作为 B组输入。 四选一多路选择器 (213) 由配置存储器(207)控制选择 C2作为 C组输入。 配置存储器(206)配置为将和产 生逻辑 (209) 的当前输出锁存在寄存器 (216) 中, 二选一多路选择器 (214) 被配置存储 器 (206)配置为选择寄存器 (216) 存储的值作为基本单元 (303) 的 S输出。 二选一多路 选择器 (212) 被配置存储器 (204) 配置为选择进位产生逻辑 (208) 的当前输出值作为基 本单元 (303) 的 Co输出。 B组输入也可接外部全局信号输入。
图 3 (b) 是实现普通 8 位加法运算时基本单元间连接的一个实施例。 所有基本单元 (301-308)的 A端输入为从非相邻行输出的一组信号 G7-G0, B端输入为由各自配置存储器 (205) 所存储的值。 所有基本单元 (301-308) 通过单个基本单元 Co端与左边相邻基本单 元的 C端之间连线 (311-317) 形成进位链, 且进位链可包含超前进位逻辑。 通过配置存储 器 (207) 的适当配置可将进位链断开为几段, 从而实现 SIMD加法指令。
图 4 (a) 是实现逻辑运算时单个基本单元内部配置的一个实施例。 此时, 六选一多路 选择器 (210) 由配置存储器 (202) 控制选择 A3作为 A组输入。 三选一多路选择器 (211) 由配置存储器 (203) 控制选择 B2作为 B组输入。 四选一多路选择器 (213) 由配置存储器 (207) 控制选择逻辑 "0"作为 C组输入。 二选一多路选择器 (214) 被配置存储器 (206) 配置为选择和产生逻辑(209) 的当前输出值作为基本单元(404)的 S输出。 二选一多路选 择器 (212) 被配置存储器 (204) 配置为选择进位产生逻辑 (208) 的当前输出值作为基本 单元(404)的 Co输出。 此时, S端输出即为输入 A和输入 B的异或逻辑结果, Co端输出即 为输入 A和输入 B的与逻辑结果。
图 4(b)是实现逻辑运算时基本单元间连接的一个实施例。此时,所有基本单元 (401- 408) 的 C端输入都为逻辑 "0", A端输入为上一行的加法结果 SUM7- SUMO, B端输入为从非相邻 行输出的一组信号 G7-G0, S端输出即为输入 A和输入 B的按位异或逻辑结果 X0R7- X0R0, C 端输出即为输入 A和输入 B的按位与逻辑结果 AND7-AND0。 图 5中的 500和 506构成的一个 4位乘法器的实施例,其连接关系只反映了通过配置用 于实现乘法操作的连线, 省去了用于完成其他功能的所有不相关连线。 该乘法器的特点是, 被乘数不断变化,而乘数相对固定。乘数经过 Booth编码后在特定权重位上产生的每个部分 积与被乘数的关系相对稳定, 即为 ±2倍、 ± 1倍的被乘数或 0。 此时该乘法器可以有效减 少部分积个数, 提高运算效率。
对于一个位宽分别 m、 n的乘法操作 Ym X Xn, 采用 Radix2的 Booth编码方式, 用线性 阵列对部分积进行压縮得到两个部分和,再通过一个带进位加法器得到最终结果。线性阵列 的规模可根据操作数有无符号分为两种基本情况: 当为有符号操作时,需要 201单元的行数 为 N/2, 其中 N为不小于 n的最小偶数; 当为无符号操作时, 需要 201单元的行数为 N/2+1 , 其中 N为大于 n的最小偶数; 两种情况下, 需要的 201单元的列数均为 m X 2+2。 用于求最 终结果的加法器, 需要一行 ni+ri个 201单元。
由于可以对无符号数高位进行 "0 "扩展, 使之成为符号数, 所以对有无符号数是否要 区别对待, 完全可以根据用户的需求而灵活决定。
以一个 Y4 X X4的 4位有符号乘法为例。 500是由 3 X 10个 201组成的, 可以完成部分 积压縮的阵列。
首先将被乘数 Y作两位符号扩展, 再在其低位补 4位" 0 "。则被乘数 Y被表示为一个位 宽为 10的数 Y— in<9 : 0>, 此数连同其左移一位以及分别按位取反的结果, 共同对应 500阵 列中 201模块的 A端口输入 Al、 A2、 A1N和 A2N。
在进行乘法操作时, 为实现正确逻辑, 须确定 201模块的各端口输入, 方式如下:
A端口: 通过 202的配置信息而确定。 配置的依据是对乘数 X通过 Radix2的 Booth编 码, 选择相应的部分积为 0、 +1倍被乘数、 - 1倍被乘数、 +2倍被乘数和 -2倍被乘数, 分别 通过多路选择器选择 A端口的 0、 AK A2、 A1N和 A2N信号作为输入。 再经过所需的特定逻 辑处理, 参与求和运算。 501 中经配置后的 A端输入可以实现部分积的符号扩展逻辑, 502 中的 A端输入用于产生正确的部分积, 503中的 A端输入则通过配置被置为逻辑 "0 ", 以保 证低位乘法结果正确。
B端口: 通过 203的配置信息而确定。 501中的 B端输入, 通过配置被置为逻辑 " 1 ", 用以实现部分积符号扩展逻辑。 504是乘法运算开始的首行, 其中经过配置的 B端口输入可 以实现部分积求和过程中所需耍的取反码操作的部分逻辑, 即符号扩展的部分逻辑。 505中 的则通过配置固定选取上一级 201中 C端输出右移一位的结果,以实现部分积的对齐和累加。
C端口: 通过 207的配置信息而确定。 501中的 C端输入与 B端相同, 通过配置被置为 逻辑 " 1 ", 用以实现部分积符号扩展逻辑。 504中的 C端输入则通过配置被置为逻辑 " 0 "。
505中的则通过配置固定选取上一级 201中 S端输出右移两位的结果, 以实现部分积的对齐 和累加, 移入数据的高位补位逻辑 "0 "。
通过以上配置, 最后一级得到两组部分和, 对齐后通过加法器 506相加得到最终结果。 506是乘法操作中使用加法器, 其配置方法为:
A端口: 通过 202配置固定选取上一级 201中 C端输出左移一位的结果, 以实现部分和 的对齐和累加。
B端口: 通过 203配置固定选取上一级 201中 S端输出的结果。
C端口: 通过 207配置选取同级低位 201中 Co端输出结果, 形成加法器进位链。 这样, 输出端 S便可得到位宽为 10位的和 Pdt<9 : 0>, 结果的最高两位是符号扩展, 可 以放弃,从而得到有效位宽为 8位的最终乘法结果 Pdt<7 : 0>。对于乘法结果的低位 Pdt<3 : 0>, 可以根据实际应用的具体情况, 进行取舍。
由 500和 506构成的乘法器,对于两个操作数即被乘数和乘数同时变化的乘法运算,其 性能受到配置信息产生方式的影响, 会有较大程度的降低, 致使其应用受到局限。 图 6则是 一个可实现两个输入同时变化的一般乘法器的实施例。仍以 4位的有符号乘法为例,该乘法 器与上述乘法器的主要不同是, 部分积不再根据乘数 X进行 Booth编码后通过 201中的 202 配置信息确定,而是将乘数 X直接由一组三态门控制的总线实现对部分积的选择, 图中只给 出了三态门对部分积选择的控制连接线。由于不进行 Booth编码, 因此将产生的部分积累加 得到的中间结果向右平移的位数较图 5的乘法器实施例少一位。而对于符号扩展的处理和相 应端口的输入则采用相同的方法。 三态门控制总线的设定方法是: 用乘数 X 的相应位通过 601、 602、 603和 604分别控制部分积为 1倍被乘数(可由随后逻辑求其相反数, 相反数只 可能出现在最后一个部分积)或强制为 "0 "。将由 201求得的两个中间结果, 向右平移与下 一级产生的部分积通过 201再相加, 以此类推, 得到最终乘积 Pdt。
此外, 还可以根据乘法的不同类型, 数据的不同位宽, 以及是否是 SIMD等具体情况, 灵活地实现最优配置。图 7用相近宽度的数据通道实现了 SISD和 SIMD指令,是本发明在重 构数据通道上的优势和灵活性的一个具体表现。图 7 (a)是同图 5中结构的一个 8位乘法器, 得到的乘法结果为 16位, 用于运算的整个数据通道的宽度为 18位, 高度为 6行。 图 7 (b) 通过对数据通道的分割和重新配置, 在一个宽度为 20位, 高度为 8行的数据通道上, 可以 实现四个同样结构的 4位乘法器 701、 702、 703和 704, 每个乘法器的宽度为 10, 高度为 4 行, 同时产生四个 8可用于后续的运算位的乘法结果。
图 8 是实现浮点乘法操作时数据通道中各功能模块划分的实施例。 浮点乘法分为三个 部分来实现: 符号位, 指数和尾数。浮点乘法所得积的符号位是由两个操作数符号位的逻辑 同或, 积的指数是两操作数指数的和, 而尾数的乘积就是积的尾数。
图中 801、 802和 803之间的横向连接被断幵, 分别实现了符号位, 指数和尾数的逻辑。 801通过配置实现同或逻辑, 完成符号运算, 在 802、 803得到最终结果前, 符号运算的结 果逐级传递; 802配置成加法器, 完成两个指数的求和, 此结果在 803得到最终结果前, 也 逐级传递; 803则配置成乘法器, 实现无符号数乘法, 即得到尾数的乘积。 当 801、 802和 803各部分结果全部产生后输出, 即可得到浮点乘法的结果。
浮点乘法乘积的尾数可以依用户需求设定精度,设 804产生乘法结果,位宽与尾数位宽 一致, 但两个操作数相乘会使得乘积的位宽增加。对于浮点乘法的乘积结果的尾数, 可以选 择各种取舍方式进行处理, 包括但不限于去尾法, 进一法, 四舍五入法等。在对此类算法进 行实现时,如四舍五入法,便可以通过对 805中处于加法器 806前的 201 单元中包含的 203 进行配置来实现, 这里将此 210单元的 B端口输入置为逻辑 " 1 "即可。
图 9是实现乘加器的实施例。 901是同图 5结构的乘法器, 902是同图 3结构的加法器。 将若干 901和 902结构的乘法器与加法器如图中方式连接起来, 其中, 903至 906为配置控 制线, 作为 901乘法器一个操作数的输入端, 输入信息是是经过配置后输入的; 907至 910 为全局总线, 作为 901乘法器另一个操作数的输入端; 911、 912和 913, 也为全局总线, 用 于传递乘法和加法产生的中间结果,由此便可以实现乘加器。依据连续进行的乘加操作次数, 可以对乘法器宽度进行配置,以保证求和时的中间结果及最终结果不会出现或在耍求范围内 不会出现溢出的情况。 假设 903至 910的输入操作数分别为 X、 Y、 Z、 W、 a、 b、 c和 d, 那 么图 9所示乘加器, 分别实现 aX、 bY、 cZ和 dW四个乘法, 以及将四个乘积依次向加的 3 个加法, 由 914输出的最终结果为:
aX+bY+cZ+dW 需要说明的是由配置控制线输入的乘法操作数 X、 Υ、 Ζ和 W应该是一组相对固定的数。 否则, 便不能体现本发明中可重构的数据通道的优势。
有了乘加器, 就可以实现向量乘法。 如:
X
Y
[a b c d]* = [aX + bY + cZ + dW]
Ζ
W 上式是向量乘法的数学表达式。 同样设式中的 X、 Y、 z、 w为一组相对固定的数, 组成 了一个变化不频繁的向量, 那么可以通过配置, 采用上面所述的乘加器, 便可以得到向量相 乘的结果了。
矩阵与向量的乘法是由若干个向量乘法组成, 如下式所示:
αθ b0 c0 d0' 'X' aOX + bOY + cOZ + dOW
。1 bl cl d\ • Y a\X + b\Y + c\Z + dW
α2 b2 c2 d2 z a2X + b2Y + c2Z + d2W
α3 63 c3 d3 w a3X + b3Y + c3Z + d3W 其最后结果的四个元素均是通过四个乘法和三个加法所得到的,依向量乘法的方法,依 次算出四个结果, 通过有序的输出, 便可实现矩阵与向量的乘法。 以此类推, 矩阵乘法也可 以通过类似的方法完成。
图 10 (a)和图 10 (b)是移位器的实施例。两图分别表示出了 500与 506中 Co端输出与下 一级基本单元 201的 B端输入的连接关系。经过配置, 可以实现移位器功能, 端口配置方法 均为:
A端口: 通过配置选取需耍进行移位的操作数。
B端口: 通过 202配置被置为逻辑 " 1 "。 C端口: 通过 207配置被置为逻辑 " 0 "。 由于输出端口 Co分别与下一级中左移一位的基本单元 201和右移一位基本单元 201的 输入端口 B相连接, 所以只需通过对下一级 201中的 202配置, 使其选取恰当的输入, 即可 实现相应的移位操作。
如果移位后的操作要求操作数的输入端口为 A, 那么可以通过配置该级基本单元 201, 使八、 C两端口的输入均被置为逻辑 "0 ", 则此级的输出端 S仍保持了移位结果, 而 S与再 下一级中的基本单元 201的输入端口 A有连接关系,那么通过配置其中的 202可使这一级的 A端口输入为所需移位结果。 进行移位操作时, 可以通过对最高位和最低位的配置, 在移位 结果的高位或低位按需要补 "0 "或 " 1 "。
图 11 (a)是垂直方向数据流基本单元的一个实施例。 IN0为 4行基本单元的其中某行的 输出, IN1为全局信号输入, 二选一选择器 (1112) 由配置存储器(1101 )控制选择将基本 单元输出或是全局信号输入向下传递。 二选一多路选择器(1120) 由配置存储器(1116)控 制选择是二选一选择器(1112)的当前输出值或是用寄存器(1121 )存储的值作为输出 0UT1。 寄存器(1121 ) 由配置存储器(1116 )控制是否锁存二选一选择器(1112 ) 的的输出。 同时 二选一选择器(1113)由配置存储器(1105 )控制选择将下面基本单元行产生的输入(1114) 或二选一选择器(1112)的输出向上传递。二选一多路选择器(1123)由配置存储器(1117) 控制选择二选一选择器(1113 ) )的当前输出值或是用寄存器(1122)存储的值作为输出 0UT2。 寄存器 (1122 ) 由配置存储器 (1117 ) 控制是否锁存二选一选择器 (1113 ) 的的输出。
图 11 (b)是总线结构的一个实施例。 共有 4组向下传递信号的总线 (1124-1127 ) 及一 组反向传递信号的总线 (1123 )。 四选一选择器 (1108 ) 选择 4 组向下传递信号的总线
( 1124-1127) 中的一组作为基本单元行 (1140) 的全局信号输入。 基本单元行 (1140) 的 其他输入由上一行的输出产生。每 4个基本单元行的输出通过三态门(1141 )连接在一起作 为垂直方向数据流基本单元(1104 )的 IN0输入。通过垂直方向数据流基本单元(1104 )可 实现信号的垂直方向传递。
图 12 (a)是横向移位单元的一个实施例。输入(1209、 1210等)各 4位信号可以合起来 组成一组从高位到低位的信号或是分别作为不同组信号输入,每组信号的对应位通过三态门
( 1205)连接在一起,输出连接在一起的三态门(1205)任意时刻只有一个导通。三态门(1205 ) 受配置存储器(1206)控制。四选一选择器(1207)可实现移位,它受配置存储器(1208)控制。 通过所示图可按需求实现整体左移(右移)任意位循环移位操作, 或是将单独组的输入信号 输出到所有组输出 (1211, 1212等)。
图 12 (b)是可重构普适数字数据处理块的一个实施例。数据从上面进入可重构普适数字 数据处理块, 可重构普适数字数据处理块由输入 /输出单元 (1206)、 存储单元 (1213)、 横 向移位单元 ( 1202 )、 逻辑单元 (1203 ) 构成。 逻辑单元 (1203)根据需耍配置成不同的功 能, 可以是一行, 也可以是一个阵列。 横向移位单元 (1202 ) 将信号按要求移位。
图 13 ( a) 为一块尚未进行配置的可重构普适数字数据信号处理块。 一块可重构数字信 号处理块 (1301 ) 中包含复数个基本单元 (1302 )。
图 13 (b )为一种配置成单指令流多数据流(SIMD)的可重构普适数字数据信号处理块。 该可重构普适数字数据信号处理块 (1301 ) 被配置成同时进行四路运算的一个 SIMD处理模 块, 分别从 1315, 1316, 1317, 1318输入数据, 从 1319, 1320, 1321 , 1322输出运算结果, 其中阵 1322被配置成乘法运算,行 1323被配置成逻辑与操作,行 1324被配置成减法操作, 行 1325被配置成异或操作,行 1326被配置成加法操作,阵 1327被配置成右移操作,阵 1328, 1329, 1330, 1331被配置成乘加操作, 阵 1332, 1334被配置成限幅( saturation )操作操 作, 行 1333被配置成加法操作。 在每一行都有对应的有效位 (136(Γ1372 ), 若某行或某阵 的有效位无效, 则对应的行或阵将被关闭, 节省功耗。
图 13 ( c )为一种配置成串行操作的可重构普适数字数据信号处理块。该处理块(1340 ) 的输入为 1341, 在该数据经处理后, 经路径 1343, 把当前列的输出输出到阵 1345的输入。 同样的, 经路径 1349把整个处理块配置成一个串行操作的模块, 其输出为 1351。 其中, 处 理块中的基本单元 (1302 ) 经过配置形成功能能不同的处理行, 处理阵, 如行 1353实现减 法操作, 行 1355实现限幅操作, 阵 1357实现乘加操作, 阵 1359实现移位操作等。 根据位 宽, 该处理块 (1340) 被配置成三列, 每列对应一列有效位 (1380, 1381, 1382 )。 当有效 位无效 (1380, 1381 , 1382 ) 时, 对应的行, 阵关闭, 节省功耗。
图 13 ( d ) 为一种配置后的可重构普适数字数据信号处理块的输入, 输出走向图。 该处 理块(1390 )的输出为 1391, 当其经过阵 1392的处理后,其输出 1393,分别输出到阵 1394, 1395, 整个块的输出为 1396, 1397。 其中数据向下传递 (1335 ) 输入到配置后的阵或行中 进行运算处理, 向上传递 (1336 ) 为经过反回总线传递数据, 并不进行运算处理。
图 13 ( e) 为一种配置后的可重构普适数字数据信号处理块的输入, 输出走向图。 该处 理块(1398 )拥有两个独立的阵: 阵 1344, 阵 1346, 两个阵的输入, 输出为独立的。阵 1344 的输入为 1350, 输出为 1352; 阵 1346的输入为 1354, 输出为 1356。 与图 13 ( d ) 中的数 据传输相同, 向下传递输入到配置后的阵或行中进行运算处理, 向上传递为经过反回总线传 递数据, 并不进行运算处理。 本实施例的输入 1354先经过向上传递的总线 (1337 ) 输入到 阵中, 再进行运算处理。 图 14 (a) 是本发明中可重构存储器地址转换的一个实施例, 在本实施例中, 可重构存 储器 (1401) 由复数个字组成, 每个字都有对应的地址。 当地址输入 (1402) 为二进制的 1001时,如果可重构存储器(1401)被配置为不映射, 则地址输入(1402)经过译码(1403) 后可以寻找到对应于二进制地址 1001的字(1404); 如果可重构存储器(1401)被配置为映 射, 且二进制 1001被映射为二进制 0010, 则地址输入 (1402)经过译码 (1403) 后进入映 射模块, 经映射后可以寻找到对应于实际二进制地址 0010的字(1406), 如此实现不同地址 的影射转换。
图 1'4 (b) 是本发明中可重构存储器在应用中的一个实施例。 在本实施例中, 可重构存 储器 (1401) 由复数个字组成, 实现了一个写寻址和读寻址不同存储器。 以二进制 1001为 例, 写映射将其映射为二进制 0010, 而读映射将其映射为二进制 1010。 当地址输入(1402) 为二进制的 1001时, 如果是写有效(1409), 则写映射(1407)工作, 将从译码模块(1403) 传输来的地址映射到二进制地址 0010对应的字,实现用逻辑地址 1001寻址写物理地址 0010 对应的字; 如果是读有效 (1412), 则读映射 (1408) 工作, 将从译码模块 (1403) 传输来 的地址映射到二进制地址 1010对应的字, 实现用逻辑地址 1001寻址读物理地址 1010对应 的字。 基于本实施例, 可以很方便地实现图像压縮算法中像素重排序 (zigzag), 也可以很 方便地实现 FIF0。
本发明所述可重构有限状态机可以被配置成如图 15 (a) 的有限状态机。 当处于状态 A
(1501) 时, 若输入为 i (0) (1507), 则迁移到状态 B (1502)。 当处于状态 B ( 1502) 时, 若输入为 j (0) (1508), 则迁移到状态 A (1501); 若输入为 j (1) (1509), 则保持状态 B
(1502)不变; 若输入为 j (2) (1510), 则迁移到状态 C (1503); 若输入为 j (3) (1511), 则迁移到状态 D (1504); 若输入为 j (4) (1512), 则迁移到状态 E (1505); 若输入为 j (5)
(1513), 则迁移到状态 F (1506)。 当处于状态 C (1503) 时, 若输入为 k (2) (1514), 则 迁移到状态 B (1502)。 为更清楚地描述, 图中略去了其他状态及状态迁移条件。
采用本发明中可重构有限状态机可以实现图 15 (a) 中的有限状态机。 配置后的可重构 有限状态机如图 15 (b) 所示。 该实施例由可随机访问的存储器 (1515)、 当前状态寄存器 (1521)、 可重构多路选择器和可重构随机逻辑 (1519) 构成, 本实施例中可随机访问的存 储器 (1515) 每一行位宽 /字长为 q, 其中输入界定值(1516)、 状态迁移值 (1517) 和输出 控制信号值 (1518) 各自的位宽 /字长均已确定, 分别为 r、 s、 t, 即9 = r + s + t, 此时 可以省去边界标志存储器。 当前状态寄存器 (1521) 中的值代表状态 B (1502) 时, 即处于 状态 B (1502), 对应到可随机访问的存储器的第 B行 (1524)。 在本实施例中, 可随机访问 的存储器第 B行(1524)中的存储的值输出, 其中 r位输出作为 6组输入状态信息被传输到 可重构组合逻辑(1519) 中, s位输出被作为 6组状态迁移信息传输到下一状态多路选择器
(1520)中, t位输出被作为 6组输出控制信号信息传输到输出控制多路选择器(1522)中。 可重构组合逻辑(1519)对输入的 r位数据和外部输入(1525)进行逻辑操作, 得到选择信 号 (1526)。 在本实施例中, 所述的逻辑操作即判断外部输入 (1525) 与 6组输入状态信息 中的哪一组相同, 并产生对应的选择信号 (1526), 用于选择下一状态多路选择器 (1520) 和输出控制多路选择器 ( 1522) 中的相应输入。 例如, 外部输入为 j (2) 时, 产生的选择 信号相应选择状态表中的状态 C (1503) 和输出表中的 out2 (1527), 之后状态 C ( 1503) 被写入当前状态寄存器 (1521), 并在下一周期指向可随机访问的存储器 (1515) 的第 C行
(1528), 实现有限状态机中的状态迁移和控制信号输出。
本实施例中输入界定值 (1516)、 状态迁移值 (1517) 和输出控制信号 (1518) 各自的 位宽 /字长是确定的, 不需要省去边界标志存储器。 在其他情况中, 也可以动态调整输入界 定值、 状态迁移值和输出控制信号的位宽 /字长, 即可随机访问的存储器中的边界 (1529) 是可以根据边界标志存储器中的值变化的。
本发明所述可重构有限状态机在配置后可以支持多状态机交替并发。 在图 15 (c) 中, 可重构有限状态机被配置为三个状态机交替并发, 图 15 (c) 与图 15 (b) 相比, 增大了可 随机访问的存储器 (1515) 并增加了一个外部输入多路选择器 (1536)、 两个当前状态寄存 器 (1530)、 一个当前状态多路选择器 (1531)、 两个控制信号寄存器 (1535)。 外部输入多 路选择器(1536)根据配置选择当前的输入进入可重构组合逻辑中, 任意时刻只运行一个有 限状态机,其他两个有限状态机的当前值被存储在相应的当前状态寄存器中,新增的当前状 态多路选择器(1531)选择运行中有限状态机的当前状态寄存器中的值作为当前状态。 当切 换状态机时,将原运行中的有限状态机的状态保存在相应当前状态寄存器中,新增的当前状 态多路选择器(1531)选择即将运行的有限状态机相应的当前状态寄存器中的值作为当前状 态继续运行状态机, 即可完成状态机切换。三个状态机各状态的相应控制输出可以如图中控 制信号(1532、 1533、 1534) 分别输出到各处, 也可以根据需要复用控制线; 其中, 控制信 号寄存器 (1535) 可以是复数个也可以是单数个。 将本发明的支持多状态机交替并发的可重构有限状态机配置成对应与不同接口协议的 有限状态机分时复用, 并将同一套硬件接口端定义为多种接口协议中的对应输入输出端口, 即可实现输入输出端口 (I/O) 的可重构复用。 以四种不同协议的输出端口为例, 请参阅图 16。在本实施例中,不同协议的输出数据从输出寄存器(1601 )中传输到输出逻辑通道 ( 1602)。 输出逻辑通道(1602)根据协议要求不同, 可以由可重构普适数字信号处理块中的基本单元 通过配置构成, 也可以是可重构控制单元通过配置实现的随机逻辑。 输出逻辑通道 (1602) 可以是复数个也可以是单数个。 由可重构有限状态机配置而成的多状态并发状态机 (1603) 能实现所述四种不同协议对应的状态机,并能产生相应的控制信号( 1608)。控制信号( 1608) 中的一部分在输出逻辑通道(1602 )中控制输出寄存器(1601 )传输来的输出数据按特定规 则写入 FIFO ( 1604 )。 FIFO ( 1604) 中存储的即是相应端口对应四种协议各自的输出值。 本 发明提出的输入输出端口复用,可以是由同一个有限状态机控制不同的物理端口实现不同的 接口协议, 也可以由同一个有限状态机控制同一个 /组物理端口实现不同的接口协议。 在本 实施例中, 输出端口 (1606)是单独用于一种接口协议的物理端口, 输出端口 ( 1607 )是由 其他三种接口协议复用的物理端口,由多状态有限状态机( 1603 )通过端口复用选择器( 1605 ) 选择实现分时复用。在本发明中, 可以根据配置, 使对外端口全部都是单独用于特定接口协 议的物理端口,也可以部分是单独用于特定接口协议的物理端口、部分是多种协议复用的物 理端口, 还可以全部是多种协议复用的物理端口, 具有很强的灵活性, 适用与相同位宽或不 同位宽的各种接口协议。本实施例只对输出端口做了说明,输入端口及输入输出双向端口的 情况与输出端口类似。
图 17为一典型视频解压縮流程图。 本发明可以支持不同视频压缩标准如 AVS, MPEG2, H. 264/AVC等标准的视频解压縮。 视频解压縮主耍分为熵解码 (1701 ), 反量化 ( 1703), 反 变换 ( 1705 ), 帧内预测 ( 1707 ), 帧间预测 ( 1709 ), 运动补偿 (1711 ), 去块滤波 ( 1713 ) 等步骤。
各标准的反量化 (1703) 的基本运算为 tij = (cijxA) «B, 其中 A与 B的值对不同的 标准有着不同的值。 可以通过配置把不同的值传递给运算单元, 实现不同标准的反量化 ( 1703 )。
各标准的反变换(1705)主要是采用矩阵乘, 以及加法和移位来实现, 但各视频压縮标 准所需耍的处理的矩阵大小并不相同, 如 AVS采用 8x8矩阵, H. 264/AVC采用 4x4矩阵, 这 样可以通过配置大小不同的矩阵来适应各自的需求。
各标准的帧内预测(1707)主要通过乘加及移位实现,且各标准的帧内预测的模式不同, 导致其运算需要的系数不同, 需要的乘法、加法的个数也不相同, 可以通过配置来传递不同 的系数以及进行乘法、 加法等运算的次数。
各标准的帧间预测 (1709), 运动补偿 (1711) 差异较大, 可以面对不同的视频压縮标 准进行不同的配置。
各标准的去块滤波 (1713) 中, 块的划分方式不同, 如 AVS中的 4x4块, H.264/AVC中 的 8x8块, 滤波的条件与所要做的处理也不相同, 也适合进行配置来实现。
从上述分析可以知道,本发明灵活性强,可以实现不同的视频压縮标准中的解压縮运算。 图 18为配置成乘法器时有效位的配置实施例。 图中基本单元(180广1815)与实施例 2 中基本单元(201)结构类似,与本实施例描述不相关部分在图中省略。基本单元 1806, 1807, 1808, 1809所在的行执行部分积累加, 在一个时钟周期内完成, 1810所在的行执行乘法操 作最后一步全加, 在另外一个时钟周期内完成。有效位(1816)的传播配置根据所要实现的 功能进行不同配置。 在本实施例中, 有效位 (1816) 在基本单元 1801, 1802, 1803中, 不 经过寄存器(图中已省略)寄存, 直接经过多路选择器(1817)的 1位选出; 有效位(1816) 在基本单元 1804, 1805中,经过寄存器(1818)存储,下个时钟周期再经过多路选择器(1817) 的 0位选出,其中多路选择器(1817)的选择信号由配置存储器(1819)产生。在有效位(1816) 的传递过程中, 把有效位传递到相应的三输入与门(1821)中, 去控制用于运算的基本单元 (1806^1815) 中的寄存器 (1822) 是否进行存储, 其中三输入与门 (1821) 的另外两个输 入为相应行的多路选择器 (1817) 的配置信息和时钟信号 (1823)。 在本实施例中, 通过配 置后, 基本单元 1806, 1807, 1808, 1811, 1812, 1813中的寄存器 ( 1822) 不进行锁存, 多路选择器 (1824) 的输入为其在 1位上的输入; 基本单元 1809, 1810, 1814, 1815中的 寄存器(1822)进行锁存,多路选择器(1824)的输入为其在 0位上的输入,即寄存器(1822) 的输出。 在本实施例中, 三输入与门 (1821) 可以为专门的列, 也可由基本单元配置而成。
图 19是实现绝对值、 限幅、 比较选择操作的一个实施例。 三态门 (1903) 根据位数要 求实现导通或高阻。 即输入数据为 N 位宽度时, 第 N-1 位的三态门 (1903) 导通, 其余 (N- 2, 〜,0) 位的三态门 (1903) 均为高阻态。 三态门 (1903) 由配置存储器 (1904) 控 制。 当配置为绝对值操作时, 第一行基本单元(1905)中的输出为输入数据的取反加一结果, 第二行基本单元(1906)所有二选一选择器(1902)根据第一行最高位进位信号 (1901,可将其 配置为输入数的符号位) 选择是输入数据的原值输出或是上一行的取反加一结果作为输出; 当配置为限幅操作时,第一行基本单元(1905)执行加法或减法操作,第二行基本单元(1906) 中所有二选一选择器 (1902)根据第一行最高位进位信号(1901 )选择第一行的输出结果或限 幅的最大最小值作为输出;当配置为比较选择操作时,第一行基本单元(1905)执行比较操作, 第二行基本单元 (1906)的所有二选一选择器(1902)根据第一行最高位进位信号 (1901,即比 较结果) 选择输入 A或输入 B作为输出。
图 20是可重构随机逻辑单元的一个实施例。 配置存储器 (2015, 2016 ) 分别控制四选 一选择器(2005, 2006 )选择输入(2001-2004), 四选一选择器(2005, 2006 ) 的两位输出 (2007)作为四选一选择器 (2008)的控制信号选择配置存储器 (2009-2013) 中的某个值作为 该可重构随机逻辑单元的输出 (2014)。 通过将一行可重构随机逻辑单元的输出接到下一行 可重构随机逻辑单元的输入这种方式, 实现可重构随机逻辑。

Claims

1、一可重构数据处理平台, 可以通过配置(set )特定的配置存储器(sett ing memory ) 实现不同的功能; 所述可重构数据处理平台由单数个或复数个以下部分或全部模块构成:
( 1 )可重构普适数字数据处理块(Reconf igurable Universal Digi tal Data processing block ), 用于数字类数据处理;
( 2 ) 可重构存储器 (reconf igurable memory ), 用于存储和数据重排序;
( 3 ) 可重构控制单元 (reconf igurable control ler), 可单独用于产生控制信号, 也 可与其他逻辑共同实现可配置的逻辑功能;所述其他逻辑可以是可重构普适数字数据处理块 和可重构存储器。
2、 根据权利耍求 1所述的可重构数据处理平台, 其特征在于所述可重构普适数字数据 处理块, 大量实现一位 (bit ) 最基本运算的单数种或少量复数种基本单元 (cell ) 由互联 结构规整地 (regularly ) 连接构成, 整个普适数字信号处理块具有非各轴同性 (axi s asymmetrical )的预设连接, 能实现特定的功能, 并可以通过配置特定的配置存储器实现重 构, 以基本单元为最小单位拆分成单数个或复数个数字类数据处理器 (digi tal data processor ),每个数字类数据处理器根据相应配置实现不同位宽的数字类数据处理功能,所 述可重构普适数字数据处理块即可同时实现单数个或复数个相关或互不相关的算法;所述配 置存储器可以全部或部分位于所述可重构普适数字数据处理块内,也可以全部或部分位于所 述可重构普适数字数据处理块外;所述复数个数字类数据处理器可以各自独立,也可以部分 或全部有相互联系, 串行、 并行或串并混合地进行数字类信号的处理; 当所述数字类数据处 理器为独立的数字类数据处理器时,能通过纵向总线和 /或横向总线或旁路路径越过 /绕过其 他数字类数据处理器与外部联系; 所述数字类数据处理可以是算术、逻辑运算及算术、逻辑 运算的组合;所述算术运算包括任意位宽小于等于所述普适数字信号处理块总位宽的定点运 算和浮点运算; 所述数字类数据处理可以是数字信号处理。
3、 根据权利要求 2所述的可重构数据处理平台, 其特征在于用一个所述基本单元通过 配置可以完成一位逻辑运算和一位加法运算;由所述基本单元经配置并按配置信息决定的连 接关系连接构成的行 (row) 能实现所述基本单元无法独立实现的复数位逻辑运算和特定的 复数位算术运算, 所述运算可以是加法、减法和移位; 由所述基本单元经配置并按配置信息 决定的连接关系连接构成的阵 (array ), 能实现所述行无法独立实现的特定的复数位运算, 所述运算可以是移位、 乘法、 乘加、 乘减、 限辐、 提取、 绝对值和舍入。
4、 根据权利要求 2所述的可重构数据处理平台, 其特征在于所述基本单元可以不使用 查找表(LUT) , 用同一套逻辑实现不同的功能; 所述功能可以是逻辑运算和加法运算; 所述 基本单元至少包含进位(carry )产生逻辑、 和 (sum)产生逻辑、 复数个配置存储器、 复数 个输入端口和复数个输出端口;所述输入端口全部或部分由输入多路选择器选择出送入进位 产生逻辑 /和产生逻辑的数据; 所述多路选择器的选择端控制信号来源于配置存储器; 所述 输入多路选择器至少包含一个进位输入多路选择器和一个用于根据高阶(hi gh radix )乘法 重编码结果选择输入的多路选择器;所述输出端口的数据全部或部分来源于带旁路功能的输 出延迟(delay )装置;所述带旁路功能的输出延迟装置可以是多路选择器和寄存器 /锁存器; 所述带旁路功能的输出延迟装置中的多路选择器用于从进位产生逻辑 /和产生逻辑的输出和 延迟后的输出中选择数据作为最终的输出, 选择端控制信号来源于配置存储器。
5、 根据权利要求 4所述的可重构数据处理平台, 其特征在于所述进位产生逻辑可以根 据配置产生全加器的进位输出,也可以根据配置产生逻辑运算的输出;所述和产生逻辑可以 根据配置产生全加器的和输出, 也可以根据配置产生逻辑运算的输出。
6、 根据权利要求 4所述的可重构数据处理平台, 其特征在于所述配置存储器用于存储 配置信息, 可以全部或部分位于所述基本单元中, 也可以全部或部分位于所述基本单元外; 当所述配置存储器全部或部分位于所述基本单元外时,所述配置存储器可以用于控制单数个 或复数个基本单元中的相应可重构部分。
7、 根据权利要求 4所述的可重构数据处理平台, 其特征在于所述基本单元中的进位输 入多路选择器的输入根据配置不同可以来自行内右侧相邻基本单元的输出端口,也可以来自 阵内上方特定基本单元的输出端口, 也可以是来自相应配置存储器的作系数用的配置信息, 还可以被赋值为逻辑 "0 "或逻辑 " 1 "; 所述用于根据高阶乘法重编码结果选择输入的多路 选择器的输入根据配置不同可以来自阵内上方特定基本单元的输出端口,也可以是被乘数及 其反码、左移后的被乘数及其反码或逻辑" 0 ", 也可以是来自相应配置存储器的作系数用的 配置信息;所述其他输入端多路选择器的输入根据配置不同可以来自操作数的数据输入,也 可以是来自相应配置存储器的作系数用的配置信息。
8、 根据权利耍求 4所述的可重构数据处理平台, 其特征在于所述基本单元通过配置、 相互连接, 可以构成以移位加方式实现的乘法器, 也可以构成高阶乘法器; 所述高阶乘法器 可以是阶二(radix- 2 )乘法器; 所述重编码器可以是布斯(booth )编码器; 所述重编码器 可以是独立存在于所述可重构数据处理平台中,供单数个或复数个乘法器使用,也可以不独 立存在而由所述可重构控制单元通过配置实现,还可以由所述可重构数据处理平台外部的逻 辑或处理器实现;所述重编码器的编码结果可以作为输入传输到所述构成乘法器的基本单元 中控制根据重编码结果选择乘法 ¾5入的多路选择器, 也可以作为系数 (coeffici ent ) 预先 存储在用于控制根据重编码结果选择乘法输入的多路选择器的配置存储器中。
9、 根据权利要求 2所述的可重构数据处理平台, 其特征在于所述可重构普适数字数据 处理块中的互联结构分为高速局部连接和全局总线两个层次。
10、根据权利要求 9所述的可重构数据处理平台,其特征在于所述高速局部连接用于邻 近基本单元间的高速连接; 所述高速局部连接是硬连接 (hard wired ) 的近距离固定连线, 用于大部分时延关键路径;通过对基本单元内输入多路选择器的配置可以重构基本单元间的 连接关系;所述连接关系可以是同一行内的加减法的进位关系和不同行间的乘法部分积传递 关系。
11、根据权利要求 9所述的可重构数据处理平台,其特征在于所述全局总线包括纵向连 接行或阵的用于传输数据的可重构纵向总线结构和横向的可重构横向移位结构;所述可重构 纵向总线结构用于数据 /数据流可重构普适数字数据处理块中的纵向传输; 所述可重构横向 移位结构可以根据配置将特定行或阵输出的数据移位后送往左侧、右侧或下一个行或阵,还 可以实现大跨度的数据移位操作和数据交叉换位操作。
12、根据权利要求 2所述的可重构数据处理平台,其特征在于所述根据配置信息重构而 成的实现特定功能的数字类数据处理器可以根据配置信息包含单数个或复数个有效位
( val id flag), 或不包含有效位; 所述单数个或复数个有效位可以用于控制是否运行对应 数字类数据处理器并标识运行结果的有效性,也可以用于控制是否运行对应数字类数据处理 器中部分或全部行或阵并标识运行结果的有效性;所述有效位用存储器实现,可以是寄存器、 锁存器和随机访问存储器;控制所述行或阵不运行的方法可以是关断相应时钟或关断相应电 源; 所述有效位也可以只用于标识有效性; 所述有效位也可以被旁路逻辑旁路而不起作用。
13、根据权利耍求 1所述的可重构数据处理平台,其特征在于所述可重构存储器可以通过配 置特定的配置存储器实现重构,能用于存储数据并具备数据重排序功能;所述可重构存储器 的位宽 /字长可以是固定的, 也可以是可变的; 所述配置存储器可以全部或部分位于所述可 重构存储器内, 也可以全部或部分位于所述可重构存储器外。
14、根据权利要求 1所述的可重构数据处理平台,其特征在于所述可重构控制单元包括 可重构随机逻辑和可重构有限状态机,可以通过配置特定的配置存储器实现重构,用基本部 件产生处理不同事务所需的控制信号及有限状态机;所述配置存储器可以全部或部分位于所 述可重构控制单元内, 也可以全部或部分位于所述可重构控制单元外。
15、 根据权利要求 14所述的可重构数据处理平台, 所述可重构随机逻辑包括可重构功 能单元和可重构连接;所述可重构功能单元通过配置能实现任意种逻辑功能;所述可重构连 接通过配置能实现复数个可重构功能单元的任意种连接;单数个或复数个可重构功能单元及 单数个或复数个可重构连接通过特定的配置, 即能实现随机逻辑的重构;所述可重构随机逻 辑还可以根据配置, 在由可重构功能单元和可重构连接构成的随机逻辑中插入寄存器 /锁存 器以保证时延要求。
16、 根据权利要求 14所述的可重构数据处理平台, 所述可重构有限状态机由可随机访 问的存储器 (randomly accessible memory ), 当前状态寄存器、 可重构多路选择器和可重 构随机逻辑构成; 所述可随机访问的存储器内存储有输入界定值(qual ifi cat i on ) 状态迁 移(transiti on )值和输出控制信号值; 所述可随机访问的存储器中的每一行存储对应的单 数组或复数组输入界定值、 单数组或复数组状态迁移值和单数组或复数组输出控制信号值; 所述可随机访问的存储器每一行中输入界定值、状态迁移值和输出控制信号值各自占的位宽 /字长是不变 /可变的; 所述可随机访问的存储器内每行中不同值间的边界信息是不变 /可变 的; 当所述可随机访问的存储器内每行中不同值间的边界信息可变时, 由边界标志存储器存 储每行相应的边界信息;所述当前状态寄存器用于存储当前状态值,实现下一状态到当前状 态的迁移;所述可重构多路选择器用于从复数组状态迁移值中选出下一状态, 并从复数组输 出控制信号值中选出满足条件的控制信号输出;所述可重构随机逻辑可以通过配置特定的配 置存储器实现重构,用于完成特定的随机逻辑功能;所述可重构有限状态机还可以包括由可 重构随机逻辑构成的计数器(counter ) , 以实现根据计数结果进行状态迁移的功能; 所述可 随机访问的存储器可以是可挥发的, 也可以是不可挥发的。
17、 根据权利耍求 16所述的可重构数据处理平台, 所述可重构有限状态机在配置后可 以支持多状态机交替并发 (mu l ti- thread )。
18、 根据权利耍求 15、 16、 17所述的可重构数据处理平台, 其特征在于所述可重构控 制单元可以与其他逻辑共同实现可重构逻辑功能; 所述可重构逻辑功能可以是可重构输入 / 输出对外接口; 所述可重构输入 /输出对外接口, 可以通过配置, 以同样的一套硬件结构同 时实现单数种或复数种接口协议。
19、根据权利要求 1所述的可重构数据处理平台,其特征在于用于配置可重构的模块的 配置信息可以存储在平台内, 也可以存储在平台外; 当配置信息存储在平台外时, 可以将所 述可重构数据处理平台视为存储器,用向存储器存储数据的方式将所有配置信息并行传输到 所述可重构数据处理平台中的各个配置存储器,或者用扫描链的方式将所有配置信息串行移 位输入到所述可重构数据处理平台中的各个配置存储器;还可以在所述可重构平台运行时将 所述配置信息传输到各个配置存储器, 实现不同功能间的动态切换。
20、 根据权利耍求 19所述的可重构数据处理平台, 其特征在于所述配置信息可以是未 经编码的原始配置信息, 也可以是经过编码后的信息; 所述编码的方式可以是加密和压縮; 对所述加密编码信息解密所用的密钥可以以硬件形式存在与所述可重构数据处理平台中,也 可以通过配置信息输入所述可重构数据处理平台。
21、根据权利耍求 1所述的可重构数据处理平台,其特征在于所述配置信息可以通过人 工制定, 也可以通过工具根据映射规则自动产生; 所述映射可以是硬件描述语言 (HDL) 到 配置信息的映射、计算机程序语言到配置信息的映射、计算机建模到配置信息的映射和算法 描述到配置信息的映射;在所述可重构数据处理平台中,可以预先实现常用的运算的配置信 息模板, 当使用自动工具根据映射规则自动产生配置信息时,所述自动工具可以直接调用所 述模板中的配置信息。
22、根据权利耍求 1所述的可重构数据处理平台,其特征在于所述配置存储器可以是挥 发性的, 也可以是非挥发性的; 还可以在设计、 生产所述可重构数据处理平台时, 直接固化 配置, 使所述可重构数据处理平台只能实现特定功能。
23、根据权利耍求 1所述的可重构数据处理平台,其特征在于所述可重构数据处理平台 还可以包括扩展模块, 以适应更广泛的需求; 所述扩展模块可以是: 随机逻辑控制器、模拟 单元、 中央处理器、 数字信号处理器 (DSP)、 数据包头检测器和逻辑零个数检测器。
24、根据权利耍求 1所述的可重构数据处理平台,其特征在于基于所述的可重构普适数 字信号处理块、可重构存储器和可重构控制单元进行设计, 配置为不同的应用, 不需耍做时 延收敛 ( timing closure )。
25、根据权利耍求 1所述的可重构数据处理平台,其特征在于所述可重构数据处理平台 可以具备自测试能力,能够在加电工作的情况下不依赖于外部设备进行芯片的自测试; 当所 述可重构数据处理平台具备自测试能力时,可以将可重构普适数字数据处理块中特定的单数 个基本单元或行或阵,或者复数个基本单元或行或阵配置成比较器,对可重构普适数字数据 处理块中相应的复数组其他基本单元或行或阵及基本单元或行或阵的组合给予具有特定关 系的激励,并用所述比较器比较所述复数组其他行或阵及行或阵的组合的输出是否符合相应 的特定关系;所述激励可以来自所述可重构数据处理平台中的特定模块,也可以来自所述可 重构数据处理平台外部; 所述特定关系可以是相等、 相反、 互逆、 互补; 所述测试结果可以 被送到所述可重构数据处理平台外部, 也可以保存在所述可重构数据处理平台中的存储器 中;所述的自测试可以是在晶圆测试,封装后集成电路测试或者芯片使用时在系统启动时进 行测试; 也可以人为设定自测试条件及周期, 在工作期间定期进行自测试; 所述存储器可以 是挥发性的, 也可以是非挥发性的。
26、 根据权利要求 25所述的可重构数据处理平台, 其特征在于所述可重构数据处理平 台在具备自测试能力的前提下,可以具备自修复能力; 当所述测试结果保存在所述可重构数 据处理平台中的存储器中时,可以对失效基本单元或失效行或失效阵作标记,在对所述可重 构数据处理平台进行配置时,可以根据相应标记绕过失效基本单元或失效行或失效阵,使所 述可重构数据处理平台依然能正常工作,实现自修复;所述自修复可以是在晶圆测试后进行, 封装后集成电路测试后进行或者芯片使用时在系统启动时进行测试后进行;也可以人为设定 自测试自修复条件及周期, 在工作期间定期进行自测试后进行。
PCT/CN2010/000072 2009-01-21 2010-01-15 可重构数据处理平台 WO2010083723A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/187,841 US8468335B2 (en) 2009-01-21 2011-07-21 Reconfigurable system having plurality of basic function units with each unit having a plurality of multiplexers and other logics for performing at least one of a logic operation or arithmetic operation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200910045980.6 2009-01-21
CN200910045980.6A CN101782893B (zh) 2009-01-21 2009-01-21 可重构数据处理平台

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/187,841 Continuation US8468335B2 (en) 2009-01-21 2011-07-21 Reconfigurable system having plurality of basic function units with each unit having a plurality of multiplexers and other logics for performing at least one of a logic operation or arithmetic operation

Publications (1)

Publication Number Publication Date
WO2010083723A1 true WO2010083723A1 (zh) 2010-07-29

Family

ID=42355527

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/000072 WO2010083723A1 (zh) 2009-01-21 2010-01-15 可重构数据处理平台

Country Status (3)

Country Link
US (1) US8468335B2 (zh)
CN (1) CN101782893B (zh)
WO (1) WO2010083723A1 (zh)

Families Citing this family (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8539016B1 (en) 2010-02-09 2013-09-17 Altera Corporation QR decomposition in an integrated circuit device
US8539014B2 (en) 2010-03-25 2013-09-17 Altera Corporation Solving linear matrices in an integrated circuit device
US8577951B1 (en) * 2010-08-19 2013-11-05 Altera Corporation Matrix operations in an integrated circuit device
CN102207927B (zh) * 2011-05-27 2016-01-13 清华大学 动态可重构处理器之间的数据传输方法、处理器和系统
CN102236632B (zh) * 2011-05-27 2013-05-22 清华大学 一种层次化描述动态可重构处理器配置信息的方法
US8862835B2 (en) * 2011-06-14 2014-10-14 Texas Instruments Incorporated Multi-port register file with an input pipelined architecture and asynchronous read data forwarding
US8862836B2 (en) * 2011-06-14 2014-10-14 Texas Instruments Incorporated Multi-port register file with an input pipelined architecture with asynchronous reads and localized feedback
US8812576B1 (en) 2011-09-12 2014-08-19 Altera Corporation QR decomposition in an integrated circuit device
US8949298B1 (en) 2011-09-16 2015-02-03 Altera Corporation Computing floating-point polynomials in an integrated circuit device
US9053045B1 (en) 2011-09-16 2015-06-09 Altera Corporation Computing floating-point polynomials in an integrated circuit device
US8762443B1 (en) 2011-11-15 2014-06-24 Altera Corporation Matrix operations in an integrated circuit device
WO2013103382A1 (en) 2012-01-06 2013-07-11 Ge Intelligent Platforms, Inc. Apparatus and method for creating and presenting control logic
US9239786B2 (en) 2012-01-18 2016-01-19 Samsung Electronics Co., Ltd. Reconfigurable storage device
US20130346985A1 (en) * 2012-06-20 2013-12-26 Microsoft Corporation Managing use of a field programmable gate array by multiple processes in an operating system
US9230091B2 (en) 2012-06-20 2016-01-05 Microsoft Technology Licensing, Llc Managing use of a field programmable gate array with isolated components
US9424019B2 (en) 2012-06-20 2016-08-23 Microsoft Technology Licensing, Llc Updating hardware libraries for use by applications on a computer system with an FPGA coprocessor
US9298438B2 (en) 2012-06-20 2016-03-29 Microsoft Technology Licensing, Llc Profiling application code to identify code portions for FPGA implementation
US8898480B2 (en) 2012-06-20 2014-11-25 Microsoft Corporation Managing use of a field programmable gate array with reprogammable cryptographic operations
US9207909B1 (en) 2012-11-26 2015-12-08 Altera Corporation Polynomial calculations optimized for programmable integrated circuit device structures
US9189200B1 (en) 2013-03-14 2015-11-17 Altera Corporation Multiple-precision processing block in a programmable integrated circuit device
US9015643B2 (en) 2013-03-15 2015-04-21 Nvidia Corporation System, method, and computer program product for applying a callback function to data values
US20140278328A1 (en) * 2013-03-15 2014-09-18 Nvidia Corporation System, method, and computer program product for constructing a data flow and identifying a construct
US9323502B2 (en) 2013-03-15 2016-04-26 Nvidia Corporation System, method, and computer program product for altering a line of code
US9015646B2 (en) 2013-04-10 2015-04-21 Nvidia Corporation System, method, and computer program product for translating a hardware language into a source database
US9171115B2 (en) 2013-04-10 2015-10-27 Nvidia Corporation System, method, and computer program product for translating a common hardware database into a logic code model
US9021408B2 (en) 2013-04-10 2015-04-28 Nvidia Corporation System, method, and computer program product for translating a source database into a common hardware database
US9348795B1 (en) 2013-07-03 2016-05-24 Altera Corporation Programmable device using fixed and configurable logic to implement floating-point rounding
US9535936B2 (en) * 2013-09-05 2017-01-03 The Boeing Company Correlation of maximum configuration data sets
US20160228066A1 (en) * 2013-11-20 2016-08-11 Intel Corporation Binarized frequency transform
CN104699575B (zh) * 2013-12-09 2018-04-20 华为技术有限公司 Fpga芯片和fpga系统
US9501591B2 (en) * 2013-12-09 2016-11-22 International Business Machines Corporation Dynamically modifiable component model
CN103914429B (zh) * 2014-04-18 2016-11-23 东南大学 用于粗粒度动态可重构阵列的多模式数据传输互连器
US10838719B2 (en) * 2014-11-14 2020-11-17 Marvell Asia Pte, LTD Carry chain for SIMD operations
CN106294278B (zh) * 2016-08-01 2019-03-12 东南大学 用于动态可重构阵列计算系统的自适硬件预配置控制器
JP6786955B2 (ja) * 2016-08-25 2020-11-18 富士ゼロックス株式会社 再構成可能論理回路
CN114168526B (zh) * 2017-03-14 2024-01-12 珠海市芯动力科技有限公司 可重构并行处理
CN106970775A (zh) * 2017-03-27 2017-07-21 南京大学 一种可重构定浮点通用加法器
CN106951211B (zh) * 2017-03-27 2019-10-18 南京大学 一种可重构定浮点通用乘法器
KR102477516B1 (ko) * 2017-05-17 2022-12-14 구글 엘엘씨 하드웨어에서 매트릭스 곱셈을 수행
CN107368459B (zh) * 2017-06-24 2021-01-22 中国人民解放军信息工程大学 基于任意维数矩阵乘法的可重构计算结构的调度方法
CN107301032A (zh) * 2017-07-02 2017-10-27 郑州云海信息技术有限公司 一种数字信号处理方法和装置
CN107679012A (zh) * 2017-09-27 2018-02-09 清华大学无锡应用技术研究院 用于可重构处理系统的配置的方法和装置
CN107832844A (zh) * 2017-10-30 2018-03-23 上海寒武纪信息科技有限公司 一种信息处理方法及相关产品
GB201719355D0 (en) * 2017-11-22 2018-01-03 Univ Leuven Kath Reconfigerable logic circuit
US10657292B2 (en) * 2017-12-18 2020-05-19 Xilinx, Inc. Security for programmable devices in a data center
CN109932953A (zh) * 2017-12-19 2019-06-25 陈新 智能超算可编程控制器
US10831702B2 (en) * 2018-09-20 2020-11-10 Ceva D.S.P. Ltd. Efficient utilization of systolic arrays in computational processing
WO2020215124A1 (en) * 2019-04-26 2020-10-29 The University Of Sydney An improved hardware primitive for implementations of deep neural networks
CN110531954A (zh) * 2019-08-30 2019-12-03 上海寒武纪信息科技有限公司 乘法器、数据处理方法、芯片及电子设备
CN110737628A (zh) 2019-10-17 2020-01-31 辰芯科技有限公司 一种可重构处理器和可重构处理器系统
EP4085354A4 (en) * 2019-12-30 2024-03-13 Star Ally International Ltd PROCESSOR FOR CONFIGURABLE PARALLEL CALCULATIONS
CN111429944B (zh) * 2020-04-17 2023-06-02 北京百瑞互联技术有限公司 一种编解码器开发测试优化方法及系统
WO2023069074A1 (en) * 2021-10-19 2023-04-27 Picoai Limited Method and apparatus of programmable quantization/de-quantization engine
CN114489475A (zh) * 2021-12-01 2022-05-13 阿里巴巴(中国)有限公司 分布式存储系统及其数据存储方法
US11853596B2 (en) 2021-12-06 2023-12-26 Taiwan Semiconductor Manufacturing Company, Ltd. Data sequencing circuit and method
CN117235007B (zh) * 2023-11-13 2024-01-26 中科芯磁科技(珠海)有限责任公司 互连模块控制方法、互连模块及存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1900927A (zh) * 2006-07-14 2007-01-24 中国电子科技集团公司第三十八研究所 可重构数字信号处理器
CN101169866A (zh) * 2006-10-26 2008-04-30 朱明程 自重构片上多媒体处理系统及其自重构实现方法

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5956518A (en) * 1996-04-11 1999-09-21 Massachusetts Institute Of Technology Intermediate-grain reconfigurable processing device
US6622233B1 (en) * 1999-03-31 2003-09-16 Star Bridge Systems, Inc. Hypercomputer
US6959316B2 (en) * 2001-02-01 2005-10-25 Nokia Mobile Phones Limited Dynamically configurable processor
US7089436B2 (en) * 2001-02-05 2006-08-08 Morpho Technologies Power saving method and arrangement for a configurable processor array
US7444531B2 (en) * 2001-03-05 2008-10-28 Pact Xpp Technologies Ag Methods and devices for treating and processing data
US7571303B2 (en) * 2002-10-16 2009-08-04 Akya (Holdings) Limited Reconfigurable integrated circuit
CN100412801C (zh) * 2003-09-30 2008-08-20 三洋电机株式会社 备有可重构电路的处理装置、集成电路装置
US7584345B2 (en) * 2003-10-30 2009-09-01 International Business Machines Corporation System for using FPGA technology with a microprocessor for reconfigurable, instruction level hardware acceleration
JP2006011825A (ja) * 2004-06-25 2006-01-12 Fujitsu Ltd 再構成可能演算装置および半導体装置
US20100122105A1 (en) * 2005-04-28 2010-05-13 The University Court Of The University Of Edinburgh Reconfigurable instruction cell array
JP4720436B2 (ja) * 2005-11-01 2011-07-13 株式会社日立製作所 リコンフィギュラブルプロセッサまたは装置
GB0605349D0 (en) * 2006-03-17 2006-04-26 Imec Inter Uni Micro Electr Reconfigurable multi-processing coarse-grain array
US20080320293A1 (en) * 2007-01-31 2008-12-25 Broadcom Corporation Configurable processing core
CN100550002C (zh) * 2007-08-23 2009-10-14 顾士平 动态可重构指令处理器配置及通信控制装置
WO2009035185A1 (en) * 2007-09-11 2009-03-19 Core Logic Inc. Reconfigurable array processor for floating-point operations
CN201163400Y (zh) * 2008-03-12 2008-12-10 山东泉清通信有限责任公司 一种新型可重构电路

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1900927A (zh) * 2006-07-14 2007-01-24 中国电子科技集团公司第三十八研究所 可重构数字信号处理器
CN101169866A (zh) * 2006-10-26 2008-04-30 朱明程 自重构片上多媒体处理系统及其自重构实现方法

Also Published As

Publication number Publication date
US8468335B2 (en) 2013-06-18
CN101782893B (zh) 2014-12-24
US20120191967A1 (en) 2012-07-26
CN101782893A (zh) 2010-07-21

Similar Documents

Publication Publication Date Title
WO2010083723A1 (zh) 可重构数据处理平台
US7340562B2 (en) Cache for instruction set architecture
KR100948512B1 (ko) 부동 소수점 연산을 지원하는 부동 소수점 유닛-프로세싱 요소(fpu-pe) 구조 및 그 fpu-pe 구조를 포함한 재구성 어레이 프로세서(rap) 및 그 rap를 포함한 멀티미디어 플랫폼
US7721069B2 (en) Low power, high performance, heterogeneous, scalable processor architecture
US8078833B2 (en) Microprocessor with highly configurable pipeline and executional unit internal hierarchal structures, optimizable for different types of computational functions
JPH11296345A (ja) プロセッサ
KR20060056855A (ko) 프로세서
CN110716707A (zh) 前缀网络定向的加法
JPH04128982A (ja) プロセッサエレメント、プロセッシングユニット、プロセッサ、及びその演算処理方法
KR20200139178A (ko) 집적 회로를 위한 데이터 프로세싱 엔진 타일 아키텍처
CN113468102A (zh) 混合粒度的计算电路模块和计算系统
JP3558119B2 (ja) 情報処理システム、プログラマブル論理回路の回路情報形成方法、プログラマブル論理回路の再構成方法
Furuta et al. Spatial-temporal mapping of real applications on a dynamically reconfigurable logic engine (DRLE) LSI
US10747531B1 (en) Core for a data processing engine in an integrated circuit
Hazarika et al. High performance multiplierless serial pipelined VLSI architecture for real-valued FFT
US11061673B1 (en) Data selection network for a data processing engine in an integrated circuit
JP5633303B2 (ja) リコンフィグ可能なlsi
Denholm et al. Maximising Parallel Memory Access for Low Latency FPGA Designs
WO2005038644A1 (ja) データ処理装置
Yang et al. Configuration approaches to enhance computing efficiency of coarse-grained reconfigurable array
JP2004234407A (ja) データ処理装置
Samanth et al. Design and Implementation of 32-bit Functional Unit for RISC architecture applications
Zhang et al. Super K: A Superscalar CRYSTALS KYBER Processor Based on Efficient Arithmetic Array
Lin et al. High performance architecture for unified forward and inverse transform of HEVC
Chen et al. VLSI architecture of the reconfigurable computing engine for digital signal processing applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10733196

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10733196

Country of ref document: EP

Kind code of ref document: A1