CN203241983U

CN203241983U - Device for carrying out signal processing operation

Info

Publication number: CN203241983U
Application number: CN 201220352023
Authority: CN
Inventors: 朱鹏飞; 孙红霞; 吴永强; E·圭代蒂
Original assignee: STMicroelectronics Beijing R&D Co Ltd; STMicroelectronics SRL
Current assignee: STMicroelectronics Beijing R&D Co Ltd; STMicroelectronics SRL
Priority date: 2012-07-11
Filing date: 2012-07-11
Publication date: 2013-10-16
Anticipated expiration: 2022-07-11

Abstract

The utility model discloses a device for carrying out signal processing operation. The device comprises a system memory storage unit, an address generator unit, a register memory array and a multiply and accumulate execution unit. The address generator unit is functionally connected to the system memory storage unit and used for receiving data and writing in the data; the register memory array is functionally connected to an address generator and used for receiving the data and writing in values, and the data in the register memory array are stored through a register file system; the multiply and accumulate execution unit is functionally connected to the register file system and used for receiving and writing in the data values, multiplying and accumulating the data values in a paired mode and writing in the data values to a position in the register memory array for summation. The register file system is organized in a hierarchy plan, single register memory positions are paired and organized into corresponding paired register units, and the paired register units are paired and organized into corresponding grouped register units; and the address generator unit is used for storing the values from the system memory storage unit into registers.

Description

Be used for executive signal and process the device of operation

Related application data

The application relates to [acting on behalf of case 11-BJ-0647] " Modified BalancedThroughput Data-Path Architecture for Special Correlation Applications ", and this application is at the admissible this paper that incorporates into by reference and in full to the full extent of law.

Technical field

The utility model described herein relates to for system architecture and the device of realizing digital signal processing (DSP) operation.More specifically but be not exclusively, the utility model relates to for system and the device of realizing relating to the DSP operation (such as finite impulse response (FIR) filtering, finite fourier transform, convolution, relevant and other DSP operation) that multiplication cumulative (MAC) calculates.Other scientific domain is also used MAC operation, for example numerical simulation of physics.

Background technology

Process in (especially digital signal processing) field at signal, many must the operation is the form of finite impulse response (FIR) wave filter (being also referred to as weighted mean).In this well-known operations, and the finite aggregate of value (being also referred to as filter coefficient or tap weightings) h (k) (k=0 ..., N-1) the value x (k) with input data sequence is used for according to rule

Create output sequence value y (n).Owing to n is increased progressively 1 all with the selected set displacement 1 of input value at every turn; So this process is also referred to as the sliding window summation.In order to calculate each y (n), at first the pairing of coefficient and input value is multiplied each other, then addition summation, this is called the process of multiplication cumulative (MAC).

FIR operates in and is widely used for selecting required frequency in the signal processing, removes noise and detects radar signal and other application.Shown in the form of equation, the FIR filtering operation is adapted at realizing on the computer hardware well.In such implementation, in the private memory array, load filter coefficient, then for each value y (n), load the counterpart of input in the second memory array, and to the coefficient through aiming at and input by to carrying out the MAC operation.

Although can and often be to finish at multi-purpose computer by software to realize the FIR operation, many signal processing applications need the very fast calculating of FIR operation.These situations often need to realize in special use upper at special digital hardware (such as digital signal processor (DSP)), upper at reconfigurable platform (such as field programmable gate array (FPGA)) or on special IC (ASIC).On this level, the detail of hardware implementation mode (such as how representing and storage inside value and their data type, data bus size etc.) becomes important for obtaining very high speed FIR operation.The target that hardware-efficient is implemented is to allow the MAC operation come across each cycle.Realization even higher MAC speed especially are worth.

Figure 1 illustrates a kind of conventional method and system be used to realizing quick FIR operation known in the art.Signal data or coefficient move and are stored in the quick addressable memory position (being called register file (Reg file)) of system from the storer of system through address generator (AG).In each cycle, move two values from the Reg file and enter the MAC unit, and calculate they product, sue for peace into accumulated value and be written back to the accumulator register position.

For ongoing normal running, the balance of the data volume that the data volume that reads in must oriented register file and MAC unit consume.In addition, entering the data value of MAC must be complete; If access essential data value for MAC has delay, then MAC must wait for (perhaps a plurality of) cycle until it obtains to be used for the partial data value of multiplication and accumulation calculating.Such time-out is called and emits dwell cycle.The inefficiency of the integrated operation of its representative system.Prevent that such poor efficiency from being an overall goal of the present utility model.Another target of the present utility model is to realize the speed more than a MAC operation in each cycle.

The utility model content

The utility model embodiment disclosed herein has realized the balance throughput data path architecture of new model, and this form can overcome the out-of-alignment problem of data-carrier store, and can be promoted to produce the implementation of each cycle more than one MAC operation.Figure 3 illustrates new architecture.Data are stored in (comprising input and coefficient for MAC operation) the large-scale memory storage device of system, and this large-scale memory storage device often is random access memory and referred to herein as system storage.Calculate owing to need to be used for FIR from the various values of system storage, so AG will be worth the register memory file system that moves into framework from system storage, this system comprises the quick addressable memory cell of MAC performance element.

According to an aspect of the present utility model, a kind of device for executive signal processing operation is disclosed, it is characterized in that, comprise the system memory stores unit; The address generator unit is connected to described system memory stores unit and can operates for the data bus by having a plurality of data widths from described system memory stores unit receive data and to described system memory stores unit data writing on the function; The register memory array, be connected to described address generator on the function and can operate for from described address generator receive data and to described address generator value of writing, wherein store described data in the described register memory array with the register file system; The multiplication performance element that adds up, be connected to described register file system on the function and can operate for writing from described register memory array received and to described register memory array, and the data value pairing multiplied each other and addition and the position in described register memory array write summation; Wherein organize described register file system at the hierarchy plan that is used for described independent register memory position, wherein independent register memory position matched group is made into corresponding paring registers unit, and paring registers unit matched group is made into the respective packets register cell; And wherein said address generator unit uses misalignment address layout system with by any misaligned data address and the midpoint alignment of grouping register will be put into described register from the value of described system memory stores unit.

A key element of an example embodiment of the utility model is that hierarchy is used for the register memory file system.This feature (being called grouping register file (GRF) system) is organized into three grades with register.The first order is the base level of independent register position.The second level is organized into the register pairing with register.The third level is organized into the grouping register with paring registers, and each grouping register comprises two paring registers and therefore comprises four independent registers.

The classification of GRF system and the scheme of quoting are used by next feature (misalignment address layout (MAP) system) of embodiment, and this feature is realized by the revision of address generation (AG) unit.Modified form AG according to two detailed processes that hereinafter describe in detail from system storage to register loaded value, in order to fill each single grouping register fully.In addition, concrete loading sequence helps each cycle of whole system to realize one or more than one MAC.

The 3rd feature of example embodiment is to use parallel processing in the MAC performance element.Owing to treat to many operations that data are carried out it is that multiplication is cumulative, thus MAC be configured to receive many to data and coefficient and each cycle simultaneously executable operations be favourable.The term that is used for this processing form is single instruction multiple data (SIMD).The value how parallel amount of its utilization of MAC performance element all operates to the write-back MAC of register memory system after the multiplication cumulative process.

As hereinafter will describing in detail, the combination of these features of the present utility model allows throughput of system---to and from RS---to keep balance.In addition can be by overcoming with MAP and modified form AG owing to emit dwell cycle due to the storer misalignment.Can realize at last higher MAC speed.

To know aforementioned and further feature, effectiveness and advantage of the present utility model according to following more specifically description the to the utility model embodiment as shown in the drawing.

Description of drawings

Specific descriptions are with reference to accompanying drawing.In the drawings, the numerical digit on two of reference number low-order digit left sides identifies the figure that this reference number occurs first.Same reference numbers is used for quoting similar features and parts in institute's drawings attached.

Fig. 1 shows the balance throughput data path architecture of prior art.

Fig. 2 shows how reference-to storage of the interior details of address generator (AG) of prior art and it.

Fig. 3 shows the utility model to the modification of the balance throughput data path architecture of prior art.

Fig. 4 shows the modification that is used for AG in the utility model.

Fig. 5 shows grouping register file organization scheme of the present utility model.

Fig. 6 shows under aiming at memory layout two phase process that load data in the grouping register.

Fig. 7 shows a grouping register (amount to four registers) and from the misalignment address layout of the value of storer and the misalignment address is aligned to the centre (between paring registers pr1 and pr0) of grouping register.

Fig. 8 shows the step right-handed mode of using grouping and misalignment address layout process, data value A and the B example continuous data load operation from storer to grouping register g0 (altogether four registers).

Fig. 9 show use grouping and the left-handed mode of misalignment address layout, be the second step of data value C and the D example continuous data load operation from storer to same packets register g now.

Embodiment

Here enumerate abb. commonly used:

In the literature, word " exemplary " be used for meaning " as example, example or illustrate and be not interpreted as the restriction ".Here any embodiment or the implementation that are described as the utility model subject content of " exemplary " must not be interpreted as more preferred or favourable than other embodiment.

The operation of sliding window type is used in the many operations (being specially the FIR wave filter) that are well known that digital signal processing, pursues output set to the summation value of establishment that multiplies each other by input be shifted input value and coefficient or tap weight value of set in this operation.For example the FIR wave filter has form

And finite fourier transform is

Wherein

For needs calculate such application of formula fast, be clear that and carry out rapidly multiplication and cumulative operation.The utility model is the open various embodiment that are used for realizing fast such MAC operation here.

Figure 1 illustrates a kind ofly for implementing the known framework of FIR filtering in digital circuit, and be referred to as balance throughput data path architecture.It can be realized at special DSP chip, FPGA or ASIC.It comprises four staples: large scale system storer 101, address generator AG 103, register file 104 (Reg file) and MAC performance element 105.System storage often comprises random access memory and is used for storing a large amount of input and output data values, and if must then also store the filter coefficient of using for FIR.The Reg file unit comprises the array of memory locations that is called register, and these registers allow the faster access of the processing element of system usually.AG is the addressing system that realizes in the circuit of being everlasting, and this system is responsible for mobile required many data between system storage and Reg file.AG is by data bus 102 reception values and to the system storage value of writing.At last, MAC unit 105 comprises for two values being multiplied each other and with the essential circuit of this product and accumulated value addition.Accumulated value Accum can be on byte-sized in being stored in the Reg file greater than the size of data or coefficient D/C in order to prevent the problem that arithmetic overflows.As in the art, the MAC unit has the ability of carrying out the MAC operation in one-period now.

Under desirable operating conditions (wherein target is to obtain to occur in each cycle 1 MAC operation), system is the identical data volume of the mobile data volume that moves from the Reg file to MAC with it must be from system storage to the Reg file, and moves back to the accumulator register position.This is in order to prevent overflowing and guaranteeing to utilize fully the MAC performance element and the data throughout balance that needs of Reg file.

In this known framework, the Reg file can have three and read/two structures of writing inbound port, thereby in each cycle two data and/or coefficient value (D/C among Fig. 1) is moved into the MAC unit to be used for the MAC operation with cumulative currency from the Reg file.Simultaneously, AG writes port one 06 by one two new datas or coefficient value is moved into the Reg file from system storage, and when the MAC EO, the MAC performance element is write inbound port by another of Reg file accumulated value that upgrades is moved back to the Reg document location that accumulated value comes from.

For this framework is worked ideally, must be in one-period move two new datas or the coefficient value from system storage that to access from system storage.In addition, the data memory addresses that AG uses must be aimed at the memory block of storer, thereby can move two data values by data bus in one-period.

If yet the storage address of the coefficient of complete pairing and/or data value is not and the block alignment of system storage (being that the byte between the border of system memory accesses piece is pointed in the address), then in one-period, can move by bus the only part of required pairing, and system will need to wait for to next cycle to finish data mobile.This is called the storer misalignment; It needs to emit dwell cycle in the MAC unit, thereby the value of complete pairing can move into the Reg document location.

The out-of-alignment known way of a kind of disposal reservoir be AG is doubled and allow system storage have double address port and binary output module port the two.This is shown in Figure 2.In the situation of crossing over memory block border storage data and/or coefficient value, then in order to access it, AG will need to generate two addresses (start address 201 and incrementor unit 202 generate increase progressively address).But this will require two address ports available on system storage.In addition, system storage will need two ports, derive the memory block that comprises value by these two ports.In AG, selector switch and combiner unit 203 will be assembled data value and it will be shifted to the Reg file.Yet on function, this mode needs more that multicircuit area and power are used for implementing at digital hardware.

Embodiment of the utility model shown in Fig. 3 discloses a kind of different frameworks for realizing the equilibrium criterion handling capacity, this framework can reduce or eliminate the appearance of emitting dwell cycle, and this need not the additional port on the system storage and can be extended to the MAC operation that realizes more than in the clock period.Embodiment comprises main memory system 301 (this system generally includes RAM), and modified form address generator AG 303 access main memory systems are with to the register memory array system and from register memory array system movement value (signal data value or coefficient).Can only realize AG with an address adder.With the GRF304 of register memory array organization for directly accessing and write by multiplication accumulation processor (MAC) 305.Memory data bus 302 can be to double width or be any positive power width of four times of width or 2 in other cases that wherein width refers to each register memory position size take byte as unit here.

Be with one of difference of prior art: in one embodiment, the MAC unit can operate by using single instruction multiple data process (SIMD) carry out in one-period more than a pair of MAC.

In addition, the GRF for register memory array 304 is used for independent register memory position with hierarchically organized scheme.In one embodiment, this is three layer data addressing and access schemes, and this scheme comprises basal layer, the second layer (pairing of wherein making up independent register memory position is used for being used for as the unit that is called grouping register (GR) as being called the unit of paring registers (PR) and wherein making up two PR) of independent register.Fig. 5 shows from left to right wherein that eight registers have the example of independent address (schematically being labeled as r0 to r7), how the pairing of these registers is combined into four PR (schematically being labeled as p0 to p3) and finally how these four PR is grouped into two grouping registers (GR) (schematically being labeled as g0 and g1)---according to alternate mode---.Embodiment shown in Fig. 5 show that the PR left side has the odd number indexed registers and the even number indexed registers on the right.

Existence is organized into PR two kinds of patterns of GR.Leftward in the pattern, even number index PR is in the layout of the left side and odd number index PR is in the layout of the right.In right-handed mode, odd number index PR be in the layout of the left side and even number index PR be in the layout of the right.

Utilize this classification register organization scheme, in one embodiment, modified form AG303 can use misalignment address layout process (MAP) to accumulator system and from the accumulator system movement value.

As the sample situation of modified form AG with the MAP operation of GRF system, suppose that register memory position width is 32 bits (i.e. 4 bytes).Suppose that also AG doubles the piece of width from 64 bits (i.e. 8 bytes) of system storage by doubling the width data bus access.When the system memory addresses that provides to AG is 4 multiple, aim at this address.For the address that provides with scale-of-two, the address of aligning has and equals separately two minimum significant figures of 0.

If do not detect the storage address misalignment in instruction, then the value from system storage can be stored among the PR of a GR.Then the second data block from system storage can be stored among another PR of GR.This illustrates in Fig. 6.

If yet detect the storage address misalignment (in this example, when the address that provides to AG is not 4 multiple), then exemplary embodiment of the present utility model is 0 to create the address of aligning by the least significant bit (LSB) that forces the proper number in the address.Determine that according to the address of aiming at 8 bytes of value to be loaded double width.In addition, AG is based on the misalignment mode assignments alignment point of address.As example, if the misalignment of address in (byte 0 is to byte 7) byte 2, then alignment point is at the byte 1 of mobile data block and the point between the byte 2.The midpoint alignment of alignment point and targeted packets register, thus as shown in Figure 7, byte 0 and 1 schematically is aligned to mid point the right of GR and byte 2 to 7 schematically is aligned to the left side of mid point.Then in target GR, load data byte as shown in the figure.Note only two bytes in four free Bytes in each register among buried register r1 and the r3.Attention is in Fig. 5, and the GR that is labeled as g0 uses the right-handed mode of GRF system.

For continuous load operation, if the exemplary right-handed mode load operation of first previous paragraphs has been used for a load operation, then as shown in Fig. 8 and Fig. 9, in next iteration, next of loading 8 block of bytes are used left-handed mode for same GR.Fig. 8 shows identical right-handed mode process shown in Fig. 5.Fig. 7 shows the register that uses among the left-handed mode access gr0 and how to allow to load next 8 byte C and D in the remaining segment of gr0.

Owing to move into the register array positions with will be to have been multiplied each other two values of MAP, so the MAC performance element can access two values and accumulated value at one-period, carry out the accumulated value of the cumulative operation of multiplication and write-back renewal.

The embodiment of this framework can realize when doubling width more than one MAC operation with these two the pairing of loading coefficient and/or data value at data routing 302 in each cycle, and structure MAC unit is used for single instruction multiple data (SIMD) operation.A kind of exemplary approach that is used for each cycle positive integer K MAC operation for structure MAC unit with configuration; Wherein the size of data value to be multiplied each other is 2 positive power, M; And wherein the data routing from Memory Storage Unit to the register memory array is 2*M*K.

At a certain exact level currently preferred embodiment of the present utility model and many improvement thereof are described.Be to be understood that by example and carry out this description and the utility model by the circumscription of claims.Clear other embodiment within the scope of the claims of those of ordinary skills.

Claims

1. one kind is used for the device that executive signal is processed operation, it is characterized in that, comprising:

The system memory stores unit;

The address generator unit is connected to described system memory stores unit and can operates for the data bus by having a plurality of data widths from described system memory stores unit receive data and to described system memory stores unit data writing on the function;

The register memory array, be connected to described address generator on the function and can operate for from described address generator receive data and to described address generator value of writing, wherein store described data in the described register memory array with the register file system;

The multiplication performance element that adds up, be connected to described register file system on the function and can operate for writing from described register memory array received and to described register memory array, and the data value pairing multiplied each other and addition and the position in described register memory array write summation;

Wherein organize described register file system at the hierarchy plan that is used for described independent register memory position, wherein independent register memory position matched group is made into corresponding paring registers unit, and paring registers unit matched group is made into the respective packets register cell; And

Wherein said address generator unit uses misalignment address layout system with by any misaligned data address and the midpoint alignment of grouping register will be put into described register from the value of described system memory stores unit.

2. device according to claim 1 is characterized in that, described a plurality of width of the described data bus from described system storage to described address generator are 2 positive power of the size of register memory position take byte as unit.

3. device according to claim 1 is characterized in that, described address generator has an address adder.

4. device according to claim 1 is characterized in that, described address generator is accessed described system memory stores unit by single port.

5. device according to claim 1 is characterized in that, described hierarchically organized scheme is organized into two grouping register cells according to left-handed mode or right-handed mode with eight register position; Wherein said left-handed mode in order [r1, r0, r3, r2] with register r0 to r3 be arranged to GR0 and in order [r5, r4, r7, r6] register r4 to r7 is arranged to GR1; And wherein said right-handed mode in order [r3, r2, r1, r0] with register r0 to r4 be arranged to GR0 and in order [r7, r6, r5, r4] register r5 to r7 is arranged to GR1.

6. device according to claim 5, it is characterized in that, described address generator is configured to as the data block of the big or small twice of unit data be moved into as standard register take byte by mobile size take byte as unit the register of described grouping, wherein said address generator also is configured to determine alignment point for the byte of the described data block that moves from system storage, and described address generator also is configured to the midpoint alignment of grouping register that described alignment point and described address generator will be moved into described data and described data of byte-by-byte loading accordingly in the described grouping register.

7. device according to claim 6, it is characterized in that, for the situation of memory aligned address, the described alignment point of the described data block that moves from system storage is the end of byte 0, and the end adjacent with byte 1 of described end and byte 0 is relative.

8. device according to claim 6, it is characterized in that, for the situation of storer misalignment address, the described alignment point of the described data block that moves from system storage be the byte number of described misalignment address with next end that more the low byte number is adjacent, and wherein said address generator visits described memory cell by forcing described misalignment address to be aligned to described memory block.

9. device according to claim 6, it is characterized in that, described address generator also is configured to by at first load first and move two data blocks according to the described grouping register of left hand ordering configuration and according to right hand ordering configuration the second data block according to process according to claim 6, and the size of each data block take byte as unit is the twice of the size of standard register take byte as unit.

10. device according to claim 6, it is characterized in that described address generator also is configured to name a person for a particular job that a plurality of pairings of data block move into the corresponding register that divides into groups successively so that two data blocks are shifted to the register of a grouping by each pairing that process according to claim 6 is applied to data block and associated objects grouping register and the alignment that is identified for each pairing of data block.

11. device according to claim 1 is characterized in that, described multiply-accumulator is arranged to the single instruction multiple data operation.

12. device according to claim 1 is characterized in that, described multiply-accumulator is arranged to the cumulative operation of each cycle positive integer K multiplication; Wherein the size of described data value to be multiplied each other is 2 positive power M; And wherein the described data routing from described Memory Storage Unit to described register memory array is 2*M*K.