CN202084032U

CN202084032U - IP (Internet Protocol) core based on two-dimensional (2D) IDCT (Inverse Discrete Cosine Transformation) distributed algorithm of SOPC (System on Programmable Chip) technology

Info

Publication number: CN202084032U
Application number: CN2011200806185U
Authority: CN
Inventors: 付扬; 邓超; 郭培源
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2011-03-24
Filing date: 2011-03-24
Publication date: 2011-12-21
Anticipated expiration: 2021-03-24

Abstract

The utility model discloses an IP (Internet Protocol) core based on a two-dimensional (2D) IDCT (Inverse Discrete Cosine Transformation) distributed algorithm of an SOPC (System on Programmable Chip) technology. The IP core device comprises an Avalon bus reading module, a controller module, a 2D IDCT module, an Avalon bus writing module, an Avalon bus and a control register, wherein the control register is used for writing in control data; the control module is used for controlling the Avalon bus reading module to read an operation address in the control register, and reading data to be processed in an input cache through the Avalon bus, and a result is written back to the original address through the Avalon bus writing module after the data to be processed is processed through the 2D IDCT module; a hardware module has high working speed, and working frequency of a system is reduced while high-quality decoding is realized, so that power consumption is reduced; a software module has good flexibility and expandability, so that a decoder has very good compatibility; and as calculated amount is mainly concentrated on a hardware accelerator, calculation load of a CPU (Central Processing Unit) is reduced, therefore, the CPU can be used for supporting more upper applications in consideration of decoding speed, function, flexibility, cost and development cycle.

Description

IP kernel based on SOPC technology two dimension IDCT distributed algorithm

Technical field

The utility model relates to the picture decoding technical field, particularly based on the IP kernel of SOPC technology two dimension IDCT distributed algorithm.

Background technology

In recent years, along with the develop rapidly of semiconductor technology, design performance and the cost performance of modern high-density device FPGA (Field-Programmable Gate Array) can be contended with ASIC fully.Under such background, U.S. altera corp and proposed programmable system SOPC on the sheet (System on a Programmable Chip) new technology in 2000, and released the Quartus II that develops software accordingly simultaneously.

The SOPC system is a kind of special embedded system, the SOPC technology is exactly that big and complete as far as possible electronic system is realized in a FPGA, comprise embedded processor system, interface system, dsp system, digital communication system, memory circuit etc., its essence will hold to go into more multimode exactly among the PLD, it is SOC (system on a chip) SOC, is programmable system simultaneously, has the design flexible mode, can reduce, extendible, scalable, and possess the function of software and hardware at system programmable.

The SOPC technology mainly comprises: analysis-by-synthesis and verification technique between software-hardware synergism designing technique, IP kernel multiplex technique, module and module interfaces.

The concurrency of software-hardware synergism designing technique software intensive and hardware design exploitation and feedback are mutually separated the drawback that system synthesis expection that designing institute brings can not be determined to overcome in the classic method because of software and hardware.Because traditional design method is to design hardware earlier, again according to algorithm design software, in the deep-submicron design, the expense of hardware is very large, after design is finished, when finding that mistake is changed, spend great amount of manpower and material resources and time, and the design cycle is elongated.At traditional method for designing, design is a kind of method for designing of highly effective based on the software-hardware synergism of FPGA.Determine that by collaborative design the mutual restriction between system software and the hardware concerns, make Design of software must consider the hardware configuration of chip, the chip structure of simultaneity factor, more need the collaborative of software and hardware design, so that total system realizes optimization in collaborative design, shorten the design cycle greatly, improved design efficiency.For the SOPC system, systemic-function is carried out with hardware, needs logical block and the execution time of certain FPGA, carries out with software, then needs the capacity and the processor time of certain storer.Software realizes not needing to take hardware resource, but needs the long execution time; In contrast, hardware realizes that the execution time is shorter, but takies the hardware logic resource.Rationally divide software and hardware and be exactly the fpga logic unit take with time consumption on do a good balance.Cardinal rule is that high speed, low-power consumption are realized by hardware; Many kinds, short run are realized by software; Processor and specialized hardware and in order to improve processing speed and to reduce power consumption.

Main design will generally be admitted and be become to the IP reuse theory of SOPC technology.The SOPC chip needs the system of an integrated complexity, and this has caused it to have the structure of more complicated, if start anew to finish chip design, obviously will spend lot of manpower and material resources.The lifetime of electronic product constantly shortens now in addition, and this requires the design of chip to finish in the shorter cycle.In order to accelerate the speed of chip design, people call the form of existing IC circuit with module in the SOPC chip design, thereby design time is shortened in the design of facilitating chip, improves design efficiency.The IP module is a kind of pre-designed, and through checking, integrated level is higher and have integrated circuit, device or the parts of certain complete function, as modules such as MPU, DSP, DRAM, Flash.Make up a system and be a complicated process, the deviser can be concentrating on total system, and needn't consider the correctness and the performance of each module.The utilization again of IP module, except the design time that can shorten chip, the cost that can also reduce design greatly and make improves reliability.

In recent years, research about coding and decoding video has obtained very big progress, particularly international organization such as International Organization for Standardization and International Telecommunication Union has made a series of corresponding international standards, greatly promote the development of video coding and decoding technology, promoted the widespread use of video coding and decoding technology.

Along with improving constantly of embedded system performance, the SOPC technology realizes that image and video decode will largely improve decoding performance, and the technical characterstic of its software-hardware synergism design will have incomparable advantage to video decode.

Discrete cosine transform (DCT) and its inverse discrete cosine transform (IDCT) are widely used in image and video compress, the decompression applications.DCT can remove the correlativity between the data, and the energy in can focused image makes data be convenient to compression, is the core of present most of image and video encoding standard, such as JPEG, H.26x series, MPEGx series standard etc.And in image and video decoding system, then use IDCT that data are reduced.IDCT is a part and parcel in the decoding, the operand of commonly used 8 * 8 2-D discrete cosine inverse transformation (2D IDCT) is big, its calculated amount accounts for about 40% of whole decoding computing, directly have influence on the real-time of image and video decoding system, therefore be in the core SOPC video decoding system with Nios II processor, the realization of studying two-dimentional IDCT is particularly important.

The utility model content

The purpose of this utility model is to solve the problem of going up, IP kernel based on SOPC technology two dimension IDCT distributed algorithm is provided, and this device hardware module operating rate is fast, so just can be when realizing real-time high-quality decoding, the frequency of operation of reduction system, thus power consumption reduced greatly; Dirigibility that software module had and extensibility make demoder have good compatibility, can revise and add new function eaily; 3) calculated amount has mainly focused on the hardware accelerator, has alleviated the computation burden of CPU greatly, makes CPU can support more upper layer application, takes into account the requirement of decoding speed, power consumption, dirigibility, cost and construction cycle.

For achieving the above object, the technical scheme that the utility model adopted is: based on the IP kernel of SOPC technology two dimension IDCT distributed algorithm, this IP kernel device has Avalon bus read module, controller module, 2D IDCT module, Avalon bus writing module, Avalon bus and control register; The control register write control data; Control module control Avalon bus read module reads the operation address in the control register, and the data that will handle are read in input-buffer via the Avalon bus, by Avalon bus writing module the result are write back raw address after 2D IDCT resume module.

2D IDCT module has 1D IDCT module, deserializer, multiplexer, parallel-to-serial converter, transposition internal memory and controller; After string and conversion Buffer module are received one group of data it is exported to multiplexer simultaneously as data line, calculate the inverse transformation value of 8 of every row by 1D IDCT, export to the transposition internal memory, export to multiplexer again, via the inverse transformation value of calculating 8 of every row by 1D IDCT module, export to parallel-to-serial converter output; Controller is controlled whole process.

1D IDCT module has shift register, 8 shifting accumulators and post-processing module; Shift register is imported 13 bit data, exports 8 bit data and gives 8 shifting accumulators; 8 shifting accumulators are output as 14 bit data, output after post-processing module expands to 16 with precision.

The shifting accumulator module is made of 4 input totalizers.

The utility model is based on the IP kernel of SOPC technology two dimension IDCT distributed algorithm, utilize the ranks resolution characteristic of 2D IDCT earlier, it is become two 1D idct transforms, earlier all row are carried out the 1D idct transform, again all row are carried out the 1D idct transform, that finally obtain is exactly the result of 2D idct transform.The benefit that this decomposition brings is many-sided: at first it has reduced operand, and has reduced the complexity that realizes; Secondly it makes computing very regular, helps software and hardware to realize; When hardware is realized, can examine the consumption of having saved hardware resource by multiplexing same 1D IDCT.Like this, the key of realization 2D IDCT hardware is to realize 1D IDCT.For the 1D-DCT/IDCT computing, a lot of fast algorithms have been arranged, as Chen, Wang, Lee, Loeffer algorithm etc., these fast algorithms are used for the software of 1D DCT/IDCT more to be realized, is not suitable for hardware and realizes.Mainly be because these algorithms are unfavorable for the executed in parallel of hardware, and need to use bigger multiplier that it is more that multiplier takies hardware resource, and processing speed is slow.The design's research is on the basis with Chen algorithm reduced equation, 1D IDCT hardware design adopts distributed algorithm (Distributed Arithmetic, be called for short DA) realize multiplication, and design offset binary coding OBC (Offset Binary Code) reduces its look-up table LUT (Look up table) size.The DA algorithm is the data of input to be calculated all results that produced by inner product of vector all be stored in the inside, like this when to use certain dot product wherein as a result the time just can be by searching the result that LUT obtains wanting, this has not only improved the shortcoming that the sort of computation process of picture conventional serial algorithm is loaded down with trivial details, calculated amount is big, hardware circuit is complicated, system performance is improved greatly, and operating rate is accelerated.

Further design forms two-dimentional IDCT IP kernel based on the interface and the control register group of Avalon bus standard, and this IP kernel is joined in the SOPC video decode, realizes two-dimentional IDCT function.

Complete technical scheme is discussed from comprehensive and five aspects of test of IDCT algorithm, one dimension IDCT hardware design, two-dimentional IDCT hardware design, the two-dimentional IDCT IP kernel design of Avalon bus interface, two-dimentional IDCT IP kernel, and details are as follows:

(1), two-dimentional IDCT decomposes

The definition of 8 * 8 two-dimentional IDCT is as follows:

f_{i, j} = Σ_{x = 0}^{7} Σ_{y = 0}^{7} \frac{C (x) C (y)}{4} F_{x, y} \cos (\frac{(2 i + 1) xπ}{16}) \cos (\frac{(2 j + 1) yπ}{16}) - - - (1)

Fx wherein, y is the coefficient behind the dct transform, fi, j is a raw data, works as n=0, C (n)=2-1/2, when n ≠ 0, C (n)=1.

The operand that directly calculates 8 * 8 two-dimentional IDCT is very big, therefore utilizes the ranks resolution characteristic of 2D IDCT, and it is become two 1D idct transforms.Earlier all row are carried out the 1D idct transform, again all row are done the 1D idct transform, that finally obtain is exactly the result of 2D idct transform.

(2), the distributed algorithm of one dimension IDCT

The key that realizes 2DIDCT hardware is to realize 1D IDCT, for 1D IDCT computing, realizes in order to be suitable for hardware.The design uses distributed algorithm to realize 1D IDCT on the basis with Chen algorithm reduced equation.

The definition of 8 1D IDCT is as follows:

f_{i} = Σ_{x = 0}^{7} \frac{C (x)}{2} F_{x} \cos (\frac{(2 i + 1) xπ}{16}) - - - (2)

Wherein Fx is the coefficient behind the dct transform, and fi is a raw data, works as x=0, C (x)=2-1/2, and when x ≠ 0, C (x)=1.

Use Chen algorithm abbreviation, make Ci=cos (i π/16), then:

P = [\begin{matrix} C_{4} & C_{2} & C_{4} & C_{6} \\ C_{4} & C_{6} & - C_{4} & - C_{2} \\ C_{4} & - C_{6} & - C_{4} & C_{2} \\ C_{4} & - C_{2} & C_{4} & - C_{6} \end{matrix}] [\begin{matrix} F (0) \\ F (2) \\ F (4) \\ F (6) \end{matrix}] - - - (3)

M = [\begin{matrix} C_{1} & C_{3} & C_{5} & C_{7} \\ C_{3} & - C_{7} & - C_{1} & - C_{5} \\ C_{5} & - C_{1} & C_{7} & C_{3} \\ C_{7} & - C_{5} & C_{3} & - C_{1} \end{matrix}] [\begin{matrix} F (1) \\ F (3) \\ F (5) \\ F (7) \end{matrix}] - - - (4)

Then 8 1D IDCT can calculate with following two formulas:

[\begin{matrix} f (0) \\ f (1) \\ f (2) \\ f (3) \end{matrix}] = \frac{1}{2} (P + M) - - - (5)

[\begin{matrix} f (7) \\ f (6) \\ f (5) \\ f (4) \end{matrix}] = \frac{1}{2} (P - M) - - - (6)

The problem of calculating 8 1D IDCT has changed into two matrix forms of calculating P, M, and P, M reality are respectively 4 inner product of vectors, need use parallel multiplication (MAC) during calculating.Therefore how to realize that parallel multiplication has just become the key issue that realizes 1D IDCT computing with hardware.

Use distributed algorithm (DA) can solve the inner product of vectors computational problem effectively, will calculate good part in advance and deposit in the look-up table, utilize shifting accumulator and look-up table and do not use multiplier to obtain result of calculation.

Make that Fi is a B position two's complement form, can be expressed as:

F_{i} = - F_{i}^{B - 1} + Σ_{j = 1}^{B - 1} 2^{- j} F_{i}^{B - 1 - j}

Wherein j represents the j position of Fi, and B-1 is most significant digit (MSB) sign bit, and the value of Fij only may be 0 or 1.

Write P as the inner product of vectors form (x=1,2,3,4).The Fi substitution of following formula is got:

P_{x} = Σ_{i = 0}^{3} C_{i, x} (- F_{i}^{B - 1} + Σ_{j = 1}^{B - 1} 2^{- j} F_{i}^{B - 1 - j})

Can be write as following formula after the arrangement:

P_{x} = Σ_{j = 1}^{B - 1} 2^{- j} D_{x} (F^{j}) - D_{x} (F^{0})

Wherein:

D_{x} (F^{j}) = (Σ_{i = 0}^{3} C_{i, x} F_{i}^{B - 1 - j})

Part and Dx (Fj) are the functions of position j, and for 4 input position Fij, its output has only 24=16 kind possibility, therefore this 16 this value can be existed in the look-up table, calculates by addition and shifting function then, and does not carry out multiplying.This distributed algorithm is the compute vector inner product effectively, and travelling speed is fast, and the hardware of being convenient to simple in structure is realized.

(3), the distributed algorithm of OBC coding

The look-up table size of distributed algorithm, relevant with the vector length of inner product formula.For vector length is the inner product of vectors of N, and its look-up table size is 2N.Along with the increase of vector length, the big young pathbreaker of look-up table increases thereupon, and look-up table is crossed senior general influences the access speed of totalizer to it, and will take more hardware resource.

Use offset binary coding OBC the size of look-up table can be reduced half, it is mapped as-1,1 with input

vector place value

0,1, makes part and Dx become the mirror image symmetry about the positive negative value of input vector.Write the two's complement of B position Fi as following formula:

F_{i} = \frac{1}{2} [F_{i} - (- F_{i})] = \frac{1}{2} [Σ_{j = 1}^{B - 1} 2^{- j} (F_{i}^{B - 1 - j} - \overset{&OverBar;}{F_{i}^{B - 1 - j}}) - (F_{i}^{B - 1} - \overset{&OverBar;}{F_{i}^{B - 1}}) - 2^{- (B - 1)}]

Order

d_{i}^{j} = \{\begin{matrix} F_{i}^{j} - \overset{&OverBar;}{F_{i}^{j}}, j &NotEqual; B - 1 \\ \overset{&OverBar;}{F_{i}^{B - 1}} - F_{i}^{B - 1}, j = B - 1, \end{matrix}

Then Fi can be expressed as:

F_{i} = \frac{1}{2} [2^{- j} Σ_{j = 0}^{B - 1} d_{i}^{B - 1 - j} - 2^{- (B - 1)}]

With following formula substitution expression formula

In, behind the abbreviation:

P_{x} = Σ_{j = 0}^{B - 1} 2^{- j} (Σ_{i = 0}^{3} \frac{1}{2} C_{i, x} d_{i}^{B - 1 - j}) - 2^{- (B - 1)} (\frac{1}{2} Σ_{i = 0}^{3} C_{i, x})

Note

D_{j} = Σ_{i = 0}^{3} \frac{1}{2} C_{i, x} d_{i}^{j},

D_{app} = - \frac{1}{2} Σ_{i = 0}^{3} C_{i, x},

Then:

P_{x} = Σ_{j = 0}^{B - 1} D_{B - 1 - j} 2^{- j} + D_{app} 2^{- (B - 1)}

As can be seen from the above equation, if calculate good part and Dj in advance and deposit look-up table in,, can calculate inner product of vectors Px equally by the displacement operation that adds up.The value of dij only may be-1 or+1, part becomes the mirror image symmetry with Dj about the positive negative value of vectorial d.Be example to calculate P1 below, illustrate this part and symmetric relation.The calculating formula of P1 is as follows:

P ₁＝C ₄F ₀+C ₂F ₂+C ₄F ₄+C ₆F ₆

Definition according to part and Dj has:

D_{j} = \frac{1}{2} (C_{4} d_{0}^{j} + C_{2} d_{1}^{j} + C_{4} d_{2}^{j} + C_{6} d_{3}^{j})

D_{j} = \frac{1}{2} [C_{4} (F_{0}^{j} - \overset{&OverBar;}{F_{0}^{j}}) + C_{2} (F_{2}^{j} - \overset{&OverBar;}{F_{2}^{j}}) + C_{4} (F_{4}^{j} - \overset{&OverBar;}{F_{4}^{j}}) + C_{6} (F_{6}^{j} - \overset{&OverBar;}{F_{6}^{j}})]

The look-up table that calculates P1 sees Table 1.

Table 1 calculates the look-up table of P1

From table 1, can see the part of P1 and about the value symmetry of F0j.When F0j equal 0 the time, only use its excess-three position to table look-up and get final product.When F0j equal 1 the time, carry out XOR as long as will be worth with other value of three, table look-up in black surround with the result behind the XOR then, the negate as a result that obtains of will tabling look-up again at last is correct output.Therefore, actual look-up table size is 23=8, compares with before 24=16 to have reduced half.

Therefore, this device hardware module operating rate is fast, so just can reduce the frequency of operation of system, thereby reduce power consumption greatly when realize real-time high-quality decoding; Dirigibility that software module had and extensibility make demoder have good compatibility, can revise and add new function eaily; 3) calculated amount has mainly focused on the hardware accelerator, has alleviated the computation burden of CPU greatly, makes CPU can support more upper layer application, takes into account the requirement of decoding speed, power consumption, dirigibility, cost and construction cycle.

Description of drawings:

1, Fig. 1 is a structure connection diagram of the present utility model;

2, Fig. 2 is a 2D structure connection diagram of the present utility model;

3, Fig. 3 is a 1D connection diagram of the present utility model;

4, Fig. 4 is a totalizer connection diagram of the present utility model.

Embodiment:

For making the technical solution of the utility model be convenient to understand, the utility model is further described below in conjunction with embodiment.

Embodiment 1:

As shown in Figure 1, 2, 3, based on the IP kernel of SOPC technology two dimension IDCT distributed algorithm, this IP kernel has Avalon bus read module, controller module, 2D IDCT module, Avalon bus writing module, Avalon bus and control register; The control register write control data; Control module control Avalon bus read module reads the operation address in the control register, and the data that will handle are read in input-buffer via the Avalon bus, by Avalon bus writing module the result are write back raw address after 2D IDCT resume module.

The 1DIDCT module has shift register, 8 shifting accumulators and post-processing module; Shift register is imported 13 bit data, exports 8 bit data and gives 8 shifting accumulators; 8 shifting accumulators are output as 14 bit data, output after post-processing module expands to 16 with precision.

The shifting accumulator module is made of 4 input totalizers.

Use the SOPC development platform of Cyclone II EP2C35F672C8 fpga chip as core, hardware design uses Verilog HDL hardware description language to write, carry out comprehensively at Quartus II software, whole 2D IDCT has taken 4336 logical blocks, and nucleus module 1D IDCT has only taken 632 logical blocks.8 look-up table means have directly been used the look-up table LUT in the fpga logic unit, do not have register or built-in RAM.The implementation simple and flexible of this look-up table means, and the chip access speed is fast.But the highest synthetic operation frequency of 2D IDCT IP kernel is 140.39MHz.

Being in the SOPC system of processor with Nios II, carry out the actual video test decode.The IDCT IP kernel is added among the SOPC Builder, and the video measurement file of will encoding is burned in FLASH, transplants decoding program in Nios II IDE, deletes original IDCT software function, the driving function of C language compilation 2DIDCT IP kernel.After system decodes, play by the LCD that is with the VGA interface.

After system added the IDCT IP kernel, LCD display frame was clear, does not reduce the decoding quality of system, add behind the 2D IDCT IP kernel system decodes time little about 11ms, frame per second has improved 6 frames.

This IP kernel device, effective at the two-dimentional IDCT algorithm that hardware design is studied, the use of distributed algorithm has improved the maximum operation frequency of chip, in conjunction with the OBC coding method, has reduced the area occupied of logical resource greatly.Synthesis result has shown that chip takies that resource is few, access speed is fast, but its highest synthetic operation frequency reaches 140.39MHz, has successfully realized the programmable FPGA hardware design of two-dimentional IDCT,

Realized the two-dimentional IDCT IP kernel design based on the SOPC system, test result shows that the decoding of using this IP kernel than using software decode to improve video decode speed, has on average improved more than 20%, good authentication the real-time and the validity that design.

Because the IP kernel multiplex technique of SOPC, this IP kernel device can be used in the relevant image and processing system for video, has very strong practicality, versatility and extendability.

The above, it only is preferred embodiment of the present utility model, be not that the utility model is done any formal and substantial restriction, all those skilled in the art, in not breaking away from the technical solutions of the utility model scope, when can utilizing the above technology contents that discloses, and a little change of making, modify the equivalent variations with differentiation, be equivalent embodiment of the present utility model; Simultaneously, all foundations essence technology of the present utility model all still belongs in the scope of the technical solution of the utility model change, modification and the differentiation of any equivalent variations that above embodiment did.

Claims

1. based on the IP kernel of SOPC technology two dimension IDCT distributed algorithm, it is characterized in that: this IP kernel device has Avalon bus read module, controller module, 2DIDCT module, Avalon bus writing module, Avalon bus and control register; Described control register write control data; Control module is controlled described Avalon bus read module and is read operation address in the control register, the data that will handle are read in input-buffer via described Avalon bus, by described Avalon bus writing module the result are write back raw address after described 2DIDCT resume module.

2. the IP kernel based on SOPC technology two dimension IDCT distributed algorithm according to claim 1, it is characterized in that: described 2D IDCT module has 1D IDCT module, deserializer, multiplexer, parallel-to-serial converter, transposition internal memory and controller; After described string and conversion Buffer module are received one group of data it is exported to described multiplexer simultaneously as data line, calculate the inverse transformation value of 8 of every row by described 1D IDCT, export to described transposition internal memory, export to multiplexer again, via the inverse transformation value of calculating 8 of every row by described 1D IDCT module, export to described parallel-to-serial converter output; Described controller is controlled whole process.

3. the IP kernel based on SOPC technology two dimension IDCT distributed algorithm according to claim 1 is characterized in that: described 1D IDCT module has shift register, 8 shifting accumulators and post-processing module; Described shift register is imported 13 bit data, exports 8 bit data and gives described 8 shifting accumulators; Described 8 shifting accumulators are output as 14 bit data, output after described post-processing module expands to 16 with precision.

4. the IP kernel based on SOPC technology two dimension IDCT distributed algorithm according to claim 1 is characterized in that: described shifting accumulator module is made of 4 input totalizers.