CN105828071A

CN105828071A - Deblocking filtering vectorization realization method facing vector processor

Info

Publication number: CN105828071A
Application number: CN201610194300.7A
Authority: CN
Inventors: 陈胜刚; 万江华; 刘胜; 王耀华; 陈小文; 刘仲; 陈海燕
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2016-03-31
Filing date: 2016-03-31
Publication date: 2016-08-03
Anticipated expiration: 2036-03-31
Also published as: CN105828071B

Abstract

The present invention provides a deblocking filtering vectorization realization method facing a vector processor. The method comprises the steps: S1: data preparation: inputting n*m filtered video data blocks into a vector memory bank, and performing vectorization; S2: horizontal filtering operation; S3: result storage: selecting the final result (p3, p2, p1, p0, q0, q1, q2 and q3) for each PE and the p3 and q3 values (p3, p2', p1', p0', q0', q1', q2' and q3') for each PE according to the result of the step S2, and storing the final results and the p3 and q3 values in a matrix register file; S4: repeating the step S2 and the step 3 until the completion of filtering all the boundaries at the horizontal direction; S5: performing vertical filtering; S6: performing result storage: selecting the final result (p3, p2, p1, p0, q0, q1, q2 and q3) for each PE and the p3 and q3 values (p3, p2', p1', p0', q0', q1', q2' and q3') for each PE according to the results of the step 5, and directly storing the final results and the p3 and q3 values into the vector memory bank; and the S7: repeating the step 5 and the step 6 until completion of filtering all the boundaries at the vertical direction. The deblocking filtering vectorization realization method facing a vector processor has advantages of efficient calculation, fully performed multi-PED cooperation of the vector processor and shortened operation time, etc.

Description

The block elimination filtering vectorization implementation method of vector processor-oriented

Technical field

Present invention relates generally to vector processor and coding and decoding video field, refer in particular to the vectorization implementation method of the block-eliminating effect filtering of a kind of vector processor-oriented.

Background technology

In coding and decoding video algorithm, block-based prediction, compensate, change, quantization can cause blocking effect, has a strong impact on the subjective perceptual quality rebuilding image.In order to eliminate the blocking effect of image, generally require and reconstruction image is carried out block filtering, and international standard is brought block-eliminating effect filtering algorithm in the loop of encoding and decoding algorithm the most especially, referred to as in-loop deblocking effect filtering (in-loopdeblockingfiltering)；Owing to each border of encoding block is required for being filtered judgement, calculates and repeatedly updates storage etc. so that deblocking filter algorithm consumes the computation complexity of decoder more than 1/3rd.Therefore, use the execution speed accelerating block-eliminating effect filtering significant for real-time high-definition video encoding and decoding.

The usual way accelerating block-eliminating effect filtering is parallelization.Researcher often uses specialized hardware to accelerate block-eliminating effect filtering algorithm, and the shortcoming of this method is very flexible, standard update the most frequently in the case of expense huge；Simultaneously, it is necessary to use special transposition circuit to process in block-eliminating effect filtering algorithm the access to raw column data.Therefore, programmable way more market.

But, conventional single-core processor is difficult to the calculating demand meeting real-time decoder to block-eliminating effect filtering, and polycaryon processor is the loosest due to coupling, and internuclear data transmission expense is relatively big, is not the most also suitable for block-eliminating effect filtering parallelization and accelerates.In this case, vector processor becomes first-selection.Vector processor is typically made up of multiple processing units (PE), and between PE, closely, each PE comprises independent multiple functional parts, such as multiplying unit, adding unit, shifting part etc. in coupling.It is carried out very long instruction word (VLIW) instruction every PE, comprises and multiple perform bag, do not share the functional part of streamline and can perform multiple to perform bag simultaneously.Each PE comprises one group of local register, and the local register of the same numbering of all PE forms the most again a vector registor.Such as in Fig. 1, all R0 depositors of PE_0～PE_M-1 logically constitute the element that vector registor VR0, the R0 corresponding to each PE are referred to as vector registor VR0.Meanwhile, vector processor often can provide the matrix register file accessed for matrix ranks, can effectively meet the storage requirements for access of deblocking effect filtering different directions filtering.

But, block-eliminating effect filtering algorithm self adaptation is relatively strong, the execution route factor data source of adjacent boundary and different, and same data needs are carried out discontinuously, are repeatedly read and write, therefore, how realizing the vectorization to block-eliminating effect filtering algorithm on vector processor and calculating acceleration is a difficult point.

Summary of the invention

The technical problem to be solved in the present invention is that the technical problem existed for prior art, and the present invention provides a kind of principle to be simple and convenient to operate, can efficiently calculate, give full play to the block elimination filtering vectorization implementation method of the vector processor-oriented of vector processor many PE cooperation, shortening operation time.

For solve above-mentioned technical problem, the present invention by the following technical solutions:

The block elimination filtering vectorization implementation method of a kind of vector processor-oriented, the steps include:

S1: data prepare；Input n × m by filtering video data block to vector memory bank, and carry out being vectorization；

S2: horizontal filtering operation；Selecting to be currently needed for the horizontal boundary of filtering, each PE reads from vector memory bank needs the view data (p3, p2, p1, p0, q0, q1, q2, q3) of filtering；Use view data (p3, p2, p1, p0, q0, q1, q2, q3) and constant to calculate judgment condition, and be stored in vector condition depositor；Rule according to block-eliminating effect filtering algorithm calculates all results of (p3, p2, p1, p0, q0, q1, q2, q3), is stored in respectively in partial vector depositor；

S3: result stores；Result according to step S2 is that each PE selects the final result of (p3, p2, p1, p0, q0, q1, q2, q3) and the value (p3, p2 ', p1 ', p0 ', q0 ', q1 ', q2 ', q3) of p3 and q3, is stored in matrix register file；

S4: repeat step S2 and step S3, until all boundary filterings of horizontal direction are complete；

S5: vertical filtering；Selecting to be currently needed for the border of filtering, each PE reads view data (p3, p2, the p1 needing filtering from matrix register file, p0, q0, q1, q2, q3), use in matrix register file and have passed through the data of horizontal filtering as initial data, select view data (p3, p2, the p1 of vertical direction, p0, q0, q1, q2, q3) and constant calculating judgment condition, and be stored in vector condition depositor；Rule according to block-eliminating effect filtering algorithm calculates all results of (p3, p2, p1, p0, q0, q1, q2, q3), is stored in respectively in partial vector depositor；

S6: result stores；Result according to step S5 is that each PE selects the final result of (p3, p2, p1, p0, q0, q1, q2, q3) and the value (p3, p2 ', p1 ', p0 ', q0 ', q1 ', q2 ', q3) of p3 and q3, is directly stored in vector memory bank；

S7: repeat step 5 and step 6, until all boundary filterings of vertical direction are complete.

As a further improvement on the present invention: described step S3 and step S6 select the operation of final result comprise the steps of

S100: assume that the calculating candidate result of the pi on border corresponding for each PE is made up of R0～Rk-1, these results form a complete or incomplete binary tree；Final result for any one PE, pi is necessarily present in R0～Rk-1；R0～Rk-1 is launched to be written as according to the number of PEThe conditional matrix of conditional operation is obtained according to corresponding judgment condition in block-eliminating effect filtering algorithmWherein

S200: according to conditional matrix, operated by k vector condition MOV, the final result Pi of pi can be obtained；That is, Pi=∑ Ri Ci；

S300: repeat step S100, S200, until p2, the result of p1, p0, q0, q1, q2 selects complete.

As a further improvement on the present invention: the concrete operation method of described step S3 and the operation of step S6 conditional is: assume that vector processor currently performs vector instruction Inst, simultaneously conditional register R0, R0={R₀₁,R₀₂,...,R_0M-1, the most corresponding PE₀～PE_M-1.If R_0i==1, then PE_iPerform instruction Inst, otherwise PE_iPerform do-nothing operation.

As a further improvement on the present invention: described vector memory bank includes M memory block, described M memory block and M vector PE one_to_one corresponding successively；M memory block unified addressing, intersects by BANK and deposits；That is, first character is deposited at first BANK, and second word is deposited at second BANK ..., until m-th word is deposited in m-th BANK；Then the M+1 word is deposited at first BANK again ..., the like；Each memory block is divided into memory block and lower memory block and supports to carry out two vectorial accessing operations simultaneously.

As a further improvement on the present invention: described vector matrix register file is made up of M × M memory element, and the bit wide of each memory element is generally 4,8,12,16,32, this array is logically by M row vector depositor VR₀—VR_M-1Or M column vector CVR₀—CVR_M-1Depositor forms；Each row vector depositor comprises M element E_i,0—E_i,M-1, wherein i=0,1,2 ... M-1, each column vector depositor comprises M element E_0,i—E_M-1,i, wherein i=0,1,2 ... M-1；Matrix register completes reading and the write of ranks vector under the control of read-write enable, read/write address and row array selecting signal.

Compared with prior art, it is an advantage of the current invention that:

1, the vectorization implementation method of the block-eliminating effect filtering of the vector processor-oriented of the present invention, the filtering operation on M border can be performed simultaneously, effectively accelerate the speed of block-eliminating effect filtering, for other functional modules reserved more sufficient time of real-time high-definition video encoding and decoding.The parallel method of this vectorization can make full use of the vectorial calculation features of vector processor, excavate the concurrency of vector processor, fully develops the data parallelism of block-eliminating effect filtering algorithm, it is possible to increase substantially its operational performance.

2, the vectorization implementation method of the block-eliminating effect filtering of the vector processor-oriented of the present invention, by conditional operation, so that vector processor processes the multiple-limb program triggered due to source data.What each PE of vector processor obtained from dispatch unit is same instruction, and it performs cycle and typically performs the different branch instruction of result not across each PE.The present invention passes through conditional operation so that vector processor can be smoothly performed the instruction comprising multiple-limb target.

Accompanying drawing explanation

Fig. 1 is the general structure schematic diagram of vector processor.

Fig. 2 is the present invention the most vertically and horizontally boundary filtering pixel schematic diagram.

Fig. 3 is the schematic flow sheet of the inventive method.

Fig. 4 is the present invention vectorial bank structure schematic diagram in concrete application example.

Fig. 5 is present invention filtering data a kind of location mode schematic diagram in vector memory bank in concrete application example.

Fig. 6 is the memory cell array structure schematic diagram of the matrix register used in the present invention.

Detailed description of the invention

Below with reference to Figure of description and specific embodiment, the present invention is described in further details.

As it is shown on figure 3, the block elimination filtering vectorization implementation method of the vector processor-oriented of the present invention, comprise the following steps:

S1: data prepare；

S101: input n × m by filtering video data block to vector memory bank in.

S102: constant required for loading algorithm from vector memory bank, and be vectorization to it.

S2: horizontal filtering operation；

S201: select to be currently needed for the horizontal boundary of filtering, each PE read from vector memory bank needs the view data (p3, p2, p1, p0, q0, q1, q2, q3) of filtering；

S202: use view data (p3, p2, p1, p0, q0, q1, q2, q3) and constant to calculate judgment condition, and be stored in vector condition depositor；

S203: calculate all results of (p3, p2, p1, p0, q0, q1, q2, q3) according to the rule of block-eliminating effect filtering algorithm, be stored in respectively in partial vector depositor.

S3: result stores；

The result of the judgment condition depositor according to block-eliminating effect filtering algorithm, selects (p3, p2 for each PE, p1, p0, q0, q1, q2, q3) final result and the value (p3 of p3 and q3, p2 ', p1 ', p0 ', q0 ', q1 ', q2 ', q3), it is stored in matrix register file.

S4: repeat step S2 and step S3, until all boundary filterings of horizontal direction are complete.

S5: vertical filtering；

S501: select to be currently needed for the border of filtering, each PE reads the view data (p3, p2, p1, p0, q0, q1, q2, q3) needing filtering from matrix register file.

S502: use the data that have passed through horizontal filtering in matrix register file as initial data, select the view data (p3, p2, p1, p0, q0, q1, q2, q3) of vertical direction and constant to calculate judgment condition, and be stored in vector condition depositor.

S503: calculate all results of (p3, p2, p1, p0, q0, q1, q2, q3) according to the rule of block-eliminating effect filtering algorithm, be stored in respectively in partial vector depositor.

S6: result stores；

The result of the judgment condition depositor according to block-eliminating effect filtering algorithm, selects (p3, p2 for each PE, p1, p0, q0, q1, q2, q3) final result and the value (p3 of p3 and q3, p2 ', p1 ', p0 ', q0 ', q1 ', q2 ', q3), vector memory bank it is directly stored in.

By the said method of the present invention, can support that block elimination filtering vectorization calculates efficiently, give full play to the computation capability of whole PE of vector processor, be effectively improved the execution efficiency of vector processor, shorten operation time.

In concrete application example, above-mentioned vector memory bank includes M memory block, described M memory block and M vector PE one_to_one corresponding successively；M memory block unified addressing, intersect by BANK and deposit (referring to that first character is deposited at first BANK, second word is deposited at second BANK ..., until m-th word deposits in m-th BANK.Then the M+1 word is deposited at first BANK again ..., the like)；Each memory block is divided into memory block and lower memory block and supports to carry out two vectorial accessing operations simultaneously.

In concrete application example, vector matrix register file is made up of M × M memory element, and the bit wide of each memory element is generally 4,8,12,16,32, and this array logically can be regarded as by M row vector depositor VR₀—VR_M-1Or M column vector CVR₀—CVR_M-1Depositor forms.Each row vector depositor comprises M element (memory element) E_i,0—E_i,M-1(i=0,1,2 ... M-1), each column vector depositor comprises M element E_0,i—E_M-1,i(i=0,1,2 ... M-1).Matrix register completes reading and the write of ranks vector under the control of read-write enable, read/write address and row array selecting signal.

In concrete application example, the concrete operation method of step S3 and the operation of step S6 conditional is: assume that vector processor currently performs vector instruction Inst, simultaneously conditional register R0, R0={R₀₁,R₀₂,...,R_0M-1, the most corresponding PE₀～PE_M-1.If R_0i==1, then PE_iPerform instruction Inst, otherwise PE_iPerform do-nothing operation.

In concrete application example, step S3 and step S6 select the operation of final result comprise the steps of

S100: assume that the calculating candidate result of the pi on border corresponding for each PE is made up of R0～Rk-1, these results form a complete or incomplete binary tree.Final result for any one PE, pi is necessarily present in R0～Rk-1.R0～Rk-1 is launched to be written as according to the number of PEThe conditional matrix of conditional operation is obtained according to corresponding judgment condition in block-eliminating effect filtering algorithmWherein

S200: according to conditional matrix, operated by k vector condition MOV, the final result Pi of pi can be obtained.That is, Pi=∑ Ri Ci.

As in figure 2 it is shown, be the present invention vertically and horizontally filtering boundary schematic diagram in a concrete application example.Wherein, corresponding for abcdefgh dotted line is the filtering boundary of this block.In h .264, the image block in Fig. 2 is 16 × 16 pixels, and the block of pixels in each dotted line frame is 4 × 4 pixels.Filter involved pixel each time and include 4 pixels (p3, p2, p1, p0) and (q0, q1, q2, the q3) on filtering boundary both sides.

In conjunction with Fig. 3, as a example by the in-loop deblocking effect filtering algorithm in H.264 standard, the present invention comprises the following steps in instantiation:

(1) input data: the macro block of each 16 × 16 in H.264, as in figure 2 it is shown, block-eliminating effect filtering will be carried out.The macro block of 16 × 16 is divided into the sub-block of 16 4 × 4, and the border of the sub-block of each 4 × 4 is filtering boundary, i.e. abcdefgh border shown in Fig. 2.

As shown in Figure 4, vector memory bank is made up of M=16 block (BANK_0～BANK_15), and with PE_0～the PE_15 one_to_one corresponding of vector processing unit, 16 BANK unified addressing, intersect by BANK and deposit, data sharing can be carried out, the data access of high bandwidth is provided for 16 PE；Each BANK supports multiport to access by the intersection type of organization of body more than two groups, and (multiport includes two vectorial accessing operation ports, also include DMA port and scalar memory access port), i.e. it is divided into upper and lower two memory blocks, two vectorial accessing operations can be supported simultaneously.

In the present embodiment, the macro block data of 16 × 16 deposit position in vector memory bank is as shown in Figure 5.16 data to be filtered of every a line can once be loaded in local register participation computing；The result finally filtered can Store data line in vector memory, be then transported in external memory storage by DMA.

(2) (p3 is calculated, p2, p1, p0, q0, q1, q2, q3) filter result: 16 PE once can calculate 16 row bounds or row border, and this vector operations pattern is greatly reduced the time required for filtering, calculates all row bounds or the filtering on row border of the macro block that just can complete 16 × 16 for only 4 times.

In the present embodiment, the value mode that the filter result of (p3, p2, p1, p0, q0, q1, q2, q3) is possible has (with the P2 of capitalization, P1, P0, P0, P1, P2 represent):

P0=Min (Max (0, p0+d), 255)

P0=(p2+2*p1+2*p0+2*q0+q1+4) > > 3

P1=(p2+p1+p0+q0+2) > > 2

P1=p1+Min (Max (-C0, (p2+ (p0+q0)>>1-(p1<<1))>>1), C0)

P2=(2*p3+3p2+p1+p0+q0+4) > > 3

Q0=Min (Max (0, q0-d), 255)

Q0=(q2+2*q1+2*q0+2*p0+p1+4) > > 3

Q1=q1+Min (Max (-C0, (p2+ (p0+q0)>>1-(p1<<1))>>1), C0)

Q1=(q2+q1+q0+p0+2) > > 2

Q2=(2*q3+3q2+q1+q0+p0+4) > > 3

D=Min (Max (-C, (((q0-p0)<<2+ (p1-q1)+4)>>3)), C)

Wherein, variable C and C0 is the set parameter of in-loop deblocking effect filtering.

In the present embodiment, p3 and q3 is not modified in filtering operation, and (p2, p1, p0, q0, q1, q2) is likely to not be modified.

(3) judgment condition is calculated:

Control the relation between parameter and the source data that filtered according to filtering, calculate filtering judgment condition.Each judgment condition determines that the value of a PSW is true or false.

In the present embodiment, filtering judgment condition includes: the threshold value etc. of the difference between difference between filtering strength, neighbor, neighbor.

(4) screening final result:

According to the condition of step (3), each possible filter result is performed a condition MOV operation, obtains correct filter result the most at last.

(5) storage filter result:

If row filtering, then filter result is stored in matrix register file；Otherwise it is directly stored in vector memory bank, is saved in user's space with external data transmission engines such as DMA.

(6) if filtering is also not fully complete, then step (2)～(5) are continued until all filtering boundaries all complete Filtering position.

See Fig. 6, for present invention memory cell array structure schematic diagram of matrix register in concrete application example.The memory cell array of matrix register is typically made up of N*N memory element, the exponential of the usual position of N 2.The bit wide of each memory element is that W, W are generally 4,8,12,16,32.This array logically can regard N number of row vector depositor VR as₀—VR_N-1Or N number of column vector CVR₀—CVR_N-1Depositor forms, and each row vector depositor comprises N number of element (memory element) E_i,0—E_i,N-1(i=0,1,2 ... N-1).With VR₀As a example by, this row vector depositor includes memory element E_0,0—E_0,N-1.This memory cell array divided by column is the column of memory cells of N number of N*W position, and each column is elementary composition by N number of same column.This N number of column of memory cells and N number of column vector depositor CVR₀—CVR_N-1One_to_one corresponding, for realizing the access facility of respective column vector registor.With CVR_N-1As a example by, this column vector depositor includes all row vector depositor VR₀—VR_N-1Last element E_i,N-1(i=0,1,2 ... N-1).

Below being only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment, and all technical schemes belonged under thinking of the present invention belong to protection scope of the present invention.It should be pointed out that, for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, should be regarded as protection scope of the present invention.

Claims

1. the block elimination filtering vectorization implementation method of a vector processor-oriented, it is characterised in that step is:

The block elimination filtering vectorization implementation method of vector processor-oriented the most according to claim 1, it is characterised in that select the operation of final result to comprise the steps of in described step S3 and step S6

The block elimination filtering vectorization implementation method of vector processor-oriented the most according to claim 1, it is characterized in that, the concrete operation method of described step S3 and the operation of step S6 conditional is: assume that vector processor currently performs vector instruction Inst, simultaneously conditional register R0, R0={R₀₁,R₀₂,...,R_0M-1, the most corresponding PE₀～PE_M-1；If R_0i==1, then PE_iPerform instruction Inst, otherwise PE_iPerform do-nothing operation.

4. according to the block elimination filtering vectorization implementation method of the vector processor-oriented described in any one in claims 1 to 3, it is characterised in that described vector memory bank includes M memory block, described M memory block and M vector PE one_to_one corresponding successively；M memory block unified addressing, intersects by BANK and deposits；That is, first character is deposited at first BANK, and second word is deposited at second BANK ..., until m-th word is deposited in m-th BANK；Then the M+1 word is deposited at first BANK again ..., the like；Each memory block is divided into memory block and lower memory block and supports to carry out two vectorial accessing operations simultaneously.

5. according to the block elimination filtering vectorization implementation method of the vector processor-oriented described in any one in claims 1 to 3, it is characterized in that, described vector matrix register file is made up of M × M memory element, the bit wide of each memory element is generally 4,8,12,16,32, and this array is logically by M row vector depositor VR₀—VR_M-1Or M column vector CVR₀—CVR_M-1Depositor forms；Each row vector depositor comprises M element E_i,0—E_i,M-1, wherein i=0,1,2 ... M-1, each column vector depositor comprises M element E_0,i—E_M-1,i, wherein i=0,1,2 ... M-1；Matrix register completes reading and the write of ranks vector under the control of read-write enable, read/write address and row array selecting signal.