CN104683817A - AVS-based methods for parallel transformation and inverse transformation - Google Patents

AVS-based methods for parallel transformation and inverse transformation Download PDF

Info

Publication number
CN104683817A
CN104683817A CN201510076289.XA CN201510076289A CN104683817A CN 104683817 A CN104683817 A CN 104683817A CN 201510076289 A CN201510076289 A CN 201510076289A CN 104683817 A CN104683817 A CN 104683817A
Authority
CN
China
Prior art keywords
data
matrix
transformation
transposition
operation matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510076289.XA
Other languages
Chinese (zh)
Other versions
CN104683817B (en
Inventor
叶广明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGZHOU KUVISION DIGITAL TECHNOLOGY Co Ltd
Original Assignee
GUANGZHOU KUVISION DIGITAL TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGZHOU KUVISION DIGITAL TECHNOLOGY Co Ltd filed Critical GUANGZHOU KUVISION DIGITAL TECHNOLOGY Co Ltd
Priority to CN201510076289.XA priority Critical patent/CN104683817B/en
Publication of CN104683817A publication Critical patent/CN104683817A/en
Application granted granted Critical
Publication of CN104683817B publication Critical patent/CN104683817B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses AVS (Audio Video coding Standard)-based methods for parallel transformation and inverse transformation. The transformation method comprises the following steps: subtracting eight predicted pixel values from eight current pixel values to obtain eight residual values, and storing the residual values in a register; repeating the step above for seven times to obtain eight lines of residual values in a total of 64, and respectively storing the eight lines of residual values in eight registers; transposing a residual matrix to obtain a transposed residual matrix; performing butterfly operation of horizontal transformation part on the transposed residual matrix to obtain a horizontal transformation operation matrix; transposing the horizontal transformation operation matrix to obtain a transposed horizontal transformation operation matrix; performing butterfly algorithm operation of a vertical transformation part on the transposed horizontal transformation operation matrix to obtain a vertical transformation operation matrix; performing the following operation on the vertical transformation operation matrix: rij=(hij+2<4>)>>5, to obtain a transformation result matrix. According to the invention, an SIMD (Single Instruction Multiple Data) instruction is used, parameters in the transformation and the inverse transformation are placed into the registers in parallel for processing, so as to efficiently obtain the transformation results and the inverse transformation results.

Description

Based on parallel transformation and the inverse transform method of AVS
Technical field
The present invention relates to digital video decoding technical field, be specifically related to based in AVS (Audio Video coding Standard) standard, use the method for the conversion of SIMD (Single Instruction Multiple Data) optimization and inverse transformation.
Background technology
Along with AVS standard is in the extensive industrialization such as radio and television, the Internet, Set Top Box, monitoring, PC and embedded device play and records AVS documentation requirements and get more and more, and all wish in real time, because AVS adopts high product complexity theory to obtain high coding efficiency, therefore, need to carry out effective program optimization to codec, just may reach real-time, picture is smooth, and broadcasting and recording reach 25fps, even 30fps.
To encoding and decoding AVS code stream analysis, QUANT/DEQUANT is consuming time accounts for very large proportion, especially, if the data after quantizing are all 0, can not need to carry out many computings, and they just effectively can improve encoding-decoding efficiency well in optimization.Present PC and flush bonding processor chip have SIMD instruction, as: the 3D of mmx, sse and AMD of intel Now! , the NEON of ARM, makes to make good use of these SIMD instructions, can effectively improve encoding and decoding speed.
DCT is conversion, is current pixel and predict pixel to be subtracted each other, and is transformed to example with 8x8, obtains the residual error coefficient matrix of 8x8, and then carry out the positive discrete cosine transform of 8x8 integer, step is as follows:
1. be that signless current pixel value and the predicted pixel values of 8 is subtracted each other by sample point precision, obtain the residual values having symbol that precision is 16, form the residual matrix of 8x8.
2. pair residual matrix carries out horizontal transformation: H '=ResidualMatrix*T 8, ResidualMatrix is residual matrix, and T8 is the transformation matrix that AVS standard specifies, H ' represents the intermediate object program matrix after horizontal transformation.
T 8 = 8 10 10 9 8 6 4 2 8 9 4 - 2 - 8 - 10 - 10 - 6 8 6 - 4 - 10 - 8 2 10 9 8 2 - 10 - 6 8 9 - 4 - 10 8 - 2 - 10 6 8 - 9 - 4 10 8 - 6 - 4 10 - 8 - 2 10 - 9 8 - 9 4 2 - 8 10 - 10 6 8 - 10 10 - 9 8 - 6 4 - 2
3. couple H ' carries out vertical transitions: H "=T 8 t* H ', wherein, T 8 tfor T 8transposed matrix, H " be the intermediate object program matrix after vertical transitions.
4. by the intermediate object program matrix H after vertical transitions obtained above " adjust: r ij=(h ij"+2 4) >>5, obtain the final result matrix converted, wherein, h ij" be H " and in data, r ijfor the data in the final result matrix of conversion, the span of i and j is 0-7.
IDCT is inverse transformation, and be will be residual matrix through the inverse quantization matrix conversion converting, quantize and inverse quantization obtains, add predicted value, obtain the result after inverse transformation, step be as follows:
1. pair inverse quantization matrix carries out horizontal reverse conversion: H '=CoeffMatr ix*T 8 t, wherein CoeffMatrix is inverse quantization matrix, T 8 tfor T 8transposed matrix, the intermediate object program matrix that H ' converts for horizontal reverse.
2. the data in couple H ' calculate as follows: h ij"=clip3 (-2 n+7, 2 n+7-1, (h ij'+4)) >>3, obtain the adjustment matrix H of horizontal reverse conversion ", h ij' be the data in the intermediate object program matrix of horizontal reverse conversion, h ij" be H " and in data, the span of i and j is 0-7, n is sample point precision, and general value is 8, clip3 is the median of getting three values.
3. couple H " carry out vertical inverse transformation: H=T 8* H ", H is the intermediate object program matrix of vertical inverse transformation.
4. the data in couple H adjust, and obtain the residual values in residual matrix: r ij=clip3 (-2 n+7, 2 n+7-1, (h ij+ 64)) >>7, h ijfor the data in matrix H, r ijfor the data in residual matrix.
5. by the predicted value p before residual values and conversion ijbe added, then compare with 255, get minimum value adjustment: c ij=min (r ij+ p ij, 255), predicted value p ijfor the data in predicted value matrix, c ijfor the data in inverse transformation matrix of consequence.
In above-mentioned conversion and inverse transformation process, often calculating a pixel all needs 16 multiplication and 14 sub-additions, even if adopt butterfly computation can optimize calculating process to a certain extent, but consuming time still very large.
Summary of the invention
In order to overcome the deficiencies in the prior art, the object of the present invention is to provide the parallel transformation based on AVS and inverse transform method, use SIMD instruction, the Parameter Parallel in conversion and inverse transformation is put in register process respectively, realize obtaining transformation results and inverse transformation result expeditiously.
For solving the problem, the technical solution adopted in the present invention is as follows:
Scheme one:
Based on the parallel transformation method of AVS, comprise the following steps:
Steps A: utilize two registers to walk abreast respectively and read in eight current pixel values and eight predicted pixel values, utilizing low level to intersect instruction makes current pixel value and predicted pixel values carry out intersecting and obtain first and intersect result, and make predicted pixel values and self carry out intersecting and obtain second and intersect result, result of intersecting first deducts the second intersection result and obtains eight residual values and send in a register, wherein, current pixel value and predicted pixel values are 8 bit data, and residual values is 16 bit data;
Step B: repeat seven steps A, obtains 64 residual values, the residual matrix of composition 8*8, and wherein, eight residual values that often execution steps A obtains are the wherein data line in residual matrix, and eight row data are stored in eight registers respectively;
Step C: carry out transposition in conjunction with low level intersection instruction and the instruction of high-order intersection to residual matrix and obtain transposition residual matrix, the data in described eight registers replace with eight row data in transposition residual matrix respectively;
Step D: the computing carrying out the butterfly computation of horizontal transformation part for the data in transposition residual matrix, obtains horizontal transformation operation matrix, and the data in described eight registers replace with eight row data in horizontal transformation operation matrix respectively;
Step e: carry out transposition in conjunction with low level intersection instruction and the instruction of high-order intersection to horizontal transformation operation matrix and obtain transposition horizontal transformation operation matrix, the data in described eight registers replace with eight row data in transposition horizontal transformation operation matrix respectively;
Step F: the computing carrying out the butterfly computation of vertical transitions part for the data in transposition horizontal transformation operation matrix, obtains vertical transitions operation matrix, the data in described eight registers replace with eight row data in vertical transitions operation matrix respectively;
Step G: carry out following computing for vertical transitions operation matrix: r ij=(h ij+ 2 4) >>5, obtain transformation results matrix, wherein, h ijfor the data in vertical transitions operation matrix, r ijfor the data in transformation results matrix, the span of i and j is 0-7.
Preferably, following sub-step is comprised in step F:
Step F 0: transfer the data in transposition horizontal transformation operation matrix to 32 by 16, front four data of the every data line in transposition horizontal transformation operation matrix sent into respectively in corresponding register, rear four data of the every data line in transposition horizontal transformation operation matrix are temporary in internal memory;
Step F 1: the computing carrying out the butterfly computation of vertical transitions part for the data be saved in eight registers, obtain the left-half data in vertical transitions operation matrix, then by the left-half data temporary storage in vertical transitions operation matrix in internal memory;
Step F 2: the data be temporary in internal memory in transposition horizontal transformation operation matrix are sent to respectively from internal memory in described eight registers;
Step F 3: the computing carrying out the butterfly computation of vertical transitions part for the data be saved in eight registers, obtain the right half part data in vertical transitions operation matrix, the left-half data in vertical transitions operation matrix and right half part data assemblies are obtained vertical transitions operation matrix.
Preferably, in the computing of the computing of the butterfly computation of horizontal transformation part and the butterfly computation of vertical transitions part, for the two groups of data be stored in respective register needing to carry out respectively being added and carrying out to subtract each other, perform following calculation step: two registers are designated as xmm0 and xmm7 respectively, data in xmm0 are designated as x0, data in xmm7 are designated as x7, first being added by xmm0 and xmm7 is stored in xmm0, then being added by xmm0 and xmm0 is stored in xmm7, finally xmm7 is deducted xmm0 and is stored in xmm7; After calculation step, the result of x0+x7 is stored in xmm0, and the result of x0-x7 is stored in xmm7.
Preferably, in the computing of the computing of the butterfly computation of horizontal transformation part and the butterfly computation of vertical transitions part, y is designated as by needing the data be stored in register be multiplied with 10, then perform following calculation step for this multiplying of y*10: the register preserving data y is designated as xmm1, first data y is copied in another register, another register is designated as xmm2, then the data in xmm2 are moved to left after two and be added with the data in xmm1, addition result is stored in xmm1, then is added with self by xmm1; Through this calculation step, the result of y*10 is stored in xmm1.
Scheme two:
Based on the parallel inverse transform method of AVS, comprise the following steps:
Steps A: utilize eight registers to preserve eight row data in inverse quantization matrix respectively, the data in inverse quantization matrix are 16 bit data;
Step B: carry out transposition in conjunction with low level intersection instruction and the instruction of high-order intersection to inverse quantization matrix and obtain transposition inverse quantization matrix, the data in described eight registers replace with eight row data in transposition inverse quantization matrix respectively;
Step C: the computing carrying out the butterfly computation of horizontal reverse conversion fraction for the data in transposition inverse quantization matrix, obtains horizontal reverse transform operation matrix, and the data in described eight registers replace with eight row data in horizontal reverse transform operation matrix respectively;
Step D: carry out following computing for the data in horizontal reverse transform operation matrix: h ij"=clip3 (-2 n+7, 2 n+7-1, (h ij'+4)) >>3, obtain horizontal reverse conversion intermediary matrix, wherein, h ij" be the data in horizontal reverse conversion intermediary matrix, h ij' be the data in horizontal reverse transform operation matrix, the value of n is the span of 8, i and j be 0-7, clip3 is the median of getting three values, and the data in described eight registers replace with eight row data in horizontal reverse conversion intermediary matrix respectively;
Step e: intersect instruction and high-order intersect instruction and carry out transposition to horizontal reverse conversion intermediary matrix and obtain transposition horizontal reverse conversion intermediary matrix in conjunction with low level, the data in described eight registers replace with eight row data in transposition horizontal reverse conversion intermediary matrix respectively;
Step F: the computing carrying out the butterfly computation of vertical inverse transformation part for the data in transposition horizontal reverse conversion intermediary matrix, obtain vertical inverse transformation operation matrix, the data in described eight registers replace with eight row data in vertical inverse transformation operation matrix respectively;
Step G: carry out following computing for vertical inverse transformation operation matrix: r ij=clip3 (-2 n+7, 2 n+7-1, (h ij+ 64)) >>7, obtains residual error operation matrix, wherein r ijfor the data in residual error operation matrix, h ijfor the data in vertical inverse transformation operation matrix, the value of n is the span of 8, i and j be 0-7, clip3 is the median of getting three values;
Step H: carry out following computing for residual error operation matrix: c ij=min (r ij+ p ij, 255), obtain inverse transformation matrix of consequence, wherein, c ijfor the data in inverse transformation matrix of consequence, r ijfor the data in residual error operation matrix, p ijfor the data in predict pixel matrix, the span of i and j is that 0-7, min are for getting minimum value.
Preferably, following sub-step is comprised in step F:
Step F 0: transfer the data in transposition horizontal reverse conversion intermediary matrix to 32 by 16, send in corresponding register respectively by front four data of the every data line in transposition horizontal reverse conversion intermediary matrix, rear four data of the every data line in transposition horizontal reverse conversion intermediary matrix are temporary in internal memory;
Step F 1: the computing carrying out the butterfly computation of vertical inverse transformation part for the data be saved in eight registers, obtains the left-half data in vertical inverse transformation operation matrix and is temporary in internal memory;
Step F 2: the data be temporary in internal memory in transposition horizontal reverse conversion intermediary matrix are sent in described eight registers respectively from internal memory;
Step F 3: the computing carrying out the butterfly computation of vertical inverse transformation part for the data be saved in eight registers, obtain the right half part data in vertical inverse transformation operation matrix, the left-half data in vertical inverse transformation operation matrix and right half part data assemblies are obtained vertical inverse transformation operation matrix.
Preferably, in the computing of the butterfly computation of horizontal reverse conversion fraction and the computing of the butterfly computation of vertical inverse transformation part, for the two groups of data be stored in respective register needing to carry out respectively being added and carrying out to subtract each other, perform following calculation step: two registers are designated as xmm0 and xmm7 respectively, data in xmm0 are designated as x0, data in xmm7 are designated as x7, first being added by xmm0 and xmm7 is stored in xmm0, then being added by xmm0 and xmm0 is stored in xmm7, finally xmm7 is deducted xmm0 and is stored in xmm7; After calculation step, the result of x0+x7 is stored in xmm0, and the result of x0-x7 is stored in xmm7.
Preferably, in the computing of the butterfly computation of horizontal reverse conversion fraction and the computing of the butterfly computation of vertical inverse transformation part, y is designated as by needing the data be stored in register be multiplied with 10, then perform following calculation step for this multiplying of y*10: the register preserving data y is designated as xmm1, first data y is copied in another register, another register is designated as xmm2, then the data in xmm2 are moved to left after two and be added with the data in xmm1, addition result is stored in xmm1, then is added with self by xmm1; Through this calculation step, the result of y*10 is stored in xmm1.
Compared to existing technology, beneficial effect of the present invention is: put into by data parallel in register and carry out computing, and the multiple data of energy computing simultaneously, drastically increase operation efficiency at every turn; In addition, in the butterfly computation of horizontal transformation/inverse transformation part and vertical transitions/inverse transformation part, for needing two data of carrying out being added and carrying out additive operation respectively, addition and additive operation is made only to need just can complete in two registers of correspondence by arranging cleverly, do not need data to lead into derivation internal memory, improve arithmetic speed, for needing the data of carrying out multiplying, multiplying is converted to add operation, also improves arithmetic speed.
Accompanying drawing explanation
Fig. 1 is the flow chart of the parallel transformation method that the present invention is based on AVS.
Fig. 2 is the flow chart of the parallel inverse transform method that the present invention is based on AVS.
Fig. 3 is the process schematic of matrix transpose of the present invention.
Embodiment
Below, by reference to the accompanying drawings and embodiment, the present invention is described further:
Embodiment 1:
With reference to the parallel transformation method that figure 1 is based on AVS, comprise the following steps:
Steps A: utilize two registers to walk abreast respectively and read in eight current pixel values and eight predicted pixel values, utilizing low level to intersect instruction makes current pixel value and predicted pixel values carry out intersecting and obtain first and intersect result, and making predicted pixel values and self carry out intersecting and obtain second and intersect result, result of intersecting first deducts the second intersection result and obtains eight residual values and send in a register; Wherein, current pixel value and predicted pixel values are 8 bit data, and residual values is 16 bit data.
Wherein, eight current pixel values refer to a line 8 data in present sample matrix, and eight predicted pixel values are a line 8 data in predict pixel matrix.Because current pixel value and predicted pixel values are 8, a register has 128, and to be therefore stored in register low 64 for 8 data, and high 64 is 0.After using low level intersection instruction and subtraction instruction, just obtain the residual values of 8 16.Utilize register once can carry out concurrent operation for 8 data, efficiency increases greatly.
Above-mentioned steps can use following false code to represent:
Movq xmm0, [ecx]; // 8 values of getting current pixel put into xmm0
Movq xmm7, [edx]; // 8 values of getting predict pixel put into xmm7
Punpcklbw xmm0, xmm7; // current pixel value and predicted value low level intersect
Punpcklbw xmm7, xmm7; // predicted pixel values and oneself low level intersect
Psubw xmm0, xmm7; // subtract each other, obtain the residual values of 8 16.
Step B: repeat seven steps A, obtains 64 residual values, the residual matrix of composition 8*8, and wherein, eight residual values that often execution steps A obtains are the wherein data line in residual matrix, and eight row data are stored in eight registers respectively.
Step C: carry out transposition in conjunction with low level intersection instruction and the instruction of high-order intersection to residual matrix and obtain transposition residual matrix, the data in described eight registers replace with eight row data in transposition residual matrix respectively.
In step C, in conjunction with low level intersection instruction and the instruction of high-order intersection, transposition is carried out to residual matrix for eight registers preserving a line residual error data respectively, make the column data in residual matrix become row data and preserve in a register.Concrete transpose procedure as shown in Figure 3, employ punpcklwd, punpckhwd, punpckldq, punpckhdq, punpcklqdq and punpckhqdq instruction makes residual matrix transposition, before transposition, a0a1a2a3a4a5a6a7 is wherein a line residual error data, after above-mentioned instruction obtains transposition, a0b0c0d0e0f0g0h0 is the wherein data line in transposition residual matrix, the wherein column data namely in residual matrix.
Step D: the computing carrying out the butterfly computation of horizontal transformation part for the data in transposition residual matrix, obtains horizontal transformation operation matrix, and the data in described eight registers replace with eight row data in horizontal transformation operation matrix respectively.
In this step, the code section corresponding to butterfly computation of horizontal transformation part is as follows:
In above-mentioned code, eight row data in the corresponding transposition residual matrix of SRC (0) to SRC (7), i.e. corresponding eight registers, eight row data in the corresponding horizontal transformation operation matrix of DST (0) to DST (7), namely, after run time version, the data represented by DST (0) to DST (7) are kept in corresponding register respectively.
In above-mentioned operation part, there are two data needs to carry out respectively being added and carrying out phase reducing, such as, and s07=SRC (0)+SRC (7), d07=SRC (0)-SRC (7).If according to normal compute mode, first calculate s07 then to need first to put s 07 into internal memory, if because put in the register corresponding to SRC (0) after calculating s07, then d07 below then cannot calculate, because SRC (0) is covered by s07.Therefore, for this situation, in order to avoid data are frequently exported to internal memory, and import to register from internal memory, false code below can be used:
By above-mentioned false code, plus and minus calculation runs all in a register, does not need data to lead into derivation internal memory, improves the speed of service.
In addition, in the code of butterfly computation, also relate to the computing that certain data takes advantage of 10, such as, in code, there is a2*10, consuming time more owing to carrying out multiplying in a register, much larger than the operation time of addition and subtraction and displacement, therefore, by false code below, multiplication can be become add operation:
movdqa xmm0,xmm6;//xmm0=xmm6=a2;
psllw xmm6,2;//xmm6=(a2<<2)=4a2;
paddw xmm0,xmm6;//xmm0+=xmm6=5a2;
paddw xmm0,xmm0;//xmm0+xmm0=10a2;
By above-mentioned false code, multiplying is replaced with add operation and shift operation, improve arithmetic speed.
Step e: carry out transposition in conjunction with low level intersection instruction and the instruction of high-order intersection to horizontal transformation operation matrix and obtain transposition horizontal transformation operation matrix, the data in described eight registers replace with eight row data in transposition horizontal transformation operation matrix respectively.Transposition principle is described in detail in step C, does not repeat them here.
Step F: the computing carrying out the butterfly computation of vertical transitions part for the data in transposition horizontal transformation operation matrix, obtains vertical transitions operation matrix, the data in described eight registers replace with eight row data in vertical transitions operation matrix respectively.
Due to the computing of the butterfly computation through horizontal transformation part, carry out the words of the computing of the butterfly computation of the vertical transitions part in this step again, the data obtained may exceed 16, if 8 data preserved by continuation register, then may there is the situation of error in data, therefore needing the data point reuse in register is 32, to avoid data recording error.Specifically comprise following sub-step:
The data in transposition horizontal transformation operation matrix are transferred to 32 by 16, front four data of the every data line in transposition horizontal transformation operation matrix sent into respectively in corresponding register, rear four data of the every data line in transposition horizontal transformation operation matrix are temporary in internal memory;
Step F 1: the computing carrying out the butterfly computation of vertical transitions part for the data be saved in eight registers, obtain the left-half data in vertical transitions operation matrix, then by the left-half data temporary storage in vertical transitions operation matrix in internal memory;
Step F 2: the data be temporary in internal memory in transposition horizontal transformation operation matrix are sent to respectively from internal memory in described eight registers;
Step F 3: the computing carrying out the butterfly computation of vertical transitions part for the data be saved in eight registers, obtain the right half part data in vertical transitions operation matrix, the left-half data in vertical transitions operation matrix and right half part data assemblies are obtained vertical transitions operation matrix.
Step G: carry out following computing for vertical transitions operation matrix: r ij=(h ij+ 2 4) >>5, obtain transformation results matrix, wherein, h ijfor the data in vertical transitions operation matrix, r ijfor the data in transformation results matrix, the span of i and j is 0-7.
The code corresponding to computing in the butterfly computation of the vertical transitions part in step F and step G is as follows:
Wherein, after the data bit in register becomes 32, each register can only preserve the data of 4 32, therefore the left-half data in SRC (0) to SRC (7) corresponding transposition vertical transitions operation matrix or right half part data, the left-half data that what DST (0) to DST (7) was corresponding is in transformation results matrix or right half part data.Above-mentioned code needs computing twice, runs the left-half data obtained in transformation results matrix for the first time, runs the right half part data obtained in transformation results matrix for the second time.In addition, when performing above-mentioned operation part, also there are two data to need to carry out respectively being added and carrying out phase reducing, and take advantage of 10 computings, the processing mode in the butterfly computation of also usage level conversion fraction accordingly, plus and minus calculation is run all in a register, do not need data to lead into derivation internal memory, and multiplying is replaced with add operation and shift operation, all can improve arithmetic speed, concrete processing mode describes in detail in step D, does not repeat them here.
It should be noted that, in step implementation in embodiment 1, the current procedures matrix data obtained that is finished is stored in eight registers respectively, owing to occupying whole register, therefore in the computing carrying out next step, to need the statistical conversion in one of them register, in internal memory, to vacate a register and carry out computing.
In embodiment 1, carry out conversion and have the following advantages: put into by data parallel in register and carry out computing by above-mentioned step, the multiple data of energy computing simultaneously, drastically increase operation efficiency at every turn; In addition, in the butterfly computation of horizontal transformation part and vertical transitions part, for needing two data of carrying out being added and carrying out additive operation respectively, addition and additive operation is made only to need just can complete in two registers of correspondence by arranging cleverly, do not need data to lead into derivation internal memory, improve arithmetic speed, for needing the data of carrying out multiplying, multiplying is converted to add operation, also improves arithmetic speed.
Embodiment 2:
With reference to the parallel inverse transform method that figure 2 is based on AVS, comprise the following steps:
Steps A: utilize eight registers to preserve eight row data in inverse quantization matrix respectively, the data in inverse quantization matrix are 16 bit data.
Because the process of encoding and decoding is conversion, quantification, inverse quantization and inverse transformation, the data that therefore inverse transformation flow process reads are the data in the inverse quantization matrix of 8*8.
Step B: carry out transposition in conjunction with low level intersection instruction and the instruction of high-order intersection to inverse quantization matrix and obtain transposition inverse quantization matrix, the data in described eight registers replace with eight row data in transposition inverse quantization matrix respectively.Concrete transposition principle is described in detail in the step C of embodiment 1, does not repeat them here.
Step C: the computing carrying out the butterfly computation of horizontal reverse conversion fraction for the data in transposition inverse quantization matrix, obtains horizontal reverse transform operation matrix, and the data in described eight registers replace with eight row data in horizontal reverse transform operation matrix respectively.
Wherein, the operation part part corresponding to butterfly computation of horizontal reverse conversion fraction is as follows:
Wherein, the operation result of horizontal reverse conversion fraction is DST (0, a0), DST (1, a1), DST (2, a2), DST (3, a3), DST (7, a7), DST (6, a6), DST (5, a5) and DST (4, a4) being saved in corresponding register.In above-mentioned operation part, also there are two data to need to carry out respectively being added and carrying out additive operation, and take advantage of 10 computings, also plus and minus calculation is made to run all in a register accordingly, and multiplying is replaced with add operation and shift operation, the speed of service is improved, and concrete processing mode is described in detail in embodiment 1, does not repeat them here.
Step D: carry out following computing for the data in horizontal reverse transform operation matrix: h ij"=clip3 (-2 n+7, 2 n+7-1, (h ij'+4)) >>3, obtain horizontal reverse conversion intermediary matrix, wherein, h ij" be the data in horizontal reverse conversion intermediary matrix, h ij' be the data in horizontal reverse transform operation matrix, the value of n is the span of 8, i and j be 0-7, clip3 is the median of getting three values, and the data in described eight registers replace with eight row data in horizontal reverse conversion intermediary matrix respectively.
Step e: intersect instruction and high-order intersect instruction and carry out transposition to horizontal reverse conversion intermediary matrix and obtain transposition horizontal reverse conversion intermediary matrix in conjunction with low level, the data in described eight registers replace with eight row data in transposition horizontal reverse conversion intermediary matrix respectively.Transposition principle is described in detail in embodiment 1, does not repeat them here.
Step F: the computing carrying out the butterfly computation of vertical inverse transformation part for the data in transposition horizontal reverse conversion intermediary matrix, obtain vertical inverse transformation operation matrix, the data in described eight registers replace with eight row data in vertical inverse transformation operation matrix respectively.
Due to the computing of the butterfly computation through horizontal reverse conversion fraction, carry out the words of the computing of the butterfly computation of the vertical inverse transformation part in this step again, the data obtained may exceed 16, if 8 data preserved by continuation register, then may there is the situation of data from overflow, therefore needing the data point reuse in register is 32, to avoid data from overflow.Specifically comprise following sub-step:
Step F 0: transfer the data in transposition horizontal reverse conversion intermediary matrix to 32 by 16, send in corresponding register respectively by front four data of the every data line in transposition horizontal reverse conversion intermediary matrix, rear four data of the every data line in transposition horizontal reverse conversion intermediary matrix are temporary in internal memory;
Step F 1: the computing carrying out the butterfly computation of vertical inverse transformation part for the data be saved in eight registers, obtains the left-half data in vertical inverse transformation operation matrix and is temporary in internal memory;
Step F 2: the data be temporary in internal memory in transposition horizontal reverse conversion intermediary matrix are sent in described eight registers respectively from internal memory;
Step F 3: the computing carrying out the butterfly computation of vertical inverse transformation part for the data be saved in eight registers, obtain the right half part data in vertical inverse transformation operation matrix, the left-half data in vertical inverse transformation operation matrix and right half part data assemblies are obtained vertical inverse transformation operation matrix.
Step G: carry out following computing for vertical inverse transformation operation matrix: r ij=clip3 (-2 n+7, 2 n+7-1, (h ij+ 64)) >>7, obtains residual error operation matrix, wherein r ijfor the data in residual error operation matrix, h ijfor the data in vertical inverse transformation operation matrix, the value of n is the span of 8, i and j be 0-7, clip3 is the median of getting three values.
The code corresponding to computing in the butterfly computation of the vertical inverse transformation part in step F and step G is as follows:
Wherein, after the data bit in register becomes 32, each register can only preserve 4 32 bit data, therefore needs execution twice above-mentioned operation part just can draw all data in residual error operation matrix.In addition, two data are needed to carry out respectively being added and carrying out additive operation, and take advantage of 10 computings, plus and minus calculation is made to run all in a register accordingly, and multiplying is replaced with add operation and shift operation, arithmetic speed is improved, and concrete processing mode is described in detail in embodiment 1, does not repeat them here.
Step H: carry out following computing for residual error operation matrix: c ij=min (r ij+ p ij, 255), obtain inverse transformation matrix of consequence, wherein, c ijfor the data in inverse transformation matrix of consequence, r ijfor the data in residual error operation matrix, p ijfor the data in predict pixel matrix, the span of i and j is that 0-7, min are for getting minimum value.
It should be noted that, in step implementation in example 2, the current procedures matrix data obtained that is finished is stored in eight registers respectively, owing to occupying whole register, therefore in the computing carrying out next step, to need the statistical conversion in one of them register, in internal memory, to vacate a register and carry out computing.
In example 2, carry out inverse transformation and have the following advantages: put into by data parallel in register and carry out computing by above-mentioned step, the multiple data of energy computing simultaneously, drastically increase operation efficiency at every turn; In addition, in the butterfly computation of horizontal reverse conversion fraction with vertical inverse transformation part, for needing two data of carrying out being added and carrying out additive operation respectively, addition and additive operation is made only to need just can complete in two registers of correspondence by arranging cleverly, do not need data to lead into derivation internal memory, improve arithmetic speed, for needing the data of carrying out multiplying, multiplying is converted to add operation, also improves arithmetic speed.
Contrast with prior art respectively for the scheme in embodiment 1 and embodiment 2, contrast condition is intel core i7 cpu, 4G internal memory, and the time that intel vTune uses is 60 seconds, draws following list data:
As can be seen from the data in form, use method of the present invention, can effectively improve encoding and decoding speed.Transform operation of the present invention only has about 30% of the transform operation of prior art; Inverse transformation computing only has about 20% of the inverse transformation computing of prior art.
To one skilled in the art, according to technical scheme described above and design, other various corresponding change and deformation can be made, and all these change and deformation all should belong within the protection range of the claims in the present invention.

Claims (8)

1., based on the parallel transformation method of AVS, it is characterized in that, comprise the following steps:
Steps A: utilize two registers to walk abreast respectively and read in eight current pixel values and eight predicted pixel values, utilizing low level to intersect instruction makes current pixel value and predicted pixel values carry out intersecting and obtain first and intersect result, and make predicted pixel values and self carry out intersecting and obtain second and intersect result, result of intersecting first deducts the second intersection result and obtains eight residual values and send in a register, wherein, current pixel value and predicted pixel values are 8 bit data, and residual values is 16 bit data;
Step B: repeat seven steps A, obtains 64 residual values, the residual matrix of composition 8*8, and wherein, eight residual values that often execution steps A obtains are the wherein data line in residual matrix, and eight row data are stored in eight registers respectively;
Step C: carry out transposition in conjunction with low level intersection instruction and the instruction of high-order intersection to residual matrix and obtain transposition residual matrix, the data in described eight registers replace with eight row data in transposition residual matrix respectively;
Step D: the computing carrying out the butterfly computation of horizontal transformation part for the data in transposition residual matrix, obtains horizontal transformation operation matrix, and the data in described eight registers replace with eight row data in horizontal transformation operation matrix respectively;
Step e: carry out transposition in conjunction with low level intersection instruction and the instruction of high-order intersection to horizontal transformation operation matrix and obtain transposition horizontal transformation operation matrix, the data in described eight registers replace with eight row data in transposition horizontal transformation operation matrix respectively;
Step F: the computing carrying out the butterfly computation of vertical transitions part for the data in transposition horizontal transformation operation matrix, obtains vertical transitions operation matrix, the data in described eight registers replace with eight row data in vertical transitions operation matrix respectively;
Step G: carry out following computing for vertical transitions operation matrix: r ij=(h ij+ 2 4) >>5, obtain transformation results matrix, wherein, h ijfor the data in vertical transitions operation matrix, r ijfor the data in transformation results matrix, the span of i and j is 0-7.
2. the parallel transformation method based on AVS according to claim 1, is characterized in that, comprise following sub-step in step F:
Step F 0: transfer the data in transposition horizontal transformation operation matrix to 32 by 16, front four data of the every data line in transposition horizontal transformation operation matrix sent into respectively in corresponding register, rear four data of the every data line in transposition horizontal transformation operation matrix are temporary in internal memory;
Step F 1: the computing carrying out the butterfly computation of vertical transitions part for the data be saved in eight registers, obtain the left-half data in vertical transitions operation matrix, then by the left-half data temporary storage in vertical transitions operation matrix in internal memory;
Step F 2: the data be temporary in internal memory in transposition horizontal transformation operation matrix are sent to respectively from internal memory in described eight registers;
Step F 3: the computing carrying out the butterfly computation of vertical transitions part for the data be saved in eight registers, obtain the right half part data in vertical transitions operation matrix, the left-half data in vertical transitions operation matrix and right half part data assemblies are obtained vertical transitions operation matrix.
3. the parallel transformation method based on AVS according to claim 1, it is characterized in that, in the computing of the computing of the butterfly computation of horizontal transformation part and the butterfly computation of vertical transitions part, for the two groups of data be stored in respective register needing to carry out respectively being added and carrying out to subtract each other, perform following calculation step: two registers are designated as xmm0 and xmm7 respectively, data in xmm0 are designated as x0, data in xmm7 are designated as x7, first being added by xmm0 and xmm7 is stored in xmm0, then being added by xmm0 and xmm0 is stored in xmm7, finally xmm7 being deducted xmm0 is stored in xmm7, after calculation step, the result of x0+x7 is stored in xmm0, and the result of x0-x7 is stored in xmm7.
4. the parallel transformation method based on AVS according to claim 1, it is characterized in that, in the computing of the computing of the butterfly computation of horizontal transformation part and the butterfly computation of vertical transitions part, y is designated as by needing the data be stored in register be multiplied with 10, then perform following calculation step for this multiplying of y*10: the register preserving data y is designated as xmm1, first data y is copied in another register, another register is designated as xmm2, then the data in xmm2 are moved to left after two and be added with the data in xmm1, addition result is stored in xmm1, again xmm1 is added with self, through this calculation step, the result of y*10 is stored in xmm1.
5., based on the parallel inverse transform method of AVS, it is characterized in that, comprise the following steps:
Steps A: utilize eight registers to preserve eight row data in inverse quantization matrix respectively, the data in inverse quantization matrix are 16 bit data;
Step B: carry out transposition in conjunction with low level intersection instruction and the instruction of high-order intersection to inverse quantization matrix and obtain transposition inverse quantization matrix, the data in described eight registers replace with eight row data in transposition inverse quantization matrix respectively;
Step C: the computing carrying out the butterfly computation of horizontal reverse conversion fraction for the data in transposition inverse quantization matrix, obtains horizontal reverse transform operation matrix, and the data in described eight registers replace with eight row data in horizontal reverse transform operation matrix respectively;
Step D: carry out following computing for the data in horizontal reverse transform operation matrix: h ij"=clip3 (-2 n+7, 2 n+7-1, (h ij'+4)) >>3, obtain horizontal reverse conversion intermediary matrix, wherein, h ij" be the data in horizontal reverse conversion intermediary matrix, h ij' be the data in horizontal reverse transform operation matrix, the value of n is the span of 8, i and j be 0-7, clip3 is the median of getting three values, and the data in described eight registers replace with eight row data in horizontal reverse conversion intermediary matrix respectively;
Step e: intersect instruction and high-order intersect instruction and carry out transposition to horizontal reverse conversion intermediary matrix and obtain transposition horizontal reverse conversion intermediary matrix in conjunction with low level, the data in described eight registers replace with eight row data in transposition horizontal reverse conversion intermediary matrix respectively;
Step F: the computing carrying out the butterfly computation of vertical inverse transformation part for the data in transposition horizontal reverse conversion intermediary matrix, obtain vertical inverse transformation operation matrix, the data in described eight registers replace with eight row data in vertical inverse transformation operation matrix respectively;
Step G: carry out following computing for vertical inverse transformation operation matrix: r ij=clip3 (-2 n+7, 2 n+7-1, (h ij+ 64)) >>7, obtains residual error operation matrix, wherein r ijfor the data in residual error operation matrix, h ijfor the data in vertical inverse transformation operation matrix, the value of n is the span of 8, i and j be 0-7, clip3 is the median of getting three values;
Step H: carry out following computing for residual error operation matrix: c ij=min (r ij+ p ij, 255), obtain inverse transformation matrix of consequence, wherein, c ijfor the data in inverse transformation matrix of consequence, r ijfor the data in residual error operation matrix, p ijfor the data in predict pixel matrix, the span of i and j is that 0-7, min are for getting minimum value.
6. the parallel inverse transform method based on AVS according to claim 5, is characterized in that, comprise following sub-step in step F:
Step F 0: transfer the data in transposition horizontal reverse conversion intermediary matrix to 32 by 16, send in corresponding register respectively by front four data of the every data line in transposition horizontal reverse conversion intermediary matrix, rear four data of the every data line in transposition horizontal reverse conversion intermediary matrix are temporary in internal memory;
Step F 1: the computing carrying out the butterfly computation of vertical inverse transformation part for the data be saved in eight registers, obtains the left-half data in vertical inverse transformation operation matrix and is temporary in internal memory;
Step F 2: the data be temporary in internal memory in transposition horizontal reverse conversion intermediary matrix are sent in described eight registers respectively from internal memory;
Step F 3: the computing carrying out the butterfly computation of vertical inverse transformation part for the data be saved in eight registers, obtain the right half part data in vertical inverse transformation operation matrix, the left-half data in vertical inverse transformation operation matrix and right half part data assemblies are obtained vertical inverse transformation operation matrix.
7. the parallel inverse transform method based on AVS according to claim 5, it is characterized in that, in the computing of the butterfly computation of horizontal reverse conversion fraction and the computing of the butterfly computation of vertical inverse transformation part, for the two groups of data be stored in respective register needing to carry out respectively being added and carrying out to subtract each other, perform following calculation step: two registers are designated as xmm0 and xmm7 respectively, data in xmm0 are designated as x0, data in xmm7 are designated as x7, first being added by xmm0 and xmm7 is stored in xmm0, then being added by xmm0 and xmm0 is stored in xmm7, finally xmm7 being deducted xmm0 is stored in xmm7, after calculation step, the result of x0+x7 is stored in xmm0, and the result of x0-x7 is stored in xmm7.
8. the parallel inverse transform method based on AVS according to claim 5, it is characterized in that, in the computing of the butterfly computation of horizontal reverse conversion fraction and the computing of the butterfly computation of vertical inverse transformation part, y is designated as by needing the data be stored in register be multiplied with 10, then perform following calculation step for this multiplying of y*10: the register preserving data y is designated as xmm1, first data y is copied in another register, another register is designated as xmm2, then the data in xmm2 are moved to left after two and be added with the data in xmm1, addition result is stored in xmm1, again xmm1 is added with self, through this calculation step, the result of y*10 is stored in xmm1.
CN201510076289.XA 2015-02-11 2015-02-11 Parallel transformation and inverse transform method based on AVS Active CN104683817B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510076289.XA CN104683817B (en) 2015-02-11 2015-02-11 Parallel transformation and inverse transform method based on AVS

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510076289.XA CN104683817B (en) 2015-02-11 2015-02-11 Parallel transformation and inverse transform method based on AVS

Publications (2)

Publication Number Publication Date
CN104683817A true CN104683817A (en) 2015-06-03
CN104683817B CN104683817B (en) 2017-12-15

Family

ID=53318297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510076289.XA Active CN104683817B (en) 2015-02-11 2015-02-11 Parallel transformation and inverse transform method based on AVS

Country Status (1)

Country Link
CN (1) CN104683817B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105847840A (en) * 2015-11-18 2016-08-10 西安邮电大学 Parallel structure construction method for high efficiency video coding inverse transform operation
CN106254883A (en) * 2016-08-02 2016-12-21 青岛海信电器股份有限公司 Inverse transform method in the decoding of a kind of video and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010033617A1 (en) * 2000-04-19 2001-10-25 Fumitoshi Karube Image processing device
CN101188761A (en) * 2007-11-30 2008-05-28 上海广电(集团)有限公司中央研究院 Method for optimizing DCT quick algorithm based on parallel processing in AVS
CN104320668A (en) * 2014-10-31 2015-01-28 上海交通大学 SIMD optimization method for DCT and IDCT of HEVC/H.265

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010033617A1 (en) * 2000-04-19 2001-10-25 Fumitoshi Karube Image processing device
CN101188761A (en) * 2007-11-30 2008-05-28 上海广电(集团)有限公司中央研究院 Method for optimizing DCT quick algorithm based on parallel processing in AVS
CN104320668A (en) * 2014-10-31 2015-01-28 上海交通大学 SIMD optimization method for DCT and IDCT of HEVC/H.265

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
于云娣: "《H_264的编码器优化及信息在网络中实时传输研究》", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
魏芳等: "《H.264中变换和量化的SIMD优化》", 《计算机工程与应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105847840A (en) * 2015-11-18 2016-08-10 西安邮电大学 Parallel structure construction method for high efficiency video coding inverse transform operation
CN105847840B (en) * 2015-11-18 2018-12-07 西安邮电大学 A kind of parallel organization building method for efficient video coding inverse transformation operation
CN106254883A (en) * 2016-08-02 2016-12-21 青岛海信电器股份有限公司 Inverse transform method in the decoding of a kind of video and device
CN106254883B (en) * 2016-08-02 2021-01-22 海信视像科技股份有限公司 Inverse transformation method and device in video decoding

Also Published As

Publication number Publication date
CN104683817B (en) 2017-12-15

Similar Documents

Publication Publication Date Title
CN106066783A (en) The neutral net forward direction arithmetic hardware structure quantified based on power weight
JP2007151131A5 (en)
CN112748483B (en) Air temperature forecast deviation correction method and device based on deep learning
CN107820091B (en) Picture processing method and system and picture processing equipment
CN101188761A (en) Method for optimizing DCT quick algorithm based on parallel processing in AVS
US11010130B2 (en) Floating point processor prototype of multi-channel data
CN104244010B (en) Improve the method and digital signal converting method and device of digital signal conversion performance
Xu et al. Singular vector sparse reconstruction for image compression
CN104683817A (en) AVS-based methods for parallel transformation and inverse transformation
CN104320668B (en) HEVC/H.265 dct transform and the SIMD optimization methods of inverse transformation
CN106488236B (en) The calculation method and device of absolute transformed error sum in a kind of Video coding
CN102595112B (en) Method for coding and rebuilding image block in video coding
CN103092559B (en) For the multiplier architecture of DCT/IDCT circuit under HEVC standard
CN104683800B (en) Parallel quantization and quantification method based on AVS
CN112911289B (en) DCT/IDCT transformation optimization method and system
KR101722215B1 (en) Apparatus and method for discrete cosine transform
CN115147283A (en) Image reconstruction method, device, equipment and medium
CN101729886B (en) Method, system and device for decoding videos
CN206962992U (en) 3 for digital video decoding multiply 3 Integer DCT Transform quantizers
Shan et al. All phase discrete sine biorthogonal transform and its application in JPEG-like image coding using GPU
Teja et al. Verilog implementation of fully pipelined and multiplierless 2D DCT/IDCT JPEG architecture
CN101977318A (en) Parallel device of DCT (Discrete Cosine Transformation) quantization and method thereof
CN103488614A (en) Conversion method and device in digital signal processing
CN107846599B (en) decoding method and device
CN101309403A (en) Frequency domain video transcoding method and transcoding device implementing the same

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant