CN103716644A - H264 multi-granularity parallel handling method - Google Patents

H264 multi-granularity parallel handling method Download PDF

Info

Publication number
CN103716644A
CN103716644A CN201310645144.8A CN201310645144A CN103716644A CN 103716644 A CN103716644 A CN 103716644A CN 201310645144 A CN201310645144 A CN 201310645144A CN 103716644 A CN103716644 A CN 103716644A
Authority
CN
China
Prior art keywords
frame
parallel
data
level
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310645144.8A
Other languages
Chinese (zh)
Inventor
钱荣华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING COREWISE SMART TECHNOLOGY Inc
Original Assignee
NANJING COREWISE SMART TECHNOLOGY Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING COREWISE SMART TECHNOLOGY Inc filed Critical NANJING COREWISE SMART TECHNOLOGY Inc
Priority to CN201310645144.8A priority Critical patent/CN103716644A/en
Publication of CN103716644A publication Critical patent/CN103716644A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention relates to an H264 multi-granularity parallel handling method. In an H264 coding hierarchical structure, a frame level, a piece level, and data level are divided based on the magnitude of parallel granularities from large to small. The H264 multi-granularity parallel comprises three types, i.e., frame level parallel, piece level parallel, and data level parallel. The handling method is using an arm instruction set to applying data operation parallel on vector manipulation. The multi-granularity parallel enables a program to have better locality, thereby increasing the Cache hit rate and increasing the CPU execution efficiency.

Description

The processing method that the many granularities of a kind of H264 are parallel
Technical field
The present invention relates to a kind of decoding method of 3G transmission of video, relate in particular to the parallel processing method of the many granularities of a kind of H264.
Background technology
Video monitoring system, as the important component part of wisdom security protection and wisdom traffic in the Internet of Things application towards municipal public safety integrated management, has broad application prospects.Along with the development of social city's level, the bottleneck of traditional video monitoring system manifests.The core technology addressing this problem is the traditional video monitoring system of video structural description technological transformation, makes it to form video monitoring system of new generation---wisdom, semantization, informationalized semantic video supervisory control system.The core of video monitoring system technology is the encoding and decoding of video, for better compatible each hardware platform, need to use soft decoding to coordinate hard decoder jointly to complete the encoding and decoding work of image.The current 3G mobile video monitoring system based on android platform development, ubiquity image quality is unintelligible, show the problem that lag time is long, its basic reason is that soft code decode algorithm is DSP platform or the embedded Linux platform development based on traditional mostly, it is also to carry out based on above-mentioned platform to the Optimization Work of algorithm, and for the algorithm optimization of Android system imperfection also, so need to be optimized for the arm platform of Android system.
At present the software decode mode of android platform, under 640*480 resolution, can only reach the encoding and decoding speed of the limit of frame each seconds 10, reach more smooth user and experience, encoding and decoding speed need to be brought up to 15 frames per second more than.
Summary of the invention
Technical problem to be solved: the technical scheme of utilizing soft encoding and decoding and hard encoding and decoding to be used in conjunction with is proposed for above problem the present invention, many granularities and the program of exercising there is better locality to improve Cache hit rate, improve the execution efficiency of CPU.
Technical scheme: in order overcoming the above problems, to the invention provides the method for parallel processing of the many granularities of a kind of H264, in H264 coding hierarchical structure, can be divided into frame level, chip level and data level from big to small according to parallel granularity; It is characterized in that: the many granularities of described H264 are parallel comprises that frame level is parallel, chip level is parallel and parallel three kinds of data level, processing method is to utilize parallel to the computing of vector operations implementation data of arm instruction set, obtains considerable speed-up ratio; Concrete is as follows:
1.1 data levels are parallel
The data level of many granularities of H264 parallel encoding walks abreast and mainly comprises following two aspects:
1.1.1 the data level based on Multimedia Xtension is parallel
The described data level based on Multimedia Xtension is parallel has used MMX/SSE2 and AltiVec technology, polycaryon processor is embedded multimedia instruction set, thereby this multimedia instruction set by logarithm factually row vector computing reach the effect of parallel processing; Described multimedia instruction set is the expansion of SIMD; Described SIMD refers to that an instruction can be carried out computing to a vector of a plurality of data composition simultaneously when execute vector operates;
1.1.2 under heterogeneous polynuclear, the data level of dct transform is parallel
Data level is parallel be feature for heterogeneous polynuclear platform by the Module Division that has intensive to calculating from core, by large macroblock partitions is become to sub-block, from core, the sub-block after a plurality of divisions is being carried out to DCT conversion, finally remerging the dct transform result of large macro block; 16 * 16 big or small macro blocks are divided, the sub-block that is divided into 48 * 8 sizes, the DCT of these 4 sub-blocks conversion is carried out creating thread from core respectively, thereby again sub-block transformation results is transferred back to the dct transform that main core completes whole 16 * 16 macro blocks after each sub-block DCT conversion is complete;
1.2 chip levels are parallel
Described chip level is parallel is by frame image data cell-average is divided into a plurality of data blocks, then each sheet creates thread simultaneously and carries out parallel encoding, in cataloged procedure, add sheet header, after having encoded, successively each sheet data are combined into a frame image data again.
1.3. frame level is parallel
In H264 coding, I is intra-coded frame, does not need with reference to other frame, and as a reference, B is bi-directional predicted frames to the I frame that P frame needs forward direction, needs the I of forward and backward or P frame as its reference frame;
1.3.1 the method that realizes parallel I/P frame and B frame code synchronism in frame level design of Parallel Algorithms is that I/P frame and B frame are carried out to separately storage;
After described separately storage determines frame type to the image sequence reading in exactly, if I/P frame just carries out the operation of joining the team of I/P frame according to coded sequence, otherwise carry out the operation of joining the team of B frame, because the border of parallel granularity is I or P frame, therefore first parallel parsing initiates from I/P frame coding thread, concrete:
If I/P frame queue non-NULL takes out an I/P frame from queue, according to the coding parallel parsing coding B frame that taking-up can be parallel with this I/P frame from B frame queue again, create a plurality of threads and implement the parallel processing of frame level;
When take out P frame from I/P frame queue, by frame number corresponding relation, take out all walked abreast B frames before this P frame coding, comprise that the B frame that lags behind current P frame coded frame controls the synchronous of parallel encoding; Corresponding relation suc as formula:
f(P)-?f(B)>=T+2
Wherein f (P) is coding P frame frame number, and f (B) is coding B frame frame number, and T is B frame parameters;
When take out I frame from I/P frame queue, the reference frame that this I frame all picture frames are before described all exists, therefore before this I frame is encoded, first all codified frames in B frame queue are all taken out and create coding thread enforcement parallel encoding, subsequently again to this I frame, the initial frame of new GOP, encodes;
1.3.2 in frame level design of Parallel Algorithms, realizing the method that encoding code stream returns according to coding frame number sequential write is a special network abstraction layer structure queue of definition, and concrete algorithm design is as follows:
Coding creates the network abstraction layer data queue of regular length while starting, the data cell that the data cell that this network extraction layer queue has three kinds of state: A neither to comprise frame number also not comprise code stream altogether, data cell, the C that B only comprises frame number comprise frame number and complete code stream;
The initial condition of network extraction layer queue is all A, after the operation of at every turn coded image judgment frame type being joined the team, the operation of just also once joining the team in network extraction layer queue, this operation of joining the team only writes a frame number information, as the foundation that writes actual code flow data, now in this queue, data cell state becomes B; After the frame of appointment has been encoded, just according to current frame number of having encoded, in network extraction layer queue, search the data cell of the frame number of correspondence with it, the code stream of having encoded is all write into, now this data cell state becomes C; Owing to reading in image sequence and determining that the order of frame type is all frame coded sequence, therefore finally write code stream network extraction layer queue order afterwards, input and output thread is write data in disk file more together, in network extraction layer queue, the state of data cell becomes A from C again, be returned to initial condition, so circulation; With this, improve Code And Decode speed.
Useful effect:
The present invention uses many granularities parallel Programming technology, carries out correlation analysis, makes program have better locality to improve Cache hit rate, improves the execution efficiency of CPU, thereby has improved the speed of encoding and decoding.
Embodiment
Below the present invention is described in further details.
Embodiment:
A method for parallel processing for the many granularities of H264, in H264 coding hierarchical structure, can be divided into frame level, chip level and data level from big to small according to parallel granularity; The many granularities of described H264 are parallel comprises that frame level is parallel, chip level is parallel and parallel three kinds of data level, and processing method is to utilize parallel to the computing of vector operations implementation data of arm instruction set, and concrete is as follows:
1.1 data levels are parallel
The data level of many granularities of H264 parallel encoding walks abreast and mainly comprises following two aspects:
1.1.1 the data level based on Multimedia Xtension is parallel
The described data level based on Multimedia Xtension is parallel has used MMX/SSE2 and AltiVec technology, polycaryon processor is embedded multimedia instruction set, thereby this multimedia instruction set by logarithm factually row vector computing reach the effect of parallel processing; Described multimedia instruction set is the expansion of SIMD; Described SIMD refers to that an instruction can be carried out computing to a vector of a plurality of data composition simultaneously when execute vector operates;
1.1.2 under heterogeneous polynuclear, the data level of dct transform is parallel
Data level is parallel be feature for heterogeneous polynuclear platform by the Module Division that has intensive to calculating from core, by large macroblock partitions is become to sub-block, from core, the sub-block after a plurality of divisions is being carried out to DCT conversion, finally remerging the dct transform result of large macro block; 16 * 16 big or small macro blocks are divided, the sub-block that is divided into 48 * 8 sizes, the DCT of these 4 sub-blocks conversion is carried out creating thread from core respectively, thereby again sub-block transformation results is transferred back to the dct transform that main core completes whole 16 * 16 macro blocks after each sub-block DCT conversion is complete;
1.2 chip levels are parallel
Described chip level is parallel is by frame image data cell-average is divided into a plurality of data blocks, then each sheet creates thread simultaneously and carries out parallel encoding, in cataloged procedure, add sheet header, after having encoded, successively each sheet data are combined into a frame image data again.
1.3. frame level is parallel
In H264 coding, I is intra-coded frame, does not need with reference to other frame, and as a reference, B is bi-directional predicted frames to the I frame that P frame needs forward direction, needs the I of forward and backward or P frame as its reference frame;
1.3.1 the method that realizes parallel I/P frame and B frame code synchronism in frame level design of Parallel Algorithms is that I/P frame and B frame are carried out to separately storage;
After described separately storage determines frame type to the image sequence reading in exactly, if I/P frame just carries out the operation of joining the team of I/P frame according to coded sequence, otherwise carry out the operation of joining the team of B frame, because the border of parallel granularity is I or P frame, therefore first parallel parsing initiates from I/P frame coding thread, concrete:
If I/P frame queue non-NULL takes out an I/P frame from queue, according to the coding parallel parsing coding B frame that taking-up can be parallel with this I/P frame from B frame queue again, create a plurality of threads and implement the parallel processing of frame level;
When take out P frame from I/P frame queue, by frame number corresponding relation, take out all walked abreast B frames before this P frame coding, comprise that the B frame that lags behind current P frame coded frame controls the synchronous of parallel encoding; Corresponding relation suc as formula:
f(P)-?f(B)>=T+2
Wherein f (P) is coding P frame frame number, and f (B) is coding B frame frame number, and T is B frame parameters;
When take out I frame from I/P frame queue, the reference frame that this I frame all picture frames are before described all exists, therefore before this I frame is encoded, first all codified frames in B frame queue are all taken out and create coding thread enforcement parallel encoding, subsequently again to this I frame, the initial frame of new GOP, encodes;
1.3.2 in frame level design of Parallel Algorithms, realizing the method that encoding code stream returns according to coding frame number sequential write is a special network abstraction layer structure queue of definition, and concrete algorithm design is as follows:
Coding creates the network abstraction layer data queue of regular length while starting, the data cell that the data cell that this network extraction layer queue has three kinds of state: A neither to comprise frame number also not comprise code stream altogether, data cell, the C that B only comprises frame number comprise frame number and complete code stream;
The initial condition of network extraction layer queue is all A, after the operation of at every turn coded image judgment frame type being joined the team, the operation of just also once joining the team in network extraction layer queue, this operation of joining the team only writes a frame number information, as the foundation that writes actual code flow data, now in this queue, data cell state becomes B; After the frame of appointment has been encoded, just according to current frame number of having encoded, in network extraction layer queue, search the data cell of the frame number of correspondence with it, the code stream of having encoded is all write into, now this data cell state becomes C; Owing to reading in image sequence and determining that the order of frame type is all frame coded sequence, therefore finally write code stream network extraction layer queue order afterwards, input and output thread is write data in disk file more together, in network extraction layer queue, the state of data cell becomes A from C again, be returned to initial condition, so circulation.

Claims (1)

1. a method for parallel processing for the many granularities of H264, in H264 coding hierarchical structure, can be divided into frame level, chip level and data level from big to small according to parallel granularity; It is characterized in that: the many granularities of described H264 are parallel comprises that frame level is parallel, chip level is parallel and parallel three kinds of data level, processing method is to utilize parallel to the computing of vector operations implementation data of arm instruction set, and concrete is as follows:
1.1 data levels are parallel
The data level of many granularities of H264 parallel encoding walks abreast and mainly comprises following two aspects:
1.1.1 the data level based on Multimedia Xtension is parallel
The described data level based on Multimedia Xtension is parallel has used MMX/SSE2 and AltiVec technology, polycaryon processor is embedded multimedia instruction set, thereby this multimedia instruction set by logarithm factually row vector computing reach the effect of parallel processing; Described multimedia instruction set is the expansion of SIMD; Described SIMD refers to that an instruction can be carried out computing to a vector of a plurality of data composition simultaneously when execute vector operates;
1.1.2 under heterogeneous polynuclear, the data level of dct transform is parallel
Data level is parallel be feature for heterogeneous polynuclear platform by the Module Division that has intensive to calculating from core, by large macroblock partitions is become to sub-block, from core, the sub-block after a plurality of divisions is being carried out to DCT conversion, finally remerging the dct transform result of large macro block; 16 * 16 big or small macro blocks are divided, the sub-block that is divided into 48 * 8 sizes, the DCT of these 4 sub-blocks conversion is carried out creating thread from core respectively, thereby again sub-block transformation results is transferred back to the dct transform that main core completes whole 16 * 16 macro blocks after each sub-block DCT conversion is complete;
1.2 chip levels are parallel
Described chip level is parallel is by frame image data cell-average is divided into a plurality of data blocks, then each sheet creates thread simultaneously and carries out parallel encoding, in cataloged procedure, add sheet header, after having encoded, successively each sheet data are combined into a frame image data again;
1.3. frame level is parallel
In H264 coding, I is intra-coded frame, does not need with reference to other frame, and as a reference, B is bi-directional predicted frames to the I frame that P frame needs forward direction, needs the I of forward and backward or P frame as its reference frame;
1.3.1 the method that realizes parallel I/P frame and B frame code synchronism in frame level design of Parallel Algorithms is that I/P frame and B frame are carried out to separately storage;
After described separately storage determines frame type to the image sequence reading in exactly, if I/P frame just carries out the operation of joining the team of I/P frame according to coded sequence, otherwise carry out the operation of joining the team of B frame, because the border of parallel granularity is I or P frame, therefore first parallel parsing initiates from I/P frame coding thread, concrete:
If I/P frame queue non-NULL takes out an I/P frame from queue, according to the coding parallel parsing coding B frame that taking-up can be parallel with this I/P frame from B frame queue again, create a plurality of threads and implement the parallel processing of frame level;
When take out P frame from I/P frame queue, by frame number corresponding relation, take out all walked abreast B frames before this P frame coding, comprise that the B frame that lags behind current P frame coded frame controls the synchronous of parallel encoding; Corresponding relation suc as formula:
f(P)-?f(B)>=T+2
Wherein f (P) is coding P frame frame number, and f (B) is coding B frame frame number, and T is B frame parameters;
When take out I frame from I/P frame queue, the reference frame that this I frame all picture frames are before described all exists, therefore before this I frame is encoded, first all codified frames in B frame queue are all taken out and create coding thread enforcement parallel encoding, subsequently again to this I frame, the initial frame of new GOP, encodes;
1.3.2 in frame level design of Parallel Algorithms, realizing the method that encoding code stream returns according to coding frame number sequential write is a special network abstraction layer structure queue of definition, and concrete algorithm design is as follows:
Coding creates the network abstraction layer data queue of regular length while starting, the data cell that the data cell that this network extraction layer queue has three kinds of state: A neither to comprise frame number also not comprise code stream altogether, data cell, the C that B only comprises frame number comprise frame number and complete code stream;
The initial condition of network extraction layer queue is all A, after the operation of at every turn coded image judgment frame type being joined the team, the operation of just also once joining the team in network extraction layer queue, this operation of joining the team only writes a frame number information, as the foundation that writes actual code flow data, now in this queue, data cell state becomes B; After the frame of appointment has been encoded, just according to current frame number of having encoded, in network extraction layer queue, search the data cell of the frame number of correspondence with it, the code stream of having encoded is all write into, now this data cell state becomes C; Owing to reading in image sequence and determining that the order of frame type is all frame coded sequence, therefore finally write code stream network extraction layer queue order afterwards, input and output thread is write data in disk file more together, in network extraction layer queue, the state of data cell becomes A from C again, be returned to initial condition, so circulation.
CN201310645144.8A 2013-12-05 2013-12-05 H264 multi-granularity parallel handling method Pending CN103716644A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310645144.8A CN103716644A (en) 2013-12-05 2013-12-05 H264 multi-granularity parallel handling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310645144.8A CN103716644A (en) 2013-12-05 2013-12-05 H264 multi-granularity parallel handling method

Publications (1)

Publication Number Publication Date
CN103716644A true CN103716644A (en) 2014-04-09

Family

ID=50409146

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310645144.8A Pending CN103716644A (en) 2013-12-05 2013-12-05 H264 multi-granularity parallel handling method

Country Status (1)

Country Link
CN (1) CN103716644A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106534922A (en) * 2016-11-29 2017-03-22 努比亚技术有限公司 Video decoding device and method
CN107547896A (en) * 2016-06-27 2018-01-05 杭州当虹科技有限公司 A kind of ProRes VLC codings based on CUDA
CN110832875A (en) * 2018-07-23 2020-02-21 深圳市大疆创新科技有限公司 Video processing method, terminal device and machine-readable storage medium
WO2020078253A1 (en) * 2018-10-15 2020-04-23 华为技术有限公司 Transform and inverse transform methods and devices for image block
CN111131836A (en) * 2019-12-13 2020-05-08 苏州羿景睿图信息科技有限公司 JPEG2000 encoding parallel operation method based on FPGA
CN111541941A (en) * 2020-05-07 2020-08-14 杭州趣维科技有限公司 Method for accelerating coding of multiple encoders at mobile terminal
CN113596556A (en) * 2021-07-02 2021-11-02 咪咕互动娱乐有限公司 Video transmission method, server and storage medium
CN117934532A (en) * 2024-03-22 2024-04-26 西南石油大学 Parallel optimization method and system for image edge detection

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107547896A (en) * 2016-06-27 2018-01-05 杭州当虹科技有限公司 A kind of ProRes VLC codings based on CUDA
CN107547896B (en) * 2016-06-27 2020-10-09 杭州当虹科技股份有限公司 Cura-based Prores VLC coding method
CN106534922A (en) * 2016-11-29 2017-03-22 努比亚技术有限公司 Video decoding device and method
CN110832875A (en) * 2018-07-23 2020-02-21 深圳市大疆创新科技有限公司 Video processing method, terminal device and machine-readable storage medium
WO2020078253A1 (en) * 2018-10-15 2020-04-23 华为技术有限公司 Transform and inverse transform methods and devices for image block
CN111131836A (en) * 2019-12-13 2020-05-08 苏州羿景睿图信息科技有限公司 JPEG2000 encoding parallel operation method based on FPGA
CN111541941A (en) * 2020-05-07 2020-08-14 杭州趣维科技有限公司 Method for accelerating coding of multiple encoders at mobile terminal
CN111541941B (en) * 2020-05-07 2021-10-29 杭州小影创新科技股份有限公司 Method for accelerating coding of multiple encoders at mobile terminal
CN113596556A (en) * 2021-07-02 2021-11-02 咪咕互动娱乐有限公司 Video transmission method, server and storage medium
CN117934532A (en) * 2024-03-22 2024-04-26 西南石油大学 Parallel optimization method and system for image edge detection
CN117934532B (en) * 2024-03-22 2024-06-04 西南石油大学 Parallel optimization method and system for image edge detection

Similar Documents

Publication Publication Date Title
CN103716644A (en) H264 multi-granularity parallel handling method
CN108206937B (en) Method and device for improving intelligent analysis performance
CN101710986B (en) H.264 parallel decoding method and system based on isostructural multicore processor
CN105992008A (en) Multilevel multitask parallel decoding algorithm on multicore processor platform
CN102547289B (en) Fast motion estimation method realized based on GPU (Graphics Processing Unit) parallel
CN102098503A (en) Method and device for decoding image in parallel by multi-core processor
CN105306945A (en) Scalable synopsis coding method and device for monitor video
CN104604235A (en) Transmitting apparatus and method thereof for video processing
CN103188521A (en) Method and device for transcoding distribution, method and device for transcoding
CN104539972A (en) Method and device for controlling video parallel decoding in multi-core processor
CN103297777A (en) Method and device for increasing video encoding speed
MX2021002489A (en) Method and device for bidirectional inter frame prediction.
US20190279330A1 (en) Watermark embedding method and apparatus
Wang et al. A collaborative scheduling-based parallel solution for HEVC encoding on multicore platforms
CN105100803A (en) Video decoding optimization method
Wang et al. Parallel H. 264/AVC motion compensation for GPUs using OpenCL
Ge et al. Efficient multithreading implementation of H. 264 encoder on Intel hyper-threading architectures
CN109413432B (en) Multi-process coding method, system and device based on event and shared memory mechanism
CN101902643B (en) Very large-scale integration (VLSI) structural design method of parallel array-type intraframe prediction decoder
CN101466037A (en) Method for implementing video decoder combining software and hardware
CN103327340A (en) Method and device for searching integer
CN104956677A (en) Combined parallel and pipelined video encoder
Gong et al. Cooperative DVFS for energy-efficient HEVC decoding on embedded CPU-GPU architecture
RU2014119878A (en) VIDEO ENCODING METHOD WITH motion prediction DEVICE WITH VIDEO CODING motion prediction VIDEO ENCODING PROGRAM predictive MOTION VIDEO DECODING METHOD WITH motion prediction VIDEO DECODING DEVICE WITH MOTION PREDICTION AND DECODING VIDEO PROGRAM motion prediction C
Asif et al. Exploiting MB level parallelism in H. 264/AVC encoder for multi-core platform

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140409