WO2004114666A1

WO2004114666A1 - Constant stream compression processing method

Info

Publication number: WO2004114666A1
Application number: PCT/CN2003/000486
Authority: WO
Inventors: Sannan Yuan; Mei Xue; Qin Wang
Original assignee: Shanghai Dracom Communication Technology Ltd.
Priority date: 2003-06-23
Filing date: 2003-06-23
Publication date: 2004-12-29
Also published as: AU2003248219A1

Abstract

A method of processing video/audio data compressed according to the MPEG-2 standard, including the steps of: (l) analyzing the video objects, determing the parameters used in the following steps; (2) preprocessing associated video objects; (3) re-encoding video data stream demultiplexed, outputting constant video data stream; (4) multiplexing constant video data stream and sampled audio/sub-picture packets, outputting constant data stream. The method of compressing video/audio data to constant stream enables real-time transmission/playing of compressed video/audio data over stream networks and can randomly access bit stream.

Description

Constant current compression processing method

The invention relates to a data processing method for the MPEG-2 compression standard, in particular to process video and audio compressed data to a constant flow rate, which makes it suitable for real-time transmission and playback on a streaming media network, and can perform code matching. Data processing method for random access of streams.

Background technique

MPEG (Motion Pictures Experts Group), translated as the Motion Picture Experts Group, is an expert group convened by the International Organization for Standardization (ISO) to develop compression standards for digital video and audio. The organization originally formulated the MPEG1 standard in 1992, which was used for the transmission of programs on laser discs. The broadcasting and television industry saw the significance of MPEG technology for the television industry from the application of the MPEG1 standard, so the organization launched the MPEG2 compression standard in 1994, establishing the possibility of interoperability of video and audio services and applications worldwide.

There are three key compression techniques used by the MPEG compression standard: Discrete Cosine Transform (DCT), Motion Compensation, and Huffinan coding. DCT reduces the spatial redundancy of the image, motion compensation reduces the temporal redundancy of the image, and Huffinan coding reduces the image's redundancy in information (Entropy). The comprehensive application of these technologies makes the compression rate of MPEG higher.

The MPEG2 standard is similar to MPEG1, but it is more adaptable and can be applied to all processes and links of broadcast television. By definition, MPEG1 is actually a subset of MPEG2. This can be seen in the following MPEG2 class and level classification table.

The MPEG2 standard is divided into four files, which are: The system layer (System, IS013818-1) describes the video and audio data multiplexing method and the video and audio synchronization method.

Video compression layer (Video, IS013818-2), which describes the digital video encoding method and decoding process. Audio compression layer (Audio, IS013818-3), which describes the digital audio encoding method and decoding process. Consistency (Conformance, IS013818-4), explaining the process of testing the coded stream to verify compliance with the requirements of the first three documents.

The MPEG 2 compression algorithm is defined as a universal video and audio compression standard at the time of design. It is required to take into account different application requirements and control the compressed output bit rate and image quality. To this end, the MPEG 2 compression algorithm is divided into different levels and categories. Classes define chrominance spatial resolution and output bitstream control, while classes define image resolution, luminance sampling frequency, number of layers of video and audio that can be supported by the scalable class, and the maximum bitstream corresponding to each class of the class .

In order to better represent the encoded data, MPEG uses a syntax to define a hierarchical structure. Its structure is divided into six layers, from top to bottom:

Video Sequence Layer

Group of Pictures (GOP)

Picture

Macro block slice

Macroblock

Block

An image sequence is composed of a group of images, and has an image sequence header indicating the beginning and an image termination code indicating the end. Is a random access paragraph.

The group of pictures (GOP) is added to facilitate random access. Its structure and length are variable. MPEG2 has no hard rules for this. GOP has two parameters, namely length (N) and frame repetition frequency (M), which will be explained below. Groups of pictures are randomly accessed video units.

An image is an independent display unit and a basic coding unit. In MPEG2, images can be progressive or interlaced. This is different from MPEG1, which is always progressive.

A macroblock strip contains several consecutive macroblocks and is a unit of resynchronization. The purpose of setting the macroblock strip is to prevent the spread of error codes. When an error occurs in a macroblock strip, it does not affect the subsequent decoding of the macroblock strip.

The image is divided into 16 × 16 macroblocks in a luminance array. Macroblocks are the basic unit for motion compensation. A macro block contains 4 8X8 luma blocks. Depending on the class, a macro block also contains two 8X8 chroma blocks (one each for RY and BY, when 4: 2: 0 are sampled) or four 8X8 chroma blocks ( RY and BY two each, 4: 2: 2 sampling). A block is a unit for performing DCT operations and contains only brightness or only chrominance.

As mentioned above, MPEG is based on DCT, motion compensation, and Huffman coding algorithms. Therefore, MPEG uses two methods of intra-frame compression and inter-frame compression in compression. To achieve the maximum compression ratio in encoding, MPEG uses three types of images, g | 3l frames, P frames, and B frames.

I-frame (Intra-Fmme) is intra-frame compression, does not use motion compensation, and provides a medium compression ratio. Because I-frames do not depend on other frames, they are the entry point for random access, and they are also the reference frames in decoding.

The P-frame (Predicated-Frame) performs prediction based on the previous I-frame or P-frame, and uses motion compensation algorithm to compress. Therefore, the compression ratio is higher than that of the I-frame, and the average data volume is about 1/3 of the I-frame. The P frame is a reference frame for decoding the preceding and succeeding B frames and the succeeding P frames. The P frame itself has errors. If the previous reference frame of the P frame is also a P frame, it will cause error propagation.

B frame (Bidirectinal-Frame) is a frame based on interpolation reconstruction. It is based on two I, P frames or P, P frames before and after. It uses bidirectional prediction, and the average data volume can reach about 1/9 of I frames. B frame The body is not used as a reference, so it can not propagate errors while providing a higher compression ratio.

It should be pointed out that, although the term Frame is used here, MPEG-2 itself does not stipulate that frames must be used as a unit when compressing digital images. For interlaced video images, Field can be used as a unit.

A GOP consists of a series of I, B, and P frames, starting with an I frame. The number of frames of an image in a GOP is variable. A large number of frames can provide a high compression ratio, but it will cause random access delay (must wait until the next I frame) and accumulation of errors (error propagation of P frames). Generally, there are two I-frames in one second, which are used as entries for random access.

The structure of the GOP is not specified in MPEG2, and the frame repetition mode can be IP, IB, IBP, IBPB, or even all I frames. The repetition frequency of the reference frame is represented by M. Different frame repetition frequencies provide different output bit rates and affect the access delay.

When comparing the three compression methods of M-JPEG, DV and MPEG2, this tricky issue was mentioned: M-JPEG and DV can provide frame-accurate random access. However, if the compressed data stream of MPEG2 is Based on I, P frames or I, P, B frames, this cannot be done. This is brought by the motion compensation compression algorithm, and the advantages and disadvantages of the new technology are shown here. In a GOP, in order to decode P frames and B frames, it must depend on I frames, so when accessing a video stream, it must enter from I frames. The consequences of this problem in different applications vary widely. For example, when a television viewer switches program channels, the delay caused by the digital video decoding box waiting for the arrival of the I-frame of the new channel is not a problem, because there are at least two I-frames per second, and the viewer does not care about this small delay. However, there are major problems in TV station business. For example, it is difficult to control the starting point and length of the insertion, and the slowness of material search during non-linear editing. Therefore, the rate of existing MPEG-2 streams, such as DVD, varies with the content of the picture. This is not conducive to real-time playback through the network transmission, which will cause the decoder's VBV buffer to overflow or underflow, and the picture will appear mosaic, block and fluctuate, and even cause the decoder to stop working.

Summary of the Invention

In view of the existing MPEG-2 code stream, it is not conducive to searching, positioning, and editing, and because the stream traffic is not uniform, the number of frames and the frame structure in each GOP (Group of Pictures) are not fixed, and the traffic uncertainty causes It is possible to achieve random positioning of I-frames as random access entries and their insurmountable difficulties caused by fast-forward playback, fast-rewind playback, editing, positioning, and the like. The invention provides a method for processing video and audio data compressed by the MPEG-2 compression standard, which overcomes the above problems.

The technology of the present invention is to make the MPEG-2 video stream have a constant number of image frames in GOP units, and the code stream length in GOP units is absolutely constant, which ensures a constant video stream rate. The video stream and the audio stream are multiplexed to obtain a system stream. The audio stream is originally a constant stream, and the traffic of the sub-image and the like is very small compared to the video stream, so only a small amount of redundancy is required for each GOP , You can get a constant current MPEG-2 program stream. The invention provides a method for processing video and audio data for an MPEG-2 program stream, which includes the following steps: (1) analyzing a video object file to determine parameters to be used in subsequent steps; (2) correlating The video object file is pre-processed; (3) the demultiplexed video stream is re-encoded to obtain constant stream video data; (4) the constant stream video stream and the extracted audio / sub-image packet are multiplexed, and Constant current again to get the final data flow.

The invention can achieve a fixed length between adjacent I frames and a fixed number of frames between adjacent I frames. This is equivalent to a fixed GOP code stream length and a fixed number of frames in the GOP. This makes searching, positioning, editing, and other operations very easy. Another benefit is that it is suitable for web-based real-time playback. It will cause the overflow or underflow of the VBV buffer on the decoder side, and the picture will not appear mosaic, block and fluctuate.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of an embodiment of the present invention.

FIG. 2 is a schematic diagram of a GOP frame arrangement in the pulldown situation of the present invention.

FIG. 3 is a schematic diagram of a GOP frame arrangement in the present invention without pulldown.

Best embodiment

The detailed implementation of various aspects of the present invention will become clearer in the following description of the preferred embodiments with reference to the accompanying drawings.

The three main parameters SCR / PTS / DTS that will appear in the following are the time tags characterized at specific positions in the system stream. They are all a small piece of data inserted by the encoder in the data stream. among them:

System time (SCR)

SCR is the system clock reference. Insert at least every 0.7 seconds. The decoder extracts the SCR from the data stream and sends the SCR to the image decoder and audio decoder to synchronize the internal clock with the system clock.

Display Timestamp (PTS)

An image can be divided into many "display units", and the display unit of an image is a frame. PTS indicates the display time of the display unit. The decoder checks the PTS and compares it with the SCR, and displays the image accordingly to synchronize it with the system time.

Decoding Timestamp (DTS)

Represents the expected decoding time of this access unit in the system's target decoder. In hierarchical coding, the relevant DTS must be consistent with the corresponding access unit in all hierarchical levels. Refer to step 1 in Figure 1: First analyze the video target file to determine the parameters that will be used in subsequent steps. These parameters include: 1. After decoding the output, whether to display an image repeatedly every 2 frames, that is, whether to 3: 2 pulldown. 3: 2 pulldown refers to the repeated display of an image every 2 frames after decoding the output. This is because the frame rates of film 24 s and NTSC 30 s are different. The adjustments needed to be performed in the system conversion; 2. How many packets are cut ( Video target 1, video target 2, ..., keep the last video target program; make pure pulldown or not; and cut off the black screen); 3. Determine the video format, that is, PAL or NTSC. 4. Determine the stream ID of the required audio stream and subtitle stream; 5. Determine the audio traffic (kbps); 6. Determine the frame rate (frame / seconds).

Of course, in some other embodiments, different parameters may be selected according to different needs, and such a simple change in the type of the selected parameters does not exceed the scope of the present invention.

Refer to Step 2 in Figure 1 to preprocess the relevant video object files, including the following:

(a) Extract the required audio and sub-image packages from the intercepted video object file;

(b) Demultiplexing from the intercepted video object file to obtain a video data stream;

Since the first GOP intercepted may not be self-contained, the two B frames following the I frame cannot be decoded correctly. The present invention adopts to repeat the I frame data in the first GOP twice to cover the next two. B-frame data so that no mosaic occurs.

(c) Correct the SCR and PTS of the extracted audio and sub-picture packages.

Audio packet: The PTS of an audio packet refers to the display time of the audio frame header that first appears in the packet. The original PTS of the first audio display unit and the first video display unit are compared to obtain the difference. After the interception, the PTS of the first video display unit of the video is about 0.28 seconds. Value to correct the PTS of the first display unit of the audio. Since every audio frame The display time is fixed. For example, the time interval of an audio frame of Dolby AC3 is 32 milliseconds, so as long as the first audio frame header in the packet is the number of audio frames, the PTS of the packet can be easily calculated. . The SCR of an audio packet is the estimated time for the first byte of the packet to reach the decoder, rather than the estimated time for the first audio frame header to reach the decoder, so the SCR correction scheme of the present invention is: SCR (seconds) = PTS (seconds)-the position of the first audio frame header (byte) / frame size (byte) * frame time (seconds) of the packet-fixed experience value (seconds).

Because in some DVD video object files, the PTS of the video packet is not calculated strictly according to the video frame rate in the code stream, such as 29.97 frames / second, and the re-encoded and multiplexed code stream of the present invention strictly follows the frame. Rate to calculate SCR, PTS, and multiplex audio and video streams accordingly, which will cause audio and video to be out of sync. This is why the PTS and SCR of the audio package and the sub-picture package should be corrected here. The solution is: Multiply the PTS and SCR of the audio package by the scaling factor. The scaling factor is the ratio of the theoretical display time (calculated based on the number of display units) of the video stream to the audio stream in the original video object file. The reason for scaling audio is that the constant bit rate transmission code stream on the network is based on the transmission of a fixed number of video frames in a fixed time.

Sub-image packs: Sub-image packs appear much less frequently than video and audio, and they do not have the characteristics of constant flow of audio and video streams. The SCR and PTS correction schemes of the present invention for sub-picture packs are also different from audio packs. The present invention corrects the SCR and PTSo of the sub-image packet according to the difference between the SCR of the navigation packet of the GOP where the sub-image packet is located in the original file and the theoretical SCR.

The implementation of these preferred steps will help improve the synchronization of video and audio. Of course, other known pre-processing can be used to obtain the required audio and sub-image packets and video data streams. The present invention can also be implemented.

Refer to step 3 in Figure 1: Re-encode the video data stream to re-encode the constant stream Video data. Here, the definition of video constant current is: one GOP is fixed at 12/15 frames, which is equivalent to fixing the playback time of one GOP; the length of a GOP stream is fixed (in bytes); one GOP is fixed at the video sequence header The codeword Ox 000001b3 starts. In the constant current process, when the encoding length of a G0P is greater than the specified value, re-encoding (if the re-encoding cannot be performed, the specified length is intercepted, which will cause mosaic, which should be avoided if possible); When the encoding length of a GOP is less When the value is specified, data 0 is filled until the length is equal to the specified value.

FIG. 2 is a frame arrangement of a 12-frame GOP as defined in the pulldown case according to the present invention.

FIG. 3 is a GOP of 15 frames defined according to the present invention without pulldown. According to different needs, GOPs of different lengths and frame structures can also be defined. Obviously, this is also within the protection scope of the present invention. .

Refer to step 4 in Figure 1: Multiplex the video stream after the constant stream and the extracted audio / sub-image packets, and then constant stream again to obtain the final data stream. It includes the following:

1. Pack and multiplex the constant-current video stream into a system stream. (The key is to determine the SCR, PTS, DTS of the I frame.) First read an access unit of the video. The data in this temporary file is then packaged. It should be noted that a packet starts at the beginning of a GOP, the purpose of which is a random access of the GOP; the next frame of data immediately after the I frame starts another packet, the purpose is to access the data of the I frame when fast forward or rewind.

2. Insert the extracted audio and sub-image packages into the packaged video package.

The rules are as follows: According to the SCR, the audio or sub-picture packet should be located in several GOPs and the position in the GOP, and inserted according to this position. 3. Make the number of packets in each GOP constant (constant current), including video packets, audio packets, and sub-image packets. If the number is less than the specified value, it is filled with all-zero packets (2048 bytes). If the number is less than the specified value, the extra audio packets and sub-picture packets are moved to the beginning of the next GOP.

4. Correct the SCR of each package. The rules are: As the packets are transmitted on the network, the time interval for sending packets is fixed. Furthermore, the playback time of each GOP of the present invention is fixed, and the number of packets per GOP is fixed. The time information in the packet SCR refers to the time when the first byte of the packet is expected to reach the decoder. Therefore, the SCR of each packet should be increased evenly in turn.

It is worth mentioning that there are many known methods for multiplexing video streams and audio / sub-image packets. What is described here is only a preferred implementation method. Those skilled in the art can use other multiplexing methods. Method to achieve this step without exceeding the content disclosed by the present invention.

Industrial applicability

The method of the present invention can achieve a fixed length between adjacent I frames and a fixed number of frames between adjacent I frames, which is equivalent to a fixed GOP code stream length and a fixed number of frames in the GOP, so that search and positioning , Editing and other operations are very easy. The method of the present invention is also suitable for network-based real-time playback, which does not cause overflow or underflow of the VBV buffer at the decoding end, and the picture does not appear mosaic, blocking, and fluctuating.

Claims

Rights request

1. A method for processing video and audio data for the MPEG-2 compression standard, which is characterized in that it includes the following steps:

(1) analyzing the video target file to determine parameters to be used in subsequent steps;

(2) Pre-processing related video object files;

(3) re-encoding the demultiplexed video data stream to obtain constant stream video data;

(4) Multiplexing the constant stream video stream and the extracted audio / sub-image packets, and then constant stream again to obtain the final data stream.

2. The method according to claim 1, wherein the parameters in step 1 include but are not limited to: whether the decoded output is 3 _: 2 p _U lldown, the number of cut packets, the video format, and the required audio Data stream identification, audio traffic, and frame rate of the stream and subtitle stream.

3. The method according to claim 1, wherein step 2 comprises the following steps:

(c) Correct the SCR and PTS of the extracted audio and sub-picture packages.

4. The method according to claim 3, wherein the audio packet PTS correction scheme in step (c) is: multiplying the PTS of the audio packet by a scaling factor, and the scaling factor is the video stream and the audio stream in the original video target file The ratio of the theoretical display time (calculated based on the number of display units).

5. The method according to claim 3, wherein the audio packet SCR correction scheme in step (c) is: SCR (seconds) = PTS (seconds)-the position of the first audio frame header of the packet (byte ) / Frame size (byte) * frame time (seconds)-fixed experience value (seconds).

6. The method according to claim 3, wherein the sub-image package in step (c) The SCRIPTS correction scheme is to correct the SCR and PTSc of the sub-image package according to the difference between the SCR of the navigation package of the GOP where the sub-image package is located in the original file and the theoretical SCR.

7. The method according to any one of claims 1 to 6, wherein the definition of the constant video stream in step (3) is: one GOP is fixed to 12 or 15 frames; the GOP code stream length is fixed; A GOP starts with codeword 0x000001b3.

8. The method according to any one of claims 1 to 6, wherein step (4) comprises the following steps:

(1) packetizing and multiplexing a constant-current video stream into a system stream;

(2) Insert the extracted audio and sub-image packets into the packaged video stream;

(3) Make the number of packets in each GOP constant. If the number is less than the specified value, fill it with all zero packets. If the number is less than the specified value, the extra audio packets and sub-picture packets are moved to the beginning of the next GOP;

(4) Correct the SCR of each package.