WO2011157399A1

WO2011157399A1 - Method and device for mixing video streams at the macroblock level

Info

Publication number: WO2011157399A1
Application number: PCT/EP2011/002920
Authority: WO
Inventors: Peter Amon; Norbert Oertel
Original assignee: Siemens Enterprise Communications Gmbh & Co.Kg
Priority date: 2010-06-16
Filing date: 2011-06-14
Publication date: 2011-12-22
Also published as: DE102010023954A1; TWI511527B; CN102934437A; US9264709B2; US20130223511A1; TW201212657A; EP2583461A1; CN102934437B

Abstract

The invention relates to a method and device for mixing video streams in a video mixer device, by means of which a plurality of input video streams from different subscribers which are encoded with code words for macroblocks and in which the code words have interdependencies are combined into an output video stream. The input video streams are at least entropy-decoded to such a degree that the dependencies among the code words are dissolved, wherein the macroblocks are re-organized and mixed with each other, and the mixed macroblocks are entropy-encoded to obtain a new dedicated video stream.

Description

Method and apparatus for mixing video streams at the macroblock level

The invention describes a method for mixing video streams at the macroblock level according to the preamble of patent claim 1 and a

Apparatus for carrying out the method according to the preamble of

Patent claim 9.

In certain applications, it is necessary that the content of several

Video stream is displayed simultaneously on a device. For example, video conferencing with more than two participants is known in which the video signals and audio signals are transmitted between two or more locations in real time. The end devices or soft clients of the users are for this purpose with a camera, today usually USB webcam, and a microphone or headset as

Input devices and a screen and a speaker or headset equipped as output devices. The video and audio signals can be hardware-based encoded and decoded via plug-in cards or purely software-based. Nowadays, users of a videoconferencing system are generally required to not only see the currently speaking subscriber from all other subscribers, as is the case with "voice activated switching" systems, but that all or at least several of the interlocutors are mutually exclusive on the

Screen, as is the case with "continous presence" systems.

Another application example is in the field of video surveillance, where in a control room several video streams of different

Security cameras are simultaneously decoded and displayed live on a screen. If the system has only one decoder, only one can be used

Video stream from one surveillance camera at a time decoded and displayed. Due to the fact that today have very many installed terminals or soft clients of video conferencing systems only a single decoder, it is not possible to decode and display multiple video streams simultaneously in these terminals or soft clients. Therefore, a common practice today is to use a Video Bridge or Multipoint Control Unit (MCU). This is a central unit that handles the encoded video streams from multiple sources

Receives, processes and distributes a dedicated video stream to all participants

Participant returns. For this, the video streams must be completely or at least mostly decoded, the video data merged and then encoded into a new video stream. FIG. 4 shows a schematic representation of the complete transcoding of two H.264 coded video streams. This method is often implemented as a hardware-based implementation because it is very complex, resulting in high device costs. In addition, resulting from the transcoding delay times by the plurality of signal processing steps and

Quality degradation due to the re-encoding.

Another known method is the mixing of video streams at the slice level, as described in the co-pending application of the same applicant entitled "Mixing Video Streams" by the inventors Peter Amon and Andreas Hutter.

In the H.264 / AVC standard, the macroblocks are organized in so-called slices, whereby each slice can be decoded independently of other slices. With the im

H.264 / AVC standard defined Flexible Macroblock Ordering (FMO) allows flexible assignment of the macroblocks to slice groups. This option is now used for mixing multiple video streams according to the method. Thus, for each input video stream, a slice group can be defined, which are combined by a video mixer into a stream with two slice groups. FIG. 5 shows, in a schematic representation, the mixing of two H.264 coded video streams on the slice level. However, today there are many decoders that do not support slice groups, which makes mixing video streams at the slice level inapplicable. For the video encoding standard H.261, a method is known, with which several images on the macroblock level can be merged into a new image. The presumption that this process is known is based on the fact that in the Wainhouse Research 2003 "Will Your Next Video Bridge Be Software Based?" Analyst Report (http://www.wainhouse.com/files/papers/wr sw-video-bridges.pdf) is reported about mixing H.261 video streams without going into the process, however the performance measurements suggest that a method as described above and shown schematically in FIG , is used, since not so many complete transcoding operations can be carried out simultaneously on a computer of the specified power level.

H.261 uses a Variable Length Codes (VLC) method for entropy coding. In the variable length codes used in the H.261 standard, a symbol to be coded is permanently assigned to a codeword by means of a single codeword table. As a result, there is no dependence between the symbols and thus between the macroblocks. Thus, by simply resorting the

Macroblocks are used to merge multiple video streams into one video stream. To transfer the data to be transmitted, e.g. Remaining errors from the prediction, difference to the estimated motion vectors, etc., to be compressed again, these are coded by means of a so-called entropy coding. The H.264 / AVC standard provides two options for entropy encoding for video streams, Context-based Adaptive Variable Length Coding (CAVLC), and context-based

Adaptive Binary Arithmetic Coding (CABAC) method. Both are based on so-called adaptive context-dependent entropy coding, either with variable code length or binary arithmetic coding and thereby achieve

Performance advantages in the coding process compared to the other standards. CAVLC has dependencies on the encoding of a macroblock

Coding decisions from adjacent, already coded macroblocks. With CABAC, the coding of a symbol affects the selection of the codeword for the following symbol, so that dependencies arise between the codewords and thus between the macroblocks. The streams encoded for H.261 were shown Thus, a method of mixing the video streams at the macroblock level can not be performed directly for mixing H.264 / AVC encoded video streams. It is an object of the present invention to provide a method for mixing

Specify video streams that are coded with codewords for macroblocks and in which the codewords dependencies among each other, which over the prior art avoids the disadvantages shown. This object is achieved by the method according to claim 1 and by a device according to claim 9.

Embodiments of the invention are indicated in the dependent claims.

The inventive method is based on mixing the video streams encoded according to the H.264 / AVC standard at the macroblock level. First, the video streams received by the participants must be decoded. This is done by resolving the dependencies between the codewords by full or partial entropy decoding. Thereafter, the macroblocks of the input video streams are rearranged and merged into a new macroblock containing all the data of the individual macroblocks. Finally, a new one

Video stream is encoded and sent to all or a certain number of participants, so that the participants can see each other at the same time. This is done by reversing all or part of the entropy decoding after mixing the video streams by full or partial entropy encoding. This method is illustrated schematically in Figure 2 where two H.264 coded video streams are mixed at the macroblock level.

The inventive method can be used for the two entropy coding methods CAVLC and CABAC defined in the H.264 / A VC standard.

In the CAVLC method, the element to be coded is switched between different VLC tables depending on the data already transferred. There The VLC tables were carefully constructed using statistics, thus achieving a significant performance improvement over the exponential Golomb code.

In the CA VLC procedure, the following syntax elements are coded for each 4x4 block: - coeff token: number of coefficients not equal to zero (0-16) and the number of ones at the end of the zig-zag scan, the so-called "trailing ones"

- trailing_ones_sign_flag: sign of "trailing ones"

- level_prefix and level_suffix: magnitude and sign of the coefficients not equal to zero, without "trailing ones"

- total_zeros: number of coefficients equal to zero in the 4x4 block until the last non-zero coefficient in the scan order

- run before: number of coefficients equal to zero until the next nonzero coefficient For coding coeff_token, one of four VLC tables is selected for luminance coefficients. The selection depends on the number of nonzero coefficients in the two 4x4 blocks on the left and above the current 4x4 block, if they are in the same slice. A slice in H.264 / AVC is a number of

Macroblocks that are coded together. If the blocks do not exist, at the top left of the screen or at the beginning of a slice, a default value is set.

In a reordering of the macroblocks during mixing, however, this number can change, so that the decoder would use the wrong table for the entropy-decoding of the code words. To prevent this, the corresponding codewords must be exchanged if another VLC table would result. For this purpose, the code words do not have to be decoded, that is to say the number of non-zero coefficients and the number of "trailing ones" are determined, but the transition can be determined directly from the table defined in the H.264 / AVC standard.This VLC table is shown in FIG 3, wherein the parameter nC determines the table to be selected.

The syntax element trailing onesign flag is coded with fixed word length and not adaptive. The syntax element level suffix is variable-length (0 or 1 bit) coded. However, this word length depends only on the syntax element coeff token of the same 4x4 block. The remaining syntax elements trailing_ones_sign_flag, level_prefix, level suffix, total zeros, run before are adaptively coded according to the CAVLC method, but there are no dependencies outside the 4x4 block. Thus, the codewords for all syntax elements except coeff token can be taken directly in the mixed data stream.

Thus, since only the macroblocks or 4x4 B curls at the left and top edges must be checked, which are no longer at the left or corresponding to the upper edge of the mixed image after mixing, the entropy decoding and subsequent re-entropy encoding can reduced to a minimum and the mixing performed efficiently. If, for example, two video signals are mixed one above the other, then only the macroblocks at the upper edge of the second image have to be checked and, if appropriate, adapted in their entropy coding.

The entropy coding with the CABAC method takes place in several steps: 1. Binarization of the symbol to be coded similar to a variable length coding (VLC). The binary symbols are called "bin".

2. Selection of a context model based on the type of symbol to be coded, e.g. Motion vector or coefficient, for each bit of the binarized symbol "bin" to be encoded.

3. Coding of the "bin" on the basis of the chosen context model, i.e. arithmetic coding of the bit on the basis of the probabilities for "0" and "1", respectively

Probabilities arise from the choice of the context model.

4. Update the context model used in the encoding, i.

Tracking the probabilities. For example, if a "1" is encoded, the next encoding assumes a "bin" for its context model, a higher probability of a "1." Similarly, encoding a "0".

5. Repeat steps 3 and 4 until all of the "bin" symbols have been encoded, and if the "bins" of a symbol are assigned to different contexts then step 2 must also be repeated. The arithmetic coding properties, one bit of the output stream, may contain information for several "bin" or input symbols to be coded, and updating the context models results in dependency of the currently coded symbol on the previous symbols within the same slice Slice boundaries, the context models are set to an initial value, and if the macroblocks are mixed by multiple video streams, then the

Contexts no longer match after mixing and the new video stream can no longer be decoded. In order to allow correct decoding, complete decoding of the CABAC symbols with subsequent re-encoding is necessary. Renewed encoding will cause updates to the

Context models recalculated. Only for the macroblocks at the beginning of a slice, no recodings are made until the input streams have been changed, because the contexts are initialized at slice boundaries and are therefore correct. The

Re-encoding therefore starts after the first change of the input video streams.

Symbols within the data stream coded with VLC, for example for the macroblock type, etc., or a fixed word length, are taken directly into the new data stream without entropy decoding and re entropy decoding and re entropy encoding, as for this no dependence on previous ones

Macroblocks, or in general to other coded symbols. In H.264 / AVC so-called exponential Golomb codes are used as VLC. This is possible because it precedes the current fixed-length or variable-length codeword

CABAC codeword is terminated.

When mixing the video streams in addition to the adaptation of the contexts of the entropy coding and the references in the intra- and inter-prediction must be adjusted if necessary, or even ensured during the encoding. One way to achieve a correct intra-prediction is described in the prior application of the same applicant entitled "Accurate Intra Prediction for Macroblock-level mixing of video streams" by the inventors Peter Amon and Norbert Oertel. AVC motion vectors contain the information of the direction and the amount of movements to the movements between two images in the video stream to recognize and calculate. To avoid false inter-prediction, the motion vectors should not point out of frame as described in the co-pending application entitled "Mixing Video Streams" by the inventors Peter Amon and Andreas Hutter.

The inventive method can also be carried out particularly advantageously for video streams in which dependencies exist between the individual codewords in the entropy coding, such as in the case of the H.264 bit streams. In addition, H.264 bitstreams are generated, which can also be processed by H.264 decoders that do not support slice group decoding.

An embodiment of the invention is shown by way of example in the figures.

Fig. 1 shows the mixing of four input video streams by means of an MCU

Fig. 2 shows the mixing of two H.264 video streams at the macroblock level. Fig. 3 shows the VLC table defined in H.264 / AVC for the (de) coding of coeff token

Fig. 4 shows the complete transcoding of two H.264 video streams

Figure 5 shows the mixing of two H.264 video streams at the slice level

Fig. 6 shows the mixing of two H.261 video streams at the macroblock level

Fig. 1 shows the mixing of four input video streams by means of an MCU coded according to the H.264 / AVC standard. There are four different H.264 input video streams ISl, IS2, IS3 and IS4 of four different ones

Participants in a videoconference.

The different video contents A, B, C and D of the four different input video streams IS1, IS2, IS3 and IS4 are mixed together so that all the video contents A, B, C and D are simultaneously contained in the output video stream OS.

By way of example, the video contents A, B, C and D are arranged horizontally and vertically next to one another, so that the call partners of a video conference at the same time

Screen can be seen. According to the H.264 / AVC standard, the input video streams IS 1, IS2, IS3 and IS4 are coded with an entropy coding method. That is why the

Input video streams IS 1, IS2, IS3 and IS4 are decoded by the respective entropy decoder EDI, ED2, ED3 and ED4 so that the macroblocks of the video streams can be rearranged and mixed together in the multipoint control unit MCU. The mixed macroblocks are encoded according to the H.264 / AVC standard in the entropy encoder EE to a new dedicated H.264 output video stream OS. The video stream OS is then sent to all participants. Figure 2 shows the mixing of two video streams at the macroblock level coded according to the H.264 / AVC standard.

First, the video streams IS 1 and IS 2 received by the subscribers must be decoded in the respective entropy decoders EDI and ED. This is done by resolving the dependencies between the codewords by full or partial entropy decoding. Thereafter, the macroblocks MB1 and MB2 of the input video streams IS1 and IS2 are rearranged and merged into a new macroblock MB 'containing all the data of the individual macroblocks MB1 and MB2. Finally, a new output video stream OS is encoded in the entropy encoder EE and sent to all participants, so that all participants can see each other. This is done by reversing all or part of the entropy decoding after mixing the video streams by full or partial entropy encoding. FIG. 3 shows the VLC table for (decoding) coeff token defined in the H.264 / AVC standard.

In the CAVLC method, there are four choices of a VLC table for coding coeffjoken. The selection is made by the value nC, which is calculated based on the number of coefficients in the block above nU and on the left side nL of the current coded block. If the upper block and the left block are present, ie the two blocks are in the same coded slice, the parameter nC is calculated as follows: nC = (nU + nL) / 2. If only the upper block is present, nC = nU, if there is only the left block, nC = nL, and if neither block is present, nC = 0.

The parameter nC selects the corresponding VLC table depending on the number of coded coefficients in the adjacent blocks, i. Context-adaptive. FIG. 4 shows the complete transcoding of video streams encoded according to the H.264 / AVC standard.

The two H.264 input video streams IS1 and IS2 from two subscribers are decoded by a respective H.264 video decoder VD1 and VD2 at the frame level. After decoding the video streams IS1 and IS2 into the respective video frames VF1 and VF2, the two video frames VF1 and VF2 are merged and merged to form a new video frame VF 'containing all the data of the individual video frames VF1 and VF2. Finally, a new H.264 output video stream OS is encoded in the H.264 video encoder VE and sent to all

Participants sent.

This process is also referred to as pixel domain mixing or full transcoding using e.g. a format conversion, a mixture of image data and the generation of a conference image is performed.

Figure 5 shows the mixing of two video streams at the slice level encoded according to the H.264 / AVC standard.

In the two H.264 input video streams IS1 and IS2, according to the H.264 standard, the macroblocks are assigned to the slices without additional aids. The mixing of the video streams IS1 and IS2 takes place through a flexible allocation of the macroblocks to slice groups. For example, a slice group SGI and SG2 is defined for each of the two input video streams IS1 and IS1, which is then converted into a H.264 by the video mixer VM Output video stream OS are combined with the data of the two slice groups SGI and SG2.

Fig. 6 shows the mixing of two video streams at the macroblock level coded according to the H.261 standard.

The two H.261 input video streams IS1 and IS2 of two subscribers are each present as coded macroblocks MB1 and MB2. By reordering the macroblocks MB1 and MB2, merging of the two input video streams IS1 and IS2 on the macroblock level is performed to a new coded macroblock MB 'containing all the data of the individual macroblocks MB1 and MB2 which is designated as dedicated H. 261 output video stream OS is sent to all subscribers.

LIST OF REFERENCE NUMBERS

A - D video content

EDI - ED4 entropy decoder

EE entropy encoder

ISl - IS4 input video streams

MB 1, MB2 Encoded macroblocks

MB 'Mix of coded macroblocks

MCU Multipoint Control Unit

OS output video streams

SG1. SG2 slice group

VD1, VD2 video decoder

VE video encoder

VF1, VF2 video frame

VF 'mix of video frames

VM video mixer

Claims

claims

A method of mixing video streams in a video mixing device that combines a plurality of input video streams from different subscribers encoded with macroblock codewords and in which the codewords are interdependent into an output video stream,

characterized in that

entropy-decoding the input video streams at least to the extent that the interdependencies between the codewords are resolved, the macroblocks are reordered and mixed together, and the mixed macroblocks are entropy-encoded into a new dedicated video stream.

2. The method according to claim 1, characterized in that

the input video streams and the output video stream are encoded according to the H.264 / AVC standard.

3. The method according to claim 1, characterized in that

For the entropy coding, a context-based Adaptive Variable Length Coding (CAVLC) coding method is applied.

4. The method according to claim 1, characterized in that

For entropy coding, a context-based Adaptive Binary Arithmetic Coding (CABAC) coding method is applied.

5. The method according to claim 3, characterized in that

for the entropy decoding of mixed macroblocks, the codewords are determined by selecting a VLC table according to H.264 / AVC.

6. The method according to claim 4, characterized in that

for the entropy decoding of mixed macroblocks the CABAC symbols are completely decoded.

7. The method according to any one of the preceding claims, characterized

marked that

do not show the motion vectors contained in H.264 / AVC outside the image to avoid false inter-prediction.

8. The method according to any one of the preceding claims, characterized

marked that

no slice groups are considered for the entropy decoding of the H.264 / AVC video streams.

A video mixing unit for transcoding video streams to which a number of subscriber endpoints are connected using an H.264 / AVC standard encoder and decoder, characterized in that

the video mixer unit is equipped on the input side with a number of entropy decoders and on the output side with an entropy encoder and is set up to carry out one of the methods according to claims 1 to 8.