CN102257818B

CN102257818B - Sharing of motion vector in 3d video coding

Info

Publication number: CN102257818B
Application number: CN200980151318.7A
Authority: CN
Inventors: 陈颖; M·安尼克塞拉
Original assignee: Nokia Oyj
Current assignee: Nokia Technologies Oy
Priority date: 2008-10-17
Filing date: 2009-10-16
Publication date: 2014-10-29
Anticipated expiration: 2029-10-16
Also published as: CN102257818A; EP2338281A1; WO2010043773A1; US10306201B2; EP2338281A4; US20180262742A1; US9973739B2; US20190281270A1; US20110216833A1; US10715779B2

Abstract

Joint coding of depth map video and texture video is provided, where a motion vector for a texture video is predicted from a respective motion vector of a depth map video or vice versa. For scalable video coding, depth map video is coded as a base layer and texture video is coded as an enhancement layer(s). Inter-layer motion prediction predicts motion in texture video from motion in depth map video. With more than one view in a bit stream (for multi view coding), depth map videos are considered monochromatic camera views and are predicted from each other. If joint multi-view video model coding tools are allowed, inter-view motion skip issued to predict motion vectors of texture images from depth map images. Furthermore, scalable multi-view coding is utilized, where interview prediction is applied between views in the same dependency layer, and inter-layer (motion) prediction is applied between layers in the same view.

Description

In 3D Video coding, motion vector shares

Technical field

Three-dimensional (3D) video content that different execution modes relate generally to presenting depth map image carries out Video coding.More particularly, different execution mode relates to multi-view video and the degree of depth is carried out to combined optimization to support high coding efficiency.

Background technology

This part is intended to provide background of the present invention or the context about describing in detail in claims.Here, description can comprise and can be probed into but must not be the concept that formerly envisions or probe into.Therefore, unless separately had contrary instruction at this, otherwise what in this part, describe is not the prior art about the specification in the application and claims, and owing to being included in, in this part, to be recognized as not be prior art.

Video encoding standard comprise ITU-T H.261, visual, the ITU-T of ISO/IEC motion picture expert group (MPEG)-1 H.262 or ISO/IEC MPEG-2 video, ITU-TH.263, ISO/IEC MPEG-4 are visual and H.264 ITU-T (is also referred to as ISO/IECMPEG-4 advanced video coding (AVC)).In addition, there is the effort about exploitation new video coding standard.Such standard is scalable video coding (SVC) standard, and it is the scalable expansion to H.264/AVC.Another the such standard just having completed is multi-view video coding (MVC) standard, and it becomes another expansion to H.264/AVC.

In multi-view video coding, each of the video sequence of exporting from different cameras, corresponding to different views, is a bit stream by described video sequence coding.After decoding, in order to show certain view, rebuild and show belonging to the decoding picture of that view.Also can rebuild and show for more than one view.

Multi-view video coding has diversified application, comprises free viewpoint video/TV, 3D TV and monitors application.Current, the joint video team (JVT) of ITU-T Video coding expert group and ISO/IEC motion picture expert group (MPEG) is just making great efforts to develop MVC standard, and it is just becoming expansion H.264/AVC.These standards are hereinafter referred to as MVC and AVC at this.At JVT-AB204 (" Joint Draft Multi-view Video Coding ", the 28th JVT meeting, Hanover, Germany, in July, 2008) in the up-to-date working draft of MVC has been described, it can obtain at ftp3.itu.ch/av-arch/jvt-site/2008_07_Hannover/JVT-AB204. zip.

Except the feature defining in the working draft of MVC, other potential features, particularly pay close attention to the feature of coding tools, also in associating multi-view video model (JMVM), are described.At JVT-AA207 (" Joint Multiview Video Model (JMVM) 8.0 ", the 24th JVT meeting, Geneva, Switzerland, in April, 2008) in the latest edition of JMVM has been described, it can obtain at ftp3.itu.ch/av-arch/jvt-site/2008_04_Geneva/JVT-AA207.zi p.

Fig. 1 shows the expression of traditional MVC decoding order (, bit stream order).Decoding order is arranged and is called as time priority encoding.Define each access unit to comprise for example, for example, encoded picture for whole views (, S0, S1, S2...) of an output time point (time instance) (, T0, T1, T2...).The decoding order that it should be noted in the discussion above that access unit can be not identical with output or DISPLAY ORDER.Figure 2 illustrates traditional MVC prediction for multi-view video coding (comprise inter-picture prediction in each view and inter-view prediction the two) structure.In Fig. 2, by arrow indication predicting, wherein each sensing (pointed-to) object uses accordingly and refers to from (point-from) object as prediction reference.

Anchor point picture is such encoded picture, and therein, whole only references of cutting into slices have the section of same time index, that is, and the only not section of reference in picture morning of working as front view with reference to the section in other views.Being set to 1 pair of anchor point picture by anchor_pic_flag indicates.After anchor point picture is decoded, can, in the case of the infra-frame prediction (inter-prediction) of any picture of decoding before not carrying out comfortable anchor point picture, the whole encoded picture subsequently in DISPLAY ORDER be decoded.If the picture in a view is anchor point picture, whole pictures in other views so with same time index can be also anchor point pictures.Therefore, the decoding of any view can be from the time index corresponding to anchor point picture.Having anchor_pic_flag equals 0 picture and is called as non-anchor some picture.

In the associating draft of MVC, in sequence parameter set (SPS) MVC expansion, specify drawings relativity.Specify independently the correlation for anchor point picture and non-anchor some picture.Therefore, anchor point picture and non-anchor some picture can have different views correlation.But for the picture set with reference to identical SPS, all anchor point pictures must have identical view correlation, and all non-anchor some pictures must have identical view correlation.In SPS MVC expansion, can be for indicate respectively subordinate view as the view using with reference to picture in RefPicList0 and RefPicList1.Among an access unit, in the time that view component A directly depends on view component B, this means view component A by view component B for inter-view prediction.If view component B directly depends on view component C, and if view component A does not directly depend on view component C, view component A depends on view component C indirectly so.

In the associating draft of MVC, in network abstract layer (NAL) unit header, also there is inter_view_flag, whether its instruction photo current is for the inter-view prediction of the picture for other views.In this draft, inter-view prediction supports, only have the sample value of reconstruction can be used to inter-view prediction by texture prediction only, and only has the picture of putting identical reconstruction with the output time of photo current to be used to inter-view prediction.After first byte of NAL unit (NALU), what follow is NAL unit header expansion (3 byte).The expansion of NAL unit header comprises the syntactic element of describing the attribute of NAL unit in the context of MVC.

As the coding tools in JMVM, motion skip predicted macroblock (MB) pattern and the motion vector from inter-view reference picture, and it is only only applicable to non-anchor some view.During encoding, in the time that anchor point picture is encoded, estimate global difference motion-vector (GDMV), and leading-out needle is the weighted average from the GDMV of two contiguous anchor point pictures to the GDMV of non-anchor some picture to make this GDMV for non-anchor some picture then.GDMV is 16 pixel precisions, that is, for any MB in photo current (, the picture that is being encoded or is decoding), the respective regions shifting in inter-view reference picture according to GDMV covers just what a MB in inter-view reference picture.

Based on this GDMV, for each non-anchor some picture, GDMV is carried out to convergent-divergent.For each MB, if this MB utilizes motion skip, indicate so this locality skew of difference motion vector.At decoder place, if use motion skip mode, use so final difference motion vector to find the motion vector in picture between view, and from picture reproduction motion vector between view.

3D video has obtained remarkable concern recently.In addition,, along with obtaining and the progress of Display Technique, by using different application chance, 3D video just becomes a reality in consumer field.Consider the maturation of be sure oing of catching with Display Technique, and add the help of MVC technology, many different 3D Video Applications of being imagined are just becoming more feasible.It should be noted in the discussion above that and conventionally 3D Video Applications can be divided into three kinds: free viewpoint video; 3D TV (video); And immersion (immersive) videoconference.The requirement of these application is can be completely different, and realizes the 3D Video Applications of every type and all have the challenge of himself.

In the time transmitting 3D content based on 2D image, limit bandwidth becomes problem, and therefore needs powerful compressor reducer to use the only view of fair amount to encode to 3D content.But at client device place, for instance, user may watch the experience of 3D content at any angle (for example,, with view navigation or automatic stereo video).Therefore, wish that decoder presents view as much as possible and carries out as far as possible continuously this point.View is synthetic can be by transmitting the view of fair amount and at renderer place, other views being carried out to interpolation and deal with this limit bandwidth simultaneously.Among MPEG video subgroup, the exploration of carrying out in 3D Video coding is tested (3DV EE) to learn similar application scenarios.What claim equally is view to be synthesized and has potential help with depth map video for each view.

In addition, MPEG has also specified the form for adhering to depth map for the conventional video flowing in MPEG-C part 3.At " Text of ISO/IEC FDIS 23002-3Representation of Auxiliary Video and Supplemental the Information " (N8768 of ISO/IEC JTC 1/SC 29/WG 11, Marrakech, Morocco, in January, 2007) in this specification has been described.

In MPEG-C part 3, so-called auxiliary video can be depth map or disparity map.Texture video is made up of three components conventionally, i.e. a luminance component Y, and two chromatic component U and V, but depth map only has the one-component of the distance between representative object pixel and camera.Conventionally, represent texture video with YUV 4:2:0,4:2:2 or 4:4:4 form, wherein, respectively for each 4,2 or 1 luma samples, a chroma sample (U or V) is encoded.Depth map is regarded as according to the video of the only brightness of YUV 4:0:0 form.Can by depth map similarly interframe encode become the texture picture of the only brightness of interframe encode, and therefore the depth map of coding can have motion vector.In the time representing depth map, it is providing flexibility aspect the amount of bits for representing each depth value.For example, the resolution of depth map can be 1/4 wide of for example associated picture and 1/2 height.

Final result it should be noted in the discussion above that and determine that it is application problem which Video Codec has been used, although can be for example monochrome video (4:0:0) by depth map Video coding.For example, can be the H.264/AVC bit stream only with luminance component by depth map encoding.Alternatively, can be by depth map encoding the auxiliary video of definition in H.264/AVC.In H.264/AVC, be independent of main picture, auxiliary picture is encoded, and therefore between the main code picture for sample value and the auxiliaring coding picture for depth value, do not have prediction.

In the time that the depth information of the each picture (, depth map video) for view is provided, synthetic being improved of view presenting for 3D video.Because depth map video can consume for the major part in whole bandwidth of whole bit stream (particularly in the time that each view is associated with depth map), so should be enough efficiently to save bandwidth to the coding of depth map video.

Conventionally, as previously discussed, to depth map video (if existence) coding independently.But, between texture video and its depth map being associated, can there is correlation.For example, the motion vector in coding depth figure and the motion vector in encoding texture video may be similar.Can predict, the sample prediction between depth map and texture is inefficient and is almost useless, but the motion prediction between depth map image and texture image is favourable.

For multi-view video content, MVC is " state of the art " coding standard.Based on MVC standard, can not in a MVC bit stream, encode to depth map video and texture video, and meanwhile support the motion prediction between depth map image and texture image.

Summary of the invention

Different execution modes provide the combined coding to depth map video and texture video.According to different execution modes, the motion vector according to the corresponding motion vector prediction of depth map video for texture video, vice versa.When only presented a view in bit stream time, consider to obey the scene of SVC.In this scene, by depth map Video coding for basic layer and texture video is encoded to enhancement layer.Additionally, can be with inter-layer motion prediction to carry out the motion in predicted texture video according to the motion in depth map video.Alternatively, texture video is encoded to basic layer, is enhancement layer by depth map Video coding, and carrys out the motion in predetermined depth figure according to the motion in texture video.When present more than one view in bit stream time, can be applied in the inter-layer prediction between texture video and corresponding depth map video to each view.In another scene with multiple views, depth map video is considered to monochromatic camera view and can be according to predicting each other.If allow JMVM coding tools, can be with motion skip between view to carry out the motion vector of predicted texture image according to depth map image.In another scene, when predicting between application view between the view in identical dependence layer, and when (motion) predicted between application layer between layer in identical view, utilize scalable multi-view video coding (SMVC).

When together with accompanying drawing, by following detailed description, these and other advantage and feature of the present invention, and their tissue and mode of operation, will become apparent, and wherein, in some accompanying drawings described below, identical element has identical numeral.

Brief description of the drawings

By reference to accompanying drawing, the execution mode of different execution modes is described, in accompanying drawing:

Fig. 1 shows traditional MVC decoding order;

Fig. 2 shows traditional MVC time and the example of inter-view prediction structure;

Fig. 3 shows the block diagram for example components and the exemplary process flow of 3D video system;

Fig. 4 show for according to different execution modes to the encode flow chart of the example process of carrying out of Media Stream;

Fig. 5 a-Fig. 5 d is according to different execution modes, for the expression of different MVC scenes of 3D content with the degree of depth and texture video coding/decoding;

Fig. 6 is the expression that is applicable to the universal multimedia communication system of different execution modes of the present invention;

Fig. 7 is the perspective view of the electronic equipment that can use together with the realization of different execution modes of the present invention; And

Fig. 8 can be included in schematically illustrating of circuit in the electronic equipment of Fig. 7.

Embodiment

Figure 3 illustrates 3D video system 300.At grabber 310 places, using 3D video content as multiple video sequences, (N view) catches.Grabber 310 also can be caught the degree of depth for the subset of each view or view, but alternatively or additionally, can in preprocessor 320, estimate the degree of depth.Preprocessor 320 is responsible for how much and is corrected and color calibration.In addition, preprocessor 320 can be carried out estimation of Depth so that depth map image is associated with video sequence.At encoder 330 places, for example, be bit stream by MVC encoder by video sequence coding.If content is supplied with or is associated with depth map image/picture together with depth map image/picture, can encode to them so, for example, be encoded to the H.264/AVC middle auxiliary picture of supporting.The 3D of compression represents that i.e. bit stream transmits or is accommodated in memory device 340 by particular channel.If many views content is supplied with together with the degree of depth, need so the degree of depth to encode.

In the time that client 350 receives bit stream from channel or memory device 340, the decoder 352 of realizing in client 350 is decoded to N view and depth map image (if existence).Which synthetic view of having encoded decoder 352 can also need to be used for showing according to, and the subset of depth map image and N view be decoded.View synthesizer 354 can be based on N view and depth map image, uses view generation algorithm to generate more views (being called novelty or virtual view).Additionally, view synthesizer 354 can be mutual with display 356, and display 356 for example provides human interface device, such as remote controller.It should be noted in the discussion above that and view synthesizer 354 can be integrated in decoder 352, particularly for the automatic stereo application with small angle.

The combined coding of different execution mode supports to depth map video and texture video.Fig. 4 show for according to different execution modes to comprising the encode flow chart of the example process of carrying out of the Media Stream of the first view (comprising the first degree of depth picture, the second degree of depth picture, the first samples pictures and the second samples pictures).At 400 places, use the first motion vector to predict the second degree of depth picture according to the first degree of depth picture.At 410 places, use the second motion vector to predict the second samples pictures according to the first samples pictures.At 420 places, the first motion vector and the second motion vector are encoded, for example, jointly coding.It should be noted in the discussion above that combined coding can comprise the prediction to the second motion vector according to the first motion vector, vice versa, and only the difference of motion vector is encoded in bit stream.

When only present a view in bit stream time, the scene of consideration obedience SVC is wherein basic layer by depth map Video coding and texture video is encoded to enhancement layer.The scene of additionally, obeying SVC is used inter-layer motion prediction with according to the motion in the motion prediction texture video in depth map video.In another embodiment, when texture video being encoded to basic layer, being enhancement layer and while using inter-layer motion prediction with motion in predetermined depth figure by depth map Video coding, the scene of considering to obey SVC.When present more than one view in bit stream time, can be applied in the inter-layer prediction between texture video and corresponding depth map to each view.In the scene with multiple views, depth map video is considered to monochromatic camera view and can be according to predicting each other.If allow JMVM coding tools, can use between view motion skip with according to the motion vector of depth map image prediction texture image.In another scene, when predicting between application view between the view in identical dependence layer, and when (motion) predicted between application layer between layer in identical view, utilize SMVC.

At H.264 (" Advanced video coding for generic audiovisual services " of ITU-T suggestion, in November, 2007) in SVC specification has been described, it can obtain from http://www.itu.int/rec/T-REC-H.264/en.

In SVC, in NAL unit header SVC expansion, exist the picture that output identification " output_flag " is decoded to specify whether will be output.For Video coding layer (VCL) the NAL unit of basic layer that belongs to AVC compatibility, comprise output_flag in relevant prefix NAL unit.

SVC has also introduced the inter-layer prediction for space and SNR scalability based on texture, remnants and motion.When with the scalable solutions of other of inter-layer prediction that only utilize texture relatively time, this is the main point of the novelty of SVC.This inter-layer prediction provides macro block (MB) level to adapt to, and each MB can carry out rate-distortion optimisation (RDO) pattern between prediction in normal layer in inter-layer prediction and enhancement layer.Spatial scalability in SVC has been summarized as any resolution ratio between two layers, therefore may support basic layer (having the aspect ratio pictures of 4: 3) from having SDTV to the scalability of enhancement layer (having the aspect ratio pictures of 16: 9) with HDTV.By encoding for inter-layer prediction to having with the enhancement layer of its basic layer equal resolution, realize SNR scalability, and by enhancement layer being encoded to the meticulousr quantization parameter (QP) of prediction residue application, it is being described below in more detail.Current, for SNR scalability supporting process granularity scalability (CGS) and medium granularity scalability (MGS).Difference between MGS and CGS is that MGS allows in any access unit place conversion the transporting and decoding of different MG S layer, and CGS layer only can be converted in some fixed point, and the picture of the layer being wherein transformed into is IDR picture.Additionally, can use more flexibly refer-mechanisms so that MGS key picture provides balance between error deviation and enhancement layer coding efficiency.

In SVC, the interlayer coding dependence layering by syntactic element dependency_id identification for spatial scalability and CGS, and by the layering of syntactic element quality_id identification MGS dependence.As performed to temporal_id, these two syntactic elements also indicate in NAL unit header SVC expansion.Position at any time, can have according to the picture inter-layer prediction with less dependency_id value the picture of larger dependency_id value.But in CGS, position and for equal dependency_id value at any time, has picture that quality_id value equals QL and can only use and have gross picture that quality_id value equals QL-1 for inter-layer prediction.It is MGS layer that quality_id is greater than these quality enhancement layer of 0.

If enhancement layer has the resolution (that is, it be CGS or MGS layer) identical with basic layer, texture, remnants or motion can be directly used in to inter-layer prediction.Otherwise before for inter-layer prediction, basic layer is by oversampling (upsample) (for texture or remnants) or convergent-divergent (for motion).These inter-layer prediction methods are below discussed.

Use the coding mode of inter-layer texture prediction in SVC, to be called as " IntraBL " (in BL) pattern.In order to support single loop decoding, only have following MB can use this pattern, for these MB, only in basic layer, the MB for the common location of inter-layer prediction is carried out intraframe coding with limitation.By the MB of intraframe coding with limitation by intraframe coding and without with reference to from any sample of MB between contiguous frames.For spatial scalability, the resolution ratio based between two-layer carries out oversampling to texture.In enhancement layer, to primary signal and possibly the difference between the basic layer texture of oversampling encode, it is the motion compensation residual in the interframe MB in single layer coding seemingly.

If MB is instructed to use residual prediction, in basic layer, the MB for the common location of inter-layer prediction must be interframe MB, and can carry out oversampling to its remnants according to resolution ratio.Then the residue signal of the oversampling possibly of the basic layer of use is with the remnants of prediction enhancement layer.Difference between the remnants of the remaining and basic layer to enhancement layer is encoded.

When the MB in enhancement layer or MB subregion are supported inter-layer motion prediction, and when meanwhile the reference key of basic layer and enhancement layer is identical, basic layer of motion vector that can the common location of convergent-divergent is with the motion vector generation forecast device of the MB in enhancement layer.Have a kind of MB type that is called fundamental mode, it sends a mark for each MB.If this is masked as true and corresponding basic layer MB is not in frame, motion vector, compartment model and reference key are all derived from basic layer so.

According to the depth coding scene of obeying SVC, can consider to have a view of two videos.The first video can be the texture video that comprises texture image.The second video can be the depth map video that comprises depth map image.Depth map image can have the resolution identical or lower with texture image.In such scene, use the scalable method of combination to support the combined coding to the degree of depth and texture image, its operation is as follows.

About basic layer, texture video is encoded to the 4:2:0 view more high chroma sample format of three color components (or have) and output_flag equals 1.About enhancement layer, texture view is encoded to 4:0:0 view (only having luminance component), and output_flag is set to 0.Additionally, if texture image has the resolution identical with depth map image, so CGS or MGS are combined with from different chroma sample forms.Alternatively, if texture image has the resolution higher than depth map image, so spatial scalability is combined with from different chroma sample forms.

Alternatively and about basic layer, be that 4:0:0 view (only having luminance component) and output_flag equal 0 by depth map Image Coding.About enhancement layer, be the 4:2:0 view more high chroma sample format of three color components (or have) by Texture Encoding, and output_flag is set to 1.Additionally, if texture image has the resolution identical with depth map image, so CGS or MGS are combined with chroma sample scalability.Alternatively, if texture image has the resolution higher than depth map image, so spatial scalability is combined with chroma sample scalability.

Moreover, in example before, only utilized inter-layer motion prediction, and IntraBL and residual prediction disabled.In addition, supplemental enhancement information (SEI) message is indicated to indicate in the manner described above SVC bit stream is encoded.H.264/AVC, in SVC and MVC, bit stream can comprise SEI message.Decoding for the sample value in picture to output does not need SEI message, but SEI message is assisted relevant process, such as picture output timing, present, error detection, error concealing and resource reservation.Much SEI message is all H.264/AVC, specify in SVC and MVC standard, and user data SEI message supporting tissue and company are for they self use and specify SEI message.H.264/AVC, SVC and the MVC standard syntax and semantics that comprises the SEI message being used to specify is learned, but is not defined for the process in decoder processing messages.Therefore, in the time that encoder creates SEI message, require encoder in accordance with H.264/AVC, SVC and MVC standard, and and do not require in accordance with H.264/AVC, the decoder of SVC and MVC standard in order to export sequence consensus treatment S EI message.H.264/AVC, SVC and MVC standard comprise a reason that the syntax and semantics of SEI message is learned be allow system specification same explain side information and therefore interactive operation.Intend: system specification may be required in coding side and decoding end is all used specific SEI message, and can specify for the process in receiver treatment S EI message for the application in system specification.

From the visual angle of decoder, if receive such message, consider three execution modes with obtain depth map video and texture video the two.In the first embodiment, carry out many ring decodings, that is, basic layer and enhancement layer are rebuild completely.According to the second execution mode, extract the different subsets of bit stream, and carry out two examples of monocycle decoder., extract only comprise the bit stream subset of basic layer and then to its decode (by H.264/AVC, SVC or MVC decoder) first to obtain depth map video.Subsequently, whole bit stream is decoded to obtain texture video.According to the 3rd execution mode, single loop decoding device is modified, to depend in order to show or view is synthetic whether needs basic layer picture and the basic layer of output picture selectively.If in order to show or the synthetic layer picture substantially that do not need of view, can use traditional single loop decoding, and the basic layer picture of coding is only as the prediction reference for corresponding enhancement-layer pictures.It should be noted in the discussion above that according to these three execution modes, only participate in the synthetic view of view for those and rebuild depth map image.

It was suggested that different mechanisms/scheme is used SVC with the each view for MVC content.For example, it was suggested MVC scheme, wherein used SVC scheme (being embodied as the MVC expansion of SVC standard) to encode to each view.The feature of these proposed schemes comprises the codec design that support is encoded to any view in many views bit stream in scalable mode.Provide the design of reference picture mark and reference picture list structural design to represent for inter-view prediction early than any dependence of any other view when front view to support to use in comfortable view order.Additionally and for the dependence for inter-view prediction represent, proposed reference picture mark design and reference picture list structural design allow the basic representation representing for the dependence of inter-view prediction or strengthen the choice for use representing.The enhancing that dependence represents represents to originate from the decoding to MGS layer represents or fine particles scalability (FGS) layer represents.According to this proposed scalable multi-view video coding (SMVC) scheme, comprise the field in the NAL unit header of field in the NAL unit header of SVC and many views in the NAL of SMVC unit header.For example, in the NAL of SMVC unit header, view_id and dependency_id have been presented.

Therefore,, for multiple view bit streams, consider to obey the depth coding scene of MVC.In the time that multiple views exist, wherein each has depth map video and texture video, can apply the depth coding of obeying MVC with the inter-view prediction between support depth map video and the inter-view prediction between texture video.The mode that many views content is encoded is according to the mode of obeying MVC.According to the first execution mode, as shown in Fig. 5 a, in the time of inter-view prediction (being indicated by arrow) between the auxiliary picture of supporting in different views, be auxiliary picture by whole depth map Image Codings.For example, Fig. 5 a shows from the depth map video of view 1 and can utilize depth map video from view 0 and view 2 as prediction reference.

According to the second execution mode, be normal 4:0:0 view by whole depth map Image Codings, and to the new view identifier of each depth map video distribution.Give an example, and as shown in Fig. 5 b, consider to have the scene of three views, texture video is encoded to view 0 to view 2, and is view N, view N+1 and view N+2 by depth map Video coding.Apply in this embodiment motion prediction between the view between depth map video and the texture video in each view.It should be noted in the discussion above that the motion skip (indicating by the diagonal arrow between each depth map and texture video) of applying does not come into force in this embodiment in JMVM.Again show inter-view prediction at all the other arrows shown in Fig. 5 b.In this example, introduce SEI message to support that the view identifier of depth map is mapped to the texture video that it is associated by renderer.In Fig. 5 b, between view, motion prediction is from depth map video to texture video, and alternatively, between view, motion prediction can be carried out from texture video to depth map.

Obeying in the depth coding scene of JMVM, can between the depth map video of identical view and texture video, support motion prediction.Refer again to Fig. 5 b and show this scene.But in this JMVM scene, motion skip will come into force.As mentioned above, be normal 4:0:0 view by depth map Video coding, and to the new view identifier of each depth map video distribution.If depth map video and texture video belong to identical view, support motion skip from depth map video to texture video so.In this example, global disparity is always denoted as 0 and be always marked as 0 (if using motion skip mode) for partial error's similarities and differences sample of MB.Therefore, can reduce coder complexity.Alternatively, also can support motion skip from texture video to depth map and still global disparity and local difference are denoted as to 0 simultaneously.In this example, introduce SEI message to support that the view identifier of depth map is mapped to the texture video that it is associated by renderer.It should be noted in the discussion above that, in order to support this scene, if when depth map video and texture video are interrelated, they should have identical resolution.Alternatively, also can carry out motion skip process from texture video to depth map video instead of carry out to texture video from depth map video.

Fig. 5 c shows SMVC depth coding scene, and wherein each view has two dependence layers.According to an execution mode, lower dependence layer is corresponding to the basic layer of obeying MVC, uses H.264/AVC coding tools and inter-view prediction to encode to it.The basic layer of each view is corresponding to the depth map video of particular figure, and it encoded with monochromatic mode.With coding tools H.264/AVC, from share identical dependence layer other views inter-view prediction and from the inter-layer motion prediction (use from the depth map video of different views and indicate to the arrow of texture video) of the basic layer of identical view, the higher dependence layer of each view is encoded.This layer is MVC dependence layer, and it is corresponding to the texture video of particular figure.If texture image has the resolution higher than depth map image, so spatial scalability is combined with from different chroma sample forms.

Alternatively and according to another execution mode, lower dependence layer is corresponding to the basic layer of obeying MVC, uses H.264/AVC coding tools and inter-view prediction to encode to it.Basic layer is corresponding to the texture video of particular figure.With monochromatic mode, utilize coding tools H.264/AVC, from share identical dependence layer other views inter-view prediction and from the inter-layer motion prediction (use from the depth map video of different views and indicate to the arrow of texture video) of the basic layer of identical view, the higher dependence layer of each view is encoded.This layer is MVC dependence layer, and it is corresponding to the depth map video of particular figure.

It should be noted in the discussion above that and require complete decoding that synthesize for view, to the depth map in each view (basic layer or enhancement layer).Require for complete decoding (will show) desirable camera view, to texture video (top or basic layer).

Alternatively, as shown in Fig. 5 d, can be applied in another execution mode of realizing depth coding in SMVC, wherein use depth map video to encode to some view, and do not use depth map video to encode to some view.In this example, some view, for example view 1, only have a dependence layer (texture video), and other views can have two dependence layers (depth map and texture video).In addition, this execution mode can also utilize inter-layer prediction from texture video to depth map video in a view.

About Fig. 4 and described above, in the first embodiment, Media Stream can be the SVC bit stream with the basic layer that comprises the first and second degree of depth pictures, wherein according to the first motion vector for example, to the second motion vector encode flat (, using inter-layer motion prediction).Alternatively, in the second execution mode, Media Stream can be the SVC bit stream with the basic layer that comprises the first and second samples pictures, wherein according to the second motion vector for example, to the first motion vector encode flat (, using inter-layer motion prediction).Additionally, layer (in the first execution mode) or enhancement layer (in the second execution mode) are encoded to monochrome video substantially, are wherein MGS, CGS or spatial enhancement layer by enhancement layer coding.Should be noted that, basic layer (in the first execution mode) or enhancement layer (in the second execution mode) are instructed to " not to be output as target ", wherein encode to indicate Media Stream to comprise the basic layer (in the first execution mode) of depth map image to SEI message.Also can encode to indicate Media Stream to comprise the enhancement layer (in the second execution mode) of depth map image to SEI message.

Similarly, as described above, Media Stream can comprise the second view, and the second view comprises degree of depth picture and samples pictures, wherein the 3rd degree of depth picture in the second view is encoded and use inter-view prediction between the second degree of depth picture and the 3rd degree of depth picture.In this embodiment, can be auxiliary picture by depth map slice encode.

In another embodiment, the differential movement between the first coded views that comprises depth map image and the second coded views of comprising texture image is designated as zero and view between motion skip mode for the predictability coding of the second motion vector.

The bitstream format of the different scenes of instruction more than this describes.For the depth coding of obeying SVC, the example of the SEI message syntax that instruction use combined depth and texture video are encoded to SVC is as follows.

joint_depth_coding_SVC(payloadSize){
	view_info_pre_flag	5	u(1)
if(view_info_pre_flag)
	view_id	5	ue(v)
}

If present, the SVC bit stream of this SEI message instruction coding has one or more dependence layers (depth map video) of 4:0:0 form, and from thering are two dependence layers of different chroma sampling forms, only allow inter-layer motion prediction.The semantics of SVC combined depth coding SEI message comprises " view_info_pre_flag ", in the time that it equals 1, indicates the corresponding view identifier of this SVC bit stream designated." view_info_pre_flag " equals 0 instruction does not have given view identifier.Additionally, the video of " view_id " instruction decoding and the view identifier of the corresponding view of depth map.

The execution mode being associated for the above-mentioned depth coding scene with obeying MVC, the grammer of exemplary MVC depth views maps identifiers SEI message is as follows:

depth_id_map_mvc(payloadSize){
			num_depth_views_minus1	5	ue(v)
for(i＝0；i＜＝num_depth_views_minus1；i++){
			sample_view_id[i]	5	ue(v)
depth_view_id[i]	5	ue(v)
			}
}

MVC depth views maps identifiers SEI message semantics comprises " num_depth_views_minus1 " parameter, the quantity of the view that its instruction is encoded with depth map video.Additionally, " sample_view_id[i] " parameter is indicated the view_id of the texture video of i the view of encoding with depth map video.The view_id of the depth map video of i the view that in addition, " depth_view_id[i] " instruction is encoded with depth map video.

In the time considering to obey the depth coding scene of JMVM, exemplary SEI message syntax can be with described identical for the depth coding of obeying MVC above.For mapping SEI message semantics, as the semantics of MVC depth views maps identifiers SEI message, comprise following: " num_depth_views_minus1 " parameter of the quantity of the view that instruction is encoded with depth map video; " sample_view_id[i] " parameter of the view_id of the texture video of instruction i the view of encoding with depth map video, and " the depth_view_id[i] " parameter of indicating the view_id of the depth map video of i the view of encoding with depth map video.Additionally, in the time existing, the bit stream support depth map video that the view identifier of " depth_view_id[i] " and " sample_view_id[i] " is right from having and the motion skip of texture video.The view that equals " depth_view_id[i] " value from having " view_id " value is set to zero to the difference the indicating motion that has " view_id " value and equal the view of " sample_view_id[i] " value.

About SMVC depth coding, when a view does not have the degree of depth and another is while having the degree of depth, relevant grammer allows basic layer to have the dependency_id that is not equal to 0, to support the inter-view prediction from the texture video in different views (having identical dependency_id value).

It should be further understood that, although the text comprising at this and example description encoding process clearly it will be understood by those skilled in the art that identical concept and principle are also applicable to corresponding decode procedure, vice versa.For example and with regard to Fig. 4, decoder can be decoded to the Media Stream of coding, and the Media Stream of coding has by using the first and second motion vectors to predict according to the first degree of depth picture and the first samples pictures the above-mentioned attribute that the second degree of depth picture and the second samples pictures realize respectively, wherein motion vector has been carried out to for example combined coding.

Fig. 6 is the diagrammatic representation of the universal multimedia communication system that can realize therein for different execution modes.As shown in Figure 6, data source 600 provides the source signal of simulation, not compressed digital or compressed digital form, or the source signal of any combination of these forms.Source signal is encoded to coded media bit stream by encoder 610.It should be noted in the discussion above that can from be positioned at remote equipment among the network of any type almost receive directly or indirectly will be decoded bit stream.Additionally, can receive bit stream from local hardware or software.Encoder 610 can be encoded to more than one video type (such as Voice & Video), or can require more than one encoder 610 to encode to the different media types of source signal.Encoder 610 also can obtain the synthetic input producing, and such as figure and text, or it can produce the coded bit stream of synthetic media.In the following, for simplified characterization, only consider a medium type coded media bit stream to process.But it should be noted that typical real-time broadcast service comprises some stream (at least one audio frequency, video and text are joined caption stream conventionally).The system of should also be noted that can comprise many encoders, but for simplified characterization without loss of generality in the situation that, only shows an encoder 610 in Fig. 6.It should be further understood that, although the text comprising at this and example description encoding process clearly it will be understood by those skilled in the art that identical concept and principle are also applicable to corresponding decode procedure, vice versa.

Transmit coded media bit stream to storage 620.Storage 620 massage storage that can comprise any type are with memory encoding media bit stream.The form of the coded media bit stream in storage 620 can be basic self-contained bitstream format, or one or more coded media bit stream can be encapsulated in container file.Some system " scene " operation, omits storage and coded media bit stream is directly transmitted to transmitter 630 from encoder 610.Then when needed coded media bit stream is transmitted to transmitter 630 (also referred to as server).The form using in transmission can be basic self-contained bitstream format, packet stream format, or one or more coded media bit stream can be encapsulated in container file.Encoder 610, storage 620 and server 630 can be arranged in same physical device, or they can be included in different equipment.Encoder 610 and server 630 can operation site implementation contents, coded media bit stream is not conventionally by permanent storage in this example, but cushion for short time in content encoder 610 and/or in server 630, with the variation in Processing for removing delay, propagation delay and encoded media bit rate.

Server 630 uses communication protocol stack to send coded media bit stream.This stack can include but not limited to real time transport protocol (RTP), User Datagram Protoco (UDP) (UDP) and Internet protocol (IP).When communication protocol stack is that server 630 is encapsulated into coded media bit stream in grouping towards when grouping.For example, in the time using RTP, server 630 is encapsulated into coded media bit stream in RTP grouping according to RTP pay(useful) load form.Conventionally, each media formats has special RTP pay(useful) load form.Should again note, system can comprise more than one server 630, but for simplicity, a server 630 is only considered in ensuing description.

Server 630 can be connected to gateway 640 or be free of attachment to gateway 640 by communication network.Gateway 640 can be carried out dissimilar function, such as translating the merging of another communication protocol stack, data flow and fork according to the stream of packets of a communication protocol stack and according to down link and/or receiver capability operation data flow, such as the bit rate of the stream forwarding according to the down link network condition control being dominant.The example of gateway 640 comprises the IP wrapper in gateway between the visual telephone of MCU, circuit conversion and packet switched, push-to-talk over cellular (PoC) server, digital video broadcasting-hand-hold type (DVB-H) system, or broadcasting transmitting is carried out to the local Set Top Box forwarding to family wireless network.In the time using RTP, gateway 640 is called as RTP blender or RTP translater, and conventionally serves as the end points that RTP connects.

System comprises one or more receivers 650, and it can receive transmitted signal conventionally, demodulation or be descapsulated into coded media bit stream.Transmit coded media bit stream to record storage 655.Record storage 655 massage storage that can comprise any type are with memory encoding media bit stream.Record storage 655 can alternatively or additionally comprise computing store, such as random access storage device.The form of the coded media bit stream in record storage 655 can be basic self-contained bitstream format, or one or more coded media bit stream can be encapsulated in container file.If there are multiple coded media bit stream associated with each other, such as audio stream and video flowing, so conventionally, use container file and receiver 650 to comprise container file maker or be attached to container file maker, container file maker produces container file from inlet flow.Some system " scene " operation, omits record storage 655 and coded media bit stream is directly transmitted to decoder 660 from receiver 650.In some system, in record storage 655, only keep the forefield of the stream recording, for example, within nearest 10 minutes of the stream recording, collect, and from record storage 655, abandon any early data of record.

Transmit coded media bit stream from record storage 655 to decoder 660.If exist associated with each other and be packaged into the many coded media bit stream in container file, such as audio stream and video flowing, can use so paper analyzer (not shown in FIG.) with from the each coded media bit stream of container file decapsulation.Record storage 655 or decoder 660 can comprise paper analyzer, or paper analyzer is attached to record storage 655 or decoder 660.

Coded media bit stream is further processed by decoder 660 conventionally, and decoder 660 is output as one or more not compressed media stream.Finally, renderer 670 can be reproduced not compressed media stream with for example loud speaker or display.Receiver 650, record storage 655, decoder 660 and renderer 670 can be arranged in same physical device or they can be included in different equipment.It should be noted in the discussion above that can from be positioned at remote equipment among the network of any type almost receive will be decoded bit stream.Additionally, can receive bit stream from local hardware or software.

According to different execution modes, transmitter 630 can be arranged to for many reasons and select the layer or the view that transmit, the situation that is dominant of the network transmitting thereon such as the request in response to receiver 650 or for bit stream.Can be for example for layer or the variation of view for showing from the request of receiver, or the request of comparing the variation of the display device with different abilities from upper one.

Communication equipment of the present invention can use different transmission technologys to communicate, and this transmission technology includes but not limited to code division multiple access (CDMA), global system for mobile communications (GSM), Universal Mobile Telecommunications System (UMTS), time division multiple access (TDMA), frequency division multiple access (FDMA), TCP/IP (TCP/IP), sending and receiving short messages service (SMS), Multimedia Message transmitting-receiving service (MMS), Email, instant message transrecieving service (IMS), bluetooth, IEEE 802.11 etc.Communication equipment can use different medium to communicate, and this medium includes but not limited to that radio, infrared ray, laser, cable connect etc.

Fig. 7 and Fig. 8 show a representational electronic equipment 12 can realizing therein different execution modes.But should be appreciated that the equipment that is not intended to different execution modes to be restricted to a particular type.The electronic equipment 12 of Fig. 7 and Fig. 8 comprises shell 30, takes the display 32 of liquid crystal display form, keypad 34, microphone 36, receiver 38, battery 40, infrared port 42, antenna 44, take smart card 46, card reader 48, radio interface circuit 52, codec circuit 54, controller 56 and the memory 58 of UICC form according to an execution mode.Independent circuit and element are all type well known in the art.

Different execution mode described here has been described in the general context of method step or process, in one embodiment, can realize the method step or process by computer program, this computer program is presented as computer-readable medium, comprise computer executable instructions, such as program code, carried out by the computer in networked environment.Computer-readable medium can comprise detachable and non-dismountable memory device, it includes but not limited to, read-only memory (ROM), random access storage device (RAM), compressed disk (CD), digital universal video disc (DVD) etc.Conventionally, program module can comprise and carries out particular task or realize routine, program, object, assembly, data structure of particular abstract data type etc.Computer executable instructions, the data structure being associated and program module represent the example of the program code of the step for carrying out method disclosed herein.The particular sequence representative of such executable instruction or the data structure being associated is for realizing the corresponding actions of the function of describing in above-mentioned steps or process.

Can be in software, hardware, applied logic, or in the combination of software, hardware and applied logic, realize different execution modes.Software, applied logic and/or hardware can be positioned on for example chipset, mobile device, desktop computer, laptop computer or server.The software of different execution modes and web realize and can realize with following: there is the standardization program designing technique of rule-based logic, and for realizing other logics of disparate databases search step or process, correlation step or process, comparison step or process and deciding step or process.Also can among network element or module, completely or partially realize different execution modes.It should be noted in the discussion above that the word " assembly " that uses in this and claims of following and " module " are intended to comprise uses a line or multirow software code and/or hardware to realize and/or for receiving the realization of equipment of artificial input.

Independent and the specific structure of describing in above-mentioned example is appreciated that the representational structure that is configured for the device of carrying out the specific function described in ensuing claims, although the restriction in claims is not appreciated that formation " device adds function " restriction the in the situation that of not using term " device " in claims.Additionally, should not be used to explain that to the use of term " step " any specific restriction in claims forms " step adds function " restriction in the foregoing description.The independent reference of publishing with the patent that comprises mandate described here or that otherwise mention, patent application and non-patent is limited, and such reference is not intended to, and should not be interpreted as limiting the scope of ensuing claims.

Foregoing description for execution mode provides for the purpose of illustration and description.Foregoing description be not limit or be intended to different execution modes to be limited in disclosed accurate form, and consider above instruction, amendment and modification are possible, or can from the practice of different execution modes, obtain them.Selecting and describing execution mode discussed herein is principle and character and the practical application thereof for different execution modes are described, to make those skilled in the art can utilize different execution modes, and different amendments is suitable for desired specific use.The feature of execution mode described here can be combined in the whole possible combination of method, device, module, system and computer program.

Claims

1. the method for the Media Stream that comprises the first depth map picture being associated with the first texture picture and the second depth map picture being associated with the second texture picture is encoded, wherein said the first depth map picture belongs to the first view, and described the second depth map picture belongs to the second view, described method comprises:

Use the first motion vector to belong to the described second depth map picture of described the second view according to described the first depth map picture prediction that belongs to described the first view;

Use the second motion vector to predict described the second texture picture according to described the first texture picture;

Described the first motion vector is encoded in bit stream;

At least described the second motion vector based on showing as predicted motion vector, predicts described the first motion vector; And

Difference between described the first motion vector and described predicted motion vector is encoded in described bit stream.

2. method according to claim 1, wherein said the first texture picture and the second texture picture belong to the first view, and described Media Stream also comprises the second view, described the second view comprises the 3rd depth map picture being associated with texture picture and the 4th depth map picture being associated with the 4th texture picture, and described method also comprises:

Use the 3rd motion vector to predict described the 4th depth map picture according to described the 3rd depth map picture;

Use the 4th motion vector to predict described the 4th texture picture according to described texture picture;

Described the 3rd motion vector and described the 4th motion vector are encoded.

3. method according to claim 2, also comprises:

Between described the second depth map picture and described the 4th depth map picture, between execution view, predicting.

4. the equipment for the Media Stream that comprises the first depth map picture being associated with the first texture picture and the second depth map picture being associated with the second texture picture is encoded, wherein said the first depth map picture belongs to the first view, and described the second depth map picture belongs to the second view, described equipment comprises:

For using the first motion vector to belong to the device of the described second depth map picture of described the second view according to described the first depth map picture prediction that belongs to described the first view;

For using the second motion vector to predict the device of described the second texture picture according to described the first texture picture;

For described the first motion vector being encoded to the device of bit stream;

Be used at least described the second motion vector based on showing as predicted motion vector, predict the device of described the first motion vector; And

For the difference between described the first motion vector and described predicted motion vector being encoded to the device of described bit stream.

5. equipment according to claim 4, wherein said the first texture picture and the second texture picture belong to the first view, and described Media Stream also comprises the second view, described the second view comprises the 3rd depth map picture being associated with texture picture and the 4th depth map picture being associated with the 4th texture picture, and described equipment also comprises:

For using the 3rd motion vector to predict the device of described the 4th depth map picture according to described the 3rd depth map picture;

For using the 4th motion vector to predict the device of described the 4th texture picture according to described texture picture;

For the device that described the 3rd motion vector and described the 4th motion vector are encoded.

6. equipment according to claim 5, also comprises for the device predicting between execution view between described the second depth map picture and described the 4th depth map picture.

7. the method for the Media Stream that comprises the first depth map picture being associated with the first texture picture and the second depth map picture being associated with the second texture picture is decoded, wherein said the first depth map picture belongs to the first view, and described the second depth map picture belongs to the second view, described method comprises:

By the first motion vector from bit stream decoding;

The residual value of fetching by the prediction to based on described the first motion vector with from described bit stream is sued for peace, and the second motion vector is decoded;

Described the second depth map picture that belongs to described the second view is decoded, and wherein said the first motion vector is for predicting described the second depth map picture according to described the first depth map picture that belongs to described the first view; And

Described the second texture picture is decoded, and wherein said the second motion vector is for predicting described the second texture picture according to described the first texture picture.

8. method according to claim 7, wherein said the first texture picture and the second texture picture belong to the first view, and described Media Stream also comprises the second view, described the second view comprises the 3rd depth map picture being associated with texture picture and the 4th depth map picture being associated with the 4th texture picture, and described method also comprises:

The 3rd motion vector and the 4th motion vector are decoded;

Described the 4th depth map picture is decoded, and wherein said the 3rd motion vector is for predicting described the 4th depth map picture according to described the 3rd depth map picture; And

Described the 4th texture picture is decoded, and wherein said the 4th motion vector is for predicting described the 4th texture picture according to described texture picture.

9. method according to claim 8, also comprises:

10. the equipment for the Media Stream that comprises the first depth map picture being associated with the first texture picture and the second depth map picture being associated with the second texture picture is decoded, wherein said the first depth map picture belongs to the first view, and described the second depth map picture belongs to the second view, described equipment comprises:

For the device from bit stream decoding by the first motion vector;

Sue for peace for the residual value of fetching by the prediction to based on described the first motion vector with from described bit stream, the device that the second motion vector is decoded;

For the device of decoding to belonging to the described second depth map picture of described the second view, wherein said the first motion vector is for predicting described the second depth map picture according to described the first depth map picture that belongs to described the first view; And

For the device that described the second texture picture is decoded, wherein said the second motion vector is for predicting described the second texture picture according to described the first texture picture.

11. equipment according to claim 10, wherein said the first texture picture and the second texture picture belong to the first view, and described Media Stream also comprises the second view, described the second view comprises the 3rd depth map picture being associated with texture picture and the 4th depth map picture being associated with the 4th texture picture, and described equipment also comprises:

For the device that the 3rd motion vector and the 4th motion vector are decoded;

For the device that described the 4th depth map picture is decoded, wherein said the 3rd motion vector is for predicting described the 4th depth map picture according to described the 3rd depth map picture; And

For the device that described the 4th texture picture is decoded, wherein said the 4th motion vector is for predicting described the 4th texture picture according to described texture picture.

12. equipment according to claim 11, also comprise:

Be used at the device of predicting between execution view between described the second depth map picture and described the 4th depth map picture.