WO2011111533A1

WO2011111533A1 - Transmission error concealment processing device, transmission error concealment processing method, and program thereof

Info

Publication number: WO2011111533A1
Application number: PCT/JP2011/054064
Authority: WO
Inventors: 和也早瀬; 藤井　寛; 誠之高村; 裕尚如澤
Original assignee: 日本電信電話株式会社
Priority date: 2010-03-11
Filing date: 2011-02-24
Publication date: 2011-09-15
Also published as: JP2013102256A; TW201203167A

Abstract

With the present disclosures, when decoding and replaying hierarchically coded data, quality degradation of replayed video is caused to not be prominent even if a signal of a desired hierarchical level at a time point for which replay is desired can not be decoded due to a transmission error. In a system that receives coded data that has at least two hierarchical levels and that decodes and replays the received coded data, a decoded signal obtained from decoding the coded data at each hierarchical level is recorded. In the case where a decoded signal of the desired hierarchical level at a time point for which replay is necessary cannot be obtained due to a transmission error, at least one recorded decoded signal is read, and the read decoded signal is input into a mixing function and is mixed at a set mixing rate, by which means a mixed signal is generated, the generated mixed signal is used as an artificially produced interpolated signal for the desired hierarchical level at the time point at which replay is necessary, and the interpolated signal is output as the signal for replay for the point in time at which replay is necessary.

Description

Transmission error concealment processing apparatus, transmission error concealment processing method and program thereof

The present invention receives two or more hierarchized encoded data, decodes the received encoded data, and reproduces the video. Even if a transmission error occurs, the image quality of the reproduced video is as much as possible. The present invention relates to a transmission error concealment technique that does not degrade the transmission.
This application claims priority to Japanese Patent Application No. 2010-054136 filed in Japan on March 11, 2010, the contents of which are incorporated herein by reference.

When streaming playback of video data using a network line with transmission errors such as packet delay and packet loss, video quality degradation such as generation of block noise, playback delay, and frame skipping occurs. In order to avoid this quality degradation, transmission error recovery technology and transmission error concealment technology are usually used.

Transmission error recovery technology is a technology that provides redundant data to the original video data in advance and recovers the information of packets lost due to transmission errors using the redundant data. A typical example is forward error correction (FEC) technology.

Transmission error concealment technology is as high as possible only with information that has already been received, such as when packets are lost due to transmission errors and cannot be recovered using FEC, or when video is not completed at the timing to be reproduced due to transmission delay. It is a technology for constructing quality playback video. For example, in Patent Document 1, when an error occurs in decoded image data, a transmission error is concealed by repeatedly displaying the currently displayed image.

Japanese Unexamined Patent Publication No. 2001-119893

When an error occurs, the movement of the subject in the error occurrence section cannot be sufficiently reproduced with the repeat display as described in Patent Document 1 described above.

By the way, the scalable video coding technology is attracting attention as a video coding technology that is highly resistant to transmission errors. A scalable video-coded video stream is composed of a basic layer that holds information of low video quality and an extended layer that holds information of high video quality. The extended layer data is difference information from the basic layer data necessary for reproducing high-quality video. Due to its hierarchical nature, the usage hierarchy can be flexibly switched in response to transmission errors, and it has very high compatibility with error concealment technology.

The present invention has been made in view of such circumstances, and when a video stream having a hierarchical structure is input, if a decoded signal of a desired hierarchy is not decoded at the timing of the time at which it is desired to be reproduced, By inputting the decoded signal stored in the frame buffer of the terminal that has been received and decoded and the interpolated signal interpolated by a predetermined method into a predetermined mixing function and mixing them, the signal of the desired layer at the time The purpose is to establish a design method of a transmission error concealer that generates a pseudo signal and outputs the signal as a final video signal for reproduction.

In order to solve the above problems, a first aspect of the present invention is to conceal transmission errors in a system that receives two or more hierarchized encoded data and decodes and reproduces the received encoded data. A processing apparatus, wherein a decoded signal storage unit for storing a decoded signal obtained by decoding the encoded data for each layer, and a decoded signal of a desired layer at a time required for reproduction cannot be obtained due to a transmission error In addition, one or more decoded signals stored in the decoded signal storage unit are read, and the read decoded signals are input to a mixing function and mixed at a set mixing ratio to generate a mixed signal, An interpolated signal generating unit that makes the generated mixed signal an interpolated signal of the desired hierarchy that is artificially created at the time when the reproduction is required, and the interpolated signal is a signal for reproduction at the time that requires the reproduction. And a reproduction image output unit for outputting.

The transmission error concealment processing apparatus includes an interpolation signal storage unit that stores the interpolation signal, and the interpolation signal generation unit reads one or more interpolation signals stored in the interpolation signal storage unit, and the decoded signal By inputting the read interpolation signal to the mixing function together with the decoded signal read from the storage unit, the one or more decoded signals and the one or more interpolation signals are mixed, and the desired layer A mixed signal as the interpolation signal may be generated.

In the transmission error concealment processing device, the mixing rate is higher when the signal input to the mixing function is closer to the time when the reproduction is required, or the signal input to the mixing function is the desired layer. The higher the signal level is, the higher the signal input to the mixing function, the higher the image quality estimation value, or the higher the signal input to the mixing function, It may be set to have a value corresponding to a change in pixel value.

In the transmission error concealment processing device, the mixing rate set to be a value corresponding to the temporal pixel value change of the signal is determined by the motion amount estimation for each region obtained by dividing the screen. It may be a value set according to the determination result by determining whether it is a moving area or not.

According to a second aspect of the present invention, there is provided a transmission error concealment processing method in a system which receives two or more layered encoded data, decodes the received encoded data, and reproduces the encoded data. A step of storing a decoded signal obtained by decoding data for each layer, and one or more decoded signals stored when a decoded signal of a desired layer at a time required for reproduction cannot be obtained due to a transmission error The mixed signal is generated by inputting the read decoded signal into a mixing function and mixing at a set mixing ratio, and the generated mixed signal is created in a pseudo manner at the time when the reproduction is necessary. An interpolation signal generating step for generating an interpolation signal of the desired hierarchy; and a reproduction video output step for outputting the interpolation signal as a reproduction signal at a time when the reproduction is required.

The transmission error concealment processing method includes a step of storing the interpolation signal. In the interpolation signal generation step, one or more stored interpolation signals are read, and the read interpolation signal is read together with the read decoded signal. By inputting a signal to the mixing function, the one or more decoded signals and the one or more interpolation signals are mixed to generate a mixed signal that is the interpolation signal of the desired hierarchy. good.

In the transmission error concealment processing method, the mixing rate is higher when the signal input to the mixing function is closer to the time when the reproduction is necessary, or the signal input to the mixing function is the desired layer. The higher the signal level is, the higher the signal input to the mixing function, the higher the image quality estimation value, or the higher the signal input to the mixing function, It may be set to have a value corresponding to a change in pixel value.

In the transmission error concealment processing method, the mixing rate set to be a value corresponding to a temporal change in the pixel value of the signal is that each region is a still region by motion estimation for each divided region of the screen. It may be a value set according to the determination result by determining whether it is a moving area or not.

A third aspect of the present invention is a transmission error concealment processing program for causing a computer to execute the transmission error concealment processing method.

According to the present invention, when the desired frame in the desired layer is not decoded at the timing to be reproduced due to packet loss or transmission delay, the image quality of the video finally reproduced can be improved as compared with the conventional technique.

It is a figure which shows the example of the frame structure for demonstrating one Embodiment of this invention. It is a figure which shows the structural example of the transmission error concealment processing apparatus by one Embodiment of this invention. It is a figure which shows the flow of the transmission error concealment process by one Embodiment of this invention. It is a figure which shows the flow of the interpolation signal generation process in a transmission error concealment process. It is a figure which shows the hardware structural example when implement | achieving a transmission error concealment processing apparatus by a software program.

Hereinafter, an embodiment of the present invention will be described with reference to the drawings. First, an outline of the present embodiment will be described.
In this embodiment, when the decoded signal of the desired layer is not obtained at the timing of the desired time to be reproduced due to packet loss, transmission delay, decoding processing delay, etc., the decoded signal received so far or generated so far Using the interpolation signal, an interpolation signal of a desired hierarchy in which the decoded signal is lost at the time is generated in a pseudo manner. The interpolated signal refers to a signal obtained by performing a high image quality process such as restoration of a high frequency component on the decoded signal. This interpolated signal is finally reproduced as a signal for reproduction at the corresponding time over the video renderer. In this embodiment, the transmission error includes not only errors such as packet loss in the network but also cases where a transmission delay occurs or decoding is not in time.

In this embodiment, the following procedure is taken to create an interpolation signal at the time. First, the received decoded signal and interpolation signal are stored in a memory area such as a frame buffer. The number, type, and time range of the decoded signals and interpolation signals to be stored are set in advance from the outside. Then, the decoded signal, the interpolation signal, etc. stored in the memory area are mixed using a predetermined mixing function, and the obtained mixed signal is regarded as an interpolation signal of a desired hierarchy created in a pseudo manner at the time. The function format of the mixed function and the coefficients used internally are set in advance from the outside.

In this embodiment, in order to artificially create an interpolation signal of a desired layer in which the decoded signal is lost at the time, surrounding decoded signals are mixed adaptively with reference to information on motion and image quality. For example, in the static region, it is considered that the past signal of the same layer as the desired layer has a signal value closer to the missing signal. On the other hand, in the moving area, it is considered that the lower layer signal at the time has a signal value closer to the missing signal. Therefore, when the surrounding decoded signals are mixed, the image quality of the video is improved by increasing the mixing ratio of signals that are estimated to have signal values close to the missing signals with reference to motion information and the like. Can do.

In addition, without using the interpolation signal generated so far, only the decoded signal stored for each layer can be input to the mixing function to generate the interpolation signal of the desired layer. In this case as well, better results can be obtained than in the prior art.

Next, this embodiment will be described in further detail. In the description of the present embodiment, the preconditions of the present embodiment are described below. The encoded data to be input is data hierarchized into two or more hierarchies by scalable video coding. Examples of scalable coding include H.264. SVC (Scalable Video Video Coding), which is an extension (Annex G) standard of H.264 / AVC. Further, it is assumed that a signal of a desired hierarchy to be reproduced cannot be obtained at the time.

Therefore, the processing of the present embodiment is started upon receipt of a desired layer missing instruction flag indicating that a signal of the desired layer is not obtained. As this desired hierarchy, not the lowest hierarchy but any higher hierarchy is set. It is assumed that a predetermined received decoded signal and a predetermined interpolation signal generated up to the time are stored in the memory area of the terminal that performs decoding. The number, type, and time range of decoded signals and interpolation signals to be stored are given in advance from the outside.

For example, assume that the time is T, and the decoded signal rec from time Ta to time T and the interpolation signal ip1 from time Ta to time T-1 are stored in the frame buffer. Each pixel of the interpolation signal ipl (T) at the time T is generated as follows.

ipl (T) = f (rec (T−a), rec (T−a + 1),..., rec (T−1), rec (T), ipl (T−a), ipl (T−a + 1),. , Ipl (T-1)) ... Formula (1)
Here, f (s) is a mixing function for generating a mixed signal by inputting the signal group s.

Note that here, an example in which an interpolation signal that has already been generated as a mixed function is used will be described. On the other hand, when the generated interpolation signal is not used, each pixel of the interpolation signal ipl (T) is generated using the following mixing function.

ipl (T) = f (rec (T−a), rec (T−a + 1),..., rec (T−1), rec (T)) (1)
In the above example, the decoded signal and the interpolation signal from time Ta to time T before time T are stored. However, H. When a H.264 / AVC B picture or the like is used, since a future decoded signal is received before the time T, only the past signal in time is not necessarily the target.

Referring to FIG. 1, a specific example of the interpolation signal generation process will be described. The time is T, and the desired hierarchy to be reproduced is L. The interpolation signal generated in this embodiment is ipl (T). Here, it is assumed that spatially scalable encoded data has been received, and data has been received up to layer L at time T-2, layer L-2 at time T-1, and layer L-1 at time T without loss. Further, it is assumed that the decoded signal rec (T-2) is reproduced as it is at time T-2, and the interpolation signal ipl (T-1) is reproduced at time T-1. The present embodiment may be applied to the method for generating the interpolation signal ipl (T-1), or other methods may be used. As an example of the other method, the method described in Patent Document 1 can be cited.

When processing the encoded data of SVC, if the upsampling filter used in texture prediction (IntraBL mode) is used, the implementation cost can be reduced. It is assumed that the past two frames of decoded signals and interpolated signals are stored in the frame buffer. That is, rec (T-2), rec (T-1), rec (T), and ipl (T-1) are stored.

Further, as the mixing function f (s), a function that performs linear weighting on an input signal and averages the input signal is used. In performing the weighting, since the resolution of the decoded signal rec (T-1) and the decoded signal rec (T) is smaller than the resolution of the desired layer, the resolution is expanded to the resolution of the desired layer. Examples of the enlargement method include an enlargement method using a linear filter such as a 4-tap or a 6-tap, and a super-resolution process for pseudo-reconstructing a high-frequency component. Here, the enlarged signal is set as ups (t), and the function system of the enlargement process is ups (t) = p (rec (t))
It expresses.

At this time, each pixel of the interpolation signal ipl (T) at the time T is
_{ipl (T) = w a ·} rec (T-2) + w b · p (rec (T-1))
+ W _c · p (rec (T)) + w _d · ipl (T-1) Equation (2)
Is generated as Here, w indicates the mixing ratio of each signal,
w _a + w _b + w _c + w _d = 1
It is. The pixel values at the same spatial position are respectively weighted and mixed.

The mixing ratio w is derived from the outside either using a setting file at the receiving terminal or by providing a derivation module inside the application. Further, one mixing rate may be given to the frame, or a separate mixing rate may be given to each image region or pixel having an arbitrary shape.

The following are four examples of how to give the mixing ratio. The mixing ratio may be set by a combination of these four methods.

<Mixing rate setting method 1: Setting according to the time difference of signals>
The mixing rate setting method 1 is a setting method according to the difference between the time and the time of the decoded signal or interpolation signal to be mixed. Since the signal closer to the time has a picture structure closer to that of the signal of the desired hierarchy at the time, it is desirable to set the mixing ratio of the signal close to the time as high as possible.

<Mixing rate setting method 2: Setting according to signal hierarchy difference>
The mixing rate setting method 2 is a setting method corresponding to the difference between the desired layer and the decoded signal or interpolated signal layer to be mixed. Since the signal in the layer closer to the desired layer stores a larger number of higher frequency components, it is desirable to set the mixing ratio of the signal in the layer close to the desired layer as high as possible.

<Mixing ratio setting method 3: setting according to the estimated value of video quality>
A method for setting the mixing ratio according to the estimated value of the video quality is conceivable. If the video quality can be estimated, it is desirable to set the mixing ratio of the highest quality signal as high as possible. A method for estimating the video quality from the quantized value and the picture type can be considered.

<Mixing rate setting method 4: Setting according to temporal pixel value change of signal>
A method for setting the mixing ratio in accordance with temporal pixel value changes is conceivable. Since pixel values at the same position are mixed spatially, if the pixel values differ according to temporal changes due to movement of the object, etc., the mixing ratio of the pixel values having different times is reduced and the decoding signal mixing ratio at that time is reduced. It is desirable to increase. On the other hand, when the pixel value does not change with time, the mixing rate of the decoded signal or interpolation signal in the desired layer containing a lot of high-frequency components is increased, and the mixing rate is increased as the layer moves away from the desired layer. It is desirable to set a smaller value. A setting method based on the above requirements can be considered. In the pixel position x, it is estimated whether the pixel value change is large when moving from time T-2 to the time T, and whether the pixel value change is large when moving from time T-1 to the time T. To do.

An example of how to give the mixing ratio will be described with respect to the mixing ratio setting method 4 according to the temporal change of the pixel value in the mixing ratio setting method 4. In a pixel in which the change in pixel value from time T-2 to time T and from time T-1 to time T is both considered to be large,
w _a = w _b = w _d = 0, w _c = 1
And set the interpolation signal to
ipl (T) = p (rec (T)) (3)
Can be generated as

In addition, in the pixel in which the change in pixel value from time T-2 to time T and from time T-1 to time T is both considered to be small,
w _a = w _b = w _c = w _d = ¼
And set the interpolation signal to
ipl (T) = (1/4) × {rec (T−2) + p (rec (T−1)) + p (rec (T)) + ipl (T−1)} (4)
Can be generated as

When the pixel value change from time T-2 to time T is large, but when the pixel value change from time T-1 to time T is small, the mixing ratio is
w _a = 0, w _b = w _c = w _d = 1/3
And set the interpolation signal to
ipl (T) = (1/3) × {p (rec (T−1)) + p (rec (T)) + ipl (T−1)} (5)
Can be generated as

As a method of estimating the magnitude of the pixel value change, for example, the motion amount is estimated from time T-2 to the time T, and the pixel value in the area determined to be a moving area is regarded as having a large pixel value change. A method can be considered in which the pixel value in the region determined to be a region is regarded as having a small change in pixel value. A different mixing ratio setting method may be used for each interpolation signal.

Here are five examples of how to estimate the amount of motion from time T-2 to time T. That is, it is determined whether the pixel belongs to a stationary region or a moving region. The decoding information in the memory area, the enlarged signal, the interpolation signal, and the encoded information such as the received motion vector and prediction mode are read, the still area and the moving area are determined, and the determination result is output. These are examples of estimating the amount of motion from time T-2 to time T, and the pixel is divided into two categories of a stationary region and a moving region, but the number of categories may be three or more. . Further, a different motion amount estimation method may be used for each interpolation signal.

<Motion amount estimation method 1>
The motion amount estimation method 1 is a method for estimating the motion amount according to the difference value between the reduced signal of the decoded signal at time T-2 and the decoded signal at time T.

∙ Determine whether the pixel belongs to the static region or the moving region as follows. The decoded signal rec (T-2) at time T-2 is reduced to the resolution of the decoded signal at time T, and a pixel value difference is obtained between pixels at the same spatial position. A reduced signal of the decoded signal rec (T-2) is expressed as dws (T-2).

Still region: | rec (T) −dws (T−2) | ≦ E ₁ (6)
Moving region: E ₁ <| rec (T) −dws (T−2) | (7)
Here, E ₁ is a threshold value of the difference signal value that separates the stationary region and the moving region, and is given by an external function. For example, if the resolution of rec (T-2) is 1920 × 1080 and the resolution of rec (T) is 960 × 540, the above determination is performed for each pixel of 960 × 540. The determination result is regarded as a determination result of four pixels at the same spatial position of 1920 × 1080.

<Motion amount estimation method 2>
The motion amount estimation method 2 is a method for estimating the motion amount according to the difference value between the decoded signal at time T-2 and the enlarged signal at time T.

∙ Determine whether the pixel belongs to the static region or the moving region as follows. A pixel value difference is obtained between pixels in the same spatial position. The enlarged signal at the time T is expressed as ups (T).

Still region: | ups (T) -rec (T-2) | ≦ E ₂ ...... Formula (8)
Moving region: E ₂ <| ups (T) -rec (T-2) | Equation (9)
Here, E ₂ is a threshold value of the difference signal value that separates the stationary region and the moving region, and is given by an external function.

<Motion amount estimation method 3>
The motion amount estimation method 3 is a method for estimating the motion amount according to the norm of the motion vector used for generating the decoded signal rec (T) at the time T.

The norm of the motion vector of a certain macroblock (16 × 16 pixel region) used for generating the decoded signal rec (T) at the time T is set to n. This motion vector is assumed to be a motion vector from time T-2 to the time T. At this time, it is determined as follows whether the macroblock to which the pixel belongs belongs to a still area or a moving area.

Static region: n ≦ N …… Equation (10)
Moving region: N <n (Equation 11)
Here, N is a threshold value of a motion vector norm that separates a stationary region and a moving region, and is given by an external function. An example of a norm is Euclidean distance.

<Motion amount estimation method 4>
The motion amount estimation method 4 is a method for estimating a motion amount according to the type of prediction mode used to generate the decoded signal rec (T) at the time T.

Suppose that the prediction mode of a macroblock used for generating the decoded signal rec (T) at the time T is m. It is assumed that the encoded data conforms to SVC. At this time, it is determined as follows whether the macroblock to which the pixel belongs belongs to a still area or a moving area.

Still region: m == “skip” (12)
Moving area: m! == “skip” (13)
“Skip” indicates the skip mode in SVC. The mode to be determined may be another mode other than the skip mode, and is given by an external function. Note that “==” means that both sides are equal, and “! ==” means that both sides are not equal.

<Motion amount estimation method 5>
The motion amount estimation method 5 is a method for estimating the motion amount according to the magnitude of the prediction residual signal used for generating the decoded signal rec (T) at the time T.

The signal value of the prediction residual signal used for generating the decoded signal rec (T) at the time T is set as r. At this time, it is determined as follows whether the pixel belongs to a still region or a moving region.

Static region: | r | ≦ R (14)
Moving region: R <| r | Equation (15)
Here, R is a threshold value of a prediction residual signal that separates a stationary region and a moving region, and is given by an external function. For r, the signal value of the prediction residual signal at the same location as the pixel position in space may be used as it is, or the variance value, average value, maximum value in the image region (for example, macroblock) to which the pixel belongs. , Intermediate value, etc.

In the case of SVC encoded data, there is a flag called CBP (Coded Block Pattern) indicating whether or not all quantization coefficients are 0. Instead of the signal value of the prediction residual signal, the motion amount class may be divided according to whether the value of this flag is 0 or 1.

Some of these five motion amount estimation methods may be connected in multiple stages. For example, the motion amount estimation according to the value of the motion vector norm is performed, the frame is divided into a region with a large motion and a region with a small motion, and a motion amount estimation according to the signal difference value is performed for the region with a small motion. , It may be subdivided into a stationary area and a moving area.

Moreover, the determination result derived by these determinations may be corrected. For example, if a certain pixel or one or more image areas are determined to be moving areas, but all surrounding pixels or image areas are determined to be still areas, the determination result for the target pixel or image area Is likely to be a false determination. If the determination result is an erroneous determination, the pixel or image area appears as an isolated point, causing image quality degradation. Therefore, in this case, the determination result for the target pixel or the target image region is regarded as an erroneous determination, and the determination result is corrected as a still region. That is, when the determination result around the target pixel or the target image region is greatly different from the determination result of itself, the determination result can be corrected to improve the estimation accuracy.

[Transmission error concealment processing device]
In the embodiments described below, “mixing rate setting method 4: setting according to temporal change in pixel value of signal” is applied as a mixing rate setting method, and “motion amount estimating method 1: time T− It is assumed that the temporal pixel value change is estimated by “motion amount estimation according to the difference value between the reduced signal of the decoded signal 2 and the decoded signal at time T”. The motion amount is estimated by dividing a frame into image regions of one pixel or more. It should be noted that the present embodiment can be implemented in the same manner as will be apparent from the following description when other mixing ratio setting methods or other motion amount estimation methods are used.

FIG. 2 shows a configuration example of a transmission error concealment processing apparatus according to an embodiment of the present invention. In FIG. 2, 10 is a transmission error concealment processing device, 20 is a receiving device that receives packets of an encoded stream, and 30 is a decoding device that decodes the encoded stream and outputs a reproduced video signal (also simply referred to as a reproduced signal). To express.

The receiving device 20 is a device that receives an encoded stream encoded by scalable video encoding. The reception of the encoded stream by the receiving device 20 may be the same as that of the conventional receiving device. However, if there is some transmission error and the reception of the hierarchically encoded data with the processing target frame is missing, a missing instruction signal indicating the missing is sent to the transmission error concealment processing device 10. The decoding device 30 may generate a missing instruction signal.

Also, the decoding device 30 is the same as a conventional device for performing scalable video decoding. However, the decoding device 30 is different from the conventional one in that not only the reproduced video signal of the decoding result is output, but also the received encoded information decoded by the variable length decoding unit 31 is not only the scalable decoding unit 32, To output to the transmission error concealment processing device 10 and to output the decoded signal of each layer to the transmission error concealment processing device 10.

In FIG. 2, the transmission error concealment processing device 10 and the decoding device 30 are shown as separate devices for easy understanding of the description. However, the transmission error concealment processing device 10 may be incorporated in the decoding device 30 as a part of the decoding device 30.

The storage device 15 is provided with a decoded signal storage unit 151 that stores a decoded signal that has already been decoded by the decoding device 30 and an interpolation signal storage unit 152 that stores a previously generated mixed signal as a frame buffer. . In the memory area of the storage device 15, information such as a motion vector, a prediction mode, a prediction residual signal, and CBP is input from the decoding device 30 and stored as received encoded information. Further, in the memory area of the storage device 15, a motion amount estimation threshold and a frame mixture ratio are set in advance from the external setting unit 40 and stored.

When the transmission error concealment processing device 10 receives the processing target frame missing instruction signal from the receiving device 20, the still region / moving region determination unit 11 performs motion amount estimation for the pixel (group) to be reproduced, It is determined whether it belongs to a stationary area or a moving area. That is, the still region / moving region determination unit 11 reads the decoded signal, the enlarged signal and the interpolation signal in the storage device 15 and the encoded information such as the received motion vector and prediction mode from the memory region. The determination result is stored in the storage device 15.

The mixing rate setting unit 12 reads the determination result of the still region / moving region and the value of the mixing rate for each determination (that is, the still region and the moving region) from the storage device 15, and outputs each decoded signal, enlarged signal, and interpolation for the pixel. Set as signal mixing ratio. When the setting is completed, the process proceeds to the interpolation signal generation unit 13.

The interpolation signal generation unit 13 reads the decoding rate, the enlarged signal, and the mixing rate of the interpolation signal set by the mixing rate setting unit 12 and reads from the storage device 15 each decoding signal, the enlarged signal, and the spatially same position as the pixel. The value of the interpolation signal is read, and the values of the decoded signal, the enlarged signal and the interpolation signal are mixed according to the read mixing ratio, and the interpolation signal at the time is generated. The interpolation signal generation unit 13 outputs the interpolation signal generated by the mixing to the interpolation signal storage unit 152 of the storage device 15. When the output is completed, the process proceeds to the process of the reproduction video output unit 14.

The playback video output unit 14 reads the interpolation signal at the time from the interpolation signal storage unit 152 of the storage device 15 and outputs it as a playback video signal to a video renderer (not shown) at the playback timing.

[Process flow]
The flow of processing executed by the transmission error concealment processing device 10 will be described in detail with reference to FIGS. FIG. 3 shows the overall processing flow, and FIG. 4 shows the specific processing flow of the interpolation signal generation processing S12 in FIG.

Receives an encoded stream with spatial scalability conforming to SVC. However, here, it is assumed that the data in the extension layer may be lost due to a transmission error. On the other hand, it is assumed that data of the basic hierarchy is not missing. As a specific example of this embodiment, the decoded signal rec (T) of the highest layer (hereinafter referred to as the highest layer) among the layers decoded at the desired time T, and the decoded signal decoded up to the desired layer The flow of processing when the desired interpolation signal ipl (T) is generated using the decoded signal rec (T-2) of the frame at the latest time will be described.

Also, “mixing rate setting method 4: setting according to change in temporal pixel value of signal” is applied as a mixing rate setting method, and “motion amount estimation method 1: reduction of decoded signal at time T-2” A case where a temporal pixel value change is estimated by “motion amount estimation according to a difference value between a signal and a decoded signal at time T” will be described. The motion amount is estimated by dividing a frame into image regions of one pixel or more. The following description is a flow of generating a frame interpolation signal at the desired time T.

[Step S10: Image Region Division Processing]
In the image area dividing process, an interpolation signal output frame of a desired hierarchy is input, and the frame is divided into a plurality of predetermined image areas of one pixel or more. The predetermined plurality of image regions having one or more pixels may be, for example, a macro block (16 × 16 pixels), but is not limited thereto. The output by this processing is a divided frame and division information.

[Steps S11-S13: Interpolation Signal Generation Processing Loop in Each Image Region]
The interpolation signal generation process of step S12 is performed for each image region of the frame of the desired hierarchy. This process is repeated until the interpolation signal generation process for all image regions is completed.

[Step S12: Interpolation Signal Generation Processing]
The input in the interpolation signal generation processing is the decoded signal rec (T) of the highest layer decoded at the desired time T, and the decoded signal rec (T) of the frame at the latest time among the decoded signals decoded up to the desired layer. -2), data stored in the storage device 15 such as a still region / moving region determination threshold, a mixing ratio for the still region and the moving region, a signal mixing formula, an index of the image region, and the like. The output of the interpolation signal generation process is an interpolation signal ipl (T) for the image area.

Here, the still region / moving region determination unit 11 decodes the decoded signal rec (T) of the highest layer decoded at the desired time T and the decoded to the desired layer for the image region to be processed. The decoded signal rec (T-2) of the frame at the most recent time in the signal is used to determine whether it is a still area or a moving area. Based on the determination result, the mixing rate setting unit 12 sets the mixing rate. Then, the interpolation signal generator 13 mixes the decoded signal rec (T) and the signal value of the decoded signal rec (T-2) according to the mixing ratio, and the mixed signal is the interpolation signal ipl (T) for the image region. Is output to the interpolation signal storage unit 152.

Details of the interpolation signal generation processing in step S12 shown in FIG. 3 will be described with reference to FIG.

[Step S20: Reduced Signal Generation Processing]
The input of the reduced signal generation processing performed first by the still region / moving region determination unit 11 is the decoded signal rec (T-2) of the frame at the most recent time among the decoded signals decoded to the desired layer, and the resolution of each layer Information. Here, the still region / moving region determination unit 11 reaches the resolution of the decoded signal rec (T) of the highest layer that can be decoded at the desired time T from the decoded signal rec (T-2) at the time T-2. Process to reduce. For example, if the image area is 16 × 16 pixels having a spatial scalability of (1920 × 1080) / (960 × 540) and having an extended layer, the reduced image area is an 8 × 8 pixel area at the same spatial position. . The output in this process is a reduced signal dws (T-2) of the decoded signal rec (T-2).

[Step S21: Difference Absolute Value Calculation Processing]
The next input of the absolute difference calculation process performed by the still region / moving region determination unit 11 is the reduced signal dws (T-2) and the decoded signal rec (T) of the decoded signal rec (T-2). Here, the still region / moving region determination unit 11 calculates a pixel value difference between pixels at the same spatial position between the reduced signal dws (T−2) and the decoded signal rec (T). Then, the still region / moving region determination unit 11 calculates the sum of the absolute values of the differences in the reduced image region. In the case of the above-described example, the still region / moving region determination unit 11 calculates the sum E of absolute difference values in the reduced image region 8 × 8 pixels. The output of the difference absolute value calculation process is the sum E of the difference absolute values in the reduced image area.

[Step S22: Still Area / Moving Area Determination Process]
In the still region / moving region determination process, the sum E of absolute differences in the reduced image region and a threshold E ₁ for determining the still region / moving region are input. This threshold value E ₁ is a threshold value of a differential signal value that separates a stationary area and a moving area given in advance by the setting unit 40 such as an external function, and is a value stored in the storage device 15. Based on these inputs, the still region / moving region determination unit 11 determines whether or not the sum E of the absolute differences is larger than a threshold E ₁ for determining the still region / moving region. If the total E is less than or equal to the threshold value, the image area corresponding spatially to the reduced image area is regarded as a still area. If the total E is larger than the threshold value, the image area is regarded as a moving area. In other words, the following judgment is performed.

Still region: E ≦ E ₁ …… Equation (16)
Moving region: E ₁ <E …… Equation (17)
In the above example, the determination result of the reduced image area 8 × 8 pixels is regarded as the determination result of the image area 16 × 16 pixels. The output of this process is a still area / moving area determination result in the image area.

[Steps S23 to S25: Correction processing of determination result]
The still region / moving region determination unit 11 further performs the following processing based on whether the image region is a still region based on the determination result. If the image region is a still region, the still region / moving region determination unit 11 determines whether many surrounding image regions are moving regions. If so, the process proceeds to step S27. Otherwise, the process proceeds to step S26. If the image region is not a still region, the still region / moving region determination unit 11 determines whether many surrounding image regions are still regions. If so, the process proceeds to step S26. Otherwise, the process proceeds to step S27. As a result, when the determination result for many of the surrounding image areas is a determination result different from the image area, the determination result for the image area is corrected to the determination result for the surrounding image area.

[Step S26: Setting Mixing Ratio for Still Area]
The mixing rate setting unit 12 reads the mixing rate for the still area from the storage device 15 and sets the value in a register (not shown). Thereafter, the process proceeds to step S28.

[Step S27: Setting Mixing Ratio for Moving Area]
The mixing rate setting unit 12 reads the mixing rate for the moving area from the storage device 15 and sets the value in the register. Thereafter, the process proceeds to step S28.

[Step S28: Interpolation Signal Generation Processing]
The interpolated signal generation unit 13 decodes the highest level decoded signal rec (T) decoded at the desired time T, and the decoded signal rec (T-2) of the frame at the latest time among the decoded signals decoded up to the desired layer. ), A mixing ratio for the stationary region and the moving region, and a signal mixing formula are input, and an interpolation signal in the image region is generated from these as follows. First, the interpolation signal generation unit 13 expands the decoded signal rec (T) to the resolution of the decoded signal rec (T-2). Next, the interpolation signal generation unit 13 mixes the decoded signal rec (T-2) and the enlarged signal p (rec (T)) of the decoded signal rec (T) at the mixing ratio set in the register. Here, the signal mixing formula is a linear weighted sum, and the mixing rate is the linear weighting coefficient. When the mixing ratio and w _a and w _c, the generation of the interpolation signal of the decoded signal rec (T-2) and the larger signal p (rec (T)) is the following formula _{ipl (T) = w a ·} rec ( T−2) + w _c · p (rec (T)) (18)
Follow.

In the present embodiment, the interpolation signal generation means may be changed for each time. For example, a method of setting a mixing rate according to the estimated video quality value is applied to ipl (T-1), and a method of setting the mixing rate according to temporal pixel value change is applied to ipl (T). You may apply. The conventional technology such as the super-resolution technology and this embodiment may be changed for each time.

In this embodiment, the form of the mixing function may be an arbitrary non-linear function form instead of the linear weighted average as described above. A function format that outputs an intermediate value of a plurality of signal values or an intermediate value of a weighted signal may be used.

Also, in this processing example, interpolation processing is performed after determining a stationary region and a moving region for each image region as needed. However, the interpolation process may be performed on each pixel after the determination process of the still area and the moving area is performed on all the pixels of one frame.

The above transmission error concealment processing can also be realized by a computer and a software program. The program can be recorded on a computer-readable recording medium or provided through a network.

FIG. 5 shows a hardware configuration example when the transmission error concealment processing device is realized by using a software program.

This system receives a coded stream via a network (CPU (Central Processing Unit) 50 for executing a program, a memory 51 such as a RAM (Random Access Memory) in which a program and data accessed by the CPU 50 are stored, and the network. The encoded stream receiving unit 52, a program storage device 53 for storing a program to be executed by the CPU 50, and a video reproducing unit 54 for outputting a reproduced video signal are connected by a bus.

The program storage device 53 includes the decoding processing program 531 for decoding the encoded stream of the hierarchically encoded data received by the encoded stream receiving unit 52, and the above-described case when there is a transmission error in the processing target frame. A transmission error concealment processing program 532 for causing the CPU 50 to execute a transmission error concealment process is stored.

The CPU 50 loads the decoding processing program 531 and the transmission error concealment processing program 532 into the memory 51 and executes them. As a result, even when the desired layer of the desired frame is not decoded at the timing to be reproduced due to packet loss or transmission delay, it is possible to make the deterioration of the image quality of the finally reproduced video inconspicuous.

The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to the above-described embodiment, and the design and the like (addition of configuration, Omissions, substitutions, and other changes). The present invention is not limited by the above description, but only by the appended claims.

The present invention is used, for example, in a system that receives two or more hierarchized encoded data, decodes the received encoded data, and reproduces a video. According to the present invention, even when a desired frame in a desired layer is not decoded at a timing to be reproduced due to packet loss or transmission delay, the image quality of the video to be finally reproduced can be improved.

DESCRIPTION OF SYMBOLS 10 Transmission error concealment processing apparatus 11 Still region / moving region determination unit 12 Mixing rate setting unit 13 Interpolation signal generation unit 14 Playback video output unit 15 Storage device 151 Decoding signal storage unit 152 Interpolation signal storage unit 20 Reception device 30 Decoding device 31 Variable Long decoding unit 32 Scalable decoding unit 40 Setting unit

Claims

A transmission error concealment processing apparatus in a system that receives two or more layered encoded data, decodes and reproduces the received encoded data,
A decoded signal storage unit for storing a decoded signal obtained by decoding the encoded data for each layer;
When a decoded signal of a desired layer at a time required for reproduction cannot be obtained due to a transmission error, one or more decoded signals stored in the decoded signal storage unit are read, and the read decoded signals are input to a mixing function An interpolated signal generating unit that generates a mixed signal by mixing at the set mixing ratio, and uses the generated mixed signal as an interpolated signal of the desired hierarchy that is artificially generated at the time when the reproduction is required; ,
A transmission error concealment processing apparatus comprising: a reproduction video output unit that outputs the interpolated signal as a signal for reproduction at the time when the reproduction is required.
The transmission error concealment processing device according to claim 1,
An interpolation signal storage unit for storing the interpolation signal;
The interpolation signal generation unit reads one or more interpolation signals stored in the interpolation signal storage unit, and inputs the read interpolation signal to the mixing function together with the decoded signal read from the decoded signal storage unit Thus, the transmission error concealment processing device generates the mixed signal that mixes the one or more decoded signals and the one or more interpolation signals to form the interpolation signal of the desired layer.
In the transmission error concealment processing device according to claim 1 or 2,
The mixing rate is higher as the signal input to the mixing function is closer to the time when the reproduction is required, or higher as the signal input to the mixing function is closer to the desired layer. Or the signal input to the mixing function has a higher value as the image quality estimation value is higher, or the signal input to the mixing function has a value corresponding to a temporal pixel value change of the signal. A transmission error concealment processing device.
The transmission error concealment processing device according to claim 3,
The mixing ratio set to be a value corresponding to the temporal pixel value change of the signal indicates whether each area is a stationary area or a moving area by estimating the amount of movement for each area obtained by dividing the screen. A transmission error concealment processing device that is determined and set according to the determination result.
A transmission error concealment processing method in a system that receives two or more layered encoded data, decodes the received encoded data, and reproduces the encoded data,
Storing a decoded signal obtained by decoding the encoded data for each layer;
When the decoded signal of the desired layer at the time that needs to be reproduced cannot be obtained due to a transmission error, one or more stored decoded signals are read, and the read decoded signal is input to the mixing function to set the mixing An interpolated signal generating step of generating a mixed signal by mixing at a rate, and using the generated mixed signal as an interpolated signal of the desired hierarchy that is artificially generated at the time when the reproduction is required;
A reproduction video output step of outputting the interpolated signal as a signal for reproduction at a time when the reproduction is necessary. A transmission error concealment processing method.
The transmission error concealment processing method according to claim 5,
Storing the interpolated signal;
In the interpolation signal generation step, one or more stored interpolation signals are read, and the read interpolation signal is input to the mixing function together with the read decoded signal, and the one or more decoded signals and A transmission error concealment processing method for generating a mixed signal that is mixed with the one or more interpolation signals and used as the interpolation signal of the desired layer.
In the transmission error concealment processing method according to claim 5 or 6,
The mixing rate is higher as the signal input to the mixing function is closer to the time when the reproduction is required, or higher as the signal input to the mixing function is closer to the desired layer. Or the signal input to the mixing function has a higher value as the image quality estimation value is higher, or the signal input to the mixing function has a value corresponding to a temporal pixel value change of the signal. A transmission error concealment processing method.
The transmission error concealment processing method according to claim 7,
The mixing ratio set to be a value corresponding to the temporal pixel value change of the signal indicates whether each area is a stationary area or a moving area by estimating the amount of movement for each area obtained by dividing the screen. Transmission error concealment processing method that is a value determined according to the determination result.
A transmission error concealment processing program for causing a computer to execute the transmission error concealment processing method according to any one of claims 5 to 8.