WO2021007702A1

WO2021007702A1 - Video encoding method, video decoding method, video encoding device, and video decoding device

Info

Publication number: WO2021007702A1
Application number: PCT/CN2019/095782
Authority: WO
Inventors: Sato Kazushi
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2019-07-12
Filing date: 2019-07-12
Publication date: 2021-01-21

Abstract

A method for avoiding redundant calculation when creating a hyper-lapse video from an encoded bitstream is provided. The video encoding method includes: performing scene analysis on an input bitstream to determine a frame to be dropped at a time of generating a hyper-lapse video in a video decoding device; and encoding the bitstream subjected to the scene analysis and outputting the encoded bitstream. The video encoding method further includes dynamically determining a GOP (Group of Picture) structure of a bitstream to be output based on the scene analysis.

Description

VIDEO ENCODING METHOD, VIDEO DECODING METHOD, VIDEO ENCODING DEVICE, AND VIDEO DECODING DEVICE

Technical Field

The present application relates to a video encoding method, a video decoding method, a video encoding device, and a video decoding device.

Background Art

As inexpensive and high-quality cameras become widely available in the market, users tend to shoot fairly long videos. For example, a user shoots a long video while moving a long distance. If there is not enough time to see the entire long video shot, the user can watch a hyper-lapse video created from the original long video.

A video decoding device needs redundant calculation to create a hyper-lapse video after decoding a full bitstream including a plurality of frames encoded by a video encoding device. For example, even if a frame (I picture or P picture) encoded in a lower temporal hierarchy within a GOP (Group of Picture) structure, as specified in H. 264, is dropped at a time of creating a hyper-lapse video, such a frame is referenced when decoding a frame (for example, reference B picture or non-reference B picture) encoded in a higher hierarchy, so that the frame cannot be dropped and needs to be decoded. In addition, although the computational capability of a receiving device (including a video decoding device) that receives and decodes a bitstream is usually limited, scene analysis to create a hyper-lapse video after decoding requires more calculation.

It is an object of an embodiment of the present invention to avoid the aforementioned redundant calculation when creating a hyper-lapse video from a bitstream including a plurality of frames encoded by a method specified in H. 264 or H. 265 or a method similar to those methods, which supports both of a reference B picture and a non-reference B picture.

Summary of Invention

One embodiment of the present invention is a video encoding method executed by a video encoding device, which includes: analyzing input image data to determine whether any of a plurality of scenes represented by the image data is to be dropped at a time of generating a hyper-lapse video in a video decoding device; and encoding the image data based on a result of the analysis to generate a plurality of frames corresponding to the plurality of scenes, and outputting a bitstream including the generated plurality of frames. The video encoding method according to an embodiment of the present invention allows a video decoding device with a limited computational capability to easily select an optimal frame and display a hyper-lapse video. In an example, the determination is reflected as a picture type (I/P/refB/nonrefB pictures) within a bitstream.

In one aspect, the determination may include dynamically determining a GOP structure of the plurality of frames in the bitstream to be output based on the result of the analysis. The determining a GOP structure of the plurality of frames in the bitstream to be output may include assigning a frame to be dropped at a time of generating a hyper-lapse video in the video decoding device to higher temporal hierarchy within the GOP structure. The video decoding device does not need redundant computation at the time of dropping some of frames. In an example, the higher temporal hierarchy corresponds to B picture such as b1, b2, b3, b11 shown in FIG. 2.

In another aspect, the bitstream to be output may support a plurality of frame rates for a user to view the bitstream as a hyper-lapse video in the video decoding device. A video encoding method according to this embodiment may further include including metadata for decoding the bitstream as a hyper-lapse video in a sequence parameter set (SPS) , a picture parameter set (PPS) , supplemental enhancement information (SEI) , or video usability information (VUI) of the bitstream to be output so as to be supplied to the video decoding device. The use of an existing syntax allows metadata to be acquired using the existing functions of a video decoding device. For example, assuming the bitstream supports f1, f2, f3 Hz of frame rate when it is converted into hyper-lapse video. Metadata is such information.

Another embodiment of the present invention is a video transcoding method executed by a server, which includes: decoding a plurality of encoded first frames in an input bitstream, the plurality of encoded first frames having a first GOP (Group of Picture) structure; by analyzing image data representing a plurality of scenes corresponding to the plurality of decoded first frames, determining whether any of the plurality of scenes is to be dropped at a time of generating a hyper-lapse video in a video decoding device; determining a second GOP structure which a plurality of second frames will have, based on a result of the analysis, the plurality of second frames corresponding to the plurality of scenes; transcoding the plurality of decoded first frames having the first GOP structure into the plurality of second frames; and transmitting the plurality of second frames in a bitstream to the video decoding device. The video transcoding method according to this embodiment of the present invention allows a server on a cloud computer to transcode a bitstream including a plurality of frames having a fixed GOP structure into a bitstream including a plurality of frames having a GOP structure that enables a video decoding device with a limited computational capability to display a hyper-lapse video, and to provide the bitstream to the video decoding device. In addition, mobile devices such as cameras and smartphones have limitations in the amount of computation and power consumption. However, with the cloud server, it is possible to perform processing without regard to the limitation. When the video encoding method is implemented by the video encoding device, it is necessary to perform capture, analysis, and encoding of the image in real time. However, when the video transcoding method is implemented by the cloud server, transcoding may be performed by off-line processing after the Bitstream is transmitted to the server.

Yet another embodiment of the present invention is a video decoding method executed by a video decoding device, which includes: receiving a bitstream, the bitstream including a plurality of encoded frames and metadata for decoding the plurality of encoded frames as a hyper-lapse video, the metadata indicating one or more frame rates for the hyper-lapse video supported by the bitstream, each of the one or more frame rates being associated with a frame to be used or a frame to be dropped among the plurality of frames; determining a frame rate for displaying a video in the one or more frame rates indicated by the metadata; determining whether or not to display the hyper-lapse video based on the determined frame rate; dropping some of the plurality of encoded frames based on the determined frame rate, on condition that it is determined that the hyper-lapse video is displayed; and decoding remaining frames of the plurality of encoded frames to display the hyper-lapse video. The video encoding method according to this embodiment of the present invention allows a video decoding device with a limited computational capability to easily select an optimal frame and display a hyper-lapse video.

In one aspect, the plurality of encoded frames may be encoded in a GOP structure dynamically determined in a video encoding device. The dropped frame may be encoded in higher temporal hierarchy within the GOP structure. The dropping some of the plurality of encoded frames may include dropping a frame encoded in higher temporal hierarchy within the GOP structure according to the determined frame rate. The video decoding device does not need redundant computation at the time of dropping some of frames.

In another aspect, the metadata may be included in a sequence parameter set (SPS) , a picture parameter set (PPS) , supplemental enhancement information (SEI) , or video usability information (VUI) of the bitstream. The use of an existing syntax allows metadata to be acquired using the existing functions of a video decoding device.

In a further aspect, the bitstream may be input from a video encoding device or a server.

Still another embodiment of the present invention is a video encoding device, which includes: a scene analysis unit configured to analyze input image data to determine whether any of a plurality of scenes represented by the image data is to be dropped at a time of generating a hyper-lapse video in a video decoding device; a GOP (Group of Picture) structure setting unit configured to determine a GOP structure of the plurality of frames in a bitstream to be output, based on a result of the analysis; and a video encoder unit configured to, based on the result of the analysis, encode the image data to generate a plurality of frames corresponding to the plurality of scenes, and output a bitstream including the generated plurality of frames, the generated plurality of frames having the determined GOP structure. The video encoding device according to this embodiment of the present invention allows a video decoding device with a limited computational capability to easily select an optimal frame and display a hyper-lapse video.

Yet still another embodiment of the present invention is a server including: a video decoder unit configured to decode a plurality of encoded first frames in an input bitstream, the plurality of first frames having a first GOP structure; a scene analysis unit configured to analyze image data representing the plurality of scenes corresponding to the plurality of decoded first frames to determine whether any of the plurality of scenes is to be dropped at a time of generating a hyper-lapse video in a video decoding device; a GOP structure setting unit configured to determine a second GOP structure which a plurality of second frames will have, based on a result of the analysis, the plurality of second frames corresponding to the plurality of scenes; and a video encoder unit configured to transcode the plurality of decoded first frames having the first GOP structure into the plurality of second frames, and transmit the plurality of second frames in a bitstream to the video decoding device. The server according to this embodiment of the present invention allows a server on a cloud computer to transcode a bitstream including a plurality of frames having a fixed GOP structure into a bitstream including a plurality of frames having a GOP structure that enables a video decoding device with a limited computational capability to display a hyper-lapse video, and to provide the bitstream to the video decoding device. In addition, mobile devices such as cameras and smartphones have limitations in the amount of computation and power consumption. However, with the cloud server, it is possible to perform processing without regard to the limitation. When the video encoding method is implemented by the video encoding device, it is necessary to perform capture, analysis, and encoding of the image in real time. However, the cloud server may perform transcoding by off-line processing after the Bitstream is transmitted to the server.

Yet still another embodiment of the present invention is a video decoding device including: a bitstream buffer configured to receive a bitstream, the bitstream including a plurality of encoded frames, metadata for decoding the plurality of encoded frames as a hyper-lapse video in the bitstream, and a GOP (Group of Picture) structure which the plurality of encoded frames have, the metadata indicating one or more frame rates for the hyper-lapse video supported by the bitstream, each of the one or more frame rates being associated with a frame to be used or a frame to be dropped among the plurality of frames, the bitstream buffer being also configured to read the GOP structure; a display frame rate setting unit configured to determine a frame rate for displaying a video in the one or more frame rates indicated by the metadata; a frame dropping unit configured to drop some of the plurality of encoded frames based on the determined frame rate; and a video decoder unit configured to decode remaining frames in the plurality of encoded frames to output the hyper-lapse video. The video decoding device according to this embodiment of the present invention can easily select an optimal frame and display a hyper-lapse video without requiring an additional computational capability.

Brief Description of Drawings

To describe the technical solutions in the embodiments more clearly, the following briefly describes the accompanying drawings required for describing the present embodiments. Apparently, the accompanying drawings in the following description depict merely some of the possible embodiments, and a person of ordinary skill in the art may still derive other drawings, without creative efforts, from these accompanying drawings, in which:

[Fig. 1] Fig. 1 is a diagram showing the GOP structure of a typical bitstream;

[Fig. 2] Fig. 2 is a diagram showing an example of the GOP structure of a bitstream which is determined in one embodiment of the present invention;

[Fig. 3] Fig. 3 is a diagram showing an example of the functional blocks of a video encoding device according to one embodiment of the present invention;

[Fig. 4] Fig. 4 is a diagram showing an example of the configuration of a video encoder unit according to one embodiment of the present invention;

[Fig. 5] Fig. 5 is a diagram showing an example of the functional blocks of a video decoding device according to one embodiment of the present invention;

[Fig. 6] Fig. 6 is a diagram showing an example of the configuration of a video decoder unit according to one embodiment of the present invention;

[Fig. 7] Fig. 7 is a flowchart illustrating an example of video encoding processing according to one embodiment of the present invention;

[Fig. 8] Fig. 8 is a flowchart illustrating an example of video decoding processing according to one embodiment of the present invention;

[Fig. 9] Fig. 9 is a diagram showing an example of the functional blocks of a cloud server according to one embodiment of the present invention; and

[Fig. 10] Fig. 10 is a flowchart illustrating an example of video transcoding processing according to one embodiment of the present invention.

Description of Embodiments

The following will describe embodiments of the present invention in detail referring to the accompanying drawings. Same or like reference symbols indicate same or like elements to avoid redundant descriptions.

According to one embodiment of the present invention, a video encoding device determines which pictures are to be dropped to generate a hyper-lapse video before encoding a video into a bitstream, so that a GOP (Group of Picture) structure may be determined dynamically.

Also, according to one embodiment of the present invention, in order to generate hyper-lapse video, a video decoding device may drop a reference B picture, a non-reference B picture, or a P picture according to the playback speed of a hyper-lapse video, based on temporal hierarchy of a bitstream.

Moreover, according to one embodiment of the present invention, metadata for generating a hyper-lapse video from a bitstream received at a video decoding device may be transmitted in SEI (Supplemental Enhancement Information) or another syntax element of the bitstream from a video encoding device to the video decoding device.

Furthermore, scene analysis and transcoding (modification of the GOP structure) to enable generation of a hyper-lapse video can be performed on a cloud server as well as the video encoding device.

(Video Encoding Standards)

The latest video coding standards such as H. 264 and H. 265 support both reference B pictures and non-reference B pictures that enable temporal scalability in addition to conventional I pictures and P pictures (for example, see Gary J. Sullivan, et al., “Overview of the High Efficiency Video Coding (HEVC) Standard” , IEEE Trans. On Circuits and Systems for Video Technology, VOL 22, No. 12, Dec 2012) .

Fig. 1 shows the GOP structure of a plurality of frames included in a typical bitstream. A plurality of frames are generated by encoding image data. Each of a plurality of scenes represented by image data corresponds to a plurality of frames. In Fig. 1, I ₀ represents an I picture, P ₄ represents a P picture, B ₂ represents a reference B picture, and b ₁ and b ₃ represent non-reference B pictures. The GOP structure of the typical bitstream shown in Fig. 1 is fixed, so that the GOP structure periodically and repeatedly appears in a bitstream output from the video encoding device. When the frame rate of the original full bitstream is 60 Hz, the bitstream can be decoded at a frame rate of 30 Hz of the bitstream shown in Fig. 1 by dropping the non-reference B pictures b ₁ and b ₃. The bitstream can also be decoded at a frame rate of 15 Hz of the full bitstream shown in Fig. 1 by dropping the non-reference B picture b ₁, the reference B picture B ₂ and the non-reference B picture b ₃.

The simplest method for creating a hyper-lapse video is to subsample a plurality of frames or a plurality of scenes of an input video uniformly in a temporal direction. However, when the motion of a camera at the time of shooting a video includes a high frequency (for example, when a video violently shakes up and down, left or right, or back and forth) , the created hyper-lapse video will be uncomfortable. This may happen, for example, when shooting a video with a camera called an action camera.

Thus, there is a need for a method for creating a comfortable hyper-lapse video.

(Scene Analysis and Determination of GOP Structure)

The video encoding device according to the present embodiment analyzes image data to determine which frame (picture) in a plurality of frames corresponding to a plurality of scenes represented by the image data is to be selected or dropped in order to generate a hyper-lapse video. The video encoding device encodes the image data based on a result of the analysis to generate a plurality of frames corresponding to a plurality of scenes. The video encoding device dynamically determines the GOP (Group of Picture) structure based on the result of the analysis of the image data.

It is assumed that the original frame rate of the bitstream is f ₀ Hz and that the frame rates of hyper-lapse videos that can be generated from the bitstream are f ₁ Hz, f ₂ Hz and f ₃ Hz (f ₀>f ₁>f ₂>f ₃) . The GOP structure is dynamically determined in such a way that only I pictures, P pictures and non-reference B pictures are decoded when a user wants to view a hyper-lapse video with a frame rate of f ₁ Hz, or only I pictures and P pictures are decoded when the user wants to view a hyper-lapse video with a frame rate of f ₂ Hz, or only I pictures are decoded when the user wants to view a hyper-lapse video with a frame rate of f ₃ Hz.

Fig. 2 shows a GOP structure that is determined dynamically. The GOP structure shown in Fig. 2 need not be periodic, and which frame is to be dropped usually differs for each GOP.

For example, video encoding devices may be implemented with a known algorithm (see, for example, Neel Joshi, et al., “Real-Time Hyperlapse Creation via Optimal Frame Selection, ” ACM Transactions on Graphics, 34, July 2015) , as an algorithm for analyzing image data, to dynamically determine a GOP structure.

The known algorithm generally includes three steps. (1) In a frame matching step, feature quantity based sparse estimation is used to estimate how well temporally adjacent frames can be aligned, and the calculated costs are stored as a sparse matrix. (2) In a frame selection step, dynamic time warping (DTW) is used to find an optimal frame path that trades off with a target rate with a suppressed minimum motion between frames. (3) In a pass smoothing and rendering step, for selected frames, a hyper-lapse video is rendered by smoothing a camera path to obtain a stabilized result.

In the video encoding device according to the present embodiment, image data analysis including (1) the frame matching step, (2) the frame selection step, and (3) the path smoothing and rendering step make it possible to find an optimal frame that enables minimization of both the frame rates of f ₁ Hz, f ₂ Hz and f ₃ Hz and the motion between frames to thereby determine a GOP structure. That is, it is possible to determine whether any of a plurality of scenes represented by image data is to be dropped at the time of generating a hyper-lapse video in a video decoding device, and determine a GOP structure of a plurality of frames corresponding to a plurality of scenes. The video encoding device according to this embodiment can encode image data based on the result of analysis of the image data and generate a plurality of frames having the determined GOP structure. The plurality of frames generated can be output in such a form as to be included in a bitstream.

Thus, complex scene analysis, that is, complex analysis of image data is necessary only in a video encoding device and is unnecessary in a video decoding device. Further, data provided by a gyro sensor is useful for complex scene analysis. Such data is available when a video is captured by a camera (provided in the video encoding device) , but not available for a display (provided in the video decoding device) . Therefore, it is useful to dynamically determine a GOP structure in the video encoding device for displaying a bitstream as a hyper-lapse video.

(Configuration of Video Encoding Device)

The video encoding device according to the present embodiment is configured in such a way that one or more frame rates (for example, f ₁ Hz, f ₂ Hz and f ₃ Hz) of hyper-lapse videos which can be selected in the video decoding device by a user are set in the video encoding device. Of course, the video encoding device may be configured in such a way that a user selectively sets one or more frame rates of hyper-lapse videos.

Fig. 3 shows the functional blocks of the video encoding device of this embodiment. As shown in Fig. 3, the video encoding device 10 includes a scene analysis unit 11, a GOP structure setting unit 12, and a video encoder unit 13.

The scene analysis unit 11 performs analysis on image data representing a plurality of scenes to find an optimal frame to be traded off among a plurality of frames corresponding to the plurality of scenes for each of the frame rates f ₁ Hz, f ₂ Hz and f ₃ Hz of hyper-lapse videos before encoding image data of an input video.

The GOP structure setting unit 12 determines which frame (picture) is to be used or dropped in the video decoding device, for a plurality of scenes represented by the image data, according to the result of the analysis of image data performed by the scene analysis unit 11 to determine and set a GOP structure of a plurality of frames corresponding to a plurality of scenes at the time of encoding the image data into the plurality of frames in order to generate a hyper-lapse video. The GOP structure setting unit 12 also supplies the video encoder unit 13 with metadata indicating information about which frame (picture) is to be used or dropped in order to generate a hyper-lapse video. The metadata indicates one or more frame rates for a hyper-lapse video supported by a bitstream including a plurality of encoded frames. Each of the one or more frame rates is associated with a frame to be used or a frame to be dropped among the plurality of frames. In this embodiment, an optimal frame for generating a hyper-lapse video with a frame rate of f ₃ Hz is set to be encoded as an I picture. Of optimal frames for generating a hyper-lapse video with a frame rate f ₂ Hz, frames other than an optimal frame for generating the frame rate f ₃ Hz are set to be encoded as P pictures. Of optimal frames for generating a hyper-lapse video with a frame rate of f ₁ Hz, frames other than the optimal frame for generating hyper-lapse videos with frame rates of f ₃ Hz and f ₂ Hz are set to be encoded as non-reference B pictures. The remaining frames are set to be encoded as reference B pictures. In this manner, a GOP structure is dynamically set according to the result of the analysis of the image data.

The video encoder unit 13 encodes the input image data according to the GOP structure set by the GOP structure setting unit 12 to generate a plurality of frames corresponding to a plurality of scenes, and outputs a bitstream including the generated plurality of frames.

(Configuration of Video Encoder Unit)

Fig. 4 shows one example of the configuration of the video encoder unit 13. The video encoder unit 13 includes a re-ordering buffer 311, a subtractor 312, a transform unit 313, a quantizer 314, an entropy coding unit 315, and a buffer 316. The video encoder unit 13 further includes a rate controller 318, an inverse quantizer 319, an inverse transform unit 320, an adder 321, a loop filter 322, a memory 323, an intra prediction unit 324, and an inter prediction unit 325.

The re-ordering buffer 311 re-orders input video data (image data) according to the GOP structure set by the GOP structure setting unit 12. The re-ordered image data is output to the subtracter 312.

Image data input from the re-ordering buffer 311 and predictive image data from the intra prediction unit 324 or the inter prediction unit 325 are supplied to the subtracter 312. The subtracter 312 calculates prediction error data which is the difference between the image data from the re-ordering buffer 311 and the predictive image data, and outputs the calculated prediction error data to the transform unit 313.

The transform unit 313 performs transform on the prediction error data input from the subtracter 312, and generates transform coefficient data which is the result of transform of a pixel region in the image to a frequency region. The generated transform coefficient data is output to the quantizer 314. The transform that is performed by the transform unit 313 may be, for example, discrete cosine transform (DCT) , Karhunen-Loéve transform or the like.

The quantizer 314 performs quantization on the transform coefficient data output from the transform unit 313. The quantized transform coefficient data is output to the entropy coding unit 315 and the inverse quantizer 319. The bit rate of the quantized data output from the quantizer 314 is controlled based on a rate control signal from the rate controller 318.

The quantizer 314 also quantizes the transform coefficient data generated by the transform unit 313.

The entropy coding unit 315 codes metadata supplied from the GOP structure setting unit 12 and indicating information specifying which frame (picture) is to be used or dropped in order to generate a hyper-lapse video into a syntax element such as SEI (Supplemental Enhancement Information) , or SPS (Sequence Parameter Set) , PPS (Picture Parameter Set) or VUI (Video Usability Information) , which is associated with the image data. The metadata may be frame rates (f ₁ Hz, f ₂ Hz and f ₃ Hz) of hyper-lapse videos.

The entropy coding unit 315 performs entropy coding on quantized data to generate a bitstream including the coded plurality of frames. Coding by the entropy coding unit 315 may be, for example, variable length coding, arithmetic coding or the like.

The buffer 316 outputs a bitstream. The buffer 316 temporarily stores the bitstream output from the entropy coding unit 315. The buffer 316 then outputs the stored bitstream at a rate matching the bandwidth of the transmission path to the video decoding device. The buffer 316 may be constituted by a recording medium such as a semiconductor memory.

The rate controller 318 monitors the free area of the buffer 316. Then, the rate controller 318 generates a rate control signal according to the free area of the buffer 316, and outputs the generated rate control signal to the quantizer 314. When the free area of the buffer 316 is small, for example, the rate controller 318 generates the rate control signal to reduce the bit rate for quantized data. When the free area of the buffer 316 is sufficiently large, on the other hand, the rate controller 318 generates the rate control signal to increase the bit rate for quantized data.

The inverse quantizer 319 performs inverse quantization on the quantized data input from the quantizer 314. The inverse quantizer 319 then outputs the transform coefficient data obtained through the inverse quantization to the inverse transform unit 320.

The inverse transform unit 320 performs inverse quantization on the transform coefficient data input from the inverse quantizer 319 to restore prediction error data. The inverse transform unit 320 then outputs the restored prediction error data to the adder 321.

The adder 321 generates decoded image data by adding the restored prediction error data input from the inverse transform unit 320 and the predictive image data input from the intra prediction unit 324 or the inter prediction unit 325 together. The generated decoded image data is output to the loop filter 322 and the memory 323.

The loop filter 322 performs filtering to reduce coding distortion which is caused at the time of coding an image. The loop filter 322 eliminates the coding distortion by filtering the decoded image data input from the adder 321, and outputs the filtered decoded image data to the memory 323.

The memory 323 stores the decoded image data input from the adder 321 and the filtered decoded image data input from the loop filter 322. Specifically, the memory 323 may be constituted by, for example, a recording medium such as a semiconductor memory. The memory 323 supplies the decoded image data before filtering, which is used for intra prediction, as reference image data to the intra prediction unit 324, or supplies the filtered decoded image data, which is used for inter prediction, as reference image data to the inter prediction unit 325.

The intra prediction unit 324 performs intra prediction in each intra prediction mode based on the image data to be coded, input from the re-ordering buffer 311, and the decoded image data supplied from the memory 323. For example, the intra prediction unit 324 evaluates the result of the prediction in each intra prediction mode by using a predetermined cost function. The intra prediction unit 324 then selects the intra prediction mode that minimizes the cost function value, i.e., the intra prediction mode that maximizes the compression rate, as an optimal intra prediction mode. Further, the intra prediction unit 324 outputs information related to intra prediction, such as the prediction mode information indicating the optimal intra prediction mode, the predictive image data and the cost function value.

The inter prediction unit 325 performs inter prediction (interframe prediction) based on the image data to be coded, input from the re-ordering buffer 311, and the decoded image data supplied from the memory 323. For example, the inter prediction unit 325 evaluates the result of the prediction in each inter prediction mode by using a predetermined cost function. The inter prediction unit 325 then selects the inter prediction mode that minimizes the cost function value, i.e., the inter prediction mode that maximizes the compression rate, as an optimal inter prediction mode. Further, the inter prediction unit 325 generates predictive image data according to the optimal inter prediction mode. Then, the inter prediction unit 325 outputs information related to inter prediction, such as the prediction mode information representing the optimal inter prediction mode, the predictive image data, the cost function value, and the motion vector.

The cost function value related to intra prediction output from the intra prediction unit 324 is compared with the cost function value related to inter prediction output from the inter prediction unit 325 to select the intra prediction or the inter prediction, whichever provides a smaller cost function value. When the intra prediction is selected, the information related to intra prediction is output to the entropy coding unit 315, and the predictive image data is output to the subtracter 312 and the adder 321. When the inter prediction is selected, on the other hand, the information related to inter prediction is output to the entropy coding unit 315, and the predictive image data is output to the subtracter 312 and the adder 321. (Configuration of Video Decoding Device)

The video decoding device of the present embodiment is configured in such a way that the rate of frames to be displayed is set by a user, and a video with the original frame rate or a hyper-lapse video is output according to the set display frame rate. When a hyper-lapse video is to be output, unnecessary frames are dropped during re-ordering in the video decoding device, and will not be decoded. Therefore, redundant calculation can be avoided.

Fig. 5 shows the functional blocks of the video decoding device of the present embodiment. As shown in Fig. 5, the video decoding device 50 includes a bitstream buffer 51, a display frame rate setting unit 52, a frame dropping unit 53, and a video decoder unit 55.

The bitstream storage buffer 51 temporarily stores an input bitstream input from the video encoding device 10 over the transmission path 3. The bitstream storage buffer 51 may be constituted by, for example, a recording medium such as a semiconductor memory. The bitstream storage buffer 51 supplies metadata coded into, for example, a syntax element within SEI to the display frame rate setting unit 52.

The display frame rate setting unit 52 presents selectable frame rates for a hyper-lapse video to a user based on the supplied metadata. The display frame rate setting unit 52 receives, from the user, selection on whether a video with the original frame rate or a hyper-lapse video is displayed. When display of a hyper-lapse video is selected, the display frame rate setting unit 52 receives, from the user, selection of a frame rate from selectable frame rates for a hyper-lapse video. In response to the reception of the selection from the user, the display frame rate setting unit 52 determines a frame rate for displaying a video. The display frame rate setting unit 52 supplies the rate of the frame to be displayed, which is determined according to the selection made by the user, to the frame dropping unit 53.

The frame dropping unit 53 drops a frame from the bitstream stored in the bitstream buffer 51 according to the rate of the frame to be displayed, which is determined according to the selection made by the user, and supplies the resultant bitstream to the video decoder unit 55. The frame to be dropped here is the frame that is associated with the frame rate selected by the user from among one or more frame rates indicated by the metadata. When the user selects display of a video with the original frames, the frame dropping unit 53 does not drop frames.

The video decoder unit 55 restores the original video or a hyper-lapse video from the bitstream supplied from the frame dropping unit 53.

(Configuration of Video Decoder Unit)

Fig. 6 shows an example of the configuration of the video decoder unit 55. The video decoder unit 55 includes an entropy decoding unit 552, an inverse quantizer 553, an inverse transform unit 554, an adder 555, a loop filter 556, a re-ordering buffer 557, a memory 558, an intra prediction unit 559, and an inter prediction unit 560. The video decoder unit 55 basically performs a process inverse to the process performed by the video encoder unit 13 to restore video data.

The entropy decoding unit 552 decodes the input bitstream input from the bitstream buffer 51. The entropy decoding unit 552 refers to syntax elements such as SPS and PPS. The entropy decoding unit 552 decodes information multiplexed into the header area of the input bitstream.

The inverse quantizer 553 and the inverse transform unit 554 generate prediction error data by performing inverse quantization and inverse transform on quantized data input from the entropy decoding unit 552. The inverse transform unit 554 outputs the generated prediction error data to the adder 555. The inverse quantizer 553 and the inverse transform unit 554 perform processes inverse to the processes performed by the quantizer 314 and the transform unit 313 in the video encoding device 10. That is, the inverse quantizer 553 and the inverse transform unit 554 perform inverse quantization and inverse transform by using the SPS and the PPS corresponding to a sequence or a picture to be processed.

The adder 555 generates decoded image data by adding the prediction error data input from the inverse transform unit 554 and the predictive image data input from the intra prediction unit 559 or the inter prediction unit 560. The generated decoded image data is output to the loop filter 556 and the memory 558.

The loop filter 556 eliminates the coding distortion by filtering the decoded image data input from the adder 555, and outputs the filtered decoded image data to the re-ordering buffer 557 and the memory 558.

The re-ordering buffer 557 re-orders the images input from the loop filter 556 to generate a sequence of time-sequential image data. The image data generated by the re-ordering buffer 557 is output as a video with the original frame rate or a hyper-lapse video.

The memory 558 stores the unfiltered, decoded image data input from the adder 555 and the filtered decoded image data input from the loop filter 556. The memory 558 may be constituted by, for example, a recording medium such as a semiconductor memory.

(Processing Flow of Video Encoding)

Fig. 7 shows one example of the video encoding processing in the video encoding device 10.

The video encoding device 10 determines frame rates (for example, f ₁ Hz, f ₂ Hz and f ₃ Hz) of frames to be displayed as a hyper-lapse video (step 21) . The frame rate for displaying of frames as a hyper-lapse video may be preset in the video encoding device 10 or may be input or selected by a user.

The video encoding device 10 (for example, scene analysis unit 11) performs scene analysis on the input video bitstream (step 23) . In this step, image data representing a plurality of scenes is analyzed. The video encoding device 10 finds an optimal frame for minimizing the motion between frames from among a plurality of frames corresponding to the plurality of scenes, for each of the frame rates of f ₁ Hz, f ₂ Hz and f ₃ Hz of a hyper-lapse video. For example, the video encoding device 10 may perform the above-described (1) frame matching step, (2) frame selection step, and (3) path smoothing and rendering step for each of the frame rates determined at step 21 to find the optimal frame for each of the frame rates.

The video encoding device 10 (for example, GOP structure setting unit 12) determines which frame (picture) is to be used or dropped in the video decoding device 50, for a plurality of scenes represented by the image data, to generate a hyper-lapse video according to the result of the scene analysis performed at step 23, that is, the result of the analysis of the image data, and determines a GOP structure of a plurality of frames corresponding to a plurality of scenes at the time of encoding the image data into the plurality of frames (step 25) .

The video encoding device 10 (for example, GOP structure setting unit 12) generates metadata for generating a hyper-lapse video (step 27) . Metadata indicates one or more frame rates for a hyper-lapse video supported, and indicates information about which frame (picture) among a plurality of frames corresponding to a plurality of scenes is to be used or dropped. Each of the one or more frame rates is associated with the frame to be used or a frame to be dropped among the plurality of frames.

The video encoding device 10 (for example, video encoder unit 13) encodes the input image data according to the GOP structure determined at step 25 to generate a plurality of frames corresponding to a plurality of scenes, encodes the metadata into a syntax element such as SEI, or SPS, PPS or VUI, and outputs a bitstream including the generated plurality of frames and a syntax element (step 29) .

(Processing Flow of Video Decoding)

Fig. 8 shows one example of the video decoding processing in the video decoding device 50.

The video decoding device 50 (for example, bitstream buffer 51) receives metadata (step 61) . The metadata may be received in an input bitstream input from the video encoding device 10 over a transmission path. The receiving the metadata may include decoding the metadata encoded into an SEI syntax element or another syntax element. Metadata indicates which frame (picture) is to be used or dropped in order to generate a hyper-lapse video.

The video decoding device 50 (for example, bitstream buffer 51) reads the GOP structure for the input bitstream (step 63) . The GOP structure may be read from a syntax element such as SPS, PPS or VUI.

The video decoding device 50 (for example, display frame rate setting unit 52) determines a frame rate for displaying a video (step 65) . For example, the determining the frame rate may include presenting selectable frame rates for a hyper-lapse video to a user based on the metadata, and receiving selection on displaying a video with the original frame rate or a hyper-lapse video from the user. The receiving selection on displaying a hyper-lapse video includes receiving selection of a frame rate from selectable frame rates for a hyper-lapse video from the user. In response to the reception of the selection from the user, the display frame rate setting unit 52 determines a frame rate for displaying a video.

The video decoding device 50 determines whether or not to display a hyper-lapse video based on the determined frame rate (step 67) . When it is determined to display a hyper-lapse video, the processing proceeds to step 69. When it is not determined to display a hyper-lapse video, that is, when it is determined that a video with the original frame rate is to be displayed, the processing proceeds to step 71.

The video decoding device 50 (for example, frame dropping unit 53) drops unnecessary frames from the bitstream stored in the bitstream buffer 51 according to the frame rate determined at step 65 (step 69) . The frame to be dropped here is the frame that is associated with the frame rate selected by the user from among one or more frame rates indicated by the metadata. The remaining bitstream is supplied to the video decoder unit 55. According to the above assumption, the reference B picture is dropped when the frame rate determined at step 65 is f ₁ Hz, the reference B picture and the non-reference B picture are dropped when the determined frame rate is f ₂ Hz, and the P picture, the reference B picture and the non-reference B picture are dropped when the determined frame rate is f ₃ Hz.

The video decoding device 50 (for example, video decoder unit 55) sequentially decodes the remaining bitstream which has not been dropped at step 69 for each frame (step 71) . The video decoding device 50 decodes the original video or a hyper-lapse video according to the frame rate determined at step 65.

(Variations)

The description of the foregoing embodiment has been given of an example where the GOP structure of a plurality of frames included in a bitstream output from the video encoding device is dynamically determined. However, the present invention can be applied to videos encoded using the fixed GOP structure shown in Fig 1.

A cloud server including the scene analysis unit 11, the GOP structure setting unit 12, and the video encoder unit 13, which are included in the video encoding device 10, and a video decoder unit that decodes a video encoded using a fixed GOP structure and supplies the decoded video to the scene analysis unit 11 may serve as an implementation form of the present invention. Image data encoded using the fixed GOP structure (that is, a plurality of frames having the fixed GOP structure) is transmitted over a network to the cloud server to be transcoded therein.

Fig. 9 is a diagram showing an example of the functional blocks of a cloud server according to one embodiment of the present invention. The cloud server 110 includes a video decoder unit 111, a scene analysis unit 11, a GOP structure setting unit 12, and a video encoder unit 13. Bitstream including image data encoded using the fixed GOP structure is input to the cloud server 110 via network from a camera or a smartphone. Bitstream including transcoded frames is output and transmitted via network to a smartphone, personal computer, or television.

Fig. 10 is a flowchart illustrating an example of video transcoding processing according to one embodiment of the present invention.

After the video decoder unit 111 included in the cloud server decodes the image data encoded using the fixed GOP structure (step 20) , the scene analysis unit 11 included in the cloud server analyzes the restored image data (step 23) . The cloud server 110 may determine frame rates (for example, f ₁ Hz, f ₂ Hz and f ₃ Hz) of frames to be displayed as a hyper-lapse video (step 21) . The frame rate for displaying of frames as a hyper-lapse video may be preset in the cloud server 110 or may be input or selected by a user. Moreover, the GOP structure setting unit 12 included in the cloud server dynamically determines a GOP structure of a plurality of frames to generate a hyper-lapse video according to the result of the analysis of the image data (step 25) . Furthermore, the video encoder unit 13 included in the cloud server encodes the restored image data according to the determined GOP structure and generates a transcoded bitstream (step 27) . That is, a plurality of frames having a fixed GOP structure is transcoded into a plurality of frames having the dynamically determined GOP structure, and a bit stream including the plurality of transcoded frames and the metadata is transmitted to the video decoding device 50 described above over the network (step 29) .

The foregoing descriptions are merely specific implementation manners of the present invention, but are not intended to limit the protection scope of the present invention. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed shall fall within the protection scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

A video encoding method executed by a video encoding device, the method comprising:

analyzing input image data to determine whether any of a plurality of scenes represented by the image data is to be dropped at a time of generating a hyper-lapse video in a video decoding device; and

encoding the image data based on a result of the analysis to generate a plurality of frames corresponding to the plurality of scenes, and outputting a bitstream including the generated plurality of frames.
The method according to claim 1, wherein

the determination includes dynamically determining a GOP (Group of Picture) structure of the plurality of frames in the bitstream to be output based on the result of the analysis.
The method according to claim 2, wherein the determining a GOP structure of the plurality of frames in the bitstream to be output includes:

assigning a frame to be dropped at a time of generating a hyper-lapse video in the video decoding device to higher temporal hierarchy within the GOP structure.
The method according to claim 3, wherein the bitstream to be output supports a plurality of frame rates for a user to view the bitstream as a hyper-lapse video in the video decoding device.
The method according to claim 2, further comprising:

including metadata for decoding the bitstream as a hyper-lapse video in a sequence parameter set (SPS) of the bitstream to be output so as to be supplied to the video decoding device.
The method according to claim 2, further comprising:

including metadata for decoding the bitstream as a hyper-lapse video in a picture parameter set (PPS) of the bitstream to be output so as to be supplied to the video decoding device.
The method according to claim 2, further comprising:

including metadata for decoding the bitstream as a hyper-lapse video in supplemental enhancement information (SEI) of the bitstream to be output so as to be supplied to the video decoding device.
The method according to claim 2, further comprising:

including metadata for decoding the bitstream as a hyper-lapse video in video usability information (VUI) of the bitstream to be output so as to be supplied to the video decoding device.
A video transcoding method executed by a server, the method comprising:

decoding a plurality of encoded first frames in an input bitstream, the plurality of encoded first frames having a first GOP (Group of Picture) structure;

by analyzing image data representing a plurality of scenes corresponding to the plurality of decoded first frames, determining whether any of the plurality of scenes is to be dropped at a time of generating a hyper-lapse video in a video decoding device;

determining a second GOP structure which a plurality of second frames will have, based on a result of the analysis, the plurality of second frames corresponding to the plurality of scenes;

transcoding the plurality of decoded first frames having the first GOP structure into the plurality of second frames; and

transmitting the plurality of second frames in a bitstream to the video decoding device.
A video decoding method executed by a video decoding device, the method comprising:

receiving a bitstream, the bitstream including a plurality of encoded frames and metadata for decoding the plurality of encoded frames as a hyper-lapse video, the metadata indicating one or more frame rates for the hyper-lapse video supported by the bitstream, each of the one or more frame rates being associated with a frame to be used or a frame to be dropped among the plurality of frames;

determining a frame rate for displaying a video in the one or more frame rates indicated by the metadata;

determining whether or not to display the hyper-lapse video based on the determined frame rate;

dropping some of the plurality of encoded frames based on the determined frame rate , on condition that it is determined that the hyper-lapse video is displayed; and

decoding remaining frames of the plurality of encoded frames to display the hyper-lapse video.
The method according to claim 10, wherein the plurality of encoded frames are encoded in a GOP structure determined in a video encoding device.
The method according to claim 11, wherein the dropped frame is encoded in higher temporal hierarchy within the GOP structure.
The method according to claim 11, wherein

the dropping some of the plurality of encoded frames includes dropping a frame encoded in higher temporal hierarchy within the GOP structure according to the determined frame rate.
The method according to claim 10, wherein the metadata is included in a sequence parameter set (SPS) of the bitstream.
The method according to claim 10, wherein the metadata is included in a picture parameter set (PPS) of the bitstream.
The method according to claim 10, wherein the metadata is included in supplemental enhancement information (SEI) of the bitstream.
The method according to claim 10, wherein the metadata is included in video usability information (VUI) of the bitstream.
The method according to claim 10, wherein the bitstream is input from a video encoding device.
The method according to claim 10, wherein the bitstream is input from a server.
A video encoding device comprising:

a scene analysis unit configured to analyze input image data to determine whether any of a plurality of scenes represented by the image data is to be dropped at a time of generating a hyper-lapse video in a video decoding device;

a GOP (Group of Picture) structure setting unit configured to determine a GOP structure of the plurality of frames in a bitstream to be output, based on a result of the analysis; and

a video encoder unit configured to, based on the result of the analysis, encode the image data to generate a plurality of frames corresponding to the plurality of scenes, and output a bitstream including the generated plurality of frames, the generated plurality of frames having the determined GOP structure.
A server comprising:

a video decoder unit configured to decode a plurality of encoded first frames in an input bitstream, the plurality of first frames having a first GOP (Group of Picture) structure;

a scene analysis unit configured to analyze image data representing the plurality of scenes corresponding to the plurality of decoded first frames to determine whether any of the plurality of scenes is to be dropped at a time of generating a hyper-lapse video in a video decoding device;

a GOP structure setting unit configured to determine a second GOP structure which a plurality of second frames will have, based on a result of the analysis, the plurality of second frames corresponding to the plurality of scenes; and

a video encoder unit configured to transcode the plurality of decoded first frames having the first GOP structure into the plurality of second frames, and transmit the plurality of second frames in a bitstream to the video decoding device.
A video decoding device comprising:

a bitstream buffer configured to receive a bitstream, the bitstream including a plurality of encoded frames, metadata for decoding the plurality of encoded frames as a hyper-lapse video in the bitstream, and a GOP (Group of Picture) structure which the plurality of encoded frames have, the metadata indicating one or more frame rates for the hyper-lapse video supported by the bitstream, each of the one or more frame rates being associated with a frame to be used or a frame to be dropped among the plurality of frames, the bitstream buffer being also configured to read the GOP structure;

a display frame rate setting unit configured to determine a frame rate for displaying a video in the one or more frame rates indicated by the metadata;

a frame dropping unit configured to drop some of the plurality of encoded frames based on the determined frame rate; and

a video decoder unit configured to decode remaining frames in the plurality of encoded frames to output the hyper-lapse video.
A video encoding device, comprising:

a memory storage comprising instructions; and

one or more processors in communication with the memory, wherein the one or more processors execute the instructions to perform the method according to any one of claims 1 to 8.
A server, comprising:

a memory storage comprising instructions; and

one or more processors in communication with the memory, wherein the one or more processors execute the instructions to perform the method according to claim 9.
a video decoding device, comprising:

a memory storage comprising instructions; and

one or more processors in communication with the memory, wherein the one or more processors execute the instructions to perform the method according to any one of claims 10 to 19.
Computer readable medium storing instructions which when executed on a processor cause the processor to perform the method according to any of claims 1 to 19.
A computer program product comprising program code for performing the method according to any of claims 1 to 19 when executed on a computer or a processor.