WO2021007702A1 - Video encoding method, video decoding method, video encoding device, and video decoding device - Google Patents

Video encoding method, video decoding method, video encoding device, and video decoding device Download PDF

Info

Publication number
WO2021007702A1
WO2021007702A1 PCT/CN2019/095782 CN2019095782W WO2021007702A1 WO 2021007702 A1 WO2021007702 A1 WO 2021007702A1 CN 2019095782 W CN2019095782 W CN 2019095782W WO 2021007702 A1 WO2021007702 A1 WO 2021007702A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
frames
bitstream
hyper
frame
Prior art date
Application number
PCT/CN2019/095782
Other languages
French (fr)
Inventor
Sato Kazushi
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/CN2019/095782 priority Critical patent/WO2021007702A1/en
Publication of WO2021007702A1 publication Critical patent/WO2021007702A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • H04N19/31Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability in the temporal domain
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/40Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video transcoding, i.e. partial or full decoding of a coded input stream followed by re-encoding of the decoded output stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234381Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by altering the temporal resolution, e.g. decreasing the frame rate by frame skipping
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer

Definitions

  • the present application relates to a video encoding method, a video decoding method, a video encoding device, and a video decoding device.
  • a video decoding device needs redundant calculation to create a hyper-lapse video after decoding a full bitstream including a plurality of frames encoded by a video encoding device. For example, even if a frame (I picture or P picture) encoded in a lower temporal hierarchy within a GOP (Group of Picture) structure, as specified in H. 264, is dropped at a time of creating a hyper-lapse video, such a frame is referenced when decoding a frame (for example, reference B picture or non-reference B picture) encoded in a higher hierarchy, so that the frame cannot be dropped and needs to be decoded.
  • a receiving device including a video decoding device
  • scene analysis to create a hyper-lapse video after decoding requires more calculation.
  • One embodiment of the present invention is a video encoding method executed by a video encoding device, which includes: analyzing input image data to determine whether any of a plurality of scenes represented by the image data is to be dropped at a time of generating a hyper-lapse video in a video decoding device; and encoding the image data based on a result of the analysis to generate a plurality of frames corresponding to the plurality of scenes, and outputting a bitstream including the generated plurality of frames.
  • the video encoding method allows a video decoding device with a limited computational capability to easily select an optimal frame and display a hyper-lapse video.
  • the determination is reflected as a picture type (I/P/refB/nonrefB pictures) within a bitstream.
  • the determination may include dynamically determining a GOP structure of the plurality of frames in the bitstream to be output based on the result of the analysis.
  • the determining a GOP structure of the plurality of frames in the bitstream to be output may include assigning a frame to be dropped at a time of generating a hyper-lapse video in the video decoding device to higher temporal hierarchy within the GOP structure.
  • the video decoding device does not need redundant computation at the time of dropping some of frames.
  • the higher temporal hierarchy corresponds to B picture such as b1, b2, b3, b11 shown in FIG. 2.
  • the bitstream to be output may support a plurality of frame rates for a user to view the bitstream as a hyper-lapse video in the video decoding device.
  • a video encoding method according to this embodiment may further include including metadata for decoding the bitstream as a hyper-lapse video in a sequence parameter set (SPS) , a picture parameter set (PPS) , supplemental enhancement information (SEI) , or video usability information (VUI) of the bitstream to be output so as to be supplied to the video decoding device.
  • SPS sequence parameter set
  • PPS picture parameter set
  • SEI Supplemental Enhancement information
  • VUI video usability information
  • the use of an existing syntax allows metadata to be acquired using the existing functions of a video decoding device. For example, assuming the bitstream supports f1, f2, f3 Hz of frame rate when it is converted into hyper-lapse video. Metadata is such information.
  • Another embodiment of the present invention is a video transcoding method executed by a server, which includes: decoding a plurality of encoded first frames in an input bitstream, the plurality of encoded first frames having a first GOP (Group of Picture) structure; by analyzing image data representing a plurality of scenes corresponding to the plurality of decoded first frames, determining whether any of the plurality of scenes is to be dropped at a time of generating a hyper-lapse video in a video decoding device; determining a second GOP structure which a plurality of second frames will have, based on a result of the analysis, the plurality of second frames corresponding to the plurality of scenes; transcoding the plurality of decoded first frames having the first GOP structure into the plurality of second frames; and transmitting the plurality of second frames in a bitstream to the video decoding device.
  • GOP Group of Picture
  • the video transcoding method allows a server on a cloud computer to transcode a bitstream including a plurality of frames having a fixed GOP structure into a bitstream including a plurality of frames having a GOP structure that enables a video decoding device with a limited computational capability to display a hyper-lapse video, and to provide the bitstream to the video decoding device.
  • mobile devices such as cameras and smartphones have limitations in the amount of computation and power consumption.
  • the cloud server it is possible to perform processing without regard to the limitation.
  • the video encoding method is implemented by the video encoding device, it is necessary to perform capture, analysis, and encoding of the image in real time.
  • transcoding may be performed by off-line processing after the Bitstream is transmitted to the server.
  • Yet another embodiment of the present invention is a video decoding method executed by a video decoding device, which includes: receiving a bitstream, the bitstream including a plurality of encoded frames and metadata for decoding the plurality of encoded frames as a hyper-lapse video, the metadata indicating one or more frame rates for the hyper-lapse video supported by the bitstream, each of the one or more frame rates being associated with a frame to be used or a frame to be dropped among the plurality of frames; determining a frame rate for displaying a video in the one or more frame rates indicated by the metadata; determining whether or not to display the hyper-lapse video based on the determined frame rate; dropping some of the plurality of encoded frames based on the determined frame rate, on condition that it is determined that the hyper-lapse video is displayed; and decoding remaining frames of the plurality of encoded frames to display the hyper-lapse video.
  • the video encoding method according to this embodiment of the present invention allows a video decoding device with a limited computational capability to easily select an optimal frame and display a hyper
  • the plurality of encoded frames may be encoded in a GOP structure dynamically determined in a video encoding device.
  • the dropped frame may be encoded in higher temporal hierarchy within the GOP structure.
  • the dropping some of the plurality of encoded frames may include dropping a frame encoded in higher temporal hierarchy within the GOP structure according to the determined frame rate.
  • the video decoding device does not need redundant computation at the time of dropping some of frames.
  • the metadata may be included in a sequence parameter set (SPS) , a picture parameter set (PPS) , supplemental enhancement information (SEI) , or video usability information (VUI) of the bitstream.
  • SPS sequence parameter set
  • PPS picture parameter set
  • SEI supplemental enhancement information
  • VUI video usability information
  • bitstream may be input from a video encoding device or a server.
  • Still another embodiment of the present invention is a video encoding device, which includes: a scene analysis unit configured to analyze input image data to determine whether any of a plurality of scenes represented by the image data is to be dropped at a time of generating a hyper-lapse video in a video decoding device; a GOP (Group of Picture) structure setting unit configured to determine a GOP structure of the plurality of frames in a bitstream to be output, based on a result of the analysis; and a video encoder unit configured to, based on the result of the analysis, encode the image data to generate a plurality of frames corresponding to the plurality of scenes, and output a bitstream including the generated plurality of frames, the generated plurality of frames having the determined GOP structure.
  • the video encoding device allows a video decoding device with a limited computational capability to easily select an optimal frame and display a hyper-lapse video.
  • Yet still another embodiment of the present invention is a server including: a video decoder unit configured to decode a plurality of encoded first frames in an input bitstream, the plurality of first frames having a first GOP structure; a scene analysis unit configured to analyze image data representing the plurality of scenes corresponding to the plurality of decoded first frames to determine whether any of the plurality of scenes is to be dropped at a time of generating a hyper-lapse video in a video decoding device; a GOP structure setting unit configured to determine a second GOP structure which a plurality of second frames will have, based on a result of the analysis, the plurality of second frames corresponding to the plurality of scenes; and a video encoder unit configured to transcode the plurality of decoded first frames having the first GOP structure into the plurality of second frames, and transmit the plurality of second frames in a bitstream to the video decoding device.
  • the server allows a server on a cloud computer to transcode a bitstream including a plurality of frames having a fixed GOP structure into a bitstream including a plurality of frames having a GOP structure that enables a video decoding device with a limited computational capability to display a hyper-lapse video, and to provide the bitstream to the video decoding device.
  • mobile devices such as cameras and smartphones have limitations in the amount of computation and power consumption.
  • the cloud server it is possible to perform processing without regard to the limitation.
  • the cloud server may perform transcoding by off-line processing after the Bitstream is transmitted to the server.
  • Yet still another embodiment of the present invention is a video decoding device including: a bitstream buffer configured to receive a bitstream, the bitstream including a plurality of encoded frames, metadata for decoding the plurality of encoded frames as a hyper-lapse video in the bitstream, and a GOP (Group of Picture) structure which the plurality of encoded frames have, the metadata indicating one or more frame rates for the hyper-lapse video supported by the bitstream, each of the one or more frame rates being associated with a frame to be used or a frame to be dropped among the plurality of frames, the bitstream buffer being also configured to read the GOP structure; a display frame rate setting unit configured to determine a frame rate for displaying a video in the one or more frame rates indicated by the metadata; a frame dropping unit configured to drop some of the plurality of encoded frames based on the determined frame rate; and a video decoder unit configured to decode remaining frames in the plurality of encoded frames to output the hyper-lapse video.
  • the video decoding device can easily select an
  • Fig. 1 is a diagram showing the GOP structure of a typical bitstream
  • Fig. 2 is a diagram showing an example of the GOP structure of a bitstream which is determined in one embodiment of the present invention
  • FIG. 3 is a diagram showing an example of the functional blocks of a video encoding device according to one embodiment of the present invention.
  • FIG. 4 is a diagram showing an example of the configuration of a video encoder unit according to one embodiment of the present invention.
  • FIG. 5 is a diagram showing an example of the functional blocks of a video decoding device according to one embodiment of the present invention.
  • FIG. 6 is a diagram showing an example of the configuration of a video decoder unit according to one embodiment of the present invention.
  • FIG. 7 is a flowchart illustrating an example of video encoding processing according to one embodiment of the present invention.
  • FIG. 8 is a flowchart illustrating an example of video decoding processing according to one embodiment of the present invention.
  • FIG. 9 is a diagram showing an example of the functional blocks of a cloud server according to one embodiment of the present invention.
  • Fig. 10 is a flowchart illustrating an example of video transcoding processing according to one embodiment of the present invention.
  • a video encoding device determines which pictures are to be dropped to generate a hyper-lapse video before encoding a video into a bitstream, so that a GOP (Group of Picture) structure may be determined dynamically.
  • GOP Group of Picture
  • a video decoding device may drop a reference B picture, a non-reference B picture, or a P picture according to the playback speed of a hyper-lapse video, based on temporal hierarchy of a bitstream.
  • metadata for generating a hyper-lapse video from a bitstream received at a video decoding device may be transmitted in SEI (Supplemental Enhancement Information) or another syntax element of the bitstream from a video encoding device to the video decoding device.
  • SEI Supplemental Enhancement Information
  • scene analysis and transcoding modification of the GOP structure
  • scene analysis and transcoding modification of the GOP structure
  • the latest video coding standards such as H. 264 and H. 265 support both reference B pictures and non-reference B pictures that enable temporal scalability in addition to conventional I pictures and P pictures (for example, see Gary J. Sullivan, et al., “Overview of the High Efficiency Video Coding (HEVC) Standard” , IEEE Trans. On Circuits and Systems for Video Technology, VOL 22, No. 12, Dec 2012) .
  • HEVC High Efficiency Video Coding
  • Fig. 1 shows the GOP structure of a plurality of frames included in a typical bitstream.
  • a plurality of frames are generated by encoding image data.
  • Each of a plurality of scenes represented by image data corresponds to a plurality of frames.
  • I 0 represents an I picture
  • P 4 represents a P picture
  • B 2 represents a reference B picture
  • b 1 and b 3 represent non-reference B pictures.
  • the GOP structure of the typical bitstream shown in Fig. 1 is fixed, so that the GOP structure periodically and repeatedly appears in a bitstream output from the video encoding device.
  • the bitstream can be decoded at a frame rate of 30 Hz of the bitstream shown in Fig.
  • the bitstream can also be decoded at a frame rate of 15 Hz of the full bitstream shown in Fig. 1 by dropping the non-reference B picture b 1 , the reference B picture B 2 and the non-reference B picture b 3 .
  • the simplest method for creating a hyper-lapse video is to subsample a plurality of frames or a plurality of scenes of an input video uniformly in a temporal direction.
  • the motion of a camera at the time of shooting a video includes a high frequency (for example, when a video violently shakes up and down, left or right, or back and forth)
  • the created hyper-lapse video will be uncomfortable. This may happen, for example, when shooting a video with a camera called an action camera.
  • the video encoding device analyzes image data to determine which frame (picture) in a plurality of frames corresponding to a plurality of scenes represented by the image data is to be selected or dropped in order to generate a hyper-lapse video.
  • the video encoding device encodes the image data based on a result of the analysis to generate a plurality of frames corresponding to a plurality of scenes.
  • the video encoding device dynamically determines the GOP (Group of Picture) structure based on the result of the analysis of the image data.
  • the original frame rate of the bitstream is f 0 Hz and that the frame rates of hyper-lapse videos that can be generated from the bitstream are f 1 Hz, f 2 Hz and f 3 Hz (f 0 >f 1 >f 2 >f 3 ) .
  • the GOP structure is dynamically determined in such a way that only I pictures, P pictures and non-reference B pictures are decoded when a user wants to view a hyper-lapse video with a frame rate of f 1 Hz, or only I pictures and P pictures are decoded when the user wants to view a hyper-lapse video with a frame rate of f 2 Hz, or only I pictures are decoded when the user wants to view a hyper-lapse video with a frame rate of f 3 Hz.
  • Fig. 2 shows a GOP structure that is determined dynamically.
  • the GOP structure shown in Fig. 2 need not be periodic, and which frame is to be dropped usually differs for each GOP.
  • video encoding devices may be implemented with a known algorithm (see, for example, Neel Joshi, et al., “Real-Time Hyperlapse Creation via Optimal Frame Selection, ” ACM Transactions on Graphics, 34, July 2015) , as an algorithm for analyzing image data, to dynamically determine a GOP structure.
  • a known algorithm see, for example, Neel Joshi, et al., “Real-Time Hyperlapse Creation via Optimal Frame Selection, ” ACM Transactions on Graphics, 34, July 2015
  • the known algorithm generally includes three steps.
  • (1) In a frame matching step, feature quantity based sparse estimation is used to estimate how well temporally adjacent frames can be aligned, and the calculated costs are stored as a sparse matrix.
  • (2) In a frame selection step, dynamic time warping (DTW) is used to find an optimal frame path that trades off with a target rate with a suppressed minimum motion between frames.
  • (3) In a pass smoothing and rendering step, for selected frames, a hyper-lapse video is rendered by smoothing a camera path to obtain a stabilized result.
  • DTW dynamic time warping
  • image data analysis including (1) the frame matching step, (2) the frame selection step, and (3) the path smoothing and rendering step make it possible to find an optimal frame that enables minimization of both the frame rates of f 1 Hz, f 2 Hz and f 3 Hz and the motion between frames to thereby determine a GOP structure. That is, it is possible to determine whether any of a plurality of scenes represented by image data is to be dropped at the time of generating a hyper-lapse video in a video decoding device, and determine a GOP structure of a plurality of frames corresponding to a plurality of scenes.
  • the video encoding device can encode image data based on the result of analysis of the image data and generate a plurality of frames having the determined GOP structure.
  • the plurality of frames generated can be output in such a form as to be included in a bitstream.
  • complex scene analysis that is, complex analysis of image data is necessary only in a video encoding device and is unnecessary in a video decoding device.
  • data provided by a gyro sensor is useful for complex scene analysis. Such data is available when a video is captured by a camera (provided in the video encoding device) , but not available for a display (provided in the video decoding device) . Therefore, it is useful to dynamically determine a GOP structure in the video encoding device for displaying a bitstream as a hyper-lapse video.
  • the video encoding device is configured in such a way that one or more frame rates (for example, f 1 Hz, f 2 Hz and f 3 Hz) of hyper-lapse videos which can be selected in the video decoding device by a user are set in the video encoding device.
  • the video encoding device may be configured in such a way that a user selectively sets one or more frame rates of hyper-lapse videos.
  • Fig. 3 shows the functional blocks of the video encoding device of this embodiment.
  • the video encoding device 10 includes a scene analysis unit 11, a GOP structure setting unit 12, and a video encoder unit 13.
  • the scene analysis unit 11 performs analysis on image data representing a plurality of scenes to find an optimal frame to be traded off among a plurality of frames corresponding to the plurality of scenes for each of the frame rates f 1 Hz, f 2 Hz and f 3 Hz of hyper-lapse videos before encoding image data of an input video.
  • the GOP structure setting unit 12 determines which frame (picture) is to be used or dropped in the video decoding device, for a plurality of scenes represented by the image data, according to the result of the analysis of image data performed by the scene analysis unit 11 to determine and set a GOP structure of a plurality of frames corresponding to a plurality of scenes at the time of encoding the image data into the plurality of frames in order to generate a hyper-lapse video.
  • the GOP structure setting unit 12 also supplies the video encoder unit 13 with metadata indicating information about which frame (picture) is to be used or dropped in order to generate a hyper-lapse video.
  • the metadata indicates one or more frame rates for a hyper-lapse video supported by a bitstream including a plurality of encoded frames.
  • Each of the one or more frame rates is associated with a frame to be used or a frame to be dropped among the plurality of frames.
  • an optimal frame for generating a hyper-lapse video with a frame rate of f 3 Hz is set to be encoded as an I picture.
  • frames other than an optimal frame for generating the frame rate f 3 Hz are set to be encoded as P pictures.
  • frames other than the optimal frame for generating hyper-lapse videos with frame rates of f 3 Hz and f 2 Hz are set to be encoded as non-reference B pictures.
  • the remaining frames are set to be encoded as reference B pictures.
  • a GOP structure is dynamically set according to the result of the analysis of the image data.
  • the video encoder unit 13 encodes the input image data according to the GOP structure set by the GOP structure setting unit 12 to generate a plurality of frames corresponding to a plurality of scenes, and outputs a bitstream including the generated plurality of frames.
  • Fig. 4 shows one example of the configuration of the video encoder unit 13.
  • the video encoder unit 13 includes a re-ordering buffer 311, a subtractor 312, a transform unit 313, a quantizer 314, an entropy coding unit 315, and a buffer 316.
  • the video encoder unit 13 further includes a rate controller 318, an inverse quantizer 319, an inverse transform unit 320, an adder 321, a loop filter 322, a memory 323, an intra prediction unit 324, and an inter prediction unit 325.
  • the re-ordering buffer 311 re-orders input video data (image data) according to the GOP structure set by the GOP structure setting unit 12.
  • the re-ordered image data is output to the subtracter 312.
  • Image data input from the re-ordering buffer 311 and predictive image data from the intra prediction unit 324 or the inter prediction unit 325 are supplied to the subtracter 312.
  • the subtracter 312 calculates prediction error data which is the difference between the image data from the re-ordering buffer 311 and the predictive image data, and outputs the calculated prediction error data to the transform unit 313.
  • the transform unit 313 performs transform on the prediction error data input from the subtracter 312, and generates transform coefficient data which is the result of transform of a pixel region in the image to a frequency region.
  • the generated transform coefficient data is output to the quantizer 314.
  • the transform that is performed by the transform unit 313 may be, for example, discrete cosine transform (DCT) , Karhunen-Loéve transform or the like.
  • the quantizer 314 performs quantization on the transform coefficient data output from the transform unit 313.
  • the quantized transform coefficient data is output to the entropy coding unit 315 and the inverse quantizer 319.
  • the bit rate of the quantized data output from the quantizer 314 is controlled based on a rate control signal from the rate controller 318.
  • the quantizer 314 also quantizes the transform coefficient data generated by the transform unit 313.
  • the entropy coding unit 315 codes metadata supplied from the GOP structure setting unit 12 and indicating information specifying which frame (picture) is to be used or dropped in order to generate a hyper-lapse video into a syntax element such as SEI (Supplemental Enhancement Information) , or SPS (Sequence Parameter Set) , PPS (Picture Parameter Set) or VUI (Video Usability Information) , which is associated with the image data.
  • SEI Supplementplemental Enhancement Information
  • SPS Sequence Parameter Set
  • PPS Picture Parameter Set
  • VUI Video Usability Information
  • the entropy coding unit 315 performs entropy coding on quantized data to generate a bitstream including the coded plurality of frames. Coding by the entropy coding unit 315 may be, for example, variable length coding, arithmetic coding or the like.
  • the buffer 316 outputs a bitstream.
  • the buffer 316 temporarily stores the bitstream output from the entropy coding unit 315.
  • the buffer 316 then outputs the stored bitstream at a rate matching the bandwidth of the transmission path to the video decoding device.
  • the buffer 316 may be constituted by a recording medium such as a semiconductor memory.
  • the rate controller 318 monitors the free area of the buffer 316. Then, the rate controller 318 generates a rate control signal according to the free area of the buffer 316, and outputs the generated rate control signal to the quantizer 314. When the free area of the buffer 316 is small, for example, the rate controller 318 generates the rate control signal to reduce the bit rate for quantized data. When the free area of the buffer 316 is sufficiently large, on the other hand, the rate controller 318 generates the rate control signal to increase the bit rate for quantized data.
  • the inverse quantizer 319 performs inverse quantization on the quantized data input from the quantizer 314.
  • the inverse quantizer 319 then outputs the transform coefficient data obtained through the inverse quantization to the inverse transform unit 320.
  • the inverse transform unit 320 performs inverse quantization on the transform coefficient data input from the inverse quantizer 319 to restore prediction error data.
  • the inverse transform unit 320 then outputs the restored prediction error data to the adder 321.
  • the adder 321 generates decoded image data by adding the restored prediction error data input from the inverse transform unit 320 and the predictive image data input from the intra prediction unit 324 or the inter prediction unit 325 together.
  • the generated decoded image data is output to the loop filter 322 and the memory 323.
  • the loop filter 322 performs filtering to reduce coding distortion which is caused at the time of coding an image.
  • the loop filter 322 eliminates the coding distortion by filtering the decoded image data input from the adder 321, and outputs the filtered decoded image data to the memory 323.
  • the memory 323 stores the decoded image data input from the adder 321 and the filtered decoded image data input from the loop filter 322.
  • the memory 323 may be constituted by, for example, a recording medium such as a semiconductor memory.
  • the memory 323 supplies the decoded image data before filtering, which is used for intra prediction, as reference image data to the intra prediction unit 324, or supplies the filtered decoded image data, which is used for inter prediction, as reference image data to the inter prediction unit 325.
  • the intra prediction unit 324 performs intra prediction in each intra prediction mode based on the image data to be coded, input from the re-ordering buffer 311, and the decoded image data supplied from the memory 323. For example, the intra prediction unit 324 evaluates the result of the prediction in each intra prediction mode by using a predetermined cost function. The intra prediction unit 324 then selects the intra prediction mode that minimizes the cost function value, i.e., the intra prediction mode that maximizes the compression rate, as an optimal intra prediction mode. Further, the intra prediction unit 324 outputs information related to intra prediction, such as the prediction mode information indicating the optimal intra prediction mode, the predictive image data and the cost function value.
  • the inter prediction unit 325 performs inter prediction (interframe prediction) based on the image data to be coded, input from the re-ordering buffer 311, and the decoded image data supplied from the memory 323. For example, the inter prediction unit 325 evaluates the result of the prediction in each inter prediction mode by using a predetermined cost function. The inter prediction unit 325 then selects the inter prediction mode that minimizes the cost function value, i.e., the inter prediction mode that maximizes the compression rate, as an optimal inter prediction mode. Further, the inter prediction unit 325 generates predictive image data according to the optimal inter prediction mode. Then, the inter prediction unit 325 outputs information related to inter prediction, such as the prediction mode information representing the optimal inter prediction mode, the predictive image data, the cost function value, and the motion vector.
  • inter prediction inter prediction
  • the cost function value related to intra prediction output from the intra prediction unit 324 is compared with the cost function value related to inter prediction output from the inter prediction unit 325 to select the intra prediction or the inter prediction, whichever provides a smaller cost function value.
  • the intra prediction is selected, the information related to intra prediction is output to the entropy coding unit 315, and the predictive image data is output to the subtracter 312 and the adder 321.
  • the inter prediction is selected, on the other hand, the information related to inter prediction is output to the entropy coding unit 315, and the predictive image data is output to the subtracter 312 and the adder 321.
  • the video decoding device of the present embodiment is configured in such a way that the rate of frames to be displayed is set by a user, and a video with the original frame rate or a hyper-lapse video is output according to the set display frame rate.
  • a hyper-lapse video is to be output, unnecessary frames are dropped during re-ordering in the video decoding device, and will not be decoded. Therefore, redundant calculation can be avoided.
  • Fig. 5 shows the functional blocks of the video decoding device of the present embodiment.
  • the video decoding device 50 includes a bitstream buffer 51, a display frame rate setting unit 52, a frame dropping unit 53, and a video decoder unit 55.
  • the bitstream storage buffer 51 temporarily stores an input bitstream input from the video encoding device 10 over the transmission path 3.
  • the bitstream storage buffer 51 may be constituted by, for example, a recording medium such as a semiconductor memory.
  • the bitstream storage buffer 51 supplies metadata coded into, for example, a syntax element within SEI to the display frame rate setting unit 52.
  • the display frame rate setting unit 52 presents selectable frame rates for a hyper-lapse video to a user based on the supplied metadata.
  • the display frame rate setting unit 52 receives, from the user, selection on whether a video with the original frame rate or a hyper-lapse video is displayed.
  • the display frame rate setting unit 52 receives, from the user, selection of a frame rate from selectable frame rates for a hyper-lapse video.
  • the display frame rate setting unit 52 determines a frame rate for displaying a video.
  • the display frame rate setting unit 52 supplies the rate of the frame to be displayed, which is determined according to the selection made by the user, to the frame dropping unit 53.
  • the frame dropping unit 53 drops a frame from the bitstream stored in the bitstream buffer 51 according to the rate of the frame to be displayed, which is determined according to the selection made by the user, and supplies the resultant bitstream to the video decoder unit 55.
  • the frame to be dropped here is the frame that is associated with the frame rate selected by the user from among one or more frame rates indicated by the metadata. When the user selects display of a video with the original frames, the frame dropping unit 53 does not drop frames.
  • the video decoder unit 55 restores the original video or a hyper-lapse video from the bitstream supplied from the frame dropping unit 53.
  • Fig. 6 shows an example of the configuration of the video decoder unit 55.
  • the video decoder unit 55 includes an entropy decoding unit 552, an inverse quantizer 553, an inverse transform unit 554, an adder 555, a loop filter 556, a re-ordering buffer 557, a memory 558, an intra prediction unit 559, and an inter prediction unit 560.
  • the video decoder unit 55 basically performs a process inverse to the process performed by the video encoder unit 13 to restore video data.
  • the entropy decoding unit 552 decodes the input bitstream input from the bitstream buffer 51.
  • the entropy decoding unit 552 refers to syntax elements such as SPS and PPS.
  • the entropy decoding unit 552 decodes information multiplexed into the header area of the input bitstream.
  • the inverse quantizer 553 and the inverse transform unit 554 generate prediction error data by performing inverse quantization and inverse transform on quantized data input from the entropy decoding unit 552.
  • the inverse transform unit 554 outputs the generated prediction error data to the adder 555.
  • the inverse quantizer 553 and the inverse transform unit 554 perform processes inverse to the processes performed by the quantizer 314 and the transform unit 313 in the video encoding device 10. That is, the inverse quantizer 553 and the inverse transform unit 554 perform inverse quantization and inverse transform by using the SPS and the PPS corresponding to a sequence or a picture to be processed.
  • the adder 555 generates decoded image data by adding the prediction error data input from the inverse transform unit 554 and the predictive image data input from the intra prediction unit 559 or the inter prediction unit 560.
  • the generated decoded image data is output to the loop filter 556 and the memory 558.
  • the loop filter 556 eliminates the coding distortion by filtering the decoded image data input from the adder 555, and outputs the filtered decoded image data to the re-ordering buffer 557 and the memory 558.
  • the re-ordering buffer 557 re-orders the images input from the loop filter 556 to generate a sequence of time-sequential image data.
  • the image data generated by the re-ordering buffer 557 is output as a video with the original frame rate or a hyper-lapse video.
  • the memory 558 stores the unfiltered, decoded image data input from the adder 555 and the filtered decoded image data input from the loop filter 556.
  • the memory 558 may be constituted by, for example, a recording medium such as a semiconductor memory.
  • Fig. 7 shows one example of the video encoding processing in the video encoding device 10.
  • the video encoding device 10 determines frame rates (for example, f 1 Hz, f 2 Hz and f 3 Hz) of frames to be displayed as a hyper-lapse video (step 21) .
  • the frame rate for displaying of frames as a hyper-lapse video may be preset in the video encoding device 10 or may be input or selected by a user.
  • the video encoding device 10 (for example, scene analysis unit 11) performs scene analysis on the input video bitstream (step 23) .
  • image data representing a plurality of scenes is analyzed.
  • the video encoding device 10 finds an optimal frame for minimizing the motion between frames from among a plurality of frames corresponding to the plurality of scenes, for each of the frame rates of f 1 Hz, f 2 Hz and f 3 Hz of a hyper-lapse video.
  • the video encoding device 10 may perform the above-described (1) frame matching step, (2) frame selection step, and (3) path smoothing and rendering step for each of the frame rates determined at step 21 to find the optimal frame for each of the frame rates.
  • the video encoding device 10 determines which frame (picture) is to be used or dropped in the video decoding device 50, for a plurality of scenes represented by the image data, to generate a hyper-lapse video according to the result of the scene analysis performed at step 23, that is, the result of the analysis of the image data, and determines a GOP structure of a plurality of frames corresponding to a plurality of scenes at the time of encoding the image data into the plurality of frames (step 25) .
  • the video encoding device 10 (for example, GOP structure setting unit 12) generates metadata for generating a hyper-lapse video (step 27) .
  • Metadata indicates one or more frame rates for a hyper-lapse video supported, and indicates information about which frame (picture) among a plurality of frames corresponding to a plurality of scenes is to be used or dropped. Each of the one or more frame rates is associated with the frame to be used or a frame to be dropped among the plurality of frames.
  • the video encoding device 10 (for example, video encoder unit 13) encodes the input image data according to the GOP structure determined at step 25 to generate a plurality of frames corresponding to a plurality of scenes, encodes the metadata into a syntax element such as SEI, or SPS, PPS or VUI, and outputs a bitstream including the generated plurality of frames and a syntax element (step 29) .
  • a syntax element such as SEI, or SPS, PPS or VUI
  • Fig. 8 shows one example of the video decoding processing in the video decoding device 50.
  • the video decoding device 50 receives metadata (step 61) .
  • the metadata may be received in an input bitstream input from the video encoding device 10 over a transmission path.
  • the receiving the metadata may include decoding the metadata encoded into an SEI syntax element or another syntax element. Metadata indicates which frame (picture) is to be used or dropped in order to generate a hyper-lapse video.
  • the video decoding device 50 reads the GOP structure for the input bitstream (step 63) .
  • the GOP structure may be read from a syntax element such as SPS, PPS or VUI.
  • the video decoding device 50 determines a frame rate for displaying a video (step 65) .
  • the determining the frame rate may include presenting selectable frame rates for a hyper-lapse video to a user based on the metadata, and receiving selection on displaying a video with the original frame rate or a hyper-lapse video from the user.
  • the receiving selection on displaying a hyper-lapse video includes receiving selection of a frame rate from selectable frame rates for a hyper-lapse video from the user.
  • the display frame rate setting unit 52 determines a frame rate for displaying a video.
  • the video decoding device 50 determines whether or not to display a hyper-lapse video based on the determined frame rate (step 67) . When it is determined to display a hyper-lapse video, the processing proceeds to step 69. When it is not determined to display a hyper-lapse video, that is, when it is determined that a video with the original frame rate is to be displayed, the processing proceeds to step 71.
  • the video decoding device 50 drops unnecessary frames from the bitstream stored in the bitstream buffer 51 according to the frame rate determined at step 65 (step 69) .
  • the frame to be dropped here is the frame that is associated with the frame rate selected by the user from among one or more frame rates indicated by the metadata.
  • the remaining bitstream is supplied to the video decoder unit 55.
  • the reference B picture is dropped when the frame rate determined at step 65 is f 1 Hz
  • the reference B picture and the non-reference B picture are dropped when the determined frame rate is f 2 Hz
  • the P picture, the reference B picture and the non-reference B picture are dropped when the determined frame rate is f 3 Hz.
  • the video decoding device 50 sequentially decodes the remaining bitstream which has not been dropped at step 69 for each frame (step 71) .
  • the video decoding device 50 decodes the original video or a hyper-lapse video according to the frame rate determined at step 65.
  • a cloud server including the scene analysis unit 11, the GOP structure setting unit 12, and the video encoder unit 13, which are included in the video encoding device 10, and a video decoder unit that decodes a video encoded using a fixed GOP structure and supplies the decoded video to the scene analysis unit 11 may serve as an implementation form of the present invention.
  • Image data encoded using the fixed GOP structure (that is, a plurality of frames having the fixed GOP structure) is transmitted over a network to the cloud server to be transcoded therein.
  • Fig. 9 is a diagram showing an example of the functional blocks of a cloud server according to one embodiment of the present invention.
  • the cloud server 110 includes a video decoder unit 111, a scene analysis unit 11, a GOP structure setting unit 12, and a video encoder unit 13.
  • Bitstream including image data encoded using the fixed GOP structure is input to the cloud server 110 via network from a camera or a smartphone.
  • Bitstream including transcoded frames is output and transmitted via network to a smartphone, personal computer, or television.
  • Fig. 10 is a flowchart illustrating an example of video transcoding processing according to one embodiment of the present invention.
  • the scene analysis unit 11 included in the cloud server analyzes the restored image data (step 23) .
  • the cloud server 110 may determine frame rates (for example, f 1 Hz, f 2 Hz and f 3 Hz) of frames to be displayed as a hyper-lapse video (step 21) .
  • the frame rate for displaying of frames as a hyper-lapse video may be preset in the cloud server 110 or may be input or selected by a user.
  • the GOP structure setting unit 12 included in the cloud server dynamically determines a GOP structure of a plurality of frames to generate a hyper-lapse video according to the result of the analysis of the image data (step 25) .
  • the video encoder unit 13 included in the cloud server encodes the restored image data according to the determined GOP structure and generates a transcoded bitstream (step 27) . That is, a plurality of frames having a fixed GOP structure is transcoded into a plurality of frames having the dynamically determined GOP structure, and a bit stream including the plurality of transcoded frames and the metadata is transmitted to the video decoding device 50 described above over the network (step 29) .

Abstract

A method for avoiding redundant calculation when creating a hyper-lapse video from an encoded bitstream is provided. The video encoding method includes: performing scene analysis on an input bitstream to determine a frame to be dropped at a time of generating a hyper-lapse video in a video decoding device; and encoding the bitstream subjected to the scene analysis and outputting the encoded bitstream. The video encoding method further includes dynamically determining a GOP (Group of Picture) structure of a bitstream to be output based on the scene analysis.

Description

VIDEO ENCODING METHOD, VIDEO DECODING METHOD, VIDEO ENCODING DEVICE, AND VIDEO DECODING DEVICE Technical Field
The present application relates to a video encoding method, a video decoding method, a video encoding device, and a video decoding device.
Background Art
As inexpensive and high-quality cameras become widely available in the market, users tend to shoot fairly long videos. For example, a user shoots a long video while moving a long distance. If there is not enough time to see the entire long video shot, the user can watch a hyper-lapse video created from the original long video.
A video decoding device needs redundant calculation to create a hyper-lapse video after decoding a full bitstream including a plurality of frames encoded by a video encoding device. For example, even if a frame (I picture or P picture) encoded in a lower temporal hierarchy within a GOP (Group of Picture) structure, as specified in H. 264, is dropped at a time of creating a hyper-lapse video, such a frame is referenced when decoding a frame (for example, reference B picture or non-reference B picture) encoded in a higher hierarchy, so that the frame cannot be dropped and needs to be decoded. In addition, although the computational capability of a receiving device  (including a video decoding device) that receives and decodes a bitstream is usually limited, scene analysis to create a hyper-lapse video after decoding requires more calculation.
It is an object of an embodiment of the present invention to avoid the aforementioned redundant calculation when creating a hyper-lapse video from a bitstream including a plurality of frames encoded by a method specified in H. 264 or H. 265 or a method similar to those methods, which supports both of a reference B picture and a non-reference B picture.
Summary of Invention
One embodiment of the present invention is a video encoding method executed by a video encoding device, which includes: analyzing input image data to determine whether any of a plurality of scenes represented by the image data is to be dropped at a time of generating a hyper-lapse video in a video decoding device; and encoding the image data based on a result of the analysis to generate a plurality of frames corresponding to the plurality of scenes, and outputting a bitstream including the generated plurality of frames. The video encoding method according to an embodiment of the present invention allows a video decoding device with a limited computational capability to easily select an optimal frame and display a hyper-lapse video. In an example, the determination is reflected as a picture type (I/P/refB/nonrefB pictures) within a bitstream.
In one aspect, the determination may include dynamically determining a GOP structure of the plurality of frames in the  bitstream to be output based on the result of the analysis. The determining a GOP structure of the plurality of frames in the bitstream to be output may include assigning a frame to be dropped at a time of generating a hyper-lapse video in the video decoding device to higher temporal hierarchy within the GOP structure. The video decoding device does not need redundant computation at the time of dropping some of frames. In an example, the higher temporal hierarchy corresponds to B picture such as b1, b2, b3, b11 shown in FIG. 2.
In another aspect, the bitstream to be output may support a plurality of frame rates for a user to view the bitstream as a hyper-lapse video in the video decoding device. A video encoding method according to this embodiment may further include including metadata for decoding the bitstream as a hyper-lapse video in a sequence parameter set (SPS) , a picture parameter set (PPS) , supplemental enhancement information (SEI) , or video usability information (VUI) of the bitstream to be output so as to be supplied to the video decoding device. The use of an existing syntax allows metadata to be acquired using the existing functions of a video decoding device. For example, assuming the bitstream supports f1, f2, f3 Hz of frame rate when it is converted into hyper-lapse video. Metadata is such information.
Another embodiment of the present invention is a video transcoding method executed by a server, which includes: decoding a plurality of encoded first frames in an input bitstream, the plurality of encoded first frames having a first  GOP (Group of Picture) structure; by analyzing image data representing a plurality of scenes corresponding to the plurality of decoded first frames, determining whether any of the plurality of scenes is to be dropped at a time of generating a hyper-lapse video in a video decoding device; determining a second GOP structure which a plurality of second frames will have, based on a result of the analysis, the plurality of second frames corresponding to the plurality of scenes; transcoding the plurality of decoded first frames having the first GOP structure into the plurality of second frames; and transmitting the plurality of second frames in a bitstream to the video decoding device. The video transcoding method according to this embodiment of the present invention allows a server on a cloud computer to transcode a bitstream including a plurality of frames having a fixed GOP structure into a bitstream including a plurality of frames having a GOP structure that enables a video decoding device with a limited computational capability to display a hyper-lapse video, and to provide the bitstream to the video decoding device. In addition, mobile devices such as cameras and smartphones have limitations in the amount of computation and power consumption. However, with the cloud server, it is possible to perform processing without regard to the limitation. When the video encoding method is implemented by the video encoding device, it is necessary to perform capture, analysis, and encoding of the image in real time. However, when the video transcoding method is implemented by the cloud server, transcoding may be performed  by off-line processing after the Bitstream is transmitted to the server.
Yet another embodiment of the present invention is a video decoding method executed by a video decoding device, which includes: receiving a bitstream, the bitstream including a plurality of encoded frames and metadata for decoding the plurality of encoded frames as a hyper-lapse video, the metadata indicating one or more frame rates for the hyper-lapse video supported by the bitstream, each of the one or more frame rates being associated with a frame to be used or a frame to be dropped among the plurality of frames; determining a frame rate for displaying a video in the one or more frame rates indicated by the metadata; determining whether or not to display the hyper-lapse video based on the determined frame rate; dropping some of the plurality of encoded frames based on the determined frame rate, on condition that it is determined that the hyper-lapse video is displayed; and decoding remaining frames of the plurality of encoded frames to display the hyper-lapse video. The video encoding method according to this embodiment of the present invention allows a video decoding device with a limited computational capability to easily select an optimal frame and display a hyper-lapse video.
In one aspect, the plurality of encoded frames may be encoded in a GOP structure dynamically determined in a video encoding device. The dropped frame may be encoded in higher temporal hierarchy within the GOP structure. The dropping some of the plurality of encoded frames may include dropping a frame  encoded in higher temporal hierarchy within the GOP structure according to the determined frame rate. The video decoding device does not need redundant computation at the time of dropping some of frames.
In another aspect, the metadata may be included in a sequence parameter set (SPS) , a picture parameter set (PPS) , supplemental enhancement information (SEI) , or video usability information (VUI) of the bitstream. The use of an existing syntax allows metadata to be acquired using the existing functions of a video decoding device.
In a further aspect, the bitstream may be input from a video encoding device or a server.
Still another embodiment of the present invention is a video encoding device, which includes: a scene analysis unit configured to analyze input image data to determine whether any of a plurality of scenes represented by the image data is to be dropped at a time of generating a hyper-lapse video in a video decoding device; a GOP (Group of Picture) structure setting unit configured to determine a GOP structure of the plurality of frames in a bitstream to be output, based on a result of the analysis; and a video encoder unit configured to, based on the result of the analysis, encode the image data to generate a plurality of frames corresponding to the plurality of scenes, and output a bitstream including the generated plurality of frames, the generated plurality of frames having the determined GOP structure. The video encoding device according to this embodiment of the present invention allows a video decoding  device with a limited computational capability to easily select an optimal frame and display a hyper-lapse video.
Yet still another embodiment of the present invention is a server including: a video decoder unit configured to decode a plurality of encoded first frames in an input bitstream, the plurality of first frames having a first GOP structure; a scene analysis unit configured to analyze image data representing the plurality of scenes corresponding to the plurality of decoded first frames to determine whether any of the plurality of scenes is to be dropped at a time of generating a hyper-lapse video in a video decoding device; a GOP structure setting unit configured to determine a second GOP structure which a plurality of second frames will have, based on a result of the analysis, the plurality of second frames corresponding to the plurality of scenes; and a video encoder unit configured to transcode the plurality of decoded first frames having the first GOP structure into the plurality of second frames, and transmit the plurality of second frames in a bitstream to the video decoding device. The server according to this embodiment of the present invention allows a server on a cloud computer to transcode a bitstream including a plurality of frames having a fixed GOP structure into a bitstream including a plurality of frames having a GOP structure that enables a video decoding device with a limited computational capability to display a hyper-lapse video, and to provide the bitstream to the video decoding device. In addition, mobile devices such as cameras and smartphones have limitations in the amount of computation and power consumption.  However, with the cloud server, it is possible to perform processing without regard to the limitation. When the video encoding method is implemented by the video encoding device, it is necessary to perform capture, analysis, and encoding of the image in real time. However, the cloud server may perform transcoding by off-line processing after the Bitstream is transmitted to the server.
Yet still another embodiment of the present invention is a video decoding device including: a bitstream buffer configured to receive a bitstream, the bitstream including a plurality of encoded frames, metadata for decoding the plurality of encoded frames as a hyper-lapse video in the bitstream, and a GOP (Group of Picture) structure which the plurality of encoded frames have, the metadata indicating one or more frame rates for the hyper-lapse video supported by the bitstream, each of the one or more frame rates being associated with a frame to be used or a frame to be dropped among the plurality of frames, the bitstream buffer being also configured to read the GOP structure; a display frame rate setting unit configured to determine a frame rate for displaying a video in the one or more frame rates indicated by the metadata; a frame dropping unit configured to drop some of the plurality of encoded frames based on the determined frame rate; and a video decoder unit configured to decode remaining frames in the plurality of encoded frames to output the hyper-lapse video. The video decoding device according to this embodiment of the present invention can easily select an optimal frame and display  a hyper-lapse video without requiring an additional computational capability.
Brief Description of Drawings
To describe the technical solutions in the embodiments more clearly, the following briefly describes the accompanying drawings required for describing the present embodiments. Apparently, the accompanying drawings in the following description depict merely some of the possible embodiments, and a person of ordinary skill in the art may still derive other drawings, without creative efforts, from these accompanying drawings, in which:
[Fig. 1] Fig. 1 is a diagram showing the GOP structure of a typical bitstream;
[Fig. 2] Fig. 2 is a diagram showing an example of the GOP structure of a bitstream which is determined in one embodiment of the present invention;
[Fig. 3] Fig. 3 is a diagram showing an example of the functional blocks of a video encoding device according to one embodiment of the present invention;
[Fig. 4] Fig. 4 is a diagram showing an example of the configuration of a video encoder unit according to one embodiment of the present invention;
[Fig. 5] Fig. 5 is a diagram showing an example of the functional blocks of a video decoding device according to one embodiment of the present invention;
[Fig. 6] Fig. 6 is a diagram showing an example of the  configuration of a video decoder unit according to one embodiment of the present invention;
[Fig. 7] Fig. 7 is a flowchart illustrating an example of video encoding processing according to one embodiment of the present invention;
[Fig. 8] Fig. 8 is a flowchart illustrating an example of video decoding processing according to one embodiment of the present invention;
[Fig. 9] Fig. 9 is a diagram showing an example of the functional blocks of a cloud server according to one embodiment of the present invention; and
[Fig. 10] Fig. 10 is a flowchart illustrating an example of video transcoding processing according to one embodiment of the present invention.
Description of Embodiments
The following will describe embodiments of the present invention in detail referring to the accompanying drawings. Same or like reference symbols indicate same or like elements to avoid redundant descriptions.
According to one embodiment of the present invention, a video encoding device determines which pictures are to be dropped to generate a hyper-lapse video before encoding a video into a bitstream, so that a GOP (Group of Picture) structure may be determined dynamically.
Also, according to one embodiment of the present invention, in order to generate hyper-lapse video, a video  decoding device may drop a reference B picture, a non-reference B picture, or a P picture according to the playback speed of a hyper-lapse video, based on temporal hierarchy of a bitstream.
Moreover, according to one embodiment of the present invention, metadata for generating a hyper-lapse video from a bitstream received at a video decoding device may be transmitted in SEI (Supplemental Enhancement Information) or another syntax element of the bitstream from a video encoding device to the video decoding device.
Furthermore, scene analysis and transcoding (modification of the GOP structure) to enable generation of a hyper-lapse video can be performed on a cloud server as well as the video encoding device.
(Video Encoding Standards)
The latest video coding standards such as H. 264 and H. 265 support both reference B pictures and non-reference B pictures that enable temporal scalability in addition to conventional I pictures and P pictures (for example, see Gary J. Sullivan, et al., “Overview of the High Efficiency Video Coding (HEVC) Standard” , IEEE Trans. On Circuits and Systems for Video Technology, VOL 22, No. 12, Dec 2012) .
Fig. 1 shows the GOP structure of a plurality of frames included in a typical bitstream. A plurality of frames are generated by encoding image data. Each of a plurality of scenes represented by image data corresponds to a plurality of frames. In Fig. 1, I 0 represents an I picture, P 4 represents a P picture, B 2 represents a reference B picture, and b 1 and b 3 represent  non-reference B pictures. The GOP structure of the typical bitstream shown in Fig. 1 is fixed, so that the GOP structure periodically and repeatedly appears in a bitstream output from the video encoding device. When the frame rate of the original full bitstream is 60 Hz, the bitstream can be decoded at a frame rate of 30 Hz of the bitstream shown in Fig. 1 by dropping the non-reference B pictures b 1 and b 3. The bitstream can also be decoded at a frame rate of 15 Hz of the full bitstream shown in Fig. 1 by dropping the non-reference B picture b 1, the reference B picture B 2 and the non-reference B picture b 3.
The simplest method for creating a hyper-lapse video is to subsample a plurality of frames or a plurality of scenes of an input video uniformly in a temporal direction. However, when the motion of a camera at the time of shooting a video includes a high frequency (for example, when a video violently shakes up and down, left or right, or back and forth) , the created hyper-lapse video will be uncomfortable. This may happen, for example, when shooting a video with a camera called an action camera.
Thus, there is a need for a method for creating a comfortable hyper-lapse video.
(Scene Analysis and Determination of GOP Structure)
The video encoding device according to the present embodiment analyzes image data to determine which frame (picture) in a plurality of frames corresponding to a plurality of scenes represented by the image data is to be selected or dropped in order to generate a hyper-lapse video. The video  encoding device encodes the image data based on a result of the analysis to generate a plurality of frames corresponding to a plurality of scenes. The video encoding device dynamically determines the GOP (Group of Picture) structure based on the result of the analysis of the image data.
It is assumed that the original frame rate of the bitstream is f 0 Hz and that the frame rates of hyper-lapse videos that can be generated from the bitstream are f 1 Hz, f 2 Hz and f 3 Hz (f 0>f 1>f 2>f 3) . The GOP structure is dynamically determined in such a way that only I pictures, P pictures and non-reference B pictures are decoded when a user wants to view a hyper-lapse video with a frame rate of f 1 Hz, or only I pictures and P pictures are decoded when the user wants to view a hyper-lapse video with a frame rate of f 2 Hz, or only I pictures are decoded when the user wants to view a hyper-lapse video with a frame rate of f 3 Hz.
Fig. 2 shows a GOP structure that is determined dynamically. The GOP structure shown in Fig. 2 need not be periodic, and which frame is to be dropped usually differs for each GOP.
For example, video encoding devices may be implemented with a known algorithm (see, for example, Neel Joshi, et al., “Real-Time Hyperlapse Creation via Optimal Frame Selection, ” ACM Transactions on Graphics, 34, July 2015) , as an algorithm for analyzing image data, to dynamically determine a GOP structure.
The known algorithm generally includes three steps. (1)  In a frame matching step, feature quantity based sparse estimation is used to estimate how well temporally adjacent frames can be aligned, and the calculated costs are stored as a sparse matrix. (2) In a frame selection step, dynamic time warping (DTW) is used to find an optimal frame path that trades off with a target rate with a suppressed minimum motion between frames. (3) In a pass smoothing and rendering step, for selected frames, a hyper-lapse video is rendered by smoothing a camera path to obtain a stabilized result.
In the video encoding device according to the present embodiment, image data analysis including (1) the frame matching step, (2) the frame selection step, and (3) the path smoothing and rendering step make it possible to find an optimal frame that enables minimization of both the frame rates of f 1 Hz, f 2 Hz and f 3 Hz and the motion between frames to thereby determine a GOP structure. That is, it is possible to determine whether any of a plurality of scenes represented by image data is to be dropped at the time of generating a hyper-lapse video in a video decoding device, and determine a GOP structure of a plurality of frames corresponding to a plurality of scenes. The video encoding device according to this embodiment can encode image data based on the result of analysis of the image data and generate a plurality of frames having the determined GOP structure. The plurality of frames generated can be output in such a form as to be included in a bitstream.
Thus, complex scene analysis, that is, complex analysis of image data is necessary only in a video encoding device and  is unnecessary in a video decoding device. Further, data provided by a gyro sensor is useful for complex scene analysis. Such data is available when a video is captured by a camera (provided in the video encoding device) , but not available for a display (provided in the video decoding device) . Therefore, it is useful to dynamically determine a GOP structure in the video encoding device for displaying a bitstream as a hyper-lapse video.
(Configuration of Video Encoding Device)
The video encoding device according to the present embodiment is configured in such a way that one or more frame rates (for example, f 1 Hz, f 2 Hz and f 3 Hz) of hyper-lapse videos which can be selected in the video decoding device by a user are set in the video encoding device. Of course, the video encoding device may be configured in such a way that a user selectively sets one or more frame rates of hyper-lapse videos.
Fig. 3 shows the functional blocks of the video encoding device of this embodiment. As shown in Fig. 3, the video encoding device 10 includes a scene analysis unit 11, a GOP structure setting unit 12, and a video encoder unit 13.
The scene analysis unit 11 performs analysis on image data representing a plurality of scenes to find an optimal frame to be traded off among a plurality of frames corresponding to the plurality of scenes for each of the frame rates f 1 Hz, f 2 Hz and f 3 Hz of hyper-lapse videos before encoding image data of an input video.
The GOP structure setting unit 12 determines which frame  (picture) is to be used or dropped in the video decoding device, for a plurality of scenes represented by the image data, according to the result of the analysis of image data performed by the scene analysis unit 11 to determine and set a GOP structure of a plurality of frames corresponding to a plurality of scenes at the time of encoding the image data into the plurality of frames in order to generate a hyper-lapse video. The GOP structure setting unit 12 also supplies the video encoder unit 13 with metadata indicating information about which frame (picture) is to be used or dropped in order to generate a hyper-lapse video. The metadata indicates one or more frame rates for a hyper-lapse video supported by a bitstream including a plurality of encoded frames. Each of the one or more frame rates is associated with a frame to be used or a frame to be dropped among the plurality of frames. In this embodiment, an optimal frame for generating a hyper-lapse video with a frame rate of f 3 Hz is set to be encoded as an I picture. Of optimal frames for generating a hyper-lapse video with a frame rate f 2 Hz, frames other than an optimal frame for generating the frame rate f 3 Hz are set to be encoded as P pictures. Of optimal frames for generating a hyper-lapse video with a frame rate of f 1 Hz, frames other than the optimal frame for generating hyper-lapse videos with frame rates of f 3 Hz and f 2 Hz are set to be encoded as non-reference B pictures. The remaining frames are set to be encoded as reference B pictures. In this manner, a GOP structure is dynamically set according to the result of the analysis of the image data.
The video encoder unit 13 encodes the input image data according to the GOP structure set by the GOP structure setting unit 12 to generate a plurality of frames corresponding to a plurality of scenes, and outputs a bitstream including the generated plurality of frames.
(Configuration of Video Encoder Unit)
Fig. 4 shows one example of the configuration of the video encoder unit 13. The video encoder unit 13 includes a re-ordering buffer 311, a subtractor 312, a transform unit 313, a quantizer 314, an entropy coding unit 315, and a buffer 316. The video encoder unit 13 further includes a rate controller 318, an inverse quantizer 319, an inverse transform unit 320, an adder 321, a loop filter 322, a memory 323, an intra prediction unit 324, and an inter prediction unit 325.
The re-ordering buffer 311 re-orders input video data (image data) according to the GOP structure set by the GOP structure setting unit 12. The re-ordered image data is output to the subtracter 312.
Image data input from the re-ordering buffer 311 and predictive image data from the intra prediction unit 324 or the inter prediction unit 325 are supplied to the subtracter 312. The subtracter 312 calculates prediction error data which is the difference between the image data from the re-ordering buffer 311 and the predictive image data, and outputs the calculated prediction error data to the transform unit 313.
The transform unit 313 performs transform on the prediction error data input from the subtracter 312, and  generates transform coefficient data which is the result of transform of a pixel region in the image to a frequency region. The generated transform coefficient data is output to the quantizer 314. The transform that is performed by the transform unit 313 may be, for example, discrete cosine transform (DCT) , Karhunen-Loéve transform or the like.
The quantizer 314 performs quantization on the transform coefficient data output from the transform unit 313. The quantized transform coefficient data is output to the entropy coding unit 315 and the inverse quantizer 319. The bit rate of the quantized data output from the quantizer 314 is controlled based on a rate control signal from the rate controller 318.
The quantizer 314 also quantizes the transform coefficient data generated by the transform unit 313.
The entropy coding unit 315 codes metadata supplied from the GOP structure setting unit 12 and indicating information specifying which frame (picture) is to be used or dropped in order to generate a hyper-lapse video into a syntax element such as SEI (Supplemental Enhancement Information) , or SPS (Sequence Parameter Set) , PPS (Picture Parameter Set) or VUI (Video Usability Information) , which is associated with the image data. The metadata may be frame rates (f 1 Hz, f 2 Hz and f 3 Hz) of hyper-lapse videos.
The entropy coding unit 315 performs entropy coding on quantized data to generate a bitstream including the coded plurality of frames. Coding by the entropy coding unit 315 may  be, for example, variable length coding, arithmetic coding or the like.
The buffer 316 outputs a bitstream. The buffer 316 temporarily stores the bitstream output from the entropy coding unit 315. The buffer 316 then outputs the stored bitstream at a rate matching the bandwidth of the transmission path to the video decoding device. The buffer 316 may be constituted by a recording medium such as a semiconductor memory.
The rate controller 318 monitors the free area of the buffer 316. Then, the rate controller 318 generates a rate control signal according to the free area of the buffer 316, and outputs the generated rate control signal to the quantizer 314. When the free area of the buffer 316 is small, for example, the rate controller 318 generates the rate control signal to reduce the bit rate for quantized data. When the free area of the buffer 316 is sufficiently large, on the other hand, the rate controller 318 generates the rate control signal to increase the bit rate for quantized data.
The inverse quantizer 319 performs inverse quantization on the quantized data input from the quantizer 314. The inverse quantizer 319 then outputs the transform coefficient data obtained through the inverse quantization to the inverse transform unit 320.
The inverse transform unit 320 performs inverse quantization on the transform coefficient data input from the inverse quantizer 319 to restore prediction error data. The inverse transform unit 320 then outputs the restored prediction  error data to the adder 321.
The adder 321 generates decoded image data by adding the restored prediction error data input from the inverse transform unit 320 and the predictive image data input from the intra prediction unit 324 or the inter prediction unit 325 together. The generated decoded image data is output to the loop filter 322 and the memory 323.
The loop filter 322 performs filtering to reduce coding distortion which is caused at the time of coding an image. The loop filter 322 eliminates the coding distortion by filtering the decoded image data input from the adder 321, and outputs the filtered decoded image data to the memory 323.
The memory 323 stores the decoded image data input from the adder 321 and the filtered decoded image data input from the loop filter 322. Specifically, the memory 323 may be constituted by, for example, a recording medium such as a semiconductor memory. The memory 323 supplies the decoded image data before filtering, which is used for intra prediction, as reference image data to the intra prediction unit 324, or supplies the filtered decoded image data, which is used for inter prediction, as reference image data to the inter prediction unit 325.
The intra prediction unit 324 performs intra prediction in each intra prediction mode based on the image data to be coded, input from the re-ordering buffer 311, and the decoded image data supplied from the memory 323. For example, the intra prediction unit 324 evaluates the result of the prediction in  each intra prediction mode by using a predetermined cost function. The intra prediction unit 324 then selects the intra prediction mode that minimizes the cost function value, i.e., the intra prediction mode that maximizes the compression rate, as an optimal intra prediction mode. Further, the intra prediction unit 324 outputs information related to intra prediction, such as the prediction mode information indicating the optimal intra prediction mode, the predictive image data and the cost function value.
The inter prediction unit 325 performs inter prediction (interframe prediction) based on the image data to be coded, input from the re-ordering buffer 311, and the decoded image data supplied from the memory 323. For example, the inter prediction unit 325 evaluates the result of the prediction in each inter prediction mode by using a predetermined cost function. The inter prediction unit 325 then selects the inter prediction mode that minimizes the cost function value, i.e., the inter prediction mode that maximizes the compression rate, as an optimal inter prediction mode. Further, the inter prediction unit 325 generates predictive image data according to the optimal inter prediction mode. Then, the inter prediction unit 325 outputs information related to inter prediction, such as the prediction mode information representing the optimal inter prediction mode, the predictive image data, the cost function value, and the motion vector.
The cost function value related to intra prediction output from the intra prediction unit 324 is compared with the  cost function value related to inter prediction output from the inter prediction unit 325 to select the intra prediction or the inter prediction, whichever provides a smaller cost function value. When the intra prediction is selected, the information related to intra prediction is output to the entropy coding unit 315, and the predictive image data is output to the subtracter 312 and the adder 321. When the inter prediction is selected, on the other hand, the information related to inter prediction is output to the entropy coding unit 315, and the predictive image data is output to the subtracter 312 and the adder 321. (Configuration of Video Decoding Device)
The video decoding device of the present embodiment is configured in such a way that the rate of frames to be displayed is set by a user, and a video with the original frame rate or a hyper-lapse video is output according to the set display frame rate. When a hyper-lapse video is to be output, unnecessary frames are dropped during re-ordering in the video decoding device, and will not be decoded. Therefore, redundant calculation can be avoided.
Fig. 5 shows the functional blocks of the video decoding device of the present embodiment. As shown in Fig. 5, the video decoding device 50 includes a bitstream buffer 51, a display frame rate setting unit 52, a frame dropping unit 53, and a video decoder unit 55.
The bitstream storage buffer 51 temporarily stores an input bitstream input from the video encoding device 10 over the transmission path 3. The bitstream storage buffer 51 may  be constituted by, for example, a recording medium such as a semiconductor memory. The bitstream storage buffer 51 supplies metadata coded into, for example, a syntax element within SEI to the display frame rate setting unit 52.
The display frame rate setting unit 52 presents selectable frame rates for a hyper-lapse video to a user based on the supplied metadata. The display frame rate setting unit 52 receives, from the user, selection on whether a video with the original frame rate or a hyper-lapse video is displayed. When display of a hyper-lapse video is selected, the display frame rate setting unit 52 receives, from the user, selection of a frame rate from selectable frame rates for a hyper-lapse video. In response to the reception of the selection from the user, the display frame rate setting unit 52 determines a frame rate for displaying a video. The display frame rate setting unit 52 supplies the rate of the frame to be displayed, which is determined according to the selection made by the user, to the frame dropping unit 53.
The frame dropping unit 53 drops a frame from the bitstream stored in the bitstream buffer 51 according to the rate of the frame to be displayed, which is determined according to the selection made by the user, and supplies the resultant bitstream to the video decoder unit 55. The frame to be dropped here is the frame that is associated with the frame rate selected by the user from among one or more frame rates indicated by the metadata. When the user selects display of a video with the original frames, the frame dropping unit 53 does not drop  frames.
The video decoder unit 55 restores the original video or a hyper-lapse video from the bitstream supplied from the frame dropping unit 53.
(Configuration of Video Decoder Unit)
Fig. 6 shows an example of the configuration of the video decoder unit 55. The video decoder unit 55 includes an entropy decoding unit 552, an inverse quantizer 553, an inverse transform unit 554, an adder 555, a loop filter 556, a re-ordering buffer 557, a memory 558, an intra prediction unit 559, and an inter prediction unit 560. The video decoder unit 55 basically performs a process inverse to the process performed by the video encoder unit 13 to restore video data.
The entropy decoding unit 552 decodes the input bitstream input from the bitstream buffer 51. The entropy decoding unit 552 refers to syntax elements such as SPS and PPS. The entropy decoding unit 552 decodes information multiplexed into the header area of the input bitstream.
The inverse quantizer 553 and the inverse transform unit 554 generate prediction error data by performing inverse quantization and inverse transform on quantized data input from the entropy decoding unit 552. The inverse transform unit 554 outputs the generated prediction error data to the adder 555. The inverse quantizer 553 and the inverse transform unit 554 perform processes inverse to the processes performed by the quantizer 314 and the transform unit 313 in the video encoding device 10. That is, the inverse quantizer 553 and the inverse  transform unit 554 perform inverse quantization and inverse transform by using the SPS and the PPS corresponding to a sequence or a picture to be processed.
The adder 555 generates decoded image data by adding the prediction error data input from the inverse transform unit 554 and the predictive image data input from the intra prediction unit 559 or the inter prediction unit 560. The generated decoded image data is output to the loop filter 556 and the memory 558.
The loop filter 556 eliminates the coding distortion by filtering the decoded image data input from the adder 555, and outputs the filtered decoded image data to the re-ordering buffer 557 and the memory 558.
The re-ordering buffer 557 re-orders the images input from the loop filter 556 to generate a sequence of time-sequential image data. The image data generated by the re-ordering buffer 557 is output as a video with the original frame rate or a hyper-lapse video.
The memory 558 stores the unfiltered, decoded image data input from the adder 555 and the filtered decoded image data input from the loop filter 556. The memory 558 may be constituted by, for example, a recording medium such as a semiconductor memory.
(Processing Flow of Video Encoding)
Fig. 7 shows one example of the video encoding processing in the video encoding device 10.
The video encoding device 10 determines frame rates (for  example, f 1 Hz, f 2 Hz and f 3 Hz) of frames to be displayed as a hyper-lapse video (step 21) . The frame rate for displaying of frames as a hyper-lapse video may be preset in the video encoding device 10 or may be input or selected by a user.
The video encoding device 10 (for example, scene analysis unit 11) performs scene analysis on the input video bitstream (step 23) . In this step, image data representing a plurality of scenes is analyzed. The video encoding device 10 finds an optimal frame for minimizing the motion between frames from among a plurality of frames corresponding to the plurality of scenes, for each of the frame rates of f 1 Hz, f 2 Hz and f 3 Hz of a hyper-lapse video. For example, the video encoding device 10 may perform the above-described (1) frame matching step, (2) frame selection step, and (3) path smoothing and rendering step for each of the frame rates determined at step 21 to find the optimal frame for each of the frame rates.
The video encoding device 10 (for example, GOP structure setting unit 12) determines which frame (picture) is to be used or dropped in the video decoding device 50, for a plurality of scenes represented by the image data, to generate a hyper-lapse video according to the result of the scene analysis performed at step 23, that is, the result of the analysis of the image data, and determines a GOP structure of a plurality of frames corresponding to a plurality of scenes at the time of encoding the image data into the plurality of frames (step 25) .
The video encoding device 10 (for example, GOP structure setting unit 12) generates metadata for generating a  hyper-lapse video (step 27) . Metadata indicates one or more frame rates for a hyper-lapse video supported, and indicates information about which frame (picture) among a plurality of frames corresponding to a plurality of scenes is to be used or dropped. Each of the one or more frame rates is associated with the frame to be used or a frame to be dropped among the plurality of frames.
The video encoding device 10 (for example, video encoder unit 13) encodes the input image data according to the GOP structure determined at step 25 to generate a plurality of frames corresponding to a plurality of scenes, encodes the metadata into a syntax element such as SEI, or SPS, PPS or VUI, and outputs a bitstream including the generated plurality of frames and a syntax element (step 29) .
(Processing Flow of Video Decoding)
Fig. 8 shows one example of the video decoding processing in the video decoding device 50.
The video decoding device 50 (for example, bitstream buffer 51) receives metadata (step 61) . The metadata may be received in an input bitstream input from the video encoding device 10 over a transmission path. The receiving the metadata may include decoding the metadata encoded into an SEI syntax element or another syntax element. Metadata indicates which frame (picture) is to be used or dropped in order to generate a hyper-lapse video.
The video decoding device 50 (for example, bitstream buffer 51) reads the GOP structure for the input bitstream (step  63) . The GOP structure may be read from a syntax element such as SPS, PPS or VUI.
The video decoding device 50 (for example, display frame rate setting unit 52) determines a frame rate for displaying a video (step 65) . For example, the determining the frame rate may include presenting selectable frame rates for a hyper-lapse video to a user based on the metadata, and receiving selection on displaying a video with the original frame rate or a hyper-lapse video from the user. The receiving selection on displaying a hyper-lapse video includes receiving selection of a frame rate from selectable frame rates for a hyper-lapse video from the user. In response to the reception of the selection from the user, the display frame rate setting unit 52 determines a frame rate for displaying a video.
The video decoding device 50 determines whether or not to display a hyper-lapse video based on the determined frame rate (step 67) . When it is determined to display a hyper-lapse video, the processing proceeds to step 69. When it is not determined to display a hyper-lapse video, that is, when it is determined that a video with the original frame rate is to be displayed, the processing proceeds to step 71.
The video decoding device 50 (for example, frame dropping unit 53) drops unnecessary frames from the bitstream stored in the bitstream buffer 51 according to the frame rate determined at step 65 (step 69) . The frame to be dropped here is the frame that is associated with the frame rate selected by the user from among one or more frame rates indicated by the metadata. The  remaining bitstream is supplied to the video decoder unit 55. According to the above assumption, the reference B picture is dropped when the frame rate determined at step 65 is f 1 Hz, the reference B picture and the non-reference B picture are dropped when the determined frame rate is f 2 Hz, and the P picture, the reference B picture and the non-reference B picture are dropped when the determined frame rate is f 3 Hz.
The video decoding device 50 (for example, video decoder unit 55) sequentially decodes the remaining bitstream which has not been dropped at step 69 for each frame (step 71) . The video decoding device 50 decodes the original video or a hyper-lapse video according to the frame rate determined at step 65.
(Variations)
The description of the foregoing embodiment has been given of an example where the GOP structure of a plurality of frames included in a bitstream output from the video encoding device is dynamically determined. However, the present invention can be applied to videos encoded using the fixed GOP structure shown in Fig 1.
A cloud server including the scene analysis unit 11, the GOP structure setting unit 12, and the video encoder unit 13, which are included in the video encoding device 10, and a video decoder unit that decodes a video encoded using a fixed GOP structure and supplies the decoded video to the scene analysis unit 11 may serve as an implementation form of the present invention. Image data encoded using the fixed GOP structure (that is, a plurality of frames having the fixed GOP structure)  is transmitted over a network to the cloud server to be transcoded therein.
Fig. 9 is a diagram showing an example of the functional blocks of a cloud server according to one embodiment of the present invention. The cloud server 110 includes a video decoder unit 111, a scene analysis unit 11, a GOP structure setting unit 12, and a video encoder unit 13. Bitstream including image data encoded using the fixed GOP structure is input to the cloud server 110 via network from a camera or a smartphone. Bitstream including transcoded frames is output and transmitted via network to a smartphone, personal computer, or television.
Fig. 10 is a flowchart illustrating an example of video transcoding processing according to one embodiment of the present invention.
After the video decoder unit 111 included in the cloud server decodes the image data encoded using the fixed GOP structure (step 20) , the scene analysis unit 11 included in the cloud server analyzes the restored image data (step 23) . The cloud server 110 may determine frame rates (for example, f 1 Hz, f 2 Hz and f 3 Hz) of frames to be displayed as a hyper-lapse video (step 21) . The frame rate for displaying of frames as a hyper-lapse video may be preset in the cloud server 110 or may be input or selected by a user. Moreover, the GOP structure setting unit 12 included in the cloud server dynamically determines a GOP structure of a plurality of frames to generate a hyper-lapse video according to the result of the analysis of  the image data (step 25) . Furthermore, the video encoder unit 13 included in the cloud server encodes the restored image data according to the determined GOP structure and generates a transcoded bitstream (step 27) . That is, a plurality of frames having a fixed GOP structure is transcoded into a plurality of frames having the dynamically determined GOP structure, and a bit stream including the plurality of transcoded frames and the metadata is transmitted to the video decoding device 50 described above over the network (step 29) .
The foregoing descriptions are merely specific implementation manners of the present invention, but are not intended to limit the protection scope of the present invention. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed shall fall within the protection scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (27)

  1. A video encoding method executed by a video encoding device, the method comprising:
    analyzing input image data to determine whether any of a plurality of scenes represented by the image data is to be dropped at a time of generating a hyper-lapse video in a video decoding device; and
    encoding the image data based on a result of the analysis to generate a plurality of frames corresponding to the plurality of scenes, and outputting a bitstream including the generated plurality of frames.
  2. The method according to claim 1, wherein
    the determination includes dynamically determining a GOP (Group of Picture) structure of the plurality of frames in the bitstream to be output based on the result of the analysis.
  3. The method according to claim 2, wherein the determining a GOP structure of the plurality of frames in the bitstream to be output includes:
    assigning a frame to be dropped at a time of generating a hyper-lapse video in the video decoding device to higher temporal hierarchy within the GOP structure.
  4. The method according to claim 3, wherein the bitstream to be output supports a plurality of frame rates for a user to view the bitstream as a hyper-lapse video in the video decoding device.
  5. The method according to claim 2, further comprising:
    including metadata for decoding the bitstream as a hyper-lapse video in a sequence parameter set (SPS) of the bitstream to be output so as to be supplied to the video decoding device.
  6. The method according to claim 2, further comprising:
    including metadata for decoding the bitstream as a hyper-lapse video in a picture parameter set (PPS) of the bitstream to be output so as to be supplied to the video decoding device.
  7. The method according to claim 2, further comprising:
    including metadata for decoding the bitstream as a hyper-lapse video in supplemental enhancement information (SEI) of the bitstream to be output so as to be supplied to the video decoding device.
  8. The method according to claim 2, further comprising:
    including metadata for decoding the bitstream as a hyper-lapse video in video usability information (VUI) of the bitstream to be output so as to be supplied to the video decoding device.
  9. A video transcoding method executed by a server, the method comprising:
    decoding a plurality of encoded first frames in an input bitstream, the plurality of encoded first frames having a first GOP (Group of Picture) structure;
    by analyzing image data representing a plurality of scenes corresponding to the plurality of decoded first frames,  determining whether any of the plurality of scenes is to be dropped at a time of generating a hyper-lapse video in a video decoding device;
    determining a second GOP structure which a plurality of second frames will have, based on a result of the analysis, the plurality of second frames corresponding to the plurality of scenes;
    transcoding the plurality of decoded first frames having the first GOP structure into the plurality of second frames; and
    transmitting the plurality of second frames in a bitstream to the video decoding device.
  10. A video decoding method executed by a video decoding device, the method comprising:
    receiving a bitstream, the bitstream including a plurality of encoded frames and metadata for decoding the plurality of encoded frames as a hyper-lapse video, the metadata indicating one or more frame rates for the hyper-lapse video supported by the bitstream, each of the one or more frame rates being associated with a frame to be used or a frame to be dropped among the plurality of frames;
    determining a frame rate for displaying a video in the one or more frame rates indicated by the metadata;
    determining whether or not to display the hyper-lapse video based on the determined frame rate;
    dropping some of the plurality of encoded frames based on the determined frame rate , on condition that it is determined  that the hyper-lapse video is displayed; and
    decoding remaining frames of the plurality of encoded frames to display the hyper-lapse video.
  11. The method according to claim 10, wherein the plurality of encoded frames are encoded in a GOP structure determined in a video encoding device.
  12. The method according to claim 11, wherein the dropped frame is encoded in higher temporal hierarchy within the GOP structure.
  13. The method according to claim 11, wherein
    the dropping some of the plurality of encoded frames includes dropping a frame encoded in higher temporal hierarchy within the GOP structure according to the determined frame rate.
  14. The method according to claim 10, wherein the metadata is included in a sequence parameter set (SPS) of the bitstream.
  15. The method according to claim 10, wherein the metadata is included in a picture parameter set (PPS) of the bitstream.
  16. The method according to claim 10, wherein the metadata is included in supplemental enhancement information (SEI) of the bitstream.
  17. The method according to claim 10, wherein the metadata is included in video usability information (VUI) of the bitstream.
  18. The method according to claim 10, wherein the bitstream is input from a video encoding device.
  19. The method according to claim 10, wherein the bitstream is input from a server.
  20. A video encoding device comprising:
    a scene analysis unit configured to analyze input image data to determine whether any of a plurality of scenes represented by the image data is to be dropped at a time of generating a hyper-lapse video in a video decoding device;
    a GOP (Group of Picture) structure setting unit configured to determine a GOP structure of the plurality of frames in a bitstream to be output, based on a result of the analysis; and
    a video encoder unit configured to, based on the result of the analysis, encode the image data to generate a plurality of frames corresponding to the plurality of scenes, and output a bitstream including the generated plurality of frames, the generated plurality of frames having the determined GOP structure.
  21. A server comprising:
    a video decoder unit configured to decode a plurality of encoded first frames in an input bitstream, the plurality of first frames having a first GOP (Group of Picture) structure;
    a scene analysis unit configured to analyze image data representing the plurality of scenes corresponding to the plurality of decoded first frames to determine whether any of the plurality of scenes is to be dropped at a time of generating a hyper-lapse video in a video decoding device;
    a GOP structure setting unit configured to determine a second GOP structure which a plurality of second frames will have, based on a result of the analysis, the plurality of second  frames corresponding to the plurality of scenes; and
    a video encoder unit configured to transcode the plurality of decoded first frames having the first GOP structure into the plurality of second frames, and transmit the plurality of second frames in a bitstream to the video decoding device.
  22. A video decoding device comprising:
    a bitstream buffer configured to receive a bitstream, the bitstream including a plurality of encoded frames, metadata for decoding the plurality of encoded frames as a hyper-lapse video in the bitstream, and a GOP (Group of Picture) structure which the plurality of encoded frames have, the metadata indicating one or more frame rates for the hyper-lapse video supported by the bitstream, each of the one or more frame rates being associated with a frame to be used or a frame to be dropped among the plurality of frames, the bitstream buffer being also configured to read the GOP structure;
    a display frame rate setting unit configured to determine a frame rate for displaying a video in the one or more frame rates indicated by the metadata;
    a frame dropping unit configured to drop some of the plurality of encoded frames based on the determined frame rate; and
    a video decoder unit configured to decode remaining frames in the plurality of encoded frames to output the hyper-lapse video.
  23. A video encoding device, comprising:
    a memory storage comprising instructions; and
    one or more processors in communication with the memory, wherein the one or more processors execute the instructions to perform the method according to any one of claims 1 to 8.
  24. A server, comprising:
    a memory storage comprising instructions; and
    one or more processors in communication with the memory, wherein the one or more processors execute the instructions to perform the method according to claim 9.
  25. a video decoding device, comprising:
    a memory storage comprising instructions; and
    one or more processors in communication with the memory, wherein the one or more processors execute the instructions to perform the method according to any one of claims 10 to 19.
  26. Computer readable medium storing instructions which when executed on a processor cause the processor to perform the method according to any of claims 1 to 19.
  27. A computer program product comprising program code for performing the method according to any of claims 1 to 19 when executed on a computer or a processor.
PCT/CN2019/095782 2019-07-12 2019-07-12 Video encoding method, video decoding method, video encoding device, and video decoding device WO2021007702A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/095782 WO2021007702A1 (en) 2019-07-12 2019-07-12 Video encoding method, video decoding method, video encoding device, and video decoding device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/095782 WO2021007702A1 (en) 2019-07-12 2019-07-12 Video encoding method, video decoding method, video encoding device, and video decoding device

Publications (1)

Publication Number Publication Date
WO2021007702A1 true WO2021007702A1 (en) 2021-01-21

Family

ID=74210096

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/095782 WO2021007702A1 (en) 2019-07-12 2019-07-12 Video encoding method, video decoding method, video encoding device, and video decoding device

Country Status (1)

Country Link
WO (1) WO2021007702A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115278305A (en) * 2022-05-12 2022-11-01 浙江大华技术股份有限公司 Video processing method, video processing system, and storage medium
WO2023184467A1 (en) * 2022-04-01 2023-10-05 Intel Corporation Method and system of video processing with low latency bitstream distribution

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003153198A (en) * 2001-11-09 2003-05-23 Matsushita Electric Ind Co Ltd Video recording and reproducing device and method thereof
WO2015130866A1 (en) * 2014-02-28 2015-09-03 Microsoft Technology Licensing, Llc Hyper-lapse video through time-lapse and stabilization
US20160330399A1 (en) * 2015-05-08 2016-11-10 Microsoft Technology Licensing, Llc Real-time hyper-lapse video creation via frame selection
US20170180764A1 (en) * 2014-04-03 2017-06-22 Carrier Corporation Time lapse recording video systems
CN109714556A (en) * 2018-12-10 2019-05-03 珠海研果科技有限公司 Kinescope method and device when a kind of monocular panorama contracting

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003153198A (en) * 2001-11-09 2003-05-23 Matsushita Electric Ind Co Ltd Video recording and reproducing device and method thereof
WO2015130866A1 (en) * 2014-02-28 2015-09-03 Microsoft Technology Licensing, Llc Hyper-lapse video through time-lapse and stabilization
US20170180764A1 (en) * 2014-04-03 2017-06-22 Carrier Corporation Time lapse recording video systems
US20160330399A1 (en) * 2015-05-08 2016-11-10 Microsoft Technology Licensing, Llc Real-time hyper-lapse video creation via frame selection
CN109714556A (en) * 2018-12-10 2019-05-03 珠海研果科技有限公司 Kinescope method and device when a kind of monocular panorama contracting

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023184467A1 (en) * 2022-04-01 2023-10-05 Intel Corporation Method and system of video processing with low latency bitstream distribution
CN115278305A (en) * 2022-05-12 2022-11-01 浙江大华技术股份有限公司 Video processing method, video processing system, and storage medium

Similar Documents

Publication Publication Date Title
US10178394B2 (en) Transcoding techniques for alternate displays
US9420279B2 (en) Rate control method for multi-layered video coding, and video encoding apparatus and video signal processing apparatus using the rate control method
JP5180294B2 (en) Buffer-based rate control that utilizes frame complexity, buffer level, and intra-frame location in video encoding
US9414086B2 (en) Partial frame utilization in video codecs
JP4755093B2 (en) Image encoding method and image encoding apparatus
US10205953B2 (en) Object detection informed encoding
JP5173409B2 (en) Encoding device and moving image recording system provided with encoding device
JP2019501554A (en) Real-time video encoder rate control using dynamic resolution switching
US20120195356A1 (en) Resource usage control for real time video encoding
JPH118855A (en) Digital video signal encoding device and its method
CN102986211A (en) Rate control in video coding
JPWO2006098226A1 (en) Encoding device and moving image recording system provided with encoding device
US11909963B2 (en) Image encoding method, decoding method, encoder, and decoder
US20180184089A1 (en) Target bit allocation for video coding
JP7015183B2 (en) Image coding device and its control method and program
CN112073737A (en) Re-encoding predicted image frames in live video streaming applications
JP2009290463A (en) Encoding/decoding device, encoding/decoding method, and program
US9565404B2 (en) Encoding techniques for banding reduction
WO2021007702A1 (en) Video encoding method, video decoding method, video encoding device, and video decoding device
KR20090046812A (en) Video encoding
US20190014332A1 (en) Content-aware video coding
US11343513B2 (en) Image encoding method and decoding method, encoder, decoder, and storage medium
KR20040048289A (en) Transcoding apparatus and method, target bit allocation, complexity prediction apparatus and method of picture therein
JP2014003523A (en) Image encoding device, control method therefor and program
US20130077674A1 (en) Method and apparatus for encoding moving picture

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19937902

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19937902

Country of ref document: EP

Kind code of ref document: A1