WO2004043065A1

WO2004043065A1 - Data processing device

Info

Publication number: WO2004043065A1
Application number: PCT/JP2003/014041
Authority: WO
Inventors: Yasuyuki Kurosawa; Yoshinori Matsui; Yoji Notoya; Tadamasa Toma
Original assignee: Matsushita Electric Industrial Co., Ltd.
Priority date: 2002-11-05
Filing date: 2003-10-31
Publication date: 2004-05-21
Also published as: AU2003280705A1

Abstract

A data recording device includes reception sections (100, 102) for receiving at least one of a video signal and an audio signal, a stream generation section (101) for encoding a signal by a predetermined encoding format and generating an encoded stream including a plurality of picture data for the video signal and a plurality of reproduction data as frame data for the audio signal, an extension information generation section (104) for generating extension information for specifying each reproduction data, an auxiliary information generation section (103) for generating access data for accessing group unit consisting of one or more reproduction data and generating auxiliary information including the access data, a multiplexing section for multiplexing an encoded stream and extension information to generate a data stream, and a recording section (120) for recording the data stream and the auxiliary information on a recording medium (131).

Description

Data processing equipment Technical field

The present invention relates to an apparatus and a method for recording a moving image stream related to video and audio on a recording medium such as an optical disk and reproducing the moving image stream recorded on the recording medium. Background art

The MP4 file format specified in the MPEG4 system standard (ISO / IEC144696-1) as a file format that can handle stream data and has high compatibility with PCs Mats are widely known. In the MPEG4 system standard, a data stream of a system stream including MPEG 2 video or MPEG 4 video and ancillary information is specified as an MP4 file. The MP4 file format is defined based on the QuickTime (TM) file format of Apple (registered trademark), and has a promising future in that it has been supported by various PC applications in recent years. It is a format. The QuickTime file format, based on it, is now widely used as a file format for video and audio in PC applications. FIG. 1 shows the structure of the MP4 file 1. MP4 file 1 Includes auxiliary information 2 and video stream data 3. The moving picture stream 3 is encoded video data and audio data such as MPEG 2 video or MPEG 4 video. These are arranged in units of one or more frames. Auxiliary information 2 includes the video and audio frame size (hereinafter referred to as “frame size”), video storage address, storage address, and frame unit specified in video stream 3. , Etc. of the playback time. The data reproducing apparatus can specify the storage position of the moving picture stream 3 based on the attached information 2 and read out and reproduce the moving picture stream data 3.

FIG. 2 shows another configuration of the MP4 file. The attachment information 2 of the MP4 file and the video stream 3 are configured as separate files. In such an MP4 file, the auxiliary information 2 includes link information L for controlling reading of the video stream 3. The QuickTime file format standard can also have the same file structure as the MP4 standard shown in Figs. In the following, the description of the MP4 file is applicable to the QuickTime file as well, unless otherwise specified, and is not limited to the MP4 file.

Hereinafter, a more specific configuration of the MP4 file 1 will be described using the MP4 file 1 shown in FIG. 1 as an example. FIG. 3 shows the specific data structure of MP4 file 1. First, the video stream part will be described. For MP4 file 1, sample the data in the video stream (sample) and chunk (chunk). The “sample” is a minimum unit of stream management in an MP4 file, and corresponds to, for example, encoded frame data of a video frame and encoded frame data of an audio frame. The figure shows a video sample (Video Sample) 4 representing frame data of a video frame and an audio sample (Audio Sample) 5 representing frame data of an audio frame. On the other hand, "chunk" refers to a set of one or more samples. Even when there is only one sample in a chunk, the data is managed as one chunk containing only that sample.

In the auxiliary information, information on video samples and information on audio samples are managed on a track-by-track basis. FIG. 3 shows an audio track 6 and a video track 7. Tracks 6 and 7 describe information such as the size of each sample and its display time, the start position of each chunk, and the number of samples included in that chunk. The overnight playback device can read out each track of the attached information and access all the samples, and can control reading and the like for each sample and chunk. Note that the storage location information for specifying each sample and each chunk, which is specified in the accessory information of the MP4 file, is also referred to as “access data”. However, because the access data is described in detail, the data size of the attached information is very large, reaching about 1 MB yte per hour for the video stream. On the other hand, for example, According to the DVD standard for rewritable / rewritable discs, Part 3 Video recording standard, Version 1.1 ", published by the DVD Forum VR4, pp. 31-35, it is necessary for the access data of the DVD video recording standard. The data size is about 70 kilobytes per hour. That is, the size of the access data of the DVD video recording standard is less than one tenth of the size of the access data included in the accessory information of the MP4 file.

FIG. 4 shows a configuration of functional blocks of a conventional reproducing apparatus 400. The playback device 400 reproduces a moving image stream (a video stream, an audio stream, or a stream in which a video stream and an audio stream are multiplexed) recorded on a DVD-RAM disk 1331. This moving image stream is, for example, moving image data constituting the MP4 file described with reference to FIGS. 1 to 3. The process of playing back a moving image stream performed by the playback device 400 will be specifically described. First, the stream data of the moving image stream is transmitted through a pickup 407 and a playback unit 404 to a DVD-RAM as a playback signal. Read from disk 131. A series of read processing is performed based on designation of a read position by a control unit (not shown), generation of a reproduction signal, and the like. Then, the reproduced signal is decoded into a video signal and an audio signal by the video stream decoding unit 403 and output to the video signal output unit 401 and the audio signal output unit 402.

In some cases, playlist information is recorded on a DVD-RAM disk. The playback device 400 performs playback according to the playlist information. It has a playlist playback function for playing back video streams in a predetermined order. Here, the “playlist information” is information that defines a reproduction order of a part or all of one or more video streams. The playlist information is generated in the recording device when a user designates an arbitrary position, a section, and the like.

In the MP4 file format described above, an access data for specifying the video stream specified by the playlist information can be stored in the attached information. At that time, the playback device 400 reads out the attached information of the MP4 file in advance and stores it in the attached information holding memory 406. This allows the playback device 400 to continuously play back the moving picture stream according to the play list information. The playlist playback function can be used when playlist information is recorded on the DVD-RAM disc 131, and can be said to be a function that makes use of the features of the DVD-RAM disc that can be accessed randomly.

Furthermore, in order to store the attached information of all the MP4 files stored on the DVD-RAM disk 13 1 in the attached information holding memory 406 as soon as possible, the attached information should be DVD_RAM. It is desirable that they are arranged collectively on the disk 13 1.

FIG. 5 shows the areas 1 3 2 and 1 3 3 of the DVD-RAM disk 13 1 on which the MP4 file 1 is recorded. DVD—The recording area of the RAM disk 13 1 is managed by being divided into a management information area 13 2 and an AV data area 13 3. Usually, the attached information 2 of the MP4 file 1 is The moving image stream 3 is recorded in the management information area 132, and the moving image stream 3 is recorded in the AV stream area 133. DVD—RAM disk 1 3 1 management By placing all the additional information of the MP4 file in the information area 13 2 collectively, the playback device 400 can read out all the additional information at high speed and add the additional information. It can be stored in the holding memory 406.

However, since the data size of the access data of the attached information is very large, the playback device corresponding to the MP4 file needs to have the attached information holding memory 406 with a large capacity. In particular, when a large amount of management information is recorded in the management information area 132 of the DVD-RAM disk 131, a considerably large capacity of the auxiliary information holding memory 406 is required.

Therefore, a technique described in, for example, Japanese Patent Application Laid-Open No. 2001-94933 is known in order to reduce the size of the accompanying information of the MP4 file. FIG. 6 shows a data structure of an MP4 file 11 in which a plurality of frames (for example, GOP (Group Of Picture) 14) correspond to one sample. Ancillary information 1 and 2 describe the access data for identifying each video sample in the same way as in the previous example, but each video sample corresponds to multiple video frames that compose GOP 14 It is attached. By adopting such a data structure, the total number of video samples is reduced, so that the data size of the access data can be reduced. Therefore, it is possible to reduce the attached information per hour of the video stream to about 10 in the previous example. Still other techniques are known for reducing the data size of the information attached to the MP4 file. In this other technology, one chunk corresponds to a plurality of frames, and only the size of the chunk, not the size of the sample, is stored as an access data. As a result, the size of the attached information can be reduced.

In the above-described technique, the video data is arranged as follows, and forms a video stream. FIGS. 7 (a) to 7 (d) show the hierarchical data structure of the MPEG2 video stream. Figure 7 (a) shows the sequence at the top of the hierarchical structure. A sequence contains at least one GOP. Figure 7 (b) shows the data structure of each GOP. In the example of FIG. 6, it corresponds to a video sample. The GOP includes one or more frame data, and stores, for example, one I-frame and P-frames and B-frames that require the I-frame as a reference frame. Fig. 7 (c) shows the data structure of each frame (or "picture"). Each frame contains multiple slices. FIG. 7D shows the data structure of each slice. Each slice is a set of macroblocks, which are coding units of the MPEG2 video, and the MPPEG2 video stream is byte-aligned in slice units.

A sequence code, a GOP, a frame, and a slice that form the hierarchical structure of the MPEG2 stream are each preceded by a 32 bit stream code. Specifically, at the beginning of the sequence, the sequence header code (Sequence Header Code) 1301, and at the beginning of the GOP, GOP Start Code 13 02, Picture Start Code 13 0 3 at the beginning of the frame, Slice Start Code 13 0 4 at the beginning of the slice Are respectively provided.

The start code consists of a 24-bit start code prefix and an 8-bit start code ID. The start code-do prefix is common to all start codes, but the start code-ID has a unique ID for each type. The start code is specified to be unique in the MPEG2 video stream. For example, if the value of 32 bits in the stream matches the value of the picture start code 133, , It can be immediately interpreted as the beginning of the frame. In the following, a code that is unique in the stream is referred to as a “unique code”.

FIGS. 8 (a) to 8 (d) show the state of error propagation when an error occurs in the MPEG-2 video stream. Since the MPEG2 video stream is composed of variable-length codes, for example, if an error occurs in the slice data with the slice start code 13044-1, the subsequent It cannot be decrypted. However, by detecting the next slice start code 13044-2 and synchronizing the streams, the subsequent streams can be decoded. In this way, the unique code is used in the MPEG2 video stream to suppress the error propagation range when there is an error in the stream during decoding. While the above-mentioned unique code is useful for error recovery, it also increases redundancy in terms of coding efficiency. In the future, it is expected that a stream without a unique code will appear in order to increase the coding efficiency. If such a stream is recorded by the conventional method, the following problems will occur.

FIGS. 9 (a) to 9 (d) show how error propagation occurs when a decoding error occurs in a video stream having no unique code. In this video stream, as shown in FIG. 6, access data is provided so that a plurality of frames (for example, GOP) correspond to one sample, and the data size of the attached information is reduced.

Since there is no unique code in each slice, frame, and GOP, if a decoding error occurs in slice 90, the error propagates to slice 91 and subsequent slices 90, and Decoding cannot be performed until the last slice. Further, the error propagates to the frame 93 next to the frame 92, and decoding cannot be performed until the last frame of G0P94 including the frame 92. As can be understood from this explanation, once an error occurs in a stream that does not have a unique code, it is not possible to return to a decryptable state using only the information in the stream. The error propagates to the beginning of the next GOP 95 where the data can be used to identify the location.

Next, another problem caused by forming one sample by a plurality of frames will be described. The accompanying information includes the key of each sample. It describes not only access data but also information such as decoding time and display time of each sample. However, if one sample is composed of a plurality of frames, the attached information does not describe the decoding time, display time, frame-by-frame data, etc. of each frame required for reproducing the moving image stream. Therefore, it is necessary for the playback device to obtain such information by calculation. For example, the reproducing apparatus 400 obtains a difference value between the decoding time of a certain sample and the decoding time of the next sample, and divides the difference value by the number of frames in the sample. The reproducing device 400 adopts the obtained division value as a difference value of the decoding time per frame.

However, depending on the moving image stream, the correct decoding time, display time, frame-by-frame data, and the like of each frame may not be obtained by the calculation procedure preset in the playback device 400.

Here, a problem in the case where the decoding time and the display start time are equal in all frames will be described with reference to FIG. FIG. 10 (a) schematically shows a sample (G〇P) in which a frame skip has occurred. Frame skip occurs after frame 2. The display time length of each frame is 1 second. Originally, this sample should consist of five frames from frame 1 to frame 5 (display time length: 5 seconds), but the sample actually contains four frames from frame 1 to frame 4. It is a sheet. On the other hand, the display time difference value of one sample obtained based on the auxiliary information is 5 s. As a result, the reproducing apparatus 400 normally calculates the display time length of each frame as 5 seconds Z 4 frames = 1.25 seconds. Figure 10 (b) shows the sample display time length. Indicates the display time length of each frame evenly allocated to.

However, the correct display time in this case is 2 seconds only for frame 2 to account for the display time of the skipped frame, and 1 second for other frames. Figure 10 (c) shows the correct display time length of each frame with respect to the sample display time length.

As described above, when a frame skip or the like occurs in the moving image stream during recording and the difference value of the display time between a certain frame and the next frame becomes inconsistent, the reproducing apparatus determines the display time of the frame included in the sample. The problem arises that cannot be obtained correctly. It can be said that this problem is caused by the processing of the recording device in which the display time length and the like of each frame are not recorded as a result of recording one sample composed of a plurality of frames.

Next, the problem when the frame included in the sample is composed of frames using bidirectional prediction and the decoding time and the display start time of each frame are different will be described. When encoding based on bidirectional prediction is performed, the display start time of each frame cannot be calculated from the sample decoding time or the display start time. As a result, in the method of calculating the display time of the playback device 400 described above, there is a problem that the display time of each frame cannot be obtained.

Furthermore, if the video stream does not include information (unique code) for identifying the start or end of each frame, there is a problem that data cannot be obtained for each frame. For example, MPEG-4 Visua 1 uses a start command in a video stream. The frame boundary can be detected by an identifier called the video code. However, MP EG-4 Advanced Video Coding (AVC) uses the video stream information to identify the frame boundary when storing it in an MP4 file. Therefore, if two or more frames encoded by MPEG-4AVC are stored in one sample, the frame boundaries cannot be detected, and the data of each frame is not included. Can not get. Disclosure of the invention

An object of the present invention is to provide a data structure capable of reducing the data size of access data and suppressing propagation of an error even if a decoding error of a moving picture stream occurs. Another object of the present invention is to ensure that when decoding a plurality of frames of a video stream as one sample, the decoding time, display time, and data of each frame can be reliably obtained. .

A data recording apparatus according to the present invention includes: a receiving unit that receives at least one of a video signal and an audio signal; encoding the signal in a predetermined encoding format; A stream generation unit that generates an encoded stream including a plurality of reproduction data that is a frame data for the audio signal, and an extension information generation that generates extension information for identifying each reproduction data. And an access data for accessing a group unit including one or more data for reproduction, and ancillary information including the access data is generated. A multiplexing unit that multiplexes the coded stream and the extended information to generate a data stream, and a recording unit that records the data stream and the additional information on a recording medium. You.

The additional information generation unit may further generate access data for accessing a group consisting of a plurality of pieces of reproduction data.

The additional information generation unit generates an access data for each first sample when the group unit is a first sample, and the second sample when the extended information is a second sample. Access data may be generated for each.

The multiplexing unit may generate the data stream by multiplexing the coded stream and the extended information for each of the first and second samples.

The accessory information generation unit may generate an access data for each sample when the group unit and the extended information regarding one or more pieces of reproduction data included in the m group are taken as one sample. .

The multiplexing unit may generate the data stream by multiplexing the encoded stream and the extension information for each sample.

The receiving unit receives a video signal and an audio signal, and the stream generation unit outputs the video signal and the audio signal respectively. And generating a coded stream including picture data of a plurality of videos and frame data of a plurality of audio frames, wherein the extended information generation unit includes at least each picture data. The additional information generation unit generates at least one of the picture data, the frame data of the plurality of audio frames, and each of the extension information.

Access data for accessing a group consisting of two or more picture data may be generated, and additional information including the access data may be generated.

The extended information generation unit may further generate extended information for specifying each frame data of the plurality of audio frames. The recording unit may record the data stream and the attached information as one data file on the recording medium. The extended information generation unit may generate at least one of information indicating a data size, a display time, and a decoding time of each of the reproduction data as the extended information.

The additional information generation unit may generate the additional information further including a default value of the extension information, and the extension information generation unit may generate the extension information having a value different from the default value.

The extended information generating unit may generate extended information for specifying reference destination picture data referred to for decoding each picture data of the video signal.

The additional information generation unit may include the additional information further including link information. The recording unit records the data stream on the recording medium as a first data file specified by the link information, and records the attached information as a second data file. May be recorded on the recording medium.

The data recording method according to the present invention includes a step of receiving at least one of a video signal and an audio signal; encoding the signal in a predetermined encoding format; Generating an encoded stream including a plurality of reproduction data as frame data for the audio signal; generating extension information for identifying each reproduction data; and performing one or more reproductions. Access data for accessing the group unit composed of the access data, and including the access data.

,

Generating ancillary information, and multiplexing the encoded stream and the extended information to generate a data stream.

One, mu

And recording the HU data stream and ηηή additional information on a recording medium.

The step of generating the ancillary information further includes an access data for accessing a group of a plurality of reproduction data.

—May produce evenings.

The step of generating the additional information includes:

It is also possible to generate access data for each first sample when one sample is used, and to generate access data for each second sample when the extended information is used as a second sample. The step of generating the data stream may include generating the data stream by multiplexing the encoded stream and the extension information for each of the first samples and for each of the second samples. .

The step of generating the accessory information includes: setting an access data for each sample when the group unit and the extended information regarding one or more pieces of reproduction data included in the group are defined as one sample. May be generated.

The step of generating the data stream may include generating the data stream by multiplexing the encoded stream and the extension information with the sample.

The receiving step includes receiving a video signal and a sound signal, and generating the coded stream includes coding the video signal and the audio signal in a predetermined coding format, respectively, Generating an encoded stream including picture data of video and frame data of a plurality of audio frames, and generating the extended information, at least generating extended information for identifying each picture data; The step of generating the additional information includes an access for accessing the picture data, the frame data of the plurality of audio frames, and each of the extension information in a unit of at least two or more picture data. Data may be generated to generate additional information including the access data. The step of generating the extended information may further include generating extended information for specifying each frame data of the plurality of audio frames.

The recording step may record the data stream and the additional information on the recording medium as one data file.

The step of generating the extended information may include generating at least one of information indicating a data size, a display time, and a decoding time of each of the reproduction data as the extended information.

The step of generating the additional information includes generating the additional information further including a default value of the extended information, and the step of generating the extended information includes generating the extended information having a value different from the predetermined value. You may. The step of generating the extended information may include generating extended information for specifying reference destination picture data referred to for decoding each picture data of the video signal.

The step of generating the additional information includes the step of generating the additional information further including link information, and the step of recording includes setting the data stream as a first data file specified by the link information on the recording medium. And the additional information may be recorded on the recording medium as a second data file.

A data reproducing device according to the present invention reproduces data recorded on a recording medium. The recording medium stores a data stream and additional information. The data stream includes a video signal and an audio signal. A reproduction data in which at least one of the signals is encoded in a predetermined encoding format, wherein the video signal is a picture data and the audio signal is frame data. A coded stream including a plurality of data and extension information for specifying each data for reproduction are multiplexed. The additional information includes access data for accessing a group consisting of one or more playback data. A reproducing unit that reads the data stream and the additional information from the recording medium and separates the data stream into the encoded stream and the extended information; and decodes the encoded stream. And a stream decoding unit. The stream decoding unit analyzes the access data of the additional information to identify the group unit, and, based on the extended information, identifies each picture data in the group unit. A decoding unit for decoding each reproduction data.

A data reproducing method according to the present invention reproduces data recorded on a recording medium. The recording medium stores a data stream and additional information. The data stream is reproduction data in which at least one of a video signal and an audio signal is encoded in a predetermined encoding format, and the video stream is picture data and the audio signal is audio data. A coded stream including a plurality of reproduction data, which is frame data, and extended information for specifying each reproduction data are multiplexed. The auxiliary information includes an access data for accessing a group of one or more data for reproduction. Contains. Reading the data stream and the additional information from the recording medium, and separating the data stream into the encoded stream and the extended information; and Decoding the data. The step of decoding the encoded stream includes a step of analyzing the access data of the additional information to identify the group unit, and identifying each reproduction data in the group unit based on the extended information. And a step of decoding the specified reproduction data.

The data structure according to the present invention includes a data stream and additional information. The data stream is a reproduction data in which at least one of a video signal and an audio signal is encoded in a predetermined encoding format, and is a picture data for the video signal and a video data for the audio signal. In other words, an encoded stream including a plurality of pieces of reproduction data as frame data, and extended information for specifying each reproduction data are multiplexed. The additional information includes access data for accessing a group consisting of one or more pieces of reproduction data.

In the recording medium according to the present invention, a data stream and additional information are recorded. The data stream is a playback data in which at least one of a video signal and an audio signal is encoded in a predetermined encoding format, and is a picture data and an audio signal for the video signal. In contrast, multiple playback data that is frame data The coded stream including the information and the extended information for specifying each data for reproduction are multiplexed. The additional information includes an access data for accessing a group of one or more data for reproduction. BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a diagram showing a configuration of the MP4 file 1.

FIG. 2 is a diagram showing another configuration of the MP4 file.

FIG. 3 is a diagram showing the data structure of the MP4 file 1.

FIG. 4 is a block diagram of a conventional reproducing apparatus 400.

FIG. 5 is a diagram showing the areas 1 3 2 and 1 3 3 of the DVD-RAM disk 13 1 where the MP4 file 1 is recorded.

FIG. 6 is a diagram showing a data structure of an MP4 file 11 in which GOPs 14 of a plurality of frames correspond to one sample.

FIGS. 7A to 7D are diagrams showing a hierarchical data structure of the MPEG2 video stream.

FIGS. 8 (a) to 8 (d) are diagrams showing the state of error propagation when an error occurs in the MPEG2 video stream.

FIGS. 9 (a) to 9 (d) are diagrams showing the state of error propagation when a decoding error occurs in a video stream having no unique code.

Figure 10 (a) is a diagram schematically showing a sample (GOP) in which frame skipping has occurred, and Figure 10 (b) shows the sample display time length. FIG. 10C is a diagram showing the display time length of each frame equally allocated, and FIG. 10C is a diagram showing the correct display time length of each frame with respect to the sample display time length.

FIG. 11 is a block diagram of the data processing device 10 according to the first embodiment.

FIG. 12 is a diagram showing the configuration of the MP4 file 21 recorded by the data processing device 10.

FIG. 13 is a flowchart showing the procedure of the recording process of the overnight processing device 10.

FIGS. 14 (a) to 14 (d) are diagrams showing the state of error propagation when a decoding error occurs in the moving picture stream constituting the MP4 file 21. FIG.

FIG. 15 is a diagram showing an example of extended information describing information on the number of slices constituting each video frame and the data size of each slice.

Figs. 16 (a) to 16 (d) are diagrams showing how error propagation when a decoding error occurs in a video stream falls within a slice.

FIG. 17 is a diagram showing an example of extended information describing the display duration of each video frame.

Figure 18 (a) is a diagram schematically showing a group of video frames in which frame skipping has occurred.Figure 18 (b) is a group of video frames when the display time is set to be uniform for all frames. FIG. You.

FIG. 19 is a diagram showing a data structure of the box 40.

FIG. 20 (a) is a diagram showing a data structure of the basic unit 50, and FIG. 20 (b) is a diagram showing a data structure of moov52.

FIG. 21 (a) is a diagram showing the data structure of ak53, and FIG.

() Is a figure which shows the field frame_count set to several entries in the box stsd56.

FIG. 22 is a diagram showing a data structure of an MP4 file including a basic part 50 and an extended part 60.

FIG. 23 is a diagram showing the data structure of moof 61. FIG.

FIG. 24 is a block diagram related to a recording function of the data processing device 170 according to the second embodiment.

FIG. 25 is a flowchart showing a procedure in which the header generation unit 175 determines the difference value of the decoding start time and the number of frames forming the sample.

FIG. 26 is a flowchart showing a procedure of determining a frame forming a sample in the header generation unit 175.

FIG. 27 is a diagram showing the relationship between the determined sample and the frame.

FIG. 28 is a diagram illustrating another example showing the relationship between the determined sample and the frame.

FIG. 29 is a flowchart showing a procedure for multiplexing frames in encoded data into samples. FIG. 30 is a diagram showing a sample in which a frame in a GOP is stored.

FIG. 31 is a block diagram of a data processing device 200 according to the fourth embodiment.

FIGS. 32 (a) to 32 (c) are diagrams showing examples of the structure of a sample and the difference between the decoding time of the next frame and each frame included in the sample.

FIGS. 33 (a) to 33 (c) show sample structures of the MP4 file.

Fig. 34 (a) is a diagram showing a sample in which the display time and the access unit are stored together, and Fig. 34 (b) is a diagram showing an example of syntax for realizing the sample structure of Fig. 34 (a). FIG. 34 (c) is a diagram showing an example in which a field indicating the size of the access unit is added next to the display time information.

FIG. 35 is a block diagram illustrating a configuration of a data processing device 300 according to the fifth embodiment.

FIG. 36 is a flowchart showing a procedure of a process in which the sample analysis unit 307 acquires picture data from a sample.

FIG. 37 is a diagram showing the data structure of the access unit.

Figure 38 (a) shows a Multi AU header box (Multi AU header box).

Box) and a sample data structure storing a plurality of access units. FIG. 38 (b) is a diagram showing a data structure of a sample header. FIG. 39 is a diagram showing an example when the sample header is arranged as the last data in the sample.

FIG. 40 (a) is a diagram showing the data structure of mtsz, FIG. 40 (b) is a diagram showing the data structure of mdta, and FIG. 40 (c) is a diagram showing the data structure of mcta .

Figure 41 (a) shows an example of the syntax of the initial value setting part of the sample header when a box is used.Figure 41 () shows the initial value of the sample header when no box is used. It is a figure showing the example of a syntax of a setting part.

FIGS. 42 (a) to (d) show examples of the respective syntaxes of mahd, mtsz, mdta, and mcta.

FIGS. 43 (a) to (c) are diagrams showing a first example for storing data in a sample header.

FIGS. 44 (a) and (b) are diagrams showing a second example for storing data in the sample header.

Fig. 45 (a) is a diagram showing an example in which a sample header is added to each of the access units in one sample, and Fig. 45 (b) is a diagram showing one sample to multiple access units less than N. FIG. 14 is a diagram showing an example in which a character is added.

FIGS. 46 (a) and (b) are diagrams showing a sample structure and a syntax example when the box structure is not used.

FIG. 47 is a flowchart showing the procedure of a process in which the sample analysis unit 307 acquires picture data from a sample. FIG. 48 is a flowchart showing the operation of acquiring the size of the access unit constituting the sample.

FIG. 49 is a diagram showing a series of pictures and the coding type of each picture.

FIGS. 50 (a) to (c) are diagrams showing a video stream and layers 0 and 1 constituting the video stream.

Fig. 51 (a) to (c) are diagrams showing the syntax of subsequences and layer-related SEIs.

FIGS. 52 (a) to (d) are diagrams showing a video stream and layers 0, 1, and 2 constituting the video stream.

FIG. 53 (a) is a diagram showing table data of sbgp for layers, and FIG. 53 (b) is a diagram showing table data of sbgp for subsequences.

FIGS. 54 (a) to (c) are diagrams showing the structure of AVC-G0P.

55 (a) to 55 (d) are diagrams showing a video stream and layers 0, 1, and 2 constituting the video stream.

FIGS. 56 (a) to 56 (c) are diagrams showing the field values of each SEI of SSL and SS SSI stored in the AVC-G0P shown in FIG. 55 (a).

FIGS. 57 (a) and (b) are diagrams showing a sample structure when the AVC-G0P data of FIG. 55 is stored in MP 4 samples.

Fig. 58 (a) is a diagram showing an example of the syntax of the sample, layer, subsequence box stls, and Fig. 58 (b) shows the case where the AVC-G0P in Fig. 55 is stored in one sample. Table structure of stls FIG.

FIG. 59 is a flowchart showing the processing procedure of the sample analyzer 307 and the decoding display 308 when only the selected picture is reproduced.

Fig. 60 (a) is a diagram showing an example of the physical format of a flexible disk (FD) as an example of a recording medium, and Fig. 60 (b) is the appearance, cross-sectional structure, and flexible structure of a flexible disk viewed from the front. FIG. 60 (c) is a diagram showing a disk, and FIG. 60 (c) is a diagram showing a device configuration for writing and reading a program to and from a flexible disk FD. BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, an embodiment of a data processing device according to the present invention will be described with reference to the accompanying drawings.

(Embodiment 1)

FIG. 11 shows a configuration of functional blocks of the data processing device 10 according to the present embodiment. In this specification, the data processing apparatus 10 is described as having both a recording function and a reproducing function of an MP4 file. It is assumed that the MP4 file is a file of the MPEG4 system standard (IS OZ IEC 1449 6-1) format. The data processing device 10 can generate an MP4 file and write it to the recording medium 131, and can play back the MP4 stream written to the recording medium 131. The recording medium 1 3 1 is, for example, a DVD_RAM disk (Hereinafter referred to as “DVD—RAM disk 13 1”). The data processing device 10 is realized, for example, as a DVD recorder. The data structure of the MP4 file will be described later with reference to FIG. Hereinafter, components and operations related to the recording function of the data processing device 10 will be described, and then components and operations related to the reproduction function will be described.

First, the MP4 file recording function of the data processing device 10 will be described. As the components related to this function, the data processing device 10 includes a video signal receiving unit 100, a video stream generating unit 101, an audio signal receiving unit 102, and an auxiliary information generating unit 10. 3, an extension information generating unit 104, a multiplexing unit 105, a recording unit 120, and an optical pickup 130.

The video signal receiving unit 100 is a video signal input terminal, and receives a video signal representing a video image. The audio signal receiving unit 102 is an audio signal input terminal, and receives an audio signal representing an audio signal. For example, the video signal receiving section 100 and the audio signal receiving section 102 are connected to a video output section and an audio output section of a tuner section (not shown) for receiving a broadcast radio wave, and the video signal receiving section 100 and the audio signal receiving section 102 respectively receive the video signal. And receive audio signals.

The video stream generation unit 101 receives a video signal and an audio signal, and outputs, for example, an MP EG 2 or an MP EG 4 (MP EG_4 Visual, MP EG-4 AVC (Advanced Video Coding)). Performs encoding based on the standard and generates a video stream (encoded stream). The additional information generation unit 103 generates additional information of the MP4 file standard. The ancillary information includes an access data for accessing the sample unit of the encoded stream. “Sample” is the minimum management unit in the attached information, and records information such as data size, decoding time, and playback time for each sample. One sample is a data unit that can be accessed at random. The details of the additional information will be described later.

The extended information generation unit 104 generates extended information indicating an attribute for specifying each frame data in the sample. The “attribute” here indicates, for example, the data size, decoding time, and display time of each frame data. The sample may be a video sample (Video Sample) representing frame data of a video frame, or an audio sample (Audio Sample) representing frame data of an audio frame.

The multiplexing unit 105 multiplexes the coded stream and the extension information to generate a moving image stream. This stream stores a video stream and / or an audio stream, and the extended information generated by the extended information generating unit 104.

The recording unit 120 controls the pickup 130 and records data at a specific position (address) of the DVD-RAM disk 131. More specifically, the recording unit 120 records the video stream generated in the multiplexing unit 105 as an MP4 file in the AV data area 133, and generates the video stream in the auxiliary information generation unit 103. Attached information to MP 4 files are recorded in the management information area 1 3 2.

The recording unit 120 may record the moving picture stream and the attached information as one MP4 file, instead of a separate MP4 file, on the DVD-RAM disc 131. Further, the recording unit 120 may record the attached information in the AV data area 133.

FIG. 12 shows the configuration of the MP4 file 21 recorded by the data processing device 10. The MP4 file 21 includes additional information 22 and a video stream 23.

The auxiliary information 22 indicates the size, storage address, playback time, etc. of each sample when a predetermined number of video frame data and / or audio frame data in the video stream 23 is taken as one sample. Information.

The video stream 23 includes a plurality of video samples and a plurality of video size samples.

The video sample is defined by the additional information 22 as a set of a plurality of video frame data. In the present embodiment, for example, one video sample (video sa immediate le) is matched with one group of picture (Group Of Picture; GOP) 25, but regardless of the presence or absence of the G〇P structure, A set of predetermined frame data may be one video sample.

The video size sample contains extension information for the corresponding video sample. In FIG. 12, each of the video frames of video sample # 0 is The frame sizes of frames # 0 to #M are described.

In FIG. 12, a video sample corresponding to the video size sample is recorded after the video size sample, but this is an example, and other arrangements can be taken. Here, 1 sample == 1 chunk (Fig. 3), so we do not mention chunks in particular. However, it is possible to treat multiple samples like a conventional MP4 file as one chunk. For example, the samples in a chunk can be stored in ascending order with respect to the decoding time and continuously. At this time, recording and reproduction of the MP4 file having the data structure according to the present embodiment are not limited.

One of the main features of the present embodiment is that a plurality of video frame data are managed as one video sample, and attributes (frame size, display duration, etc.) for specifying each frame in the sample are set. Is provided in the video stream as a separate sample. In the additional information 22, the access data of the sample describing the extended information as well as the access data of the video sample is individually managed in the additional information 22, and random access to each is enabled. explain in detail. In FIG. 12, the first video sample (video sample # 0) of the video stream stores (M + 1) video frame data. The first video size sample (video size sample # 0) of the video stream contains the same as the first video sample. The information of (M + 1) frame sizes, which is the same number, is stored. Similarly, the (N + 1) th video sample (video sa immediately le # N) counted from the beginning of the video stream stores (L + 1) video frame data. (L + 1) frame size information is stored in the (N + 1) th video size sample (video size sample # N) counted from the beginning of the video stream.

As described above, by arranging the same number of video samples and video size samples in the video stream and storing the same number of frame data and frame size information for each video frame, the frame size information and the corresponding Video frames can be easily associated. In addition, access data for each video sample is stored in the video track of the auxiliary information, and access data for each video size sample is stored in the video size track. Therefore, the frame size information of a specific video frame can be obtained from the moving picture stream at the time of reproduction using the additional information and the extended information included in the video size sample.

For simplicity, the explanation is limited to video, but audio frame data can be stored in the sample. At this time, similarly, a set of a predetermined number of audio frame data may be handled as one audio sample. The additional information 22 describes an audio track that describes access data of each audio sample and an audio size track that specifies access data for each audio size sample. Audio size The sample describes the frame size of each audio frame.

Next, the recording operation of the data processing device 10 will be described with reference to FIG. By the following recording operation, the MP4 file 21 having the above data structure is recorded on the DVD-RAM disk 131. FIG. 13 shows a procedure of a recording process of the data processing device 10. First, in step S11, when a video signal is received, the moving picture stream generator 101 encodes each frame of the video according to a predetermined encoding procedure. In step S12, the extension information generation unit 104 generates extension information indicating the frame size of each encoded frame. Since encoding is performed by the video stream generator 101, the extended information generator 104 uses the encoding result of the video stream generator 101 to encode information such as each frame size. Generate as extended information. In step S13, when the auxiliary information generation unit 103 determines that the frame data constituting one GOP corresponds to one video sample, the multiplexing unit 105 determines based on the determination. Obtains extended information (video size samples) in video sample units. Then, in step S14, the multiplexing unit 105 multiplexes each video sample and each corresponding extended information to generate moving image stream data. Next, the additional information generation unit 103 generates additional information including access to each video size sample and access data to each video sample. Then, the recording unit 120 generates a video stream based on the encoded stream data and the attached information, and outputs the MP4 file. DVD-RAM disk 1 3 1

Next, the reproducing function of the data processing device 10 will be described with reference to FIG. 11 again. It is assumed that the MP4 file 21 described above is recorded on the D VD—RAM disk 13 1. The data processing device 10 can reproduce and decode the moving picture stream recorded on the DVD-RAM disc 131, according to a user's instruction.

As the components related to the playback function, the data processing device 10 includes a video signal output unit 110, a video stream decoding unit 111, an audio signal output unit 112, and a playback unit 113. And an auxiliary information holding memory 118 and a pickup 130.

First, the playback unit 113 controls the operation of the pickup 130, reads the attached information 22 from the management information area 132 of the DVD-RAM disk 131, and acquires the attached information. The playback unit 113 outputs the acquired attached information 22 to the attached information holding memory 118 to hold it. The playback unit 113 reads a video stream 23 including a video sample and extended information (video size sample) from the AV data area 133 of the DVD-RAM disc 131. It should be noted that the data processing device 10 can also acquire a moving image stream via a network. At this time, a signal line connecting the pickup 130 and the reproducing unit 113 in FIG. 11 corresponds to a network line. The reproduction unit 113 can acquire the moving image stream 23 recorded on the recording medium 13 1 of the remote server via a transmission medium called a network line, and reproduce the moving image stream 23 in the data processing device 10. Upon receiving the video stream 23, the video stream decoding unit 111 refers to the video size track of the auxiliary information 22 stored in the auxiliary information storage memory 118, and Obtain information such as sample access data, data size, decoding time, and playback time. Then, the moving picture stream decoding unit 111 extracts each video sample and each video size sample from the moving picture stream based on the information. In addition, if an audio track exists in the attribute information, the audio data is extracted from the video stream using the access data. Then, the video stream decoding unit 111 decodes the video data and the audio data.

The video signal output unit 110 is a video signal output terminal, and outputs the decoded video data as a video signal. The audio signal output unit 112 is an audio signal output terminal, and outputs decoded audio data as an audio signal.

The data processing device 10 can reproduce the MP4 file recorded on the DVD-RAM disk 13 1. Hereinafter, a basic reproduction process of the data processing device 10 will be described. Before starting playback, the playback unit 113 reads out the attached information recorded in the management information area 132 of the DVD-RAM disk 131 and stores it in the attached information holding memory 118. Next, the playback unit 113 refers to the accessory information stored in the accessory information holding memory 118, and picks up the pickup 130 from the AV data area 133 of the DVD-RAM disk 131. Read the video stream via Access each sample of the video stream for additional information Since the access data for the reproduction is stored, the reproducing unit 113 can access any sample based on the access data. The video stream decoding section 111 decodes the read video stream into a video signal and / or an audio signal, and outputs the video signal output section 110 and / or the audio signal output section 112 Output to

Next, a more specific reproduction process of the data processing device 10 will be described with reference to FIG. In order to explain the advantage of adopting the above data structure, it is assumed that a decoding error occurs during a moving image stream. FIGS. 14 (a) to 14 (d) show the state of error propagation when a decoding error occurs in the video stream constituting the MP4 file 21. This video stream does not need to have a unique code (unique code) in the stream. Each video sample in Fig. 12 corresponds to each GOP in the sequence in Fig. 14 (a). The access data for each GOP shown in FIG. 14 (a) is stored in the video track of the additional information 22.

Now, it is assumed that the video stream decoding unit 111 detects an error while playing a video frame #Y (video frame #Y) in a video sample #X (video sample #X). Then, the moving picture stream decoding unit 111 notifies the reproducing unit 113 of the occurrence of the error.

The playback unit 113 reads the access data of the video size track of the accessory information 22 stored in the accessory information holding memory 118, and reads the video from the AV data area 133 of the DVD-RAM disk 131. The video size sample #X (video size sample # X) in the stream is W

read out. Then, the video stream decoder 1 1 1 refers to the read video size sample #X and extracts frame size information from the first frame # 0 of the sample #X to the error occurrence frame #Y. Then, the start position of the next video frame # (Y + 1) (video frame # (Y + D) is identified and decoding is resumed from that position. The start position of the next Y-th video frame is It can be obtained by calculating the sum of the frame sizes from frame # 0 to error occurrence frame # Y.After that, the playback unit 113 determines the next video frame # (Y + 1) ( The video stream data is read out sequentially from the frame of video frame # (Υ + 1)), and the playback is continued.As described above, the size of each frame is stored in the video size sample. See video size sample This makes it possible to obtain the size of each frame in the video stream, so that even if an error occurs in a video stream that does not have a unique code, the start position of the next frame can be easily specified. This allows the extent of error propagation to be completed within the frame where the error exists.

Note that the reproduction operation according to the present embodiment is applicable regardless of whether an error occurs. For example, even when multiple GOs are managed as one sample, random access to each frame is possible. The storage position of each frame can be specified by the playback unit 113 adding the frame sizes from the beginning to the frame immediately before the desired frame. At that time, Figure 12 shows As shown, if video samples and the corresponding video size samples are arranged consecutively, it is possible to reduce the extra shake operation of the pickup 130. The reason is that the playback unit 113 refers to the access data of both the video track and the video size track, and reads out the video samples and the video size samples that are continuously arranged from the DVD-RAM disk 131 at a time. This is because

According to the present embodiment, it is possible to reduce the data size of the attached information as compared with the case where one frame of the video stream in the moving image stream corresponds to the video sample. This makes it possible to prevent an increase in the memory size for holding the attached information even when playing the MP4 file. For example, the data size of ancillary information (including the video track and the video size track) when a GOP and a video sample are associated and a video stream including a video size sample is recorded for one hour is about 100 kilobytes. is there. On the other hand, the data size of the accessory information when recording a video stream for one hour by associating one frame with a video sample is about 1 megabyte. Therefore, according to the present embodiment, the memory size for holding the attached information can be significantly reduced. In other words, according to the present embodiment, it is possible to have room for storing about 10 times the additional information in the same memory size.

Note that the data structure of the MP4 file 21 shown in FIG. 12 defines a unique structure according to the present embodiment. However, according to this embodiment Video tracks and video samples can be played as usual even on playback devices that do not support the data structure. The reason is that the MPEG-4 system standard specifies that the data size of each track is described at the beginning of each track in the attached information, so that a video size track that cannot be processed can be skipped. .

The processing according to the present embodiment is applicable even if the above-mentioned moving image stream includes a video stream encoded by an encoding method that does not use inter-frame compression. However, keep in mind that the smaller the number of frames corresponding to one video sample, the larger the number of access data of the video samples to be stored in the video track of the accompanying information, and the larger the data size of the attached information. is necessary.

In the above description, the video size sample storing the information on the frame size of each video frame is the extended information, but the extended information is not limited to this. FIG. 15 shows an example of extended information describing information on the number of slices constituting each video frame and the data size of each slice. With this configuration, the data storage position of each slice can be specified by the same operation as described above, so that any slice can be accessed and the effect of error propagation when an error occurs. Can be made smaller. Figures 16 (a) to 16 (d) show how error propagation can be accommodated within a slice when a decoding error occurs in the video stream. By specifying extended information that describes information on the data size of slices, Even if an error occurs, decoding can be performed from the next slice, so that error propagation can be suppressed within the slice.

FIG. 17 shows an example of extended information describing the display duration of each video frame. The video stream 33 includes a plurality of video duration samples 36 and video samples 37. The video duration sample 36 stores the display duration information (frame duration) of each frame of the corresponding video sample as extended information. The access data of the video duration sample 36 is managed in the video duration track 34 of the auxiliary information 32. The point that the access data of the video sample 37 is managed in the video track 35 of the auxiliary information 32 is the same as the previous example.

The advantages of adopting the configuration shown in Fig. 17 are as follows. First, when no frame skip occurs during recording of a moving image stream, information representing the same time ΔΤ is described for all frames in the video duration sample. Here, consider the case where a frame skip of one frame occurs as shown in FIG. 18 (a). FIG. 18 (a) schematically shows a video frame group in which a frame skip has occurred. When a frame skip occurs, display time information that specifies twice the display time of the immediately preceding frame is generated. That is, in the example of Fig. 18 (a), the display time of the immediately preceding frame # 4 (frame # 4) is set to Δ2T, and the display time of the other frames is set to ΔΤ. In the table, the value of each frame duration of the video duration sample 36 is described. As a result, this video 00

When playing the stream, each frame can be displayed at the same timing as when recording, except for frame # 4. On the other hand, FIG. 18 (b) schematically shows a video frame group when the display time is set uniformly for all frames. If the display time of each frame is set as shown in Fig. 18 (b), each frame of the video stream will be displayed at a different timing from the recording time.

It is also possible to store the display time information and the frame size information of each frame in the same sample and to refer to them from the access data of one track in the attached information. With this configuration, both information can be managed without increasing the access data size.

In the present embodiment, the number of video frames in a video sample and the number of corresponding frame size information in a video size sample have been described as being the same, but they need not be the same. Even if the numbers differ, information indicating the correspondence between the video sample and the video size sample is stored in the auxiliary information, etc., and finally a specific video frame can be associated with the frame size information Is good enough. In the present embodiment, the MP4 file format has been described as an example, but the present invention is not limited to this. The present invention is composed of a video stream and additional information. If the configuration is such that the access data is stored, it can be applied to other file formats. An example of another file format is the QuickTime file format, which was the basis for the MP4 file format Is applicable.

In the present embodiment, the recording medium is described as a DVD-RAM disk, but the present invention is not particularly limited to this. For example, the recording medium may be an optical recording medium such as MO, DVD-R, DVD-RW, DVD + RW, CD-R, CD-RW, a magnetic recording medium such as a hard disk, or a semiconductor recording medium such as a semiconductor memory. You may.

(Embodiment 2)

In the first embodiment described above, information on the frame size and the like is stored in another sample (video size sample) independent of the video sample.

In the second and subsequent embodiments of the present invention, an example will be described in which information on a frame size and the like is stored in a video sample. Hereinafter, the data structure related to the present embodiment will be described first, and then the configuration and operation of the data processing device according to the present embodiment will be described.

As described as the background art of the present invention, in recent years, a moving image distribution service for PCs on the Internet has become widespread due to an increase in the capacity of a communication network and progress in transmission technology. In addition, with regard to video distribution on wireless terminals, TS 26.23 is a standard of 3GPP (Third Generation Partnership Project), an international standardization organization that defines standards for receiving terminals in wireless networks. 4 (Transparent end-to-end packet switched streaming service), etc., is expected to expand the video distribution service for mobile terminals. When storing and distributing media data such as audio, video, still images, and text, it is common to multiplex the header information necessary for playback of the media data with the media data. MP4, which was mentioned in connection with the first embodiment, is a multiplexed file format for achieving multiplexing, and is based on ISO / IECJ TC1 / SC29 / WG11 (International Standardization).

Organization / International Engineering Contract). And since it has been adopted in 3GPP TS26.234, it is expected that it will spread in the future.

Here, the data structure of the MP4 file will be described. In MP4 files, header information and media data are stored in objects called boxes. FIG. 19 shows the data structure of box 40. Box 40 has a size field 41, a type field 42, a version field 43, a flag (nags) field 44 and a data field 45. The content of the information stored in each field is as follows.

Size (size) field: The overall size of the pox, including the size field

Type field: The identifier of the box, usually represented by four alphabetic characters. The field length is 4 bytes, and a box is detected in an MP4 file by determining whether or not 4 consecutive bytes of data match the type field identifier. It becomes possible to search.

Version (version) field: Box version number Flag (flags) field: Flag information set for each box

Data: header information, media data, etc.

Since the version field and flag field are not mandatory, some boxes do not have these fields. In this specification, it is assumed that an identifier of a type field is used to refer to a box. For example, a box whose type is "moov" is referred to as "moov" or "box moov".

MP4 allows the use of extensions in addition to the basic parts that must be included in the file.

Hereinafter, the structure of the basic unit will be described first. FIG. 20 (a) shows the data structure of the basic unit 50. An MP4 file is composed of three basic boxes, ftyp51 and moov52, which are basic headers, and mdat53, which stores media data. ftyp51 is placed at the beginning of the MP4 file. ftyp51 contains information for identifying an MP4 file. mdat 53 stores media data in units called samples. A sample is a minimum unit when handling media data in MP4, and is equivalent to one or more audio frames or a VOPP (Video Object Plane) of MPEG-4Visual.

The MP4 file format defines the format in which audio or video frame data is stored in the mdat 53. Data is stored according to their format. The data of each medium included in the mdat 53 is called a track, and each track is identified by a track ID.

FIG. 20 (b) shows the data structure of moov 52. In the MP4 file, moov is a required box, and its number is one. Boxes are arranged hierarchically in moov52, and header information of samples included in mdat53 is stored. This header information corresponds to the additional information in the first embodiment. In other words, the moov 52 stores the additional information according to the first embodiment.

Header information common to the entire file is stored in mvhd. Also,

—Head information for each track, such as video and video, is stored in a separate trak53. Note that which track the trak 53 contains information for is identified by the track ID shown in the box tkhd (not shown) in the trak 53. When an extension is used in an MP4 file, mvex is present. mvex exists only when the extension is used, and indicates that the extension is stored after the basic part 50. mvex contains trex, and sets the default value of header information in the extension for each track.

FIG. 21 (a) shows the data structure of trak53. trak 53 contains stbl 54, and stbl 54 contains boxes stts 55, stsd 56, stsc 57. The box in stbl 54 stores information such as the decoding time of the sample, the display start time, and the size.

First, the decoding time of the sample is stored in stts 55. stts 5 to 5 Stores the difference value of the decoding time between two consecutive samples. Therefore, by integrating the difference values, the decoding time of each sample can be obtained. If the decoding time and the display start time are different, information on the difference between the decoding time and the display start time is stored in a box ctts (not shown). For example, since the decoding time and the display start time are different for frames encoded using bidirectional prediction, ctts is used to determine the display start time. The sample size is stored in a box called stsz (not shown).

Next, the boxes stsd56 and stsc57 will be described. The stsd 56 stores, as an entry, initialization information necessary for decoding of the track data and the display size of the track. The contents of the entry are referenced when each sample is decoded. A plurality of entries may exist. For example, when the display size is changed in the middle of a track, two entries are prepared before and after the change.

As described in connection with the first embodiment, in general, the size of the access data varies according to the number of frames included in one sample, so that a plurality of video frames are collectively reduced to one sample. Thus, the data size of the auxiliary information, that is, the data size of Pox moov, can be reduced. Therefore, a field frame_count indicating the number of frames included in one sample is introduced in one entry of pox stsd56. FIG. 21 (b) shows the field frame—count set for a number of entries in box stsd 56. As shown in FIG. 21 (), there are 10 entries 1 to 10 in stsd. lead Entry 1 has a frame count of 10, indicating that one sample contains 10 frames. Similarly, the second and third entries show that each sample contains 9, 8 frames.

In addition, when decoding or playing an MP4 file, stsc 57 indicates how many frames are included in each sample. The playback device can know the entry number in the stsd 56 by referring to the stsc 57 when decoding each sample. For example, if stsc 57 indicates that the first sample corresponds to entry 1 of stsd 56, the playback device can know that the first sample includes 10 frames.

For a track consisting of only the basic part, header information for the entire track is stored together in a moov. On the other hand, an extension has been standardized as a means for dividing a track and adding header information for each divided unit. In the extension section, header information is added to each unit obtained by dividing the track. However, f typ and moov are also required when using the extension, and information that is commonly used in all samples in the track, such as information necessary for decoding samples, is always stored in moov.

FIG. 22 shows a data structure of an MP4 file including a basic part 50 and an extended part 60. In the extension unit 60, the header information of the sample included in the divided unit is stored in the extension header 61 (moof), and the data of the sample in which the header information is stored in the moof is sample data. 6 2 Stored in (mdat). In addition, since the extension part 60 exists, the box mvex exists in the pox moov of the basic part 50.

FIG. 23 shows the data structure of mooi 61. Similar to moov, boxes are arranged hierarchically in moof 61, and mfhd immediately below moof stores index information of moof, and one or more traf 62 storing header information in track units. Is arranged. The traf 62 stores the header information of the samples included in each track in order from the sample with the earliest decoding time, like the ak in the moov. Note that header information for each track may be stored using a plurality of traf 62. traf 62 includes tfhd 63 and one or more truns 64. trun64 stores header information in sample units such as sample size and playback time length. Tfhd63 uses the track ID (track—ID) of the track in which traf62 stores information, and the number of the entry in stsd56 when decoding the sample included in traf62. Stores information such as whether to refer. In the above description, moov and mdat exist in the same file. However, mdat can be stored as a data file different from moov, and the file of mdat can be referenced from moov. Hereinafter, the basic operation in which the data processing device according to the present embodiment generates an MP4 file will be described with reference to FIGS. 24 and 25. FIG. 24 shows a configuration of a recording function block of the data processing device 170 according to the present embodiment. The data processing device 170 can receive a video signal, multiplex video data, header data, and the like, and generate an MP4 file. W

The data processing device 170 is connected to an encoding unit 171, memories 172 and 174, an analysis unit 173, a header generation unit 175, and a data generation unit 176. Parts 1 and 7 are provided.

The function of each component is as follows. The encoding unit 171 receives the video signal, encodes the video signal in a predetermined encoding format, and outputs encoded data. For example, video is encoded frame by frame or field by field. In this specification, the term "picture" is used as a concept encompassing both "frame" and "field". The memories 1722 and 174 are recording media for storing encoded data, analysis information, and the like, and are, for example, semiconductor memories, optical disks, and hard disks. Both need not be the same type of recording medium.

The analysis unit 173 obtains the encoded data, and analyzes information on the encoded data, for example, the data size for each sample including one or more picture data and audio frame data, the display start time, or the like. Acquires information indicating the decoding time, initialization information necessary for decoding encoded data, and the like. In addition, the analysis unit 173 outputs a moov generation signal instructing generation of the above-described box moov.

The header generation unit 175 acquires the analysis information of the encoded data based on the moov generation signal, and generates the box moov. Further, the header generator 175 outputs arrangement information regarding the arrangement of the sample data in the mdat.

The data generator 176 creates the md at in accordance with the format defined by the MP4 standard based on the encoded data and the arrangement information. The linking unit 177 links or multiplexes the box moov and the data mdat specified in the MP4 standard to generate an MP4 file.

Hereinafter, the operation of the data processing device 170 will be described. First, the encoding unit 1771 encodes the input video signal, and stores the encoded data d102 in the memory 1702. Next, the analysis unit 173 obtains the coded data d103 from the memory 172, and is necessary for decoding information of a sample unit such as a frame size, display start time or decoding time, and coded data. Obtain the necessary initialization information. After that, the analysis unit 173 stores the analysis result d105 in the memory 174. When the information necessary for generating the moov is obtained, the analysis unit 173 sends the moov generation signal d104 to the header generation unit 175.

The header generation unit 17 5 starts generation of moov using the moov generation signal d 104 as a trigger. At this time, the header generator 175 obtains the analysis information d106 from the memory 174 and generates each pox in the moov. After the creation of the moov, the header generation unit 175 inputs the arrangement information d107 of the sample data in the mdat to the data generation unit 176, and connects the created moov data d109 to the connection unit. Enter in 1 7 7. The data generator 176 obtains the input video data d108 from the memory 172, and creates mdat in accordance with the arrangement information d107 and the format specified by the MP4 standard. Then, the data generation unit 176 inputs the generated data d 107 (mdat) to the connection unit 177. Finally, the coupling unit 1177 links the data d109 (moov) and the data d107 (mdat), and outputs an MP4 file d110. The output MP The four files are recorded on an optical disk, a hard disk, or the like by a drive device (not shown) or the like, or are recorded on a semiconductor memory card or the like via a PC drive slot.

FIG. 25 shows a procedure in which the header generation unit 175 determines the difference value of the decoding start time and the number of frames forming a sample. This decision is made based on the analysis result d106. The difference value of the decoding start time indicates a difference value between the decoding start time of each sample and the decoding start time of the next sample. It is assumed that the data processing device 170 of the present embodiment also handles a plurality of video frames as one sample.

In the following, the operation of storing one GOP as one sample when MPEG-2Visual data is input will be described. At the time of this operation, it is assumed that the header generation unit 175 has previously acquired information on the total playback time length of the frames included in one GOP. In the following description and drawings, "sample_dur", "num-frame", and "i" are the difference between the decoding time between the i-th sample and the (i + 1) -th sample and the i-th sample, respectively. It indicates the number of frames included and the sample number. The initial value of sample-dur and num-frame is 0, and the initial value of i is 1.

First, in step S21, the header generation unit 175 inputs an analysis result for one frame, and then in step S22, a frame value which is a difference value of the decoding time between the next frame and the current frame is input. — Get the dur. In step S23, add frame-dur to sample-dur and add 1 to num-frame. Next, the header generator 175 determines whether the frame processed in step S24 is the last frame of the GOP by comparing the sample-dur with the playback time length of the GOP. If it is not the last frame, the process returns to step S21, and repeats the processing of steps S21 to S24 until the processing of the last sample included in the GOP is completed. When the processing of one GOP frame is completed, in step S25, the sa immediately le—dur indicating the difference between the decoding time of the i-th sample and the i + 1st sample, and the number of frames included in the i-th sample Is stored in the analysis result table. Next, in step S26, the header generation unit 1775 determines whether or not the i-th GOP is the last GOP in the input data. If the result of the determination is that it is not the final GOP, the header generator 175 adds 1 to i in step S27, immediately sets le—dur and frame—dur to 0, and then proceeds to step S27. 2 Repeat the processing from 1 to step S26. This process continues until the final GOP is reached. The basic operation when the data processing device 170 generates the MP4 file has been described above.

Next, another operation when the data processing device 170 generates the MP4 file will be described, and the data structure of the generated MP4 file will be described. By using the obtained MP4 file, the decoding time of a plurality of frames included in the sample can be correctly indicated even when the display time length of each frame in the encoded data is not constant. The encoding method of the video signal is MP EG-2 Visual. For example, MP EG-4 AVC or MP EG-4 Visual may be used, and may include not only video data but also audio and text data.

First, the characteristic operation of the data processing device 170 according to the present embodiment will be described. Comparing this operation with the operation shown in FIG. 25, the procedure for determining a frame forming a sample in the header generation unit 1775 is different. Therefore, the different processing will be described below.

The criterion for the header generation unit 175 to determine a sample is whether or not the difference value of the decoding time is constant between two consecutive frames. If the difference value is constant, the header generation unit 175 determines that those frames belong to the same sample, and if not, determines that those frames belong to different samples.

In the present embodiment, when no frame skip has occurred, all the difference values are constant, and all frames of one GOP are managed as one sample. On the other hand, if the difference value is not constant due to the occurrence of a frame skip or the like, one GOP frame is managed as a plurality of samples. However, the target sample generation unit may be a group of VOP (Video Object Plane; GOP) in MPEG-4 Visual or a sub-sequence in MPEG-4 AVC. Alternatively, it may be from the I (intra-coded) frame in MP EG-4 Visual to the frame immediately before the next I frame, or an IDR (Instant aneous decoder refresh) picture in MP EG-4 AV To the picture immediately before the next IDR picture. The video data included in the sample is not limited to frame-structured data, but may be field-structured data. FIG. 26 shows the procedure of determining the frames that make up the sample in the header generation unit 175. First, in step S31, the header generation unit 175 acquires the display start time CTS (1) and the decoding time DTS (1) of the first frame of the encoded data, and sets the variable i to 1. At this time, the value of "Temporal Reference" in the frame data is used to obtain the display start time, and the decoding time is determined according to the type of frame (1, P, B frame) that composes the GOP and the display order. Calculated from time.

Next, in step S32, the header generation unit 115 calculates the display start times CTS (i + 1) and CTS (i + 2) of the (i + 1) and (i + 2) th frames. get. Subsequently, in step S33, the header generating unit 17 5 calculates the (i + 1) and (i + 2) th frames from the acquired values of CTS (i + 1) and CTS (i + 2). Calculate the values of the decoding times DTS (i + 1) and DTS (i + 2), and in the next step S34, DTS (i), DTS (i + 1), DTS (i + 2) Calculate delta (i) and delta (i + 1) from the value of. Here, the difference between the decoding time of the i-th frame and the (i + 1) -th frame is denoted as delta (i).

In step S35, the header generation unit 1775 determines whether or not delta (delta and delta (i + 1) are equal. Proceed to step S37, and if they are equal, proceed to step S36. Here, the initial value of j is 1. In step S37, the header generator 175 determines a frame constituting the j-th sample, and adds 1 to j. At this time, the first frame of the j-th sample is the frame immediately after the last frame in the (j-1) -th sample, and the last frame is the i-th frame. Note that the first sample (j = l) starts from the first frame of the encoded data. In this way, when a discontinuity occurs in delta (i), a new sample is created, so that the difference between the decoding time of the next frame and the current frame is equal in each frame constituting the sample. Can be.

Next, in step S36, the header generation unit 175 determines whether or not the i-th frame is the last frame of the GOP. If it is not the last frame, the process proceeds to step S39. If it is the last frame, the process proceeds to step S37.

In step S39, the header generation unit 175 determines whether or not the (i + 2) -th frame is the last frame of the encoded data. If it is not the last frame, the process proceeds to step S40, and if it is the last frame, the j-th sample is determined and the process ends. In step S40, the header generation unit 1775 adds 1 to i, and repeats the processing from step S32 to step S36. Here, the first frame of the j-th sample is the frame immediately after the last frame in the (j-1) -th sample, and the last frame is the (i + 2) -th frame. In step S36, the ith frame is the last of G0P If it is determined that the frame is a frame, the process of step S37 is performed. Subsequently, in step S38, the header generation unit 1775 determines whether or not the (i + 2) -th frame is the last frame of the encoded data. At 0, 1 is added to i, and the processing from step S32 to step S38 is repeated. This process is repeated until the last frame is determined. In step S38, when it is determined that the frame is the last frame, the first frame of the〗 th sample is set to the frame immediately after the last frame in the (j -1) th sample, and the last frame is (i + 2) The frame is determined and the process ends.

Figure 27 shows the relationship between the determined sample and the frame. In this GOP, one I frame and nine P (forward prediction) frames are arranged in order of decoding time. In the figure, "I" indicates an I frame and "P" indicates a P frame. The value of "Temporal Reference" is used as the display start time of each frame. Here, an example has been described in which the decoding time is calculated based on the display start time of the frame. However, when the information indicating the decoding time of each frame is included in the encoded video data, the information is calculated. May be used.

0〇 shown in Figure 27? Since is composed only of I and P frames, the display start time and decoding time of all frames are equal. Here, let us consider a case where the display start time and the decoding time of frame P-8 are temporarily changed. For example, the difference in decoding time between all other frames is 1, whereas the difference between frames P-7 and P-8 is It is assumed that the decoding time difference value is 2. Then, the header generation unit 1 75 5 generates sample 1 from frame I-1 to frame P-6, sample 2 including only frame P_7, and sample 3 from frame P-8 to frame P-10. Generate.

FIG. 28 shows another example showing the relationship between the determined sample and the frame. In this GOP, frames are arranged in the order of 復号 Ι Β Ρ Β Β Ρ Β Β Ρ Ρ 'in order of decoding time. B indicates B (bidirectional prediction) frame. In this example, the decoding time of the I and P frames is calculated by subtracting 3 from the display start time, and the decoding time of the B frame is equal to the display start time. Here, let us consider a case where a frame skip occurs in the portion of frame B-5 during encoding. Then, in frame P_4, the difference value of the decoding time between the next frame and the current frame becomes discontinuous. Therefore, the header generation unit 17 5 generates a sample 1 from frame I-1 to frame B-3, a sample 2 including only frame P-4, and a frame 1 from frame B-6 to frame P-10. Generate sample 3.

The following advantages are obtained by determining the frames constituting the sample by the above-described processing and generating the sample including the frames. In other words, the playback device decodes the current frame without analyzing the sample data by dividing the difference between the decoding times of two consecutive samples by the number of frames constituting the sample. Then, the time from decoding to decoding the next frame can be obtained by uniform calculation. In MP4, the frames that make up the sample The number of frames is specified by the frame-count field in the st sd 56 entry shown in FIGS. 21 (a) and (b). Samples that contain different numbers of frames refer to different stsd entries, each with a corresponding frame-count value. For example, if the number of frames constituting the sample is 5, 6, and 7, three entries with frame_count values of 5, 6, and 7 are generated in stsd 56 and referenced. Just fine. Note that a frame-count value that may be required may be predicted in advance, and a group of stsd entries covering those frame-count values may be generated in advance.

(Embodiment 3)

The data processing device according to the third embodiment receives video data and encodes the video data according to MPEG-2Visual. When the difference between the decoding times of two consecutive frames in the coded data becomes discontinuous, the data processing device keeps the number of frames included in each sample as constant as possible, so that the stsd referenced by the sample is Generate MP4 files to prevent frequent switching of entries. The video data processed by the data processing apparatus may be MPEG-4AVC or MPEG-4Visual. The video data included in the sample is not limited to frame-structured data, but may be field-structured data.

The configuration of the data processing device according to the present embodiment is the same as the configuration of the data processing device according to the second embodiment. The operation of the components is identical. Therefore, in the following, description of each component of the data processing device will be omitted, and the operation of the header generation unit 175 will be described. FIG. 29 shows a procedure executed by the header generation unit 175 to multiplex a frame in an encoded data into a sample. The decision on a sample-by-sample basis is made so that one sample includes all frames in one GOP. In FIG. 29, i represents the sample number, sample—dur represents the difference between the decoding times of the i-th and (i + 1) th samples, and num—frame represents the number of frames contained in the i-th sample. I do. The initial values of sample—dur and num—frame are 0, and the initial value of i is 1.

In step S41, the header generation unit 175 obtains the display time of the current frame and the next frame based on the value of the Temporal Reference, calculates the decoding time of each frame, and then calculates the current frame and the next frame. The difference value of the decoding time of the frame is calculated.

Next, in step S42, the header generator 175 adds frame-dur to sample-dur, and adds 1 to num-frame. After that, the header generation unit 1775 compares the frame-dur of the current frame with the display time length calculated from the frame rate at the time of encoding in step S43, and determines whether a frame skip has occurred. Is determined. When it is determined that a frame skip has occurred, in step S44, the header generation unit 1775 adds a value to num-frame by the number of skipped frames. For example, if the frame rate of the encoded data is 10 Hz, the difference between the decoding times in each frame is 10 Hz. 0 ms. Here, assuming that the frame—dur of the Nth frame is 300 ms, the two frames N + l and N + 2 can be determined to have been skipped. — Add 2 to frame. Subsequently, in step S45, the header generation unit 1775 determines whether or not the current frame is the last frame of the GOP. If it is not the last frame, the header generation unit 1775 returns to step S41, and repeats the processing from step S41 to step S45 until the processing of the last sample included in the GOP is completed. . When the processing of the frame for one GOP is completed, in step S46, the difference between the decoding time of the i-th and (i + 1) -th samples is obtained. Get num—frame indicating the number of frames.

Next, in step S47, the header generator 175 determines whether or not the current frame is the last frame of the encoded data. If the frame is not the last frame, 1 is added to i in step S48, sample—dur and frame—dur are set to 0, and then the processing from step S41 to step S47 is repeated. The process ends when it reaches. When the frequency of frame skipping reaches a certain value or more, the addition processing of num-frame in step S44 may not be performed.

FIG. 30 shows a sample in which the frames in the GOP are stored. In this example, frames forming two GOPs, GOP 1 and GOP 2, are stored in samples 1 and 2. First, the frame included in GO P 1 Map 5 frames from frame 1 to frame 5 to sample 1. Next, four frames included in G〇P 2 are processed. In GOP 2, a frame skip occurs in frame 7, and the display time length of frame 7 is twice as long as other frames. In the processing of this frame, the header generator 175 determines that one frame has been skipped in step S43, proceeds to step S44, and adds 1 to num-frame. As a result, at the end of the processing of frame 7, the value of num_frame is 3, which is the same value as when no frame skip has occurred. That is, the frame count corresponding to sample 2 is not 4, but 5 as in sample 1.

In moov 52 of the MP4 file, the st sd 56 entry referenced by the sample (Figure 21 (b)) is indicated by st sc 57. Each time the stsd entry referenced by the sample is switched, the entry of the data table in stsc is added to stsc57.

According to the present embodiment, even when a frame skip occurs,

— Since the count is kept constant, the switching frequency of the referenced stsd entry is reduced, and the data processing device's regeneration load can be reduced. Furthermore, when using the MP4 extension, the entry to be referred to is not switched, so there is no need to switch the traf that stores the sample information, and no increase in overhead occurs.

In the present embodiment, the unit of the sample is determined so that one sample includes all frames in one GOP. However, the sample unit is GO V or MP in MP EG—4 Visual It may be a subsequence in EG-4A VC. Alternatively, it can be from the I-frame in MPEG-4 Visual to the frame immediately before the next I-frame, or the MPR-EG4 AVC's IDR (Instantaneous decoder refresh) picture power, etc., immediately before the next IDR picture May be used.

(Embodiment 4)

The data processing device according to the fourth embodiment receives and analyzes the MP4 file generated according to the procedure shown in FIG. 29, and decodes and displays the encoded data. It is assumed that “video data” in the present embodiment is video data that does not depend on the encoding method. For example, "video data" is MPEG-4AVC, MPEG-4Visual or MPEG-2Visual. However, it is assumed that encoding using bidirectional prediction is not performed.

FIG. 31 shows a functional block configuration of the data processing device 200 according to the present embodiment. The video stream decoding unit 200 performs a so-called demultiplexing process to decode the encoded stream. The data processing unit 200 includes a reception unit 201, memories 202, 204, 205, a separation unit 203, an analysis unit 206, and a decoding display unit 2. 07.

Hereinafter, the processing flow of the data processing device 200 will be described while describing the components of the data processing device 200. For example, if an MP4 file is recorded on a CD-ROM, the receiving unit 201 receives the MP4 file data read out via the pickup 130 and reads the MP4 file data. Input to memory 202 as 202 You. At this time, the receiving unit 201 is an interface unit for securing connection with an optical disk drive (not shown). The separation unit 203 obtains the MP4 file data d203 from the memory 202, and separates the MP4 header part composed of moov or moof from the MP4 data part composed of mdat. Separate and input the header data d200 to memory 204 and the data data d205 to memory 205. Here, the memory 205 may be a semiconductor memory or a drive device having a recording medium such as a hard disk or an optical disk.

The analysis unit 206 obtains the header data d 206 from the memory 204 and analyzes it, and analyzes the sample or the size of the frame included in the sample, decoding time, storage location, etc. After acquiring the information, the result of the analysis is input to the decoding display unit 207 as the data d207. The decoding display unit 2007 obtains the sample data from the memory 205 based on the analysis result data d207, extracts a frame included in the sample, and then decodes and displays the frame. Note that the video data does not need to be composed of frames, but may be composed of fields. Next, the analysis unit 206 determines the decoding time of each frame included in the sample. The method for obtaining the value will be described. It is assumed that the input MP4 file is generated based on the processing according to the third embodiment. That is, even when a frame skip occurs in the sample, the frame count value is the same as the frame count value of the sample when no frame skip occurs. Now, in the sample Includes the first to Nth frames, and the decoding time and display time are equal in each frame.

First, the analysis unit 206 calculates the difference value of the decoding time between two consecutive frames for the 1st to N—1st frames. Specifically, the analysis unit 206 obtains the difference value between the decoding time of the current sample and the next sample from the stts 55 or the trun 64, and divides the difference by the frame-count to obtain a continuous value. Calculate the difference value of decoding time between two frames.

Next, the analysis unit 206 calculates a difference value between the decoding time of the Nth frame and the next frame. That is, the analysis unit 206 obtains the decoding time difference value by subtracting the sum of the decoding time difference values from the first frame to the (N−1) th frame from the sample decoding time difference value. According to this calculation, when the frame skip occurs, the decoding time difference value of the last frame and the frame in which the skip occurs cannot be obtained accurately, but the decoding time difference value of the other frames is obtained accurately. can do.

Fig. 32 (a) to (c) show the relationship between the frame data constituting the encoded data and the decoding time. Fig. 32 (a) shows the relationship between the frames included in the encoded data and the decoding time. The sample contains four frames, Frame 1, Frame 2, Frame 3, and Frame 4. Since a skip occurred in Frame 2, the difference between the decoding times of Frame 2 was 2 seconds, Other frames are 1 second.

Next, the difference between the decoding time of each frame obtained by the analysis unit 206 Explain the values. FIG. 32 (b) shows the decoding time of the frame obtained by the analysis unit 206. The difference between the decoding times of frames 1, 2, and 3 is obtained by dividing the difference between the decoding times of the sample and the next sample by frame count. In the example of Fig. 32 (b), the decoding time difference value of the sample is 5 seconds, and the frame count value is 5. Therefore, the decoding time difference value of these three frames is 1 second. On the other hand, the decoding time difference value of frame 4, which is the last frame of the sample, is calculated as 2 seconds by subtracting the sum of the decoding time difference values of frames 1 to 3 from the decoding time difference value of the sample. Can be As shown in FIG. 32 (c), the analysis unit 206 divides the decoding time difference value of the sample obtained from s Us or t run by the frame count to obtain the second to Nth frames. The display time length may be calculated, and the decoding time difference value of the first frame may be calculated by subtracting the sum of the decoding time difference values of the second to Nth frames from the decoding time difference value of the sample.

(Embodiment 5)

The data processing device according to the fifth embodiment receives and analyzes an MP4 file in which one sample includes a plurality of video frames, decodes and displays encoded data. It is assumed that the video data processed by the present data processing apparatus is MPEG-4 AVC, but the video data processed may be MPEG-4 Visua1 or MPEG-2 Visua1.

When one sample contains multiple video frames, If the decoding time and the display start time are different in, the display start time of each frame cannot be obtained because moov or moo does not include time information in frame units.

Therefore, an extended sample structure is adopted for the MP4 file input to the data processing device according to the present embodiment, so that the display start time of the frame or field constituting the sample can be obtained.

MP4 specifies a format for storing video or audio frame or field data in a sample, and as long as that format is used, the frame or field data is stored in the sample. Can be stored. However, depending on the format, the frame data or field data is not always stored as it is in the sample. For example, when storing encoded data of MP EG-4 AVC in an MP4 file, it is suggested that the size of the NAL (Network A daptation Layer) unit be added to the NAL unit and stored. Is not stored continuously as it is.

FIGS. 33 (a) to 33 (c) show sample structures of MP4 files to be demultiplexed. Figure 33 (a) shows the structure of a sample when N frames (N: an integer equal to or greater than 2) are included in one sample, and display time information is added before each access unit .

In this example, the “access unit” indicates data obtained by converting one frame of data into a storage format in MP4. More generally, an “access unit” is one frame or one file. Is a unit that stores data of one picture that represents a picture.

MP E G—When using 4 AVC, P

Use the value of 0 C (Pict u r e O r d e r C o u n t). Here, PC is a parameter indicating the display order of frames. The field length of the display time information is defined by providing a new field in the stsd entry. However, it is assumed that the newly created field exists only when frame-count is greater than 1. Note that a fixed value may be used as the field length of the display time information, or may be specified for each frame.

In the data structure shown in FIG. 33 (a), an access unit indicating data of one or more frames and display time information indicating a display time of the frames are alternately provided in one sample.

On the other hand, FIG. 33 (b) shows a sample in which access units for one frame are stored. When one frame of access unit is stored in the sample, only the access unit is stored as before, as shown in Fig. 33 (b). Figure 33 (c) shows an example of the syntax of the sample structure.

The FrameCount, Timelnfo, and access-unit-data ports indicate the frame-count value, display time information, and access unit data in the stsd entry, respectively, and LengthSize indicates the field length of the display time information. , Indicated by newly defined fields in the stsd entry.

In the sample structure shown in Fig. 33 (a), the display time information and the error Although the access units are stored in pairs, the display time and the access unit can be stored together. Fig. 34 (a) shows a sample that stores the display time and access unit together. Fig. 34 (b) shows an example of the syntax that implements the sample structure of Fig. 34 (a).

Further, similarly to the first embodiment, information indicating the data size of the access unit can be included in the data stream together with or instead of the display time information. FIG. 34 (c) shows an example in which a field indicating the size of the access unit is added next to the display time information. Regarding the presence or absence of the size field and the setting method of the field length, the same method as the display time information can be used. The display time and data size information all indicate attributes for specifying each frame.

FIG. 35 shows a configuration of a functional block of the data processing device 300 according to the present embodiment. The video stream decoding unit 200 performs a so-called demultiplexing process to decode the encoded stream. The data processing device 300 includes a receiving unit 301, memories 302, 304, and 305, a separating unit 303, an analyzing unit 306, and a sample analyzing unit 307. And a decryption display unit 308.

The correspondence between the data processing device 300 and the data processing device 10 shown in FIG. 11 is as follows. That is, the receiving unit 301, the memory 302, and the separating unit 303 correspond to the reproducing unit 113 of the data processing device 10. Memory 304 is equivalent to attached information storage memory 118 W

The memory 305, the analyzing unit 306, the sample analyzing unit 307, and the decoding display unit 308 which correspond to the moving image stream decoding unit 111 of the data processing device 10. The display function of the decoding display unit 308 corresponds to the video signal output unit 110 of the data processing device 10.

Hereinafter, each component of the data processing device 300 will be described. Receiver 3

0 1 inputs the input MP4 file data to the memory 302 as an MP4 file file d302. The separating unit 303 obtains the MP4 file data d303 from the memory 302, and outputs the MP4 header portion composed of moov or moof, and the MP4 data portion composed of mdat. Are separated, the header data d 304 is input to the memory 304, and the data d 305 is input to the memory 305. Here, the memory 305 may be a recording means such as a hard disk or an optical disk.

The analysis unit 300 acquires the header data d306 from the memory 304, analyzes it, obtains information such as the sample size, decoding time, and storage location, and then analyzes the analysis result. The data is input to the sample analyzer 307 as data d307. The sample analysis unit 307 obtains sample data from the memory 305 based on the analysis result d 307, obtains picture data d 309 from the sample, and inputs it to the decoding display unit 308. . The decoding display unit 308 decodes and displays the input picture data d309.

FIG. 36 shows a procedure of a process in which the sample analyzer 307 acquires picture data from a sample. Where field length, AU_size Indicates the field length of the display time information and the size of the access unit, respectively. The initial value of both the variable data read pointer Ptr is 0. First, in step S51, the entry number of the stsd corresponding to the sample is obtained, and in step S52, the frame-count of the entry having the entry number obtained in step S51 is obtained. Here, if frame-count is greater than 1, field-length is also obtained.

Next, in step S53, it is determined whether or not the acquired frame count value is greater than 1, and if it is greater than 1, display time information is acquired and read in step S54. Move the pointer forward by field—length bytes. Next, in step S55, the access unit data is acquired, and the read-in operation is advanced by AU_size. Subsequently, in step S56, picture data is obtained from the access unit data based on the access unit structure of the MP4. In step S57, 1 is added to i. In step S58, it is determined whether or not i is smaller than the frame count, and if it is smaller, the processing from step S53 to step S58 is performed. repeat.

In the present embodiment, the description has been made using the PAC as the display time information. However, the display start time may be specified directly, or the difference between the display start time and the decoding time, or the difference between the decoding time of the first frame in the sample and the display start time of each frame. May be. Further, the data size of each access unit may be specified. Or, decryption time and display start time, or decryption time and display start The difference value from the time and the decoding time may be specified, or only the decoding time of each frame may be specified.

According to the MPEG-4Visua1 standard, the display time of a frame can be obtained from the value of "Modulo Time Base" and the value of "VOP Time Increment". Therefore, the display time can be specified or a difference value between the display start time and the decoding time can be used.

(Embodiment 6)

The data processing device according to the sixth embodiment receives and analyzes an MP4 file in which one sample includes a plurality of video frames or fields, decodes and displays encoded data. Since the following description can be applied to both frames and fields, the concept including these is described using the term “picture”. It is assumed that at least one of the display time and the decoding time is different between two different video pictures.

It is assumed that the video data processed by the data processing apparatus is encoded in the MPEG-4 AVC format, but the MP-EG-4 Visual, MPEG-2 Visual or H.263 It may be. Also, multiple frames of audio or text data may be stored in one sample.

When one sample includes multiple video pictures, moov and moo f only contain information on a sample basis, so to obtain the decoding time or display start time on a picture basis, the picture data must be analyzed. Must. MP EG—4 AVC stored in MP4 In this case, the start code is not used, so not only the analysis of the picture data is necessary to obtain the picture boundaries, but also if a bit error occurs in the sample data, an error occurs in the sample. Picture data stored after the position may not be obtained.

Therefore, an extended sample structure is adopted for the MP4 file input to the data processing device according to the present embodiment that performs the demultiplexing process, and the decoding of the pictures constituting the sample is performed without analyzing the picture data. Display time and size can be acquired. For example, a random accessible unit such as a GOP (Group of Picture) in MPEG-2 can be used as a sample. In other words, in this MP4 file, header information is stored for each sample and for each access unit constituting the sample. Thus, by having header information hierarchically, header information can be efficiently stored in units smaller than a sample.

MP4 defines a format for storing video and audio frame data as samples. When storing MP EG-4 AVC, it has been proposed to use a format in which the size of the NAL unit is added to the NAL (Network Adaptation Layer) unit in MP EG-4 AVC. ing. Here, the data unit after converting one picture data into the storage format specified in MP4 is called an access unit.

Figure 37 shows the data structure of the access unit. First, Axe The unit contains N NAL units. Furthermore, at the end of the access unit, it is possible to store an extended area that can be used freely by the user. Whether or not an extended area exists can be determined based on whether or not the size field of the NAL unit is 0. If the size field of the NAL unit is 0, it indicates that subsequent areas in the sample will contain user-defined proprietary data. If the size field of the NAL unit is not 0, the access unit is stored in that NAL unit. In Figure 37, the size fields "length" (= L1, L2, Ln) of NAL units 1, 2, and N are not zero.

Next, with reference to FIGS. 38 (a) and (b), a description will be given of a data structure of an MP4 file input to a data processing device that performs demultiplexing processing according to the present embodiment. This MP4 file specifies a sample that contains two or more access units.

As shown in Fig. 38 (a), the sample includes N (N is an integer of 2 or more) access units, as well as the decoding time, display time, size, etc. of each access unit in the sample. A header indicating information indicating an attribute (corresponding to the extended information according to the first embodiment) is added. Hereinafter, this header is referred to as a sample header in this specification. The sample header has a box structure, and in this specification, the highest-order box is referred to as a Multi AU header Box ('mahd').

In the first embodiment, one or more frame data are And extended information indicating the size of each frame data is defined as another sample. On the other hand, in the present embodiment, one or more frame data (access unit) and its extended information (information in the sample header) are stored in the same sample. The access data described in the first embodiment is set for each sample.

Figure 38 (b) shows the data structure of the sample header. The decoding time, display time, and size information are stored separately in the boxes in mahd. The decoding time is stored in Multi Decoding Time To AU Box ('mdta'), the display time is stored in Multi Composition Time To AU Box ('mcta'), and the size is stored in Multi iAUSizeBox ('mtsz'). Here, mdta, mcta, and mtsz exist only when the decoding time, display time, and size are different from the default values, respectively, and the values shown in the respective boxes are overwritten by the default values. You. If no default value is set, the necessary information must be set in the sample header.

Here, the default value and the size of the field included in each box are set in the initial value setting part of the sample header. The initial value setting portion of the sample information will be described later. For example, if mtsz does not exist in mahd, it indicates that the size of the access unit in the sample matches the default value. To get. In addition to the above boxes, access units that can be accessed randomly are shown. A box may be added to store information necessary for decryption of data included in the access unit, or identifiers of information required for decryption. For example, a box for storing information used during special playback such as double-speed playback can be added, and the following information can be stored.

(1) N (N is an integer of 1 or more) Stores the index number of the access unit to be decoded or displayed under specific playback conditions such as double-speed playback. Here, the index number indicates the identification number of the access unit such as the order of decoding time of the access unit in the sample. When multiple pictures are stored in the access unit, only one of them is decoded and displayed. Instead of using the sample header, information on specific playback conditions in the access unit for the entire video track referenced by moov or moo f can be stored in moov or moo n.

(2) Store priority information of each access unit in the sample. For example, when the priority is specified from 1 to N, to decode the access unit with priority i (i is an integer from 1 to N), the access unit with priority i or less must be decoded. Indicates that it must be done. Instead of using the sample header, a box that stores priority information about the access unit of the entire video track referenced by moov or moo f can be placed in moov or moo f.

(3) Stores the sequence of the encoding types of the access units that make up the randomly accessible unit. For example, random access 00

The GOP is composed of 15 access units whose GOPs are I, B, B, P, B, B, P, B, and B-. On 1 Stores the encoding type of the five access units. Here, when a plurality of picture data are included in the access unit, it is assumed that the encoding types of the pictures in the access unit are all the same. This makes it possible to determine the access unit to be referred to when reproducing only I, reproducing only I and P, or reproducing all I, B, and P pictures.

FIG. 38 (a) shows an example in which the sample header is placed at the head of the sample. However, it may be placed after the last access unit in the sample or as the last data in the sample. Figure 39 shows an example of placing a sample header as the last data in a sample. The sample consists only of the access unit. The sample consists of N access units. Here, the last access unit in the sample includes an extension area in addition to the M NAL units, and the sample header is stored in the extension area. In this way, header information is created for each of the sample and the access unit included in the sample, and the header information regarding the access unit is stored in md at as a part of the sample, so that all of the contents in the moov This also has the effect of reducing the size of the moov as compared to the case where header information about the access unit is stored.

In addition, the sample header summarizes information on decoding time, display time, size, etc. for all access units that make up the sample. Instead of storing the information, information about one or more access units may be stored. For example, a sample header may be added for each access unit. Note that information on the structure of a randomly accessible unit such as G0P is effective even when the sample header is not used, and may be indicated in, for example, an entry of stsd.

When the sample header is stored in the extension area of the last access unit in the sample, a size specification box for indicating the size of the mahd may be provided in the mahd. The start position of mahd can be easily obtained by using the size specification box as follows. That is, the size specification box is searched from the end of the sample to the beginning of the sample. If the size specification box is found, the mahd size can be obtained. By moving the mahd size from the end of the sample by the mahd size, the mahd start position can be obtained.

Next, the data structure of each box stored in the multi-AU header box (mahd) will be described. FIG. 40 (a) shows the data structure of mtsz, FIG. 40 (b) shows the data structure of mdta, and FIG. 40 (c) shows the data structure of mcta. Note that box size, type, version, and flag information are not shown in any of the figures because they are provided in common for all boxes. As shown in Fig. 40 (a), mtsz has the same structure as stsz in stbl, and is composed of the following three fields.

AU—Defaults ize field: Indicates the size of all access units whose size information is indicated by mtsz if they are the same size In other cases, it is set to 0.

AU—count field: Indicates the number of access units whose size information is indicated by mtsz. When mtsz indicates the size of all access units in the sample, this field value is equal to the frame-count value. When mtsz indicates the size of each access unit, this field value is 1.

Table field: AU—present only if DefaultSize is 0. The table contains the number of entries indicated by AU-count, and one entry is AU_size. In each entry, the size of the access unit is stored in decoding time order. The size of the i-th access unit in the sample is indicated by AU-Defaultsize if AU-DefaultSize is not 0, and when it is 0, it is indicated by the AU-size of the i-th entry.

Although the size of the access unit is stored directly here, the table size is reduced by setting the default value of the access unit size in units of streams or samples and storing the difference between the default value and the actual size. You may. Also, the table size can be reduced by encoding the access unit size so that it is an integral multiple of a preset constant and storing how many times the size is the set value as size information. For example, set the value to 4. At this time, assuming that there are three access units whose sizes are 12 bytes, 16 bytes, and 20 bytes, respectively, the value obtained by dividing each size by 4 is 3, 4, and 5, respectively. Access unit size information W

Use as information.

These table size reduction methods are also effective when reducing the size of the table that stores the sample size. For example, when using the latter reduction method, the size of the sample as a whole is a multiple of the set value without regard to the size of the access unit during encoding. You may do the job. As shown in FIG. 40 (b), mdta has the same structure as stts in stbl, and is composed of an entry-count field and a table field.

entry_count field: Indicates the number of entries contained in the table.

Table field: Shows data for each access unit related to decoding time. Each entry in the table consists of an AU-count and a DecodingTimeDelta field. DecodngTimeDe 11a indicates the difference between the decoding time of the i-th (i is a positive integer) and the decoding time of the (i + 1) -th access unit. AU—count indicates the number of consecutive access units having the difference value of the decoding time indicated in the DecodingTimeDelta field. In other words, a new entry is added each time an access unit with a different decoding time difference value appears.

Here, a method of obtaining the decoding time and the display time will be described. In MP EG-4 AVC, auxiliary information for decoding called SEI (Supplemental Enhancement Information) can be included in the stream of video data. What is SEI? Although it is not necessary directly, it indicates information that assists in decoding. Decoding time and display time information can also be indicated using SEI called Pictre timing SEI. When the SEI does not exist in the video data, the time information may be obtained from a parameter called POC (Picture Order Count) indicating the display order for each picture. Time information provided separately may be used. Instead of directly storing the difference value of the decoding time, the relative value of the difference value may be stored as in Temporal Reference of MPEG-2 Visual.

Next, mcta has the same structure as ctts in stbl and consists of the following fields.

entry—count: Indicates the number of entries included in the table.

Each entry in the table consists of an AU-count and a CompositionTimeOf fset field. ComposionTimeOffset indicates the difference between the decoding time and the display time of the access unit. That is, the display time can be obtained by adding the value of ComposionTimeOf fset to the decoding time. AU_count indicates the number of consecutive access units having the same CompositionTimeOfset. As with md ta, a new entry is added each time an access unit with a different CompositionTimeOf fset appears. Here, mcta does not exist when the decoding time and the display time are equal in all the access units in the sample.

Here, the sample header contains two or more access Only present if configured from nits. The number of access units stored in one sample is obtained as follows. First, referring to stsc, obtain the index number of the stsd entry corresponding to the sample, and then obtain the frame-count value of the stsd entry corresponding to the obtained index number. That is, a sample header exists only if the frame—count value corresponding to the sample is greater than 1. Note that, even when the frame count is 1, if it is effective to indicate the header information as data in the sample, the sample header may be used.

Next, the initial value setting portion of the sample header will be described. The initialization part is stored in the stsd entry only if the frame count value in the stsd entry is greater than one. In the initial value setting part, the field length in each box included in mahd and the default value of the sample header are set, and consist of the following fields.

AUSizeLengthMinusOne: Indicates the size of the AU one size field in mt sz.

DecodingTimeCountLength hMinus One: Indicates the size of mtdaOAU-count field.

DecodingTimeDel taLengthMinusOne: Indicates the size of the mtda DecodingTimeDe 1 ta field.

Compo sionTimeC oun tLeng t hMinus One: ctda (AU-count Indicates the size of the count field.

Compos it ionTimeOf f se tLengthMinusOne ctda Indicates the size of the IntegrationTimeOf fset field.

DefaultHeaderBox: Stores mahd to indicate default value of sample header.

The size of the above field may be fixed, and the above field may be omitted. Also, mtsz, mdta, and mcta do not necessarily need to exist, and if they do not exist, it is assumed that no default value is set. When the sample header is used even when the frame count is 1, even if the frame count is 1, the initial value setting portion may be stored.

When storing a box other than mtsz, mtda, and ctda in mahd, a field indicating the field size of the newly added box may be added. When the information of the access units that make up the samples is the same as the initial value (default value) in all samples, there is no box stored in mahd, so mahd is stored in each sample. You may not. Here, whether or not mahd is included in the sample header is determined by setting the flag information in the initial value setting portion of the sample header or in the entry of stsd.

Note that the flag information may be set in another part of the moov. When the flag is set, it indicates that the sample header, ie, mahd, is included in the sample; when the flag is not set, it indicates that mahd is not included. For example, even in MP EG-4 AVC, a randomly accessible unit such as MPEG-2 GOP is specified. If one sample is specified for each specified unit, mahd may not be included in the sample. The presence / absence of a box may be used as flag information. The initial value setting part is realized by a box, and the use / non-use of mahd can be identified by the presence or absence of the box in the initial value setting part.

Figure 41 (a) shows an example of the syntax of the initial value setting part of the sample header when the box is used.Figure 41 (b) shows the syntax of the initial value setting part of the sample header when the box is not used. Here is an example of the syntax: FIGS. 42 (a) to (d) show examples of the respective syntaxes of mahd, mtsz, mdta, and mcta.

Next, Figs. 43 (a) to (c) show a first example for storing data in the sample header. In the first example, a sample is composed of 15 access units that are consecutive with IBBPBBBPBB in order of decoding time. Here, I, P, and B indicate the access units of I-picture, P-picture, and B-picture, respectively, and the decoding time, display time, and size of each access unit are shown in Fig. 43 (a). It is.

First, mdta and mcta exist in the DefaultHeaderBox in the initial value setting part of the sample header, as shown in Fig. 43 (b). Since the size of each access unit is random and the default value of the size information is not set, mtsz does not exist.

Next, set the sample header. Figure 43 (c) shows the data stored in the sample header. First, the decoding time information will be described. The difference in decoding time between two consecutive access units in the sample is 100 ms, and the default value can be used as is, so mdta is not required. Here, it is also assumed that the difference value of the decoding time with the next access unit is 100 ms for B-15 which is the last access unit. Next, as for the display time, the difference between the decoding time and the display time is 300 ms for the access unit of the I and P-pictures, and the decoding time and the display time of the B-picture match. Therefore, since the default value can be used as the display time information, cdta is not required.

When the picture rate is fixed, the default value can be used as the decoding time information of the picture included in the sample. For example, if the rate of all pictures included in the track is constant, only the default value needs to be set as the decoding time information, and there is no need to indicate the decoding time information by mdta in the sample header.

In addition, when a randomly accessible unit such as G0P in MPEG-2 is treated as a sample, if the fixed picture rate and the G0P structure are fixed, the display time information about the picture in the sample is used. Only the default value needs to be set. For example, if the encoded video data consists of only three different G0P structures, prepare three stsd entries and set the default value of the display time information corresponding to each G0P structure. However, if the stsd entry to be changed for each sample is changed, it is not necessary to indicate the display time information by met a in the sample header. Here, the GOP structure indicates the number of pictures constituting the GOP and the coding type of each picture (I, B, or P). Finally, the size information will be described. Since no default value is set in the size information, the size of 15 access units included in the sample is stored using mtsz. After all, the sample header contains only mtsz to store the size information, and the decoding time and display time use the default values, so 111 (113 and 111 (; 1 & are not included).

Figures 44 (a) and (b) show a second example for storing data in the sample header. As described below, the sample header contains three boxes: mdta, mcta, and mtsz. It is assumed that the default value of the sample header is the same as the first example shown in Fig. 43 (b).

As shown in Fig. 44 (a), the configuration of the access unit in the sample is the same as in the first example (Fig. 43 (a)). As shown in FIG. 44 (a), a frame skip occurs in the first access unit B-12, which is the second access unit. Therefore, since the default value cannot be used for both the decoding time and the display time, it is necessary to set mdta and mcta in the sample header.

First, the decoding time information is set. As shown in (1) of FIG. 44 (b), the difference between the decoding time of the next access unit and that of the one access unit from I-1 to B-11 is 100. ms, and the difference value of only B_12 is 2000 ms. And P—1 3 Mdta will need three entries each, since to 100 ms again from to B—15.

Next, the display time will be described. The display time is the same as the default value except that the difference from the decoding time at P-10 is 400 ms. Therefore, each mcta entry set in the sample header is set as shown in (2) of Fig. 44 (b).

Finally, mtsz is explained. As shown in (3) of FIG. 44 (b), DltSZ is the same as in the first example.

In the above description, it is assumed that there is only one sample header in a sample, and information on all access units constituting the sample is stored in one sample header. However, there may be more than one sample header in a sample. One of the reasons is that, depending on the recording method of sample data, it may be more efficient to record the information of each access unit sequentially rather than collectively recording the information of the access units in the sample. . As an example, a case where the sample data is recorded on an optical disk in real time will be described.

When storing information of all access units in a sample in one sample header, the sample header is not completed unless information on all access units in the sample is acquired. Assuming that the sample header is placed at the beginning of the sample, the sample data is written to the optical disk after the sample header is completed, and the writing is performed until the information of the last access unit in the sample is obtained. Can't start. In addition, a memory is required to temporarily store the access unit data in the sample. For example, when a randomly accessible unit such as a GOP is used as one sample, it is necessary to have enough memory to hold all the access units included in the randomly accessible unit. On the other hand, if the information for each access unit is stored in the sample header, if the information for one access unit can be obtained, the sample header is completed, so that the sample data can be written sequentially. It has excellent real-time performance and can reduce the memory size required to hold the access unit.

The sample structure when a sample header is added to one or more access units in the sample will be described below. Here, whether one or more sample headers exist in a sample is determined by the number of access units whose header information is indicated by each box in the sample header. If the box does not exist in the sample header, it is determined by referring to the default box defined in the initial value setting part of the sample header. If the number of access units whose header information is indicated by the first sample header in the sample matches the frame_count value, the first sample header uses the header of all access units in the sample. There is only one sample header because the header information is shown.

On the other hand, if the number of access units whose header information is indicated by the first sample header is smaller than fr ame-count, one or more sample headers are used to indicate the header information of the remaining access units. Stored separately in the sample.

An example is described with reference to FIGS. 45 (a) and (b). In this example, one sample is composed of N access units (N is an integer of 1 or more). Fig. 45 (a) shows an example in which a sample header is added to each access unit in one sample. The header information of the first access unit is stored in sample header 1, and mdta and mtsz are stored in sample header 1. Here, in the mdta, entry-count is set to 1, indicating that the mdta is information on the access unit 1. Also, by setting AU-count of the entry to 1, this mdta indicates that this mdta is information about one unit. As a result, this mdta shows only information about access unit 1.

Similarly, also in mtsz, since AU-count is 1, mtsz indicates only information on access unit 1. Similarly, in the subsequent sample header, it can be seen that the i-th (i is an integer less than or equal to N) sample header indicates the header information of the i-th access unit.

FIG. 45 (b) shows an example in which one sample header is added to a plurality of access units less than N. Here, sample header 1 indicates information of the first to third access units, sample header 2 indicates the fourth and fifth access units, and sample header M (M is an integer less than N) is N Indicates the information of the th access unit. In sample header 1, mdta is set from 1 to 3 by setting entry-count of mdta to 1 and AU-count of entry to 3. The information of the third access unit is shown.

Similarly, 3 is set to AU-count of mtsz, indicating that mtsz is information on the first to third access units. Next, it can be seen from the mtsz and md in the sample header 2 that the sample header 2 indicates the information of the fourth and fifth access units. If mtsz, mdta or mcta is not present in the sample header, use the default pox set in the initial value setting part of the sample header. The start position of the next sample header may be obtained without adding the size of the access unit by adding information indicating the storage position of the next sample header to the sample header. Also, flag information for determining whether there is one or more sample headers present in the sample may be set in the initial value setting portion of the sample header, or the sample header may be set in the sample header. A header may be added for each access unit.

Furthermore, although the sample header has a box structure, necessary fields may be stored in order without using the box structure. At this time, the sample header includes a field indicating whether or not each information of the size, the decoding time, and the display time is set, and a field for setting each information. Figures 46 (a) and (b) show the sample structure and the example of syntax when the box structure is not used. In Fig. 46 (a), the Multi-AU header corresponds to the sample header. This data structure consists of one or more frame data (access unit) and its extended information (sample header). This is the same as the data structure shown in Fig. 38 (a) in that the information is defined as one sample.

Fig. 46 (b) shows an example of the syntax of a multi-AU header. In Fig. 46 (b), the AUSizePresent field, DecodingT L meDe 1 taP resent field, and ComposionTimeOf fset field are "AUSize" and "AUSize", respectively. This is flag information indicating whether the "DecodingTimeDelta" and "ComposionTimeOffset" fields are present. Frame-- The count field is the frame shown in the entry in the stsd referenced by the sample. Equal to count value.

Ma / ko, AUSize, DecodingTimeDelta,

The definition of "ComposionTimeOf fset" is the same as in the case of Figs. 40 (a) to (c). In this example, "DecodingTimeDelta" and "CompositionTimeOffset" are stored for each access unit even when consecutive access units with the same "DecodingTimeDelta" and "CompositionTimeOf fset" are consecutive. However, as in the case of using the pox structure, when the same value continues, the field value may be omitted by indicating the number of consecutive access units. In the initial value setting section of the sample header, information indicating the size of each field of AUSize, DecodingTimeDelta, and Composi- tionTimeOfset is set, and the default value of the sample header is set. Figure 41 (b) shows a syntax example of the initial value setting part. Thus, by using the sample header, the decoding time, display time, and size of the access unit can be obtained without analyzing the data of the access unit. In addition, even in an encoding method that does not include a start code in the access unit when storing MP4, such as MPEG-4 AVC, the boundary of the access unit can be easily obtained by referring to the sample header. .

Hereinafter, the data processing device according to the present embodiment will be described. The configuration of the data processing device according to the present embodiment is the same as the configuration of the data processing device shown in FIG. 35, and its basic operation is also as described above. Therefore, the following describes the components related to the processing according to the present embodiment.

The analysis unit 303 shown in FIG. 35 acquires the header data d306 from the memory 304 and analyzes it to obtain information such as the sample size, decoding time, and storage location. The analysis result is input to the sample analyzer 307 as data d307. The sample analyzer 307 obtains sample data d 308 from the memory 305 based on the analysis result d 307, obtains picture data d 309 from the sample, and decodes and displays the data. Enter 8 The decoding display unit 308 decodes the input picture data d 309 and displays it.

FIG. 47 shows a procedure of a process in which the sample analysis unit 307 acquires picture data from a sample. The initial value of the variable i is 0. First, in step S61, the stsd entry corresponding to the sample frame — get count value If frame_count is greater than 1, obtain the default value of the size, decoding time or display time information from the initial value setting part of the sample header included in the stsd entry, and obtain the default value of the sample header. Get the size of the field in the box. Here, if the entry number of the stsd, the frame count and the initial value information of the sample header corresponding to the first entry are stored in advance, the initial value information is stored for each sample in step S61. You do not need to obtain it.

Next, in step S63, it is determined whether or not the frame count is greater than 1. If it is greater than 1, a sample header is present, so the sample header is analyzed in step S63. At the time of sample header analysis, mahd searches for each box of mtsz, mtda or mcta, and if it exists, obtains its contents and overwrites it with the default value to access unit. Get the size, decryption time or display time information of the event. When frame—count is 1, the process of step S64 is performed after step S62. In step S64, data of the access unit is obtained from the sample based on the size of the access unit obtained in step S63, and in step S65, picture data is separated from the access unit.

The decoding time and the display time acquired in step S63 are used when decoding and displaying picture data in the decoding display unit 308 in FIG. Also, when starting playback from the middle of a track Alternatively, the display time of the access unit in the sample may be obtained and used to determine the access unit to start playback. In step S66, 1 is added to i. In step S67, it is determined whether or not i is smaller than frame-count. If smaller than i, the processing from step S63 to step S67 is performed. repeat.

Next, an example of a procedure for acquiring the size information of the access unit in step S63 will be described in detail. Figure 48 shows the procedure for obtaining the size of the access units that make up the sample. Here, the initial value of both the variable i and the data read pointer data_ptr is set to 0. It is also assumed that the number of bytes of the AU-size field length in nit sz is FieldSize.

First, in step S71, a box whose box type is 'mtsz' is searched in mahd. In step S72, the search result is determined. If mtsz exists, in step S73, the data read pointer data—ptr is set to the AU—Default Size field in mtsz in the sample header. Set to start position. If it is determined in step S72 that mtsz does not exist in mahd, in step S74, the default mtsz included in the initial value setting portion of the sample header is obtained, and the data read pointer data — Set ptr to the start of the AU—DefaultSize field in the default mtsz. In step S75, the value of AU—DefaultSize is acquired, and 4 is added to the data—p data read-out data-p. Then, in step S76, whether AU DefaultSize is 0 If it is not 0, in step S78, the size indicated in AU-DeiaultSize is set as the size of all access units constituting the sample. If the value is 0, the size of each access unit is obtained by obtaining the AU-size from the table entry. First, in step S77, the value of AU_count is obtained, and 4 is added to the data-ptr data read / write command. Here, the value of AU-count indicates the number of access units included in the sample.

Next, in step S79, the size of the (i + 1) th access unit in the sample is obtained, 4 is added to the data read pointer, and 1 is added to i in step S810. In step S811, i is compared with AU-count, and if i is smaller than AU-count, the processing of steps S79, S80 and step S81 is repeated to obtain a sample. It is possible to obtain the sizes of all the access units constituting

(Embodiment 7)

The data processing device according to the seventh embodiment receives and analyzes an MP4 file in which one sample includes a plurality of video pictures, decodes and displays encoded data. Here, it is assumed that at least one of the display time and the decoding time is different between two different video pictures. In the present embodiment, an MP4 data structure and a demultiplexing process thereof for the purpose of efficiently performing special reproduction such as double-speed reproduction are provided. The configuration of the data processing device according to the present embodiment is the same as the configuration of the data processing device according to the sixth embodiment. It is assumed that the video data is encoded in the MPEG-4 AVC format, but may be MPEG-4 Visual, MPEG_2 Visual or H.263. Also, multiple frames of audio or text data may be stored in one sample. When a plurality of pictures are included in one sample, the sample header described in the sixth embodiment is used.

MP EG-4 AVC allows flexible setting of the reference relationship between the pictures that make up a video stream, but on the other hand, when selectively playing back a specific picture such as double-speed playback, the picture to be decoded is determined. Have difficulty. Here, the reference relationship between pictures will be described with reference to FIG. Figure 49 shows a series of pictures and the encoding evening of each picture. As shown in Fig. 49, the picture coding type is I-3, B-1, B-2, P-6, B_4, B-5, P-9, B-7, B- The number added to the encoding type of each picture indicates the display order. For example, in order to play such a series of pictures at 3x speed, I and P pictures should be played in the order of 1-1, P-6, P-9. At this time, there is no problem as long as P-6 can refer to only I-1 and P-9 can refer to only P-6 for decoding. However, in MP EG-4 AVC, P_6 may be decoded with reference to B_2. Therefore, if the reference relationship between pictures is not known in advance, the pictures that need to be decoded during Unable to determine char. Which picture each picture refers to can be obtained by analyzing the slice header for all slices constituting the picture. However, analyzing all slices during trick play is inefficient.

In the MP4 file according to the present embodiment, (1) MPEI—4 AVC SEI (Supplemental Enhancement Information) is used to indicate the reference relationship between pictures. Or, (2) In the sample header, describe the reference relationship between the pictures that make up the sample.

Here, a structure called a subsequence and a layer is used to indicate a reference relationship between pictures. In the conventional MP4 file, these structures are described using the Sample To Group Box (sbgp) in the stbl. However, when the subsequence and the layer structure are described by sbgp, the size of the sbgp becomes very large. Therefore, there is a problem that the size of moov also increases as a result. According to the present embodiment described below, this problem is solved by using an MP4 file.

First, a method for indicating the reference relationship between pictures using SEI will be described. SEI is additional information introduced for the purpose of improving the convenience of decoding a picture, and is used by adding it to picture data. However, SEI is not directly related to the decoding operation, and it is possible to decode picture data without SEI. MP EG-4 AVC introduces a concept called sub-sequence and layer to achieve temporal scalability. SEI for that is prepared.

First, the subsequences and layers will be described with reference to FIGS. 50 (a) to (c) as an example. FIGS. 50 (a) to (c) show layers 0 and 1 constituting a video stream and a video stream. FIG. 50 (a) shows fifteen pictures making up a video stream. Each picture is numbered from 1 to 15 in decoding order. The video stream is divided into two layers, Layer 0 and Layer 1. FIG. 50 (b) shows layer 0 of the video stream, and FIG. 50 (c) shows layer 1 of the video stream. Layer 0 can be decoded independently, and Layer 1 is decoded with reference to Layer 0 or Layer 1 pictures. Further, each layer is divided into units called subsequences, and layer 0 and layer 1 are each divided into two subsequences as shown in the figure. In this way, pictures in the sub-sequence belonging to the N-th (N is an integer of 1 or more) layer can refer only to the pictures belonging to the sub-sequence of the N-th or less layer.

The subsequence and the layer-related SEI indicate the layer number to which the picture belongs, the number to which the subsequence belongs in the layer, and the subsequence to be referred to when decoding the subsequence to which the picture belongs. In the stream, by adding a sub-sequence and layer-related SEI for each picture, it indicates which sub-sequence to refer to when decoding the data of the sub-sequence to which the picture belongs. For example, layer 0 and layer Assuming that the frame rate when decoding 1 together is 30 Hz and the layer rate is only layer 0 If the frame rate when decoding is 15 Hz, only layer 0 is needed to reduce the bit rate If the bit rate is not restricted, it is possible to decode both Layer 0 and Layer 1 and reproduce them at 30 Hz. Figures 51 (a) to (c) show the syntax of subsequence and layer-related SEI. Subsequence information (SSI) SEI, Sub-sequence layer characteristics (SSL) SEI, Sub-sequence characteristics (SSC) Three SEIs are defined. The main fields in each SEI are described below.

(1) SSI SEI

Indicates the layer and sub-sequence to which the picture belongs.

sub_seq_layer_num: the number of the layer to which the subsequence belongs. sub—seq—id: index number of the subsequence in the layer,

(2) SSL SEI

The information of each of a plurality of layers is shown.

num-sub-seq-layers-minusl: The number of layers that make up the stream.

average bit rate—Layer's average bit rate.

average-frame-rate: The average frame rate of the layer.

(3) SSC SEI Shows information for each subsequence.

sub_seq_layer_num: Number of the layer to which the subsequence belongs. sub—seq—id: Index number of the subsequence in the layer. average—bit—rate: The average bit rate of the subsequence.

average-frame-per-rate: The average frame rate of the subsequence. nuin one referenced one subseqs: The number of referenced subsequences.

ref-sub-seq-layer-num: su-seq-layer-num and sub

— Seq—Layer number that includes the reference picture of the picture making up the subsequence indicated by id.

ref-sub-seq-id: Index number of the sub-sequence that includes the picture referenced by the picture constituting the sub-sequence indicated by sub-seq-layer-num and sub-seq-id. The layer to which the subsequence indicated by this field belongs is indicated by ref—sub—seq—layer—num.

ref—sub—seq—direct ion: Flag information indicating whether the referenced subsequence precedes or follows the referencing subsequence in decoding order.

Here, a method of describing a subsequence and a layer structure in a conventional general MP4 file will be described. First, in conventional MP4 files, it is prohibited to store SEIs related to subsequences and layers as video track data, and the information in these SEIs is stored using boxes in stbl. Was.

Next, the information indicated by the SSL SEI and SSC SEI is It is stored in the Releg J-rape description box (sgpd). The structure of sgpd is similar to the structure of stsd, and the information of each SEI is stored in the entry in sgpd so that the initialization information of the decoder required when decoding video data is stored in the entry in stsd. Is stored in The bit rate and frame rate information of SSL SEI is stored in an entry called AVCLayerEntry. Since the AVCLayerEntry contains information of one layer, setting the information of all layers requires AVCLayerEntry of the number of layers, and each entry is described by the index number of the entry determined by the appearance order.

These contents are the same as the entries in stsd. Similarly, since the contents of SSC SEI are stored in AVCSubSequenceEntry, two sgpds for AVCLayerEntry and AVCSubSequenceEntry are eventually required. In addition, use the Sample To Group Box ('sbgp') to associate AVCLayerEntry and AVCSubSequenceEntry with the sample. sbgp has a field indicating the entry in sgpd that the sample refers to, but only one type of entry per sbgp can be referenced. For this reason, two sbgp Boxes are used to associate AVCLayerEntry and AVCSubSequenceEny with the sample.

With reference to FIGS. 52 and 53, estimation of overhead when using the subsequence-related box will be described. FIGS. 52 (a) to (d) show a video stream and layers 0, 1, and 2 constituting the video stream. Figure 52 (a) shows the structure of a video stream. 15 pictures to be formed, and the coding types are I, B, B, P, B, B, P,..., P, B, B in decoding order. The frame rate for decoding all pictures is 30 Hz. Note that the number added after the encoding type indicates the order of the display time. 15 pictures are divided into three layers according to the coding type. FIG. 52 (b) shows layer 0 of the video stream, FIG. 52 (c) shows layer 1 of the video stream, and FIG. 52 (d) shows layer 2 of the video stream. Layer 0 is composed of I pictures, layer 1 is composed of P pictures, and layer 2 is composed of B pictures.

Pictures belonging to layer 0 to layer 2 are stored in subsequence 0-1, subsequence 11 and subsequence 2-1, respectively. Here, sub-sequence 0_1 is decoded independently, sub-sequence 1-1 is decoded with reference to sub-sequence 0-1, and sub-sequence 2-1 is sub-sequence 0-1 and sub-sequence 1-1. Decoded with reference to 1. It is assumed that the video sequence has the same structure as the 15 pictures repeated. Next, the overhead when the video sequence shown in Fig. 52 (a) is stored in the MP4 file is calculated. Information on layers from layer 0 to layer 2 is stored in the first to third AVCLayerEntry in the sgpd for AVCLayerEntry.

Information on the three subsequences, subsequence 0—1, subsequence 1-1, and subsequence 2-1 are respectively stored in the first to third sgpd for AVCSubSequenceEntry. Stored in AVCSubSequenceEntry. FIG. 53 (a) shows the table data of the sbgp for the layer, and FIG. 53 (b) shows the table data of the sbgp for the subsequence. Here, the index of the sbgp for the layer indicates the entry number in the sgpd for the AVCLayerEntry, and the index of the sbgp for the subsequence indicates the index number in the sgpd for the AVCSubSequenceEntry. sample—count indicates the number of consecutive samples with the same index value. The size of the layer and subsequence tables increases because entries are updated frequently. In addition, since one entry having the same structure is repeated, the table becomes redundant. Since the size of the sample—count and index fields are both 4 bytes, the data size for indicating information about 15 pictures (for 0.5 seconds) shown in Figure 52 is the spgp for the layer and the sub sequence. 8 * 2 * 10 = 16 bytes for both sbgp. For example, when recording one hour of data, the combined size of the two sbgp boxes can be as large as 2 * 160 * (1 / 0.5) * 360 = 2304000 bytes. Since sbgp is included in moov, the size of moov becomes extremely large as a result, and there is a problem in that the required memory size increases when playing MP4 files. In addition, in the conventional method using sbgp, information about subsequences and layers can be described only in units of samples. There is also the problem that it cannot be done.

The MP4 file according to the present embodiment includes the above-described sub-sequence, The key structure is used to indicate the reference relationship between pictures required for trick play. In addition, it is assumed that subsequence and layer-related SEI can be used as video track data, and that sbgp is not used to indicate subsequence and layer-related information. The following describes how to use each SEI. In this MP4 file, the video stream is divided into a plurality of randomly accessible units, and the reference relationship between pictures is shown based on the randomly accessible units (hereinafter referred to as AVC-G0P). In the following, one access unit will be described as being composed of data of one picture, but one access unit may be composed of data of a plurality of pictures. In that case, the same structure as that described later is used. Can be adopted.

Fig. 54 (a) to (c) show the structure of AVC-G0P. In AVC-G0P, the first picture can be decoded independently. AVC-G0P is composed of L (L is an integer of 1 or more) layers, M (M is an integer of 1 or more) pictures, and each picture is composed of N (N is an integer of 1 or more) slices, It is assumed that one layer has one subsequence in AVC-G0P. The SSL SE I placed at the top indicates information on L layers and the like that constitute the video stream, the L SSC SE I placed subsequently indicates information for each layer, and then M sheets Minute picture data follows. SSISEI is placed at the head of each picture data, followed by N slice data. Although it is assumed here that each layer in AVC-G0P is composed of one subsequence, each layer may be composed of a plurality of subsequences. All you need to do is place the SSC SE I for the required subsequence.

The order and location of SSL, SSC, and SSI SEIs are not limited to the structure shown in Fig. 49 (a). Also, the SSC SEI may be omitted by preliminarily defining a reference relationship between subsequences. For example, when each layer in AVC-G0P is composed of one sub-sequence, the picture included in layer N should be defined as referring to the picture belonging to the Nth or lower layer in the same AVC-G0P. If so, SSC SEI can be omitted. The AVC-G0P structure does not limit the use of SEIs other than the subsequence and layer-related SEIs. For example, a Random access point SEI may be placed at the beginning of the AVC-GOP. Furthermore, the SSL SEI may not be provided to all AVC-G0Ps but may be added only to the first G0P in the video stream, or may be added periodically.

A picture that can be decoded independently may be arranged as a picture other than the first picture of AVC-G0P. The use of the AVC-G0P structure is not limited to MP4, but may be used in other multiplexing formats such as MPEG-2 TS (Transport Stream) and PS (Program Stream). Also, the SSL SEI and SSC SEI may be added for each picture, not for each randomly accessible unit. Also, the reference relationship may be indicated using SSL, SS SSI at any position without setting the unit of AVC-G0P.

Figures 54 (b) and (c) show the subsequence and layer-related SEI 2 shows a sample structure defined by using. A sample header is added to each sample, but is omitted here. When frame—count is 1, one picture contains data for one picture, so the sample is determined according to the structure shown in Figure 54 (b). Sample 1 is a sample that stores the data of the first picture in AVC-G0P, and also stores a subsequence in AVC-G0P and SSL SEI and SSC SEI, which are SEIs indicating layer information. Data of the 2nd to Mth pictures in AVC-G0P are stored in sample 2 to sample M, respectively.

When frame—count is greater than 1, one sample stores the data of all the pictures that make up the AVC-G0P, and the sample is the same as the AVC-G0P, as shown in Fig. 54 (c). Take the structure. Note that even when frame_count is greater than 1, the AVC-G0P data may be divided into a plurality of samples and stored.

The following is a specific example of applying the subsequence and layer to the MPEG-4 AVC stream. 55 (a) to 55 (d) show a video stream and layers 0, 1, and 2 constituting the video stream. FIGS. 55 (a) to 55 (d) show how subsequences and layers are used in AVC-G0P. Fig. 55 (a) shows 15 pictures in AVC-G0P constituting a video stream, and the coding types are I, B, B, P, B, B, P, ·, P, B, B. The frame rate when all pictures are decoded is 30 Hz. The 15 pictures are divided into three layers according to the coding type. FIG. 55 (b) shows layer 0 of the video stream, FIG. 55 (c) shows layer 1 of the video stream, and FIG. 55 (d) shows layer 2 of the video stream. Layer 0 is composed of I pictures, layer 1 is composed of P pictures, and layer 2 is composed of B pictures.

Pictures belonging to layer 0 to layer 2 in AVC-G0P are stored in subsequence 0-1, subsequence 1-1, and subsequence 2-1 respectively. Here, subsequence 0_1 is decoded independently, subsequence 1-1 is decoded with reference to subsequence 0_1, and subsequence 2-1 is subsequence 0-1 and subsequence 11 Decoded with reference to 1.

Since the display time interval of each picture is fixed, the frame rate when decoding only Layer 0 is 2 Hz, the frame rate when decoding Layer 0 and Layer 1 is 10 Hz, and Layer 0 to Layer 0 The frame rate when decoding up to 2 is 30 Hz. The bit rate is 64 kbps for layer 0 only, 96 kbps for the sum of layer 0 and layer 1, and 128 kbps for the total from layer 0 to layer 2. FIGS. 56 (a) to (c) show SSL, SSC and SSI field values stored in the AVC-GOP shown in FIG. 55 (a). Figure 56 (a) stores the average bit rate and the average frame rate for the Nth layer (N is an integer from 0 to 2) when all the Nth layers are summed. Indicates SSL SEI.

Next, Figure 56 (b) shows the Nth (N is an integer from 0 to 2) layer Indicates the SSC SEI in which the information of the sub-sequence referenced by the sub-sequence included in is stored. For example, it is shown that subsequence 1 of layer 2 refers to subsequence 1 of layer 0 and subsequence 1 of layer 1. Further, FIG. 56 (c) shows the SSI SEI added for each picture. The SSI SEI stores information on the layer to which each picture belongs and the subsequence. For example, picture I-3 is included in subsequence 1 of layer 0, and picture B-4 is included in subsequence 1 of layer 2.

The above description is an example of setting a layer and a subsequence, and can be set freely as long as the definition of the layer and the subsequence is satisfied. Next, FIGS. 57 (a) and (b) show the sample structure when the AVC-G0P data of FIG. 55 is stored in MP4 samples. First, when frame-count is 1, that is, when one sample contains data of one picture, SSL SEI and SSC SEI are provided for the sample containing the picture I_3 data, and the subsequent samples Includes only picture data. Also, when frame-count is greater than 1, that is, when one sample contains data of multiple pictures, all the data of AVC-G0P is stored in one sample. It is assumed that SSI SEI is included in each picture data.

Next, an example of using a sample header to describe a reference relationship between pictures constituting a sample will be described. Embodiment 6 has described that a new box can be introduced in the sample header to add new information. In this embodiment, the subsequence and the layer A new sample-to-layer subsequence box (SampleToLayerSubSequenceBox; stls) is used to store the information to be stored. s 11 s indicates the AVCLayerEntry and AVCSubSequenceEntry referenced by the picture in the sample, and the reference information to both entries is stored in the same box to reduce overhead.

AVCLayerEntry and AVCSubSequenceEntry are stored in sgpd in stbl as in the case of conventional MP4. Fig. 58 (a) shows an example of the syntax of the sample * one layer subsequence box stls. Figure 58 (a) [Here, layer_description_index t subsequence-description-index indicates the entry number of the AVCLayerEntry and AVCSubSequenceEntry referenced by the picture, respectively, and the picture-count is the same. Indicates the number of consecutive pictures that refer to AVCLayerEntry and AVCSubSequenceEntry. If the picture—count, layer—description—index, and sub—sequence—description—index fields are collectively referred to as a picture level entry, a picture with the same subsequence and layer-related information appears periodically. , The same picture level entry is repeated periodically. For this reason, when the picture level entry has a periodic structure, the table size is reduced by indicating the number of consecutive same periodic structures using the entry count.

FIG. 58 (b) shows the table structure of st Is when the AVC-G0P of FIG. 55 is stored in one sample. The information for the fourth and subsequent pictures is Due to the periodicity, the table size has been significantly reduced by using the en-count. Note that the number of bits in each field and the arrangement of the fields are examples, and a different number of bits / field arrangement may be used.

Note that a box that can describe a relationship between a picture and a plurality of groups to which the picture belongs without being limited to a subsequence or a layer may be defined. Each group has an independent st sd, and a picture is associated with an entry in the stsd. In the present specification, such a box is defined as a sample-to-multi-group box (SampleToMultiGroupBox; stmg). Fig. 58 (c) shows an example of the syntax of a sample multi-group box (SampleToMultiGroupBox; stmg). num—of—related—grouping—type indicates the number of groups associated with the picture. grouping-type is an identifier of a group, for example, an identifier indicating a subsequence or a layer is stored. The definitions of total-entry-count, entry-count, and picture-count are the same as for stls. The picture—index field stores the index number of the entry in the stsd referenced by the picture for all the groups to which the picture belongs. For example, when storing layer and subsequence information using stmg, num-of-related-grouping-type is 2, and grouping_type stores each identifier in order.

In addition, the picture referenced by the grouping index field The index numbers of AVCLayerEntry and AVCSubSequenceEntry are stored in order. Also, stls can set the default value in the initial value setting part of the sample header like other boxes in the sample header. You do not need to set stls in the sample header. Also, information of each picture may be stored sequentially without expressing a periodic structure using an entry-count field or the like. Also, the subsequence and layer information may be stored using separate and independent boxes. Note that stls and stmg are not limited to use in the sample header, but may be used in moov or moof.

Next, a description will be given of criteria for selectively using (1) using SEI and (2) using stls in the sample header when describing a reference relationship between pictures.

The usage is determined by the newly defined flag in the sd entry. In this specification, this flag is called a subseq-flag. subseq—When flag is set, indicates the reference relationship between pictures using the subsequence and layer-related SEI set in the video track. The sample header stls is not used. Next, if subseq—flag is not set, use the sample header stls. On the other hand, the inclusion of subsequences and layer-related SEIs in video tracks is prohibited. That is, when acquiring the reference relationship between pictures, first check the value of the subseq flag. And subseq — If f 1 ag is set, the reference relation is obtained from the SEI of SSL, SSC, and SSI added to the picture data in the sample, and subseq—sample header if flag is not set. Get the reference relation of picture from stls of. The subseq-flag may be set in a location other than the stsd entry if it is within moov, or a box structure may be used instead of the flag.

subseq—Even if the flag is not set, the inclusion of subsequence and layer related SEI in the video track may not be prohibited. As a result, when the MPEG-4 AVC data stored in the video track of the MP4 file is taken out and converted into a different format such as a transport stream (TS), the subsequence and the subsequence in the converted format are used. Even when using layer-related SEI, the video track can be used as it is.

Also, the subseq-flag may not be used, and may be used depending on the value of frame-count. For example, if the frame—count is greater than 1, the sample header is used, so the reference relationship is set using st Is, and if the frame—count is 1, SEI is used. Yeah. Further, when the subseq-flag is set, the SEI may be used. When the subseq-flag is not set, the reference relationship may be indicated by using sbgp as before. Also, the subsequence and layer-related SEI may not be used, and the conventional method using sbgp and the method using st Is of the sample header may be switched by the value of subseq-flag. Furthermore, One of subsequence, layer related SEI, or stls may be used in advance.

As described above, by using the MP4 file of the present embodiment, it is possible to easily acquire the reference relation of the picture in the video track. Therefore, the picture to be decoded or displayed at the time of special reproduction such as double-speed reproduction can be efficiently determined. In addition, even when the value of frame-coun- ter is greater than 1 and one sample includes a plurality of pictures, the reference relation in a picture unit can be indicated. Furthermore, since the information about the picture reference relationship can be stored in the mdat using the sample header or SEI, the size of the moov can be reduced.

Hereinafter, the data processing device according to the present embodiment will be described. The configuration of the data processing device according to the present embodiment is the same as the configuration of the data processing device shown in FIG. 35, and its basic operation is also as described above. Therefore, hereinafter, the operation of the data processing device according to the present embodiment, specifically, the operation of the sample analysis unit 307 and the decoding display unit 308 during trick play will be described. The data processing device displays only the selected picture at the time of trick play, but it may be necessary to decode a picture other than the selected picture in order to decode the picture to be played. Therefore, the data processing apparatus must distinguish between a picture that performs only decoding and a picture that performs both decoding and display. Therefore, the operation of the data processing apparatus during trick play will be described below. The sample analysis unit 307 obtains the reference relationship between the pictures included in the video track, decodes the picture, and displays the picture. Determine the key. After that, the sample analysis section 307 decodes the identification signal for identifying whether to decode and display the picture data d 309 together with the picture data d 309 or to perform only decoding. Output to 308. Decoding display section 308 decodes picture data d 309, and performs display when display is instructed by the received identification signal, and does not perform display when not instructed. FIG. 59 shows a procedure of processing of the sample analysis unit 307 and the decoding display unit 308 when only the selected picture is reproduced.

First, in step S91, the value of subseq-flag stored in the entry of stsd is obtained. subseq — Π If ag is set, the picture reference relationship is obtained from the subsequence and layer-related SEI included in the video track data, and if not set, it is obtained from st Is in the sample header. It is decided to do. Subsequently, in step S92, the sub-sequence to be displayed is determined based on the frame rate divided by the bit rate of the layer or the reference relation of the sub-sequence.

Note that only a specific picture may be displayed without displaying all of the pictures included in the subsequence. In step S93, a sub-sequence to be referred to when decoding the picture of the sub-sequence determined to be displayed in step S92 is specified, and in step S94, the sub-sequence included in these referenced sub-sequences is included. Decode the picture. Finally, in step S95, the picture to be displayed is decoded and displayed. Here, the steps from step S 91 The processing up to step S93 is performed in the sample analysis section 307, and the subsequent processing is performed in the decoding display section 308.

The multiplexing processing and the demultiplexing processing of the data processing device according to each embodiment described in this specification are realized based on a combination program that defines the procedure of such processing. The computer program is recorded on a recording medium such as a flexible disk or a CD-ROM and distributed to the market, or transmitted through a telecommunication line such as the Internet. Alternatively, it can be operated as a playback device.

FIG. 60 (a) shows an example of a physical format of a flexible disk (FD) as an example of a recording medium. On the surface of the flexible disk FD, a plurality of tracks Tr are formed concentrically from the outer circumference toward the inner circumference, and each track is divided into 16 sectors Se in the angular direction. The above-described program is recorded in a predetermined area allocated on the flexible disk FD. Figure 60 (b) shows the appearance, cross-sectional structure, and flexible disk of the flexible disk viewed from the front. The flexible disk FD is built in the case F.

FIG. 60 (c) shows a device configuration for writing and reading a program to and from the flexible disk FD. The program for multiplexing and demultiplexing in the data processing device is transferred from the computer system Cs, and the flexible disk F (FDD) is used by the flexible disk drive (FDD). Written to D. When the flexible disk FD in which the program is stored is loaded into the FDD, the computer system CS reads and executes the program, thereby realizing the multiplexing process and / or the demultiplexing process.

The recording medium is not limited to a flexible disk, but may be an optical disk such as a CD-ROM, a semiconductor recording medium such as an IC card, a ROM cassette, or the like.

In the embodiments of the present invention, the description has mainly been made with respect to the MP4 file. However, most of the specifications of the μ 規格 4 standard are defined based on the QuickTime (TM) file format of Apple (registered trademark), and the specifications and names differ in some cases. Is almost the same. The above description is generally compliant with the QuickTime standard by replacing "Box" in field names with "Atom".

According to the data processing device of the present invention, the decoding time of each frame can be obtained by dividing the difference between the decoding start times of the current sample and the next sample by the number of frames forming the sample. The data processing device does not analyze the access unit even if the decoding time and display start time of the access unit constituting the sample are different or the access unit does not include a start code. In addition, time information and size of each access unit can be acquired at high speed and with low load.

Also, according to the above data structure, the access configuring the sample Since the decoding, display time information, or size of the unit is stored in the sample as information separate from the access unit, information per access unit can be obtained without analyzing the access unit. Even when the decoding time and the display start time are different in each frame, the correct display time of the frame can be obtained. Also, since the time information in the encoded data can be used as it is as the display time information of the frames constituting the sample, the load on the data processing device for acquiring the time information can be reduced.

Further, according to the data structure described above, when the frame rate of the moving image data is fixed, only the default value is set as the decoding time information of the sample, whereby the decoding of the frame constituting the sample is performed. Time information can be easily obtained, and overhead can be reduced. When the structure of a sample is known, such as 1 G0P as one sample, the overhead of the sample is obtained by acquiring information on each frame constituting the sample from the encoded data in the sample. Can be reduced. Also, by storing information about a frame in a sample only when the information about the frame constituting the sample is different from the default value, information about the frame can be stored efficiently.

Furthermore, according to the above-described data structure, special playback such as double-speed playback is performed by including in the sample header information necessary for identifying a reference frame to be referred to when decoding a frame constituting a sample. Sometimes you can easily identify the frames that need to be decoded. Configure Sample Since the video stream itself contains the information necessary to specify the reference frame to be referenced by the frame to be decoded at the time of decoding, it is easy to identify the frame that needs to be decoded during special playback such as double-speed playback. The effect is obtained. By including identification information indicating the location where the information necessary to identify the reference frame to be referred to during decoding by the frames making up the sample, information about the reference relationship between the frames used for trick play is included. The storage location can be specified. Industrial applicability

According to the present invention, for example, a video stream

Compared to a case where one frame corresponds to one video sample, a data structure that can reduce the size of the attached information and suppress the propagation of an error even when an error occurs is provided.

Claims

1. A receiving unit that receives at least one of a video signal and an audio signal, and encodes the signal in a predetermined encoding format so that the video signal is blue.

A stream generation unit for generating an encoding stream including a plurality of reproduction data, which is picture data and frame data for the audio signal;

Example

An extended information generating unit for generating extended information for identifying each reproduction data;

An additional information generation unit that generates an access data for accessing a group unit of one or more reproduction data, and generates additional information including the access data; and stores the encoded stream and the extended information. A data recording device, comprising: a multiplexing unit that multiplexes to generate a data stream; and a recording unit that records the data stream and the attached information on a recording medium.

2. The data recording device according to claim 1, wherein the additional information generation unit further generates access data for accessing a group unit including a plurality of pieces of reproduction data.

3. The additional information generation unit generates an access data for each first sample when the group unit is a first sample, and generates the second data when the extended information is a second sample. The data recording device according to claim 1, wherein the data recording device generates access data for each sample.

4. The data recording apparatus according to claim 3, wherein the multiplexing unit generates the data stream by multiplexing the coded stream and the extension information on a first sample basis and a second sample basis.

5. The accessory information generation unit generates access data for each sample when the group unit and the extended information relating to one or more pieces of reproduction data included in the group are one sample. Item 2. The data recording device according to Item 1.

6. The data recording device according to claim 5, wherein the multiplexing unit generates the data stream by multiplexing the coded stream and the extension information for each sample.

7. The receiving unit receives a video signal and an audio signal,

The stream generating unit encodes the video signal and the audio signal in a predetermined encoding format, respectively, and encodes an encoded stream including picture data of a plurality of videos and frame data of a plurality of audio frames. Generate a

The extended information generating unit generates extended information for identifying at least each picture data,

The additional information generation unit generates access data for accessing each of the picture data, the frame data of the plurality of audio frames, and the extension information in a group unit including at least two or more picture data. 2. The data recording apparatus according to claim 1, wherein the data recording apparatus generates additional information including the access data.

8. The data recording device according to claim 7, wherein the extended information generating unit further generates extended information for specifying each frame data of the plurality of audio frames.

9. The data recording device according to claim 1, wherein the recording unit records the data stream and the additional information as one data file on the recording medium.

10. The extended information generating unit according to claim 1, wherein the extended information generating unit generates at least one piece of information indicating a data size, a display time, and a decoding time of each of the reproduction data as the extended information. Data recording device.

11. The additional information generation unit generates the additional information further including a default value of the extended information, The data recording device according to claim 1, wherein the extension information generation unit generates the extension information having a value different from the predetermined value.

12. The data recording device according to claim 1, wherein the extension information generation unit generates extension information for specifying reference destination picture data referred to for decoding each picture data of the video signal.

1 3. The additional information generation unit generates the additional information further including link information,

2. The recording unit, wherein the data stream is recorded on the recording medium as a first data file specified by the link information, and the attached information is recorded on the recording medium as a second data file. 3. A data recording device according to claim 1.

1 4. receiving a video signal and / or an audio signal;

A step of encoding the signal in a predetermined encoding format to generate an encoded stream including a plurality of reproduction data which is picture data for the video signal and frame data for the audio signal. And

A step of generating extended information for identifying each piece of reproduction data; and generating access data for accessing a group of one or more pieces of reproduction data, and ancillary information including the access data. Generating a report;

Multiplexing the encoded stream and the extended information to generate a data stream;

Recording the data stream and the attached information on a recording medium; and

A data recording method including:

15. The data recording method according to claim 14, wherein the step of generating the additional information further generates an access data for accessing a group unit including a plurality of reproduction data.

16. The step of generating the additional information includes: generating an access data for each first sample when the group unit is a first sample; and generating the access information when the extended information is a second sample. 15. The data recording method according to claim 14, wherein access data for each of said second samples is generated.

17. The step of generating the data stream, comprising: multiplexing the encoding stream and the extension information for each of the first samples and for each of the second samples to generate the data stream. Data recording method described in 1.

1 8. The step of generating the accessory information includes: 15. The data recording method according to claim 14, wherein the access data for each sample is generated when the extended information on one or more pieces of reproduction data included in the group is defined as one sample.

19. The data recording method according to claim 18, wherein, in the step of generating the data stream, the data stream is generated by multiplexing the encoded stream and the extension information for each sample.

20. The receiving step includes receiving a video signal and an audio signal,

The step of generating the encoded stream includes encoding the video signal and the audio signal in a predetermined encoding format, respectively, and includes encoding including picture data of a plurality of videos and frame data of a plurality of audio frames. Generate a stream,

The step of generating the extended information includes generating extended information for identifying at least each picture data,

The step of generating the additional information includes generating, for each of the picture data, the frame data of the plurality of audio frames, and the extension information, access data for accessing a group of at least two or more picture data. The data recording method according to claim 14, further comprising: generating additional information including the access data.

21. The overnight recording method according to claim 20, wherein the step of generating the extended information further comprises generating extended information for specifying each frame data of the plurality of voice frames.

22. The data recording method according to claim 14, wherein, in the recording step, the data stream and the additional information are recorded as one data file on the recording medium.

23. The data recording according to claim 14, wherein the step of generating the extended information includes generating at least one of information indicating a data size, a display time, and a decoding time of each of the reproduction data as the extended information. Method.

24. The step of generating the additional information includes generating the additional information further including a default value of the extended information,

15. The data recording method according to claim 14, wherein the step of generating the extension information generates the extension information having a value different from the default value.

25. The method according to claim 14, wherein the step of generating the extended information includes generating extended information for identifying a reference destination picture to be referred to for decoding each picture data of the video signal. Data recording method.

26. The step of generating the additional information includes generating the additional information further including link information;

In the recording, the data stream is recorded on the recording medium as a first data file specified by the link information, and the attached information is recorded on the recording medium as a second data file. 15. The data recording method according to claim 14.