JP4155990B2

JP4155990B2 - Synchronous reconstruction method and apparatus for acoustic data and moving image data

Info

Publication number: JP4155990B2
Application number: JP2005379488A
Authority: JP
Inventors: 英樹小島
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2005-12-28
Filing date: 2005-12-28
Publication date: 2008-09-24
Anticipated expiration: 2018-11-04
Also published as: JP2006187013A

Description

本発明は、音声と動画が一体となっているデータについて、音声と動画の同期を取りながら圧縮伸長して再生する技術に関する。 The present invention relates to a technique for compressing and decompressing and reproducing data in which audio and moving images are integrated while synchronizing audio and moving images.

昨今の動画や音声に関するコンピュータ処理技術の急速な発展に伴って、デジタルビデオ等の音声と動画を取り扱うマルチメディア機器が急速に普及している。特に、音声と動画を同時に取り扱うストーリー性を有するアニメーション等の分野においては、その編集等にコンピュータを用いた処理が積極的に行われており、音声と動画の編集機能にも高度な技術が要求されつつある。例えば、あるキャラクタの動きを収めた動画データと別個に編集した音声データとを連携させて、映像データを作成する場合において、２つのデータ間の同期を取りつつ、いかに自然な映像となるようにするのかは重要な課題の一つとなっている。ここで映像とは、音声と動画を含んだマルチメディア情報の総称とし、アニメーションや実写によるストーリー性を有する映像等を含むものとする。 With the recent rapid development of computer processing technology related to moving images and sounds, multimedia devices that handle sound and moving images such as digital video are rapidly spreading. In particular, in the field of animation with a story that handles audio and video at the same time, processing using a computer is actively conducted for editing and the like, and advanced technology is also required for audio and video editing functions. It is being done. For example, when video data is created by linking video data containing the movement of a character and audio data edited separately, so that the natural data can be obtained while synchronizing the two data. Whether to do so is an important issue. Here, the term “video” is a general term for multimedia information including audio and moving images, and includes video and the like having a story by animation or live-action.

また、一つのシナリオとして完成されている映像データの一部を他の作品に流用したい場合等において、流用すべき部分の映像データを圧縮伸長させる必要が生じる場合も多い。この場合には、映像データを圧縮伸長させた場合に、音声データと動画データの同期を壊さないように圧縮伸長させることが重要となる。 In addition, when it is desired to divert part of video data completed as one scenario to other works, it is often necessary to compress and decompress the video data of the part to be diverted. In this case, when the video data is compressed and expanded, it is important to compress and expand so that the synchronization between the audio data and the moving image data is not broken.

しかし、従来の映像データにおいては、各キャラクタについて時系列に音声データと動画データを独立して生成しているものが多く、各々のデータの記録時間長さも相違している場合もあって、どちらかのデータ、もしくは双方のデータを圧縮伸長等して同期を取る必要も生じている。 However, in many conventional video data, audio data and video data are generated independently in time series for each character, and the recording time length of each data may be different. There is also a need to synchronize such data or both data by compressing / decompressing them.

映像データを圧縮伸長した場合においても、音声データと動画データの同期を保持すべく、種々の方法が考えられている。例えば、特開平５−３５６１号公報においては、音声と動画により構成されるストーリーの編集・再生等において、音声データと動画データを別個独立に編集してから、同期を取りつつ一つのストーリーとして再構成する方法が開示されている。 Various methods have been considered in order to maintain synchronization between audio data and moving image data even when video data is compressed and expanded. For example, in Japanese Patent Laid-Open No. 5-3561, in editing / playing a story composed of audio and moving images, the audio data and moving image data are edited separately, and then synchronized as a single story. A method of configuring is disclosed.

しかし、かかる方法では、音声データと動画データを別個独立に圧縮・伸長して同期を取っているため、同期点の始点と終点を除けば、動画の内容と音声とが正確に同期が取れていることの保証はない。特に精緻な映像が要求される傾向にある昨今では、口の動きにあわせて音声を出す等の細かい調整も要求されることが多いが、例えばキャラクタの口の動きに合致していない音声が出力される、キャラクタの口が動いていないのに音声を発している等の不具合点が生じる可能性があるという問題点があった。 However, in this method, since the audio data and the video data are compressed and expanded separately and synchronized, the content and audio of the video can be accurately synchronized except for the start point and end point of the synchronization point. There is no guarantee that In recent years, particularly when there is a tendency to demand precise images, fine adjustments such as making sounds in accordance with mouth movements are often required. For example, audio that does not match the character's mouth movements is output. However, there is a problem that there is a possibility that a defect such as sound is generated even though the mouth of the character is not moving.

本発明は、上記弊害を排除し、きめ細かい同期処理を簡便な方法で実現するとともに、音響データと動画データの間で同期が取られている映像データを圧縮伸長した場合においても、自然な映像として再生できる音響データ・動画データの同期再構築方法及び装置を提供することを目的とする。 The present invention eliminates the above-mentioned adverse effects and realizes fine synchronization processing by a simple method, and even when video data synchronized between audio data and video data is compressed and expanded, It is an object of the present invention to provide a method and apparatus for synchronous reconstruction of sound data / moving image data that can be reproduced.

上記目的を達成するために本発明にかかる音響データ・動画データの同期再構築装置は、相互に同期している音響データと動画データで構成される一対のデータを圧縮伸長する音響データ・動画データの同期再構築装置であって、一連の音響データと一連の動画データを複数の区間に分割するデータ区間分割部と、データ区間分割部で分割された区間ごとの音響データと動画データの圧縮伸長の度合いを決定するパラメータ時系列を指示するパラメータ時系列指示部と、データ区間分割部で分割された音響データを、さらに等間隔の入力フレームに分割する入力フレーム分割部と、パラメータ時系列指示部で決定したパラメータ時系列に基づいて音響データの出力フレーム長を決定する出力フレーム長決定部と、パラメータ時系列指示部で決定したパラメータ時系列に基づいて圧縮伸長する音響データに関して、圧縮伸長前の音響データと圧縮伸長後の音響データの間での同期ポイントの対応表を出力する音響動画同期データ出力部と、出力フレーム長に基づいて音響データを圧縮伸長する音響データ再構築部と、対応表に基づいて動画データを圧縮伸長する動画データ再構築部を含むことを特徴とする。 In order to achieve the above object, an audio data / video data synchronization reconstruction apparatus according to the present invention compresses and decompresses a pair of data composed of audio data and video data synchronized with each other. A data section dividing unit that divides a series of sound data and a series of moving image data into a plurality of sections, and compression / decompression of acoustic data and moving picture data for each section divided by the data section dividing unit A parameter time-series indicating unit for instructing a parameter time-series for determining the degree of sound, an input frame dividing unit for further dividing the acoustic data divided by the data section dividing unit into input frames at equal intervals, and a parameter time-series indicating unit The output frame length determination unit that determines the output frame length of the acoustic data based on the parameter time series determined in step 1 and the parameter time series instruction unit An audio video synchronization data output unit for outputting a correspondence table of synchronization points between the audio data before compression and expansion and the audio data after compression and expansion for the audio data to be compressed and expanded based on the parameter time series, and output frame length And a moving image data reconstruction unit that compresses and expands moving image data based on the correspondence table.

かかる構成により、あらかじめ同期をとるべきポイントごとにデータ区間を分割しておくことで、音響データについてのみ圧縮伸長しさえすれば、それに伴って動画データも同期を取りながら圧縮伸長することができ、映像データの圧縮伸長を容易かつ確実に行うことが可能となる。 With such a configuration, by dividing the data section for each point that should be synchronized in advance, as long as only compression and decompression is performed on the acoustic data, the video data can be compressed and decompressed while synchronizing with it. It becomes possible to easily and reliably perform compression / decompression of video data.

また、本発明にかかる音響データ・動画データの同期再構築装置は、パラメータ時系列指示部において、音響データのみに基づいてパラメータ時系列を抽出することが好ましい。音の出ている箇所は、キャラクタが話しをしている等の、映像上で同期を取るには重要な場面であることが多いからである。 In the acoustic data / moving image data synchronous reconstruction apparatus according to the present invention, the parameter time-series instruction unit preferably extracts the parameter time-series based only on the acoustic data. This is because the place where the sound is generated is often an important scene for synchronizing on the video such as when the character is talking.

また、本発明にかかる音響データ・動画データの同期再構築装置は、パラメータ時系列指示部において、動画データのみに基づいてパラメータ時系列を抽出することが好ましい。キャラクタの動きの変化に応じても、同期ポイントは容易に抽出することができるからである。 In the acoustic data / moving image data synchronous reconstruction apparatus according to the present invention, the parameter time series indicating unit preferably extracts the parameter time series based only on the moving image data. This is because the synchronization point can be easily extracted even when the character's movement changes.

また、本発明にかかる音響データ・動画データの同期再構築装置は、パラメータ時系列指示部において、音響データ及び動画データの双方に基づいてパラメータ時系列を抽出することが好ましい。双方を組み合わせて、重み付けをすることで、よりきめ細かい同期ポイントを設定することが可能となるからである。 In the acoustic data / moving image data synchronous reconstruction apparatus according to the present invention, the parameter time series indicating unit preferably extracts the parameter time series based on both the acoustic data and the moving image data. This is because it is possible to set a finer synchronization point by combining both and weighting.

また、本発明にかかる音響データ・動画データの同期再構築装置は、パラメータ時系列指示部において、使用者の手によってパラメータ時系列を入力することが好ましい。人間の目によって、違和感を感じることを極力少なくするために微調整をすることができるようにするためである。 In the synchronous reconstruction apparatus for audio data / moving image data according to the present invention, it is preferable that the parameter time series instruction unit inputs the parameter time series by the user. This is so that fine adjustments can be made in order to minimize the feeling of discomfort by the human eye.

次に、上記目的を達成するために本発明にかかる音響データ・動画データの同期再構築方法は、相互に同期している音響データと動画データで構成される一対のデータを圧縮伸長する音響データ・動画データの同期再構築方法であって、音響データと動画データを複数の区間に分割する工程と、分割された区間ごとに、音響データと動画データの圧縮伸長の度合いを決定するパラメータ時系列を指示する工程と、分割された音響データを、さらに等間隔の入力フレームに分割する工程と、決定したパラメータ時系列に基づいて出力フレーム長を決定する工程と、決定したパラメータ時系列に基づいて圧縮伸長する音響データに関して、圧縮伸長前の音響データと圧縮伸長後の音響データの間での同期ポイントの対応表を出力する工程と、対応表に基づいて前記音響データを圧縮伸長する工程と、対応表に基づいて動画データを圧縮伸長する工程を含むことを特徴とする。 Next, in order to achieve the above object, the acoustic data / moving image data synchronous reconstruction method according to the present invention compresses / decompresses a pair of data composed of sound data and moving image data synchronized with each other. A method for synchronous reconstruction of moving image data, the step of dividing the sound data and moving image data into a plurality of sections, and the parameter time series for determining the degree of compression and expansion of the sound data and moving image data for each divided section In accordance with the determined parameter time series, the step of dividing the divided acoustic data into input frames of equal intervals, the step of determining the output frame length based on the determined parameter time series, and the For the audio data to be compressed / expanded, a step of outputting a correspondence table of synchronization points between the audio data before compression / expansion and the audio data after compression / expansion, and correspondence Characterized in that it comprises a step of compressing and expanding video data based the steps of compressing and expanding the sound data, the correspondence table based on.

また、本発明にかかる音響データ・動画データの同期再構築方法は、パラメータを指示する工程において、音響データのみに基づいてパラメータ時系列を抽出することが好ましい。音の出ている箇所は、キャラクタが話しをしている等の、映像上で同期を取るには重要な場面であることが多いからである。 In the synchronous reconstruction method of acoustic data / moving image data according to the present invention, it is preferable to extract a parameter time series based only on acoustic data in the parameter instruction step. This is because the place where the sound is generated is often an important scene for synchronizing on the video such as when the character is talking.

また、本発明にかかる音響データ・動画データの同期再構築方法は、パラメータを指示する工程において、動画データのみに基づいてパラメータ時系列を抽出することが好ましい。キャラクタの動きの変化に応じても、同期ポイントは容易に抽出することができるからである。 In the method for synchronously reconstructing acoustic data / moving image data according to the present invention, it is preferable to extract a parameter time series based only on moving image data in the step of instructing parameters. This is because the synchronization point can be easily extracted even when the character's movement changes.

また、本発明にかかる音響データ・動画データの同期再構築方法は、パラメータを指示する工程において、音響データ及び動画データの双方に基づいてパラメータ時系列を抽出することが好ましい。双方を組み合わせて、重み付けをすることで、よりきめ細かい同期ポイントを設定することが可能となるからである。 In the acoustic data / moving image data synchronous reconstruction method according to the present invention, it is preferable that the parameter time series is extracted based on both the acoustic data and the moving image data in the parameter instruction step. This is because it is possible to set a finer synchronization point by combining both and weighting.

また、本発明にかかる音響データ・動画データの同期再構築方法は、パラメータを指示する工程において、使用者の手によってパラメータ時系列を入力することが好ましい。人間の目によって、違和感を感じることを極力少なくするために微調整をすることができるようにするためである。 In the method for synchronously reconstructing audio data / moving image data according to the present invention, it is preferable that the parameter time series is input by a user's hand in the step of instructing the parameter. This is so that fine adjustments can be made in order to minimize the feeling of discomfort by the human eye.

次に、本発明にかかるコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体は、相互に同期している音響データと動画データで構成される一対のデータを圧縮伸長するコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体であって、音響データと動画データを複数の区間に分割する工程と、分割された区間ごとに、音響データと動画データの圧縮伸長の度合いを決定するパラメータを指示する工程と、分割された音響データを、さらに等間隔の入力フレームに分割する工程と、決定したパラメータ時系列に基づいて出力フレーム長を決定する工程と、決定したパラメータ時系列に基づいて圧縮伸長する音響データに関して、圧縮伸長前の音響データと圧縮伸長後の音響データの間での同期ポイントの対応表を出力する工程と、対応表に基づいて音響データを圧縮伸長する工程と、対応表に基づいて動画データを圧縮伸長する工程を含むことを特徴とする。 Next, a computer-readable recording medium recording a program to be executed by a computer according to the present invention is a program to be executed by a computer that compresses and decompresses a pair of data composed of acoustic data and moving image data that are synchronized with each other. Is a computer-readable recording medium in which sound data and moving image data are divided into a plurality of sections, and parameters for determining the degree of compression / expansion of the sound data and moving image data for each of the divided sections. A step of instructing, a step of further dividing the divided acoustic data into equally spaced input frames, a step of determining an output frame length based on the determined parameter time series, and a compression based on the determined parameter time series Regarding the sound data to be expanded, the sound data before compression and expansion and the sound data after compression and expansion And outputting the correspondence table of synchronization points between the steps of compressing and expanding audio data based on the correspondence table, characterized in that it comprises a step of compressing and expanding video data based on the correspondence table.

かかる構成により、コンピュータ上へ当該プログラムをロードさせ実行することで、あらかじめ同期をとるべきポイントごとにデータ区間を分割しておくことで、音響データについてのみ圧縮伸長しさえすれば、それに伴って動画データも同期を取りながら圧縮伸長することができ、映像データの圧縮伸長を容易かつ確実に行うことが可能となる音響データ・動画データの同期再構築装置が実現できる。 With such a configuration, by loading and executing the program on a computer and dividing the data section for each point that should be synchronized in advance, if only compression / decompression is performed only on the acoustic data, a moving image is accompanied accordingly. Data can also be compressed and decompressed in synchronization, and an audio data / video data synchronous reconstruction apparatus that can easily and reliably compress and decompress video data can be realized.

以上のように本発明にかかる音響データ・動画データの同期再構築装置によれば、あらかじめ同期をとるべきポイントごとにデータ区間を分割しておくことで、音響データについてのみ圧縮伸長しさえすれば、それに伴って動画データも同期を取りながら圧縮伸長することができ、映像データの圧縮伸長を容易かつ確実に行うことが可能となる。 As described above, according to the synchronous reconstruction apparatus for acoustic data / moving image data according to the present invention, it is only necessary to compress and decompress only acoustic data by dividing a data section for each point to be synchronized in advance. Accordingly, the video data can also be compressed and decompressed while synchronizing, and the video data can be easily and reliably compressed and decompressed.

以下、本発明の実施の形態にかかる動画データ・音響データの同期再生装置について、図面を参照しながら説明する。図１は本発明の実施の形態にかかる音響データ・動画データの同期再生装置の概略構成図である。 A video / audio data synchronous reproduction apparatus according to an embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a schematic configuration diagram of an apparatus for synchronously reproducing sound data / moving image data according to an embodiment of the present invention.

図１において、１１は音響データ入力部を、１２は動画データ入力部を、１３はデータ区間分割部を、１４はパラメータ時系列指示部を、１５は話速変換部を、１６は音響データ出力部を、１７は音響動画同期データ出力部を、１８は動画データ再構築部を、１９は動画データ出力部を、それぞれ示す。 In FIG. 1, 11 is an acoustic data input unit, 12 is a moving image data input unit, 13 is a data section dividing unit, 14 is a parameter time-series instruction unit, 15 is a speech rate conversion unit, and 16 is an acoustic data output unit. , 17 is an acoustic video synchronization data output unit, 18 is a video data reconstruction unit, and 19 is a video data output unit.

音声と動画が一体となった映像データが音響データ入力部１１及び動画データ入力部１２から与えられると、音響データ及び動画データの双方のデータが、データ区間分割部１３において、一定の基準に基づいて分割される。分割方法として、最も単純な方法としては等間隔に分割する方法が考えられるが、これに限定されるものではなく、映像におけるシーンの切れ目や、特定キャラクタの登場・退場等、様々な基準を用いて分割することができる。 When video data in which audio and moving images are integrated is provided from the acoustic data input unit 11 and the moving image data input unit 12, both the acoustic data and the moving image data are based on a certain standard in the data section dividing unit 13. Divided. As the division method, the simplest method may be divided at equal intervals, but is not limited to this, and various criteria such as scene breaks in the video and the appearance / exit of a specific character are used. Can be divided.

また、パラメータ時系列指示部１４では、与えられた音響データ及び動画データから、パラメータ時系列を抽出するか、もしくは使用者がパラメータ時系列を入力することで、音響データ及び動画データの双方のデータをどのように同期させるかを決定するパラメータ時系列を設定する。 Further, the parameter time series instruction unit 14 extracts the parameter time series from the given acoustic data and moving picture data, or the user inputs the parameter time series, so that both the acoustic data and the moving picture data are stored. Set a parameter time series that determines how to synchronize.

次に、パラメータ時系列指示部１４で設定されたパラメータ時系列に基づいて、データ区間分割部１３で分割された区間ごとに話速変換部１５で話速を変換し、音響動画同期データ出力部１７において、その速度変化に対応した音響動画同期データを出力する。 Next, based on the parameter time series set by the parameter time series instructing unit 14, the speech speed is converted by the speech speed converting unit 15 for each section divided by the data section dividing unit 13, and the acoustic video synchronization data output unit In 17, the audio video synchronization data corresponding to the speed change is output.

話速変換部１５では、話速は変換されるが、音声の高さは変化しない。すなわち、音のピッチは変換されない。さらに、話速は同一区間内で常に同一である必要はなく、同一区間内でも伸長圧縮が可能である。 The speaking speed conversion unit 15 converts the speaking speed but does not change the voice level. That is, the pitch of the sound is not converted. Furthermore, the speech speed need not always be the same in the same section, and can be expanded and compressed within the same section.

また、音響動画同期データとしては、話速が変換された音響データの各時刻のデータが元の音響データにおいてはどの時刻のデータであったのかを示す対応表を出力する。かかる対応が明らかになることで、元の音響データの時刻に対応した動画データと同期をとるように、動画データを圧縮伸長することが可能となる。 In addition, as the acoustic moving image synchronization data, a correspondence table indicating which time of the original sound data is the data at each time of the sound data whose speech speed is converted is output. By clarifying such correspondence, the moving image data can be compressed and expanded so as to be synchronized with the moving image data corresponding to the time of the original sound data.

そして、動画データ再構築部１８において、上記対応表に基づいて、動画フレームを間引いたり、逆に動画フレームを重複させる等の圧縮伸長処理を行って、音響データと同期した動画データを作成する。 Then, the moving image data reconstruction unit 18 performs compression / decompression processing such as thinning out moving image frames or conversely overlapping moving image frames on the basis of the correspondence table, thereby generating moving image data synchronized with the sound data.

音響データは、話速変換部１５で変換された状態で新たな音響データとして音響データ出力部１６から出力され、これと同期が取れている状態で、動画データ再構築部１８で再構築された動画データが動画データ出力部１９から出力される。 The sound data is output from the sound data output unit 16 as new sound data in a state converted by the speech speed conversion unit 15, and is reconstructed by the moving image data reconstructing unit 18 in a state of being synchronized with this. The moving image data is output from the moving image data output unit 19.

次に、図２は本発明の実施の形態にかかる音響データ・動画データの同期再生装置の一実施例である。図２において、２１は入力フレーム分割部を、２２は出力フレーム長決定部を、２３は音響データ再構築部を、それぞれ示す。 Next, FIG. 2 shows an example of an apparatus for synchronously reproducing sound data / moving image data according to an embodiment of the present invention. In FIG. 2, 21 indicates an input frame dividing unit, 22 indicates an output frame length determining unit, and 23 indicates an acoustic data reconstruction unit.

図２において、音声と動画が一体となった映像データが与えられると、音響データ、動画データともに、データ区間分割部１３において、一定の基準に基づいて分割される。分割方法として、最も単純な方法としては等間隔に分割する方法が考えられるが、これに限定されるものではなく、映像におけるシーンの切れ目や、特定キャラクタの登場・退場等、様々な基準を用いて分割することができる。 In FIG. 2, when video data in which audio and moving images are integrated is given, both audio data and moving image data are divided by the data section dividing unit 13 based on a certain standard. As the division method, the simplest method may be divided at equal intervals, but is not limited to this, and various criteria such as scene breaks in the video and the appearance / exit of a specific character are used. Can be divided.

区間分割の一基準として、シーンチェンジポイントを検出して、一つのシーンチェンジポイントから次のシーンチェンジポイントまでを一区間としてデータを分割する方法が考えられる。シーンチェンジポイントとは、動画フレーム間のつながりが大きく失われたところを意味する。以下、ＭＰＥＧ１データストリームを例にとって、シーンチェンジポイントの検出方法について説明する。ただし、動画データはＭＰＥＧ１データストリームに限定されるものではなく、シーンチェンジポイントの検出方法も以下の方法に限定されるものではない。 As a reference for section division, a method of detecting a scene change point and dividing data from one scene change point to the next scene change point as one section can be considered. The scene change point means that the connection between the video frames has been greatly lost. Hereinafter, a scene change point detection method will be described with an MPEG1 data stream as an example. However, the moving image data is not limited to the MPEG1 data stream, and the scene change point detection method is not limited to the following method.

ＭＰＥＧ１データストリームでは、圧縮効率を高めるためにフレーム内符号化フレーム（I-Picture）、フレーム間符号化フレーム（P-Picture）、フレーム内挿フレーム（B-Picture）で構成されるＧＯＰ（Group Of Pictures）レイヤが設けられている。それぞれのフレームは、８ｘ８のマクロブロックごとに圧縮される。 In the MPEG1 data stream, a GOP (Group Of) composed of an intra-frame encoded frame (I-Picture), an inter-frame encoded frame (P-Picture), and an inter-frame interpolated frame (B-Picture) in order to increase compression efficiency. (Pictures) layer. Each frame is compressed for each 8 × 8 macroblock.

マクロブロックには種々のタイプのマクロブロックが存在するが、イントラ符号化されるイントラマクロブロック（以下、「Ｉ−ＭＢ」という。）、前方予測に用いられる前方予測マクロブロック（以下、「Ｐ−ＭＢ」という。）、後方予測に用いられる後方予測マクロブロック（以下、「Ｂ−ＭＢ」という。）、双方向予測に用いられる双方向予測マクロブロック（以下、「Ｄ−ＭＢ」という。）の４つのマクロブロックのフレーム内における総数をシーンチェンジポイントの検出に利用する。 There are various types of macroblocks. Intra macroblocks that are intra-coded (hereinafter referred to as “I-MB”) and forward prediction macroblocks that are used for forward prediction (hereinafter referred to as “P-”). MB ”), backward prediction macroblocks used for backward prediction (hereinafter referred to as“ B-MB ”), and bidirectional prediction macroblocks used for bidirectional prediction (hereinafter referred to as“ D-MB ”). The total number of four macroblocks in a frame is used for scene change point detection.

さらに、各予測マクロブロックから得られる動きベクトルのフレーム内の平均値も利用する。ここで、「動きベクトル」とは、あるシーンからあるシーンへの差分ベクトルを意味する。 Further, the average value of the motion vectors obtained from each prediction macroblock in the frame is also used. Here, the “motion vector” means a difference vector from a certain scene to a certain scene.

具体的には、シーンチェンジポイントを、フレーム間のつながりが大きく失われたポイントであるものと考えると、シーンチェンジポイント前後のフレームにおいては動きベクトルを求めることが困難になることが予想できるので、必然的にＩ−ＭＢの急激な増加、及びＤ−ＭＢの急激な減少が発生すると考えられる。また、シーンチェンジポイント直前のフレームでは、Ｐ−ＭＢの増加、Ｂ−ＭＢの減少が予想され、シーンチェンジポイント直後のフレームでは、Ｐ−ＭＢの減少、Ｂ−ＭＢの増加が予想される。これらの特徴を考慮した評価関数を定義することで、シーンチェンジポイントを推定する。 Specifically, considering the scene change point as a point where the connection between frames has been greatly lost, it can be expected that it will be difficult to obtain a motion vector in the frames before and after the scene change point. Inevitably, a rapid increase in I-MB and a rapid decrease in D-MB occur. Further, an increase in P-MB and a decrease in B-MB are expected in the frame immediately before the scene change point, and a decrease in P-MB and an increase in B-MB are expected in the frame immediately after the scene change point. A scene change point is estimated by defining an evaluation function considering these features.

次に、パラメータ時系列指示部１４では、与えられた音響データ、動画データに基づいてパラメータ時系列を抽出するか、もしくは使用者がパラメータ時系列を入力することで、音響データ及び動画データの双方をどのように同期させるかを決定するパラメータ時系列を設定する。 Next, the parameter time series instructing unit 14 extracts the parameter time series based on the given acoustic data and moving picture data, or the user inputs the parameter time series so that both the acoustic data and the moving picture data are obtained. Set a parameter time series that determines how to synchronize.

パラメータ時系列の抽出方法についても、種々の方法が考えられる。ここでは、音響データのみから抽出する方法、動画データのみから抽出する方法、音響データ・動画データの双方を用いて抽出する方法の３つに分けて説明する。 Various methods for extracting parameter time series are also conceivable. Here, a method of extracting from only sound data, a method of extracting from only moving image data, and a method of extracting using both sound data and moving image data will be described.

まず、音響データのみからパラメータ時系列を抽出する方法としては、音響のパワーの強弱によって抽出する方法が考えられる。これは、音の大きな箇所には重要な情報が含まれ、音の小さな箇所にはあまり重要な情報は含まれていない、等の判断基準に従って、パラメータ時系列を決定する方法である。映像において、キャラクタの話し声の音響パワーを大きく設定しておくことで、キャラクタが話している言葉に従って、音響データと動画データの同期を取る場合等に有効な方法となる。同様に、音の高さ（ピッチ）の大小によってパラメータ時系列を抽出する方法も考えられる。 First, as a method of extracting the parameter time series from only the acoustic data, a method of extracting based on the strength of the acoustic power can be considered. This is a method of determining a parameter time series according to a criterion such as that important information is included in a loud part and less important information is included in a low part. By setting the sound power of the character's speaking voice large in the video, this is an effective method for synchronizing the sound data and the moving image data in accordance with the words spoken by the character. Similarly, a method of extracting the parameter time series according to the pitch (pitch) of the sound can be considered.

さらに、音源を区別するために、音のスペクトルに基づいてパラメータ時系列を抽出する方法も考えられる。音のスペクトルとしては、ＦＦＴ（Fast Fourier Transform）スペクトルやＬＰＣ（Linear Predictive Coding）スペクトル等が考えられるが、特にこれに限定するものではなく、音源の種類を特定できる可能性のあるものであれば良い。かかる方法により、例えば主役であるキャラクタが話している区間の重要度を高く、他の脇役のキャラクタが話している区間の重要度を低く評価する等の基準に基づいて、パラメータ時系列を決定することができる。 Further, in order to distinguish sound sources, a method of extracting a parameter time series based on a sound spectrum is also conceivable. As the sound spectrum, an FFT (Fast Fourier Transform) spectrum, an LPC (Linear Predictive Coding) spectrum, and the like are conceivable. However, the present invention is not limited to this, and any type of sound source may be specified. good. By this method, for example, the parameter time series is determined based on a criterion such as evaluating the importance of the section where the main character is speaking high and evaluating the importance of the section where the other supporting character is speaking low. be able to.

次に、動画データのみからパラメータ時系列を抽出する方法として、動きベクトル（差分ベクトル）に基づいて抽出する方法が考えられる。これは、動きベクトルの変化の度合いによって、シーンチェンジポイントを求め、シーンチェンジ間隔を時系列に並べることでパラメータ時系列を決定するものである。 Next, as a method of extracting the parameter time series from only the moving image data, a method of extracting based on a motion vector (difference vector) can be considered. This is to determine a parameter time series by obtaining scene change points according to the degree of change in motion vector and arranging scene change intervals in time series.

最後に、音響データ・動画データの双方を用いてパラメータ時系列を抽出する方法としては、音響データのみから抽出する方法と動画データのみから抽出する方法の双方により求まったパラメータ時系列に重み付けを加算して、最適なパラメータ時系列を抽出する方法が考えられる。重み付けは、プログラム等によって事前に設定しておいても良いし、使用者が映像データの特質に応じて入力しても良い。 Finally, as a method for extracting parameter time series using both sound data and moving image data, weighting is added to the parameter time series obtained by both the method of extracting only from sound data and the method of extracting only from moving image data. Thus, a method for extracting an optimal parameter time series can be considered. The weighting may be set in advance by a program or the like, or may be input by the user according to the characteristics of the video data.

また、パラメータ時系列を、使用者がキーボード等の入力媒体を通じて入力しても良い。この場合、パラメータ時系列自体は、ステップ関数として入力される形になるため、パラメータ変化を滑らかにするスムージングを行う。図３に示すように、入力されたステップ関数を緩やかな変化に変更することで、全くの無音領域を作らないで、音声データのつなぎをスムーズにする効果が期待できる。 The parameter time series may be input by the user through an input medium such as a keyboard. In this case, since the parameter time series itself is input as a step function, smoothing is performed to smooth parameter changes. As shown in FIG. 3, by changing the input step function to a gradual change, it is possible to expect an effect of smoothly connecting audio data without creating a completely silent area.

次に、入力フレーム分割部２１では、音響データを等間隔の入力フレームに分割する。入力フレームの間隔は、データ区間分割部１３で定められたデータ区間よりも短い。 Next, the input frame dividing unit 21 divides the acoustic data into equally spaced input frames. The interval between the input frames is shorter than the data interval determined by the data interval dividing unit 13.

そして、出力フレーム長決定部２２では、各データ区間において出力フレーム長の和が（元のデータ区間長）＊（圧縮伸長率）となるように、パラメータ時系列の値に基づいて各出力フレーム長を決定する。最も簡便な方法としては、出力フレーム長をパラメータ時系列に比例させる方法が考えられる。 The output frame length determination unit 22 then sets each output frame length based on the value of the parameter time series so that the sum of the output frame lengths in each data section is (original data section length) * (compression / decompression rate). To decide. As the simplest method, a method of making the output frame length proportional to the parameter time series can be considered.

図４は、出力フレーム長をパラメータ時系列に比例させて求める方法の典型例を示す図である。図４において、４１は音響データの入力フレーム長を、４２は音響データを遅く再生する場合の出力フレーム長を、４３は音響データを速く再生する場合の出力フレーム長を、それぞれ示す。 FIG. 4 is a diagram showing a typical example of a method for obtaining the output frame length in proportion to the parameter time series. In FIG. 4, 41 indicates the input frame length of the acoustic data, 42 indicates the output frame length when the acoustic data is reproduced slowly, and 43 indicates the output frame length when the acoustic data is reproduced fast.

図４において、（ａ）は入力フレーム分割部２１で等間隔に分割された音響データを示す。まず、音響データをパラメータ時系列に比例させて、さらに遅く再生する場合には、（ｂ）に示すように等間隔に分割されているフレーム長４１にパラメータ時系列の比例定数を乗ずることで、出力フレーム長４２を求める。したがって、各フレームごとに出力フレーム長への伸長が行われ、求まった新出力フレーム長へと、動画データも伸長される。したがって、伸長前の各フレーム内で音声と動画の同期が取れていさえすれば、伸長後も音声と動画の同期が取れていることになる。 4A shows acoustic data divided by the input frame dividing unit 21 at equal intervals. First, in the case where the acoustic data is proportional to the parameter time series and is reproduced later, the frame length 41 divided at equal intervals as shown in (b) is multiplied by the proportional constant of the parameter time series, The output frame length 42 is obtained. Accordingly, the output frame length is expanded for each frame, and the moving image data is also expanded to the determined new output frame length. Therefore, as long as audio and video are synchronized within each frame before decompression, the audio and video are synchronized even after decompression.

同様に、音響データをパラメータ時系列に比例させて、さらに速く再生する場合には、（ｃ）に示すように等間隔に分割されているフレーム長４１にパラメータ時系列の比例定数を乗ずることで、出力フレーム長４３を求める。したがって、各フレームごとに出力フレーム長への圧縮が行われ、求まった新出力フレーム長へと、動画データも伸長される。したがって、圧縮前の各フレーム内で音声と動画の同期が取れていさえすれば、圧縮後も音声と動画の同期が取れていることになる。 Similarly, when the acoustic data is proportional to the parameter time series and is reproduced more quickly, as shown in (c), the frame length 41 divided at equal intervals is multiplied by a proportional constant of the parameter time series. The output frame length 43 is obtained. Therefore, compression to the output frame length is performed for each frame, and the moving image data is also expanded to the obtained new output frame length. Therefore, as long as the audio and video are synchronized within each frame before compression, the audio and video are synchronized even after compression.

そして、音響データ再構築部２３では、各入力フレームから出力フレーム長だけの音響データを切り出し、つなぎ目がスムーズになるように位置をずらせてから接続する。つなぎ目がスムーズになる位置を求める方法としては、相互相関性の最も強い位置を求める等の種々の方法が考えられる。ただし、これに限定されるものではなく、音がスムーズにつながって聞こえる方法であれば良い。 Then, the acoustic data reconstruction unit 23 cuts out acoustic data of only the output frame length from each input frame, and connects after shifting the position so that the joints are smooth. As a method for obtaining a position where the joint is smooth, various methods such as obtaining a position having the strongest cross-correlation can be considered. However, the present invention is not limited to this, and any method can be used as long as the sound can be heard smoothly.

また、音響動画同期データは、話速が変換された音響データの各時刻のデータが元の音響データにおけるどの時刻のデータであったのかを示す対応表を出力する。かかる対応が明らかになることで、元の音響データの時刻に対応した動画データと同期をとるように、動画データを圧縮伸長することが可能となる。 In addition, the acoustic video synchronization data outputs a correspondence table indicating at which time in the original acoustic data the data at each time of the acoustic data whose speech speed has been converted. By clarifying such correspondence, the moving image data can be compressed and expanded so as to be synchronized with the moving image data corresponding to the time of the original sound data.

図５に、音響動画同期データとして出力される対応表の例を示す。図５では、圧縮伸長前の音響データのタイムスタンプと、圧縮伸長後の音響データのタイムスタンプを対応表に記録している。したがって、例えば分割区分「１」では音響データが伸長されていることがわかり、区分終了タイムスタンプが１５秒延びている。同様に分割区分「２」では音響データが圧縮されていることがわかり、区分終了タイムスタンプが２０秒早まっている。 FIG. 5 shows an example of the correspondence table output as the audio video synchronization data. In FIG. 5, the time stamp of the acoustic data before compression / expansion and the time stamp of the acoustic data after compression / expansion are recorded in the correspondence table. Therefore, for example, it is understood that the acoustic data is expanded in the division category “1”, and the segment end time stamp is extended by 15 seconds. Similarly, it can be seen that the acoustic data is compressed in the divided section “2”, and the section end time stamp is advanced by 20 seconds.

これらのタイムスタンプの増減に合わせて、動画データを圧縮伸長させることで、分割区分ごとに音響データと動画データの同期を取ることができる。したがって、音響データと動画データの同期必須時刻を区分の分割点とすることで、同期が取れていることが必要な時刻においては、必ず同期が取れている状態で映像を圧縮伸長させることが可能となる。 By compressing and expanding the moving image data in accordance with the increase / decrease of these time stamps, the audio data and the moving image data can be synchronized for each division. Therefore, by using the required synchronization time of audio data and video data as the division point, it is possible to compress and decompress the video without fail at the time when synchronization is required. It becomes.

すなわち、動画データ再構築部１８において、上記対応表に基づいて、動画フレームを間引いたり、逆に動画フレームを重複させる等の処理を行って、音響データと同期した動画データを作成することになる。 That is, the moving image data reconstruction unit 18 creates moving image data synchronized with the sound data by performing processing such as thinning out moving image frames or conversely overlapping moving image frames based on the correspondence table. .

動画データの伸長・圧縮方法については、キャラクタの動きが比較的緩やかな場面の動画フレームを重複させたり、削除したりすることで調整することも考えられるし、一定割合で動画フレームを重複させたり、間引いたりする方法でも良い。なお、これらの方法に限定されるものではない。 The video data decompression / compression method can be adjusted by duplicating or deleting the video frame of a scene where the character's movement is relatively slow. , Or a method of thinning out. Note that the present invention is not limited to these methods.

以上のように本実施の形態によれば、同期の取れている音声データと動画データによって構成されている映像データを圧縮伸長する場合において、精緻な同期を取ることが要求される場合においても、確実かつ容易に同期を取ることができ、また見る者の目に自然な映像として再生することが可能となる。 As described above, according to the present embodiment, in the case where video data composed of synchronized audio data and moving image data is compressed and expanded, even when precise synchronization is required, The synchronization can be surely and easily performed, and it can be reproduced as a natural image for the viewer.

次に、本発明の実施の形態にかかる音響データ・動画データの同期再構築装置を実現するプログラムの処理の流れについて説明する。図６に本発明の実施の形態にかかる音響データ・動画データの同期再構築装置を実現するプログラムの処理の流れ図を示す。 Next, a description will be given of the flow of processing of a program that implements the acoustic data / moving picture data synchronization reconstruction apparatus according to the embodiment of the present invention. FIG. 6 shows a flowchart of processing of a program that realizes the synchronous reconstruction apparatus for audio data / moving image data according to the embodiment of the present invention.

図６において、まず圧縮伸長の対象となる音響データ・動画データを入力する（ステップＳ６１）。次に、入力されたデータについて、シーンチェンジポイントごと等の一定の基準に基づいて区間分割を行う（ステップＳ６２）。 In FIG. 6, first, acoustic data / moving image data to be compressed / expanded is input (step S61). Next, the input data is divided into sections based on a certain standard such as each scene change point (step S62).

さらに、入力された音響データ・動画データに基づいてパラメータ時系列を抽出するか、もしくは使用者の手によってパラメータ時系列を入力する（ステップＳ６３）。 Further, the parameter time series is extracted based on the input sound data / moving picture data, or the parameter time series is inputted by the user (step S63).

そして、音響データを分割した区分ごとに等間隔にフレーム分割した後（ステップＳ６４）、決定したパラメータ時系列に基づいて、各入力フレームに対する出力フレーム長を求める（ステップＳ６５）。かかる出力フレーム長に合わせて、音響データ・動画データ双方のデータの同期を取るために、音響動画同期データ出力として、圧縮伸長された音響データの各時刻のデータが元の音響データにおけるどの時刻のデータであったのかを示す対応表を出力する（ステップＳ６６）。 Then, after the sound data is divided into equal intervals for each division (step S64), an output frame length for each input frame is obtained based on the determined parameter time series (step S65). In order to synchronize both the audio data and the video data in accordance with the output frame length, as the audio video synchronization data output, the data at each time of the compressed and decompressed audio data is at which time in the original audio data. A correspondence table indicating whether the data is data is output (step S66).

そして、かかる対応表に基づいて音響データ・動画データ双方のデータを再構築して（ステップＳ６７）、音響データ・動画データ一体となった映像データとして出力する（ステップＳ６８）。 Then, based on the correspondence table, both the audio data and the moving image data are reconstructed (step S67) and output as video data integrated with the audio data and the moving image data (step S68).

本発明の実施の形態にかかる音響データ・動画データの同期再構築装置を実現するプログラムを記憶した記録媒体は、図７に示す記録媒体の例に示すように、ＣＤ−ＲＯＭやフロッピーディスク等の可搬型記録媒体だけでなく、通信回線の先に備えられた他の記憶装置や、コンピュータのハードディスクやＲＡＭ等の記録媒体のいずれでも良く、プログラム実行時には、プログラムはローディングされ、主メモリ上で実行される。 As shown in the example of the recording medium shown in FIG. 7, the recording medium storing the program for realizing the audio data / moving picture data synchronous reconstruction apparatus according to the embodiment of the present invention is a CD-ROM or a floppy disk. Not only a portable recording medium but also any other storage device provided at the end of a communication line, or a recording medium such as a hard disk or RAM of a computer. When a program is executed, the program is loaded and executed on the main memory. Is done.

また、本発明の実施の形態にかかる音響データ・動画データの同期再構築装置により生成された音響動画同期データ等を記録した記録媒体も、図７に示す記録媒体の例に示すように、ＣＤ−ＲＯＭやフロッピーディスク等の可搬型記録媒体だけでなく、通信回線の先に備えられた他の記憶装置や、コンピュータのハードディスクやＲＡＭ等の記録媒体のいずれでも良く、例えば本発明にかかる音響データ・動画データの同期再構築装置を利用する際にコンピュータにより読み取られる。 In addition, as shown in the example of the recording medium shown in FIG. 7, a recording medium on which the acoustic moving picture synchronization data generated by the acoustic data / moving picture data synchronization reconstruction apparatus according to the embodiment of the present invention is recorded as a CD. -Not only a portable recording medium such as a ROM or a floppy disk but also any other storage device provided at the end of a communication line, a recording medium such as a hard disk or a RAM of a computer, for example, acoustic data according to the present invention -It is read by a computer when using a synchronous reconstruction device for moving image data.

本発明の実施の形態にかかる音響データ・動画データの同期再構築装置の原理図FIG. 1 is a diagram illustrating the principle of an audio / video data synchronization reconstruction apparatus according to an embodiment of the invention 本発明の実施の形態にかかる音響データ・動画データの同期再構築装置の概略構成図1 is a schematic configuration diagram of a synchronous reconstruction apparatus for audio data / moving image data according to an embodiment of the present invention. ステップ入力のスムージング例示図Example of smoothing step input 出力フレーム長を求める方法の説明図Explanatory diagram of how to determine the output frame length 音響動画同期データ出力である対応表の例示図Example of correspondence table that is audio video synchronization data output 本発明の実施の形態にかかる音響データ・動画データの同期再構築装置における処理の流れ図Flowchart of processing in a synchronous reconstruction apparatus for audio data / moving image data according to an embodiment of the present invention 記録媒体の例示図Example of recording medium

Explanation of symbols

１１音響データ入力部
１２動画データ入力部
１３データ区間分割部
１４パラメータ時系列指示部
１５話速変換部
１６音響データ出力部
１７音響動画同期データ出力部
１８動画データ再構築部
１９動画データ出力部
２１入力フレーム分割部
２２出力フレーム長決定部
２３音響データ再構築部
３１ステップ入力
３２スムージングされたステップ入力
４１入力フレーム
４２出力フレーム長（長）
４３出力フレーム長（短）
７１回線先の記憶装置
７２ＣＤ−ＲＯＭやフロッピーディスク等の可搬型記録媒体
７２−１ＣＤ−ＲＯＭ
７２−２フロッピーディスク
７３コンピュータ
７４コンピュータ上のＲＡＭ／ハードディスク等の記録媒体 DESCRIPTION OF SYMBOLS 11 Acoustic data input part 12 Movie data input part 13 Data area division | segmentation part 14 Parameter time series instruction | indication part 15 Speech speed conversion part 16 Acoustic data output part 17 Acoustic moving image synchronous data output part 18 Movie data reconstruction part 19 Movie data output part 21 Input frame division unit 22 Output frame length determination unit 23 Acoustic data reconstruction unit 31 Step input 32 Smoothed step input 41 Input frame 42 Output frame length (long)
43 Output frame length (short)
71 Line-destination storage device 72 Portable recording medium such as CD-ROM or floppy disk 72-1 CD-ROM
72-2 Floppy disk 73 Computer 74 Recording medium such as RAM / hard disk on computer

Claims

An audio data / video data synchronous reconstruction device that compresses and decompresses a pair of data composed of audio data and video data synchronized with each other,
A data section dividing unit that divides a series of audio data and a series of moving image data into a plurality of sections;
A parameter time series instructing unit for instructing a parameter time series for determining the degree of compression and expansion of the sound data and moving image data for each of the sections divided by the data section dividing unit;
An input frame dividing unit that further divides the acoustic data divided by the data section dividing unit into equally spaced input frames;
An output for determining the output frame length of the acoustic data by multiplying the length of the input frame divided at equal intervals by the input frame dividing unit by the proportional constant of the parameter time series determined by the parameter time series instruction unit A frame length determination unit;
Outputs a correspondence table of synchronization points between the acoustic data before compression and decompression and the acoustic data after compression and decompression for the acoustic data compressed and decompressed based on the parameter time series determined by the parameter time series instruction unit An audio video synchronization data output unit,
An acoustic data reconstruction unit that cuts out and connects only the acoustic data of the output frame length from each input frame;
A sound data / moving image data synchronous reconstruction device comprising: a moving image data reconstructing unit that compresses / decompresses the moving image data based on the correspondence table.

The apparatus for synchronous reconstruction of acoustic data / moving image data according to claim 1, wherein the parameter time series instruction unit inputs the parameter time series by a user's hand.

The acoustic data / moving image data synchronous reconstruction apparatus according to claim 1, wherein the parameter time-series instruction unit extracts the parameter time-series by estimating the importance of each section from the acoustic data.

The acoustic data / moving image data synchronous reconstruction apparatus according to claim 1, wherein the parameter time series instruction unit extracts the parameter time series by estimating the importance of each section from the moving image data.

The acoustic data / moving image data synchronous reconstruction apparatus according to claim 1, wherein the data section dividing unit divides the moving image data into sections according to the contents thereof.

A method for synchronous reconstruction of audio data / video data that compresses / decompresses a pair of data composed of audio data and video data synchronized with each other,
Dividing the acoustic data and the video data into a plurality of sections;
Instructing a parameter time series for determining the degree of compression / expansion of the sound data and the moving image data for each divided section;
Dividing the divided acoustic data into input frames that are equally spaced;
Determining the output frame length by multiplying the length of the divided input frame by the proportional constant of the parameter time series;
A step of outputting a correspondence table of synchronization points between the acoustic data before compression and decompression and the acoustic data after compression and decompression for the acoustic data to be compressed and decompressed based on the determined parameter time series;
Cutting out and connecting the acoustic data of the output frame length from each input frame; and
A method for synchronously reconstructing audio data / moving image data, comprising the step of compressing / decompressing the moving image data based on the correspondence table.

A computer-readable recording medium recording a program for causing a computer to execute a process of compressing and decompressing a pair of data composed of sound data and moving image data synchronized with each other,
The program is
Dividing the acoustic data and the moving image data into a plurality of sections;
Instructing a parameter time series for determining the degree of compression / expansion of the sound data and the moving image data for each divided section;
Dividing the divided acoustic data into input frames that are equally spaced;
Determining the output frame length by multiplying the length of the divided input frame by the proportional constant of the parameter time series;
A step of outputting a correspondence table of synchronization points between the acoustic data before compression and decompression and the acoustic data after compression and decompression for the acoustic data to be compressed and decompressed based on the determined parameter time series;
Cutting out and connecting the acoustic data of the output frame length from each input frame; and
A computer-readable recording medium that causes a computer to execute a process of compressing and decompressing the moving image data based on the correspondence table.