JP2017092832A

JP2017092832A - Reproduction method and reproducer

Info

Publication number: JP2017092832A
Application number: JP2015223504A
Authority: JP
Inventors: 森　隆志; Takashi Mori; 隆志森; 祐高橋; Yu Takahashi
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2015-11-13
Filing date: 2015-11-13
Publication date: 2017-05-25

Abstract

PROBLEM TO BE SOLVED: To provide a technology which allows multi-view video that does not give discomfort to a viewer to be edited easily.SOLUTION: When a sound identification unit 120 receives from a moving image information acquisition unit 110 moving image information CAV_N that includes acoustic information A_N representing sound and video information V_N representing video of a performer of the sound, the sound identification unit 120 analyzes the acoustic information A_N and identifies the kind of the sound. A moving image information output unit 130 specifies a region corresponding to the kind of the moving image information CAV_N identified by the sound identification unit 120 with reference to a video arrangement table 213 that associates sound identification information showing the kind of the sound with each region of a display screen of a display unit 30 where images are displayed to a plurality of regions, assigns a video shown by the video information V_N to the region, and displays the same to the display screen of the display unit 30.SELECTED DRAWING: Figure 1

Description

本発明は、動画編集を支援する技術に関し、特に、編集対象の動画の再生制御に関する。 The present invention relates to a technique for supporting moving image editing, and more particularly to reproduction control of a moving image to be edited.

近年、被写体を複数の視点から撮影した映像（以下、多視点映像）からなる多視点動画が注目されている。この種の多視点動画は、複数の被写体で構成される集合体を被写体毎に撮影した映像を基に構成される。その一例として、ライブ演奏を行うバンドのライブ動画が挙げられる。この種のライブ動画は、例えば、バンドを構成する各メンバーを、担当するパート（例えば、ボーカルやギター）毎に撮影した映像を基に構成される。このようなライブ動画を再生すると、画面に各メンバーの映像が同時に表示され、視聴者は好みのメンバーに注目しながらライブ動画を視聴することができる。また、近年では、多視点映像を構成する各映像からユーザの選択した１の映像を表示する技術が提案されている。例えば、特許文献１によると、メイン画面とサブ画面に分割された表示画面の画面領域のうち、映像データがメイン画面に表示され、当該映像データとは別の視点で撮影された多視点映像データがサブ画面に表示される。ユーザが操作装置を使って所定の操作を行うことにより、サブ画面に表示させる多視点映像データを別の多視点映像データへと切り替えることが可能となっている。 In recent years, a multi-view video composed of videos (hereinafter referred to as multi-view videos) taken from a plurality of viewpoints has attracted attention. This type of multi-view video is configured on the basis of an image obtained by photographing an aggregate composed of a plurality of subjects for each subject. One example is a live video of a band performing a live performance. This type of live moving image is configured based on, for example, a video obtained by photographing each member constituting a band for each part (for example, vocal or guitar) in charge. When such a live video is played, the video of each member is displayed on the screen at the same time, and the viewer can watch the live video while paying attention to a favorite member. In recent years, a technique for displaying one video selected by the user from each video constituting a multi-view video has been proposed. For example, according to Japanese Patent Application Laid-Open No. 2004-133620, video data is displayed on a main screen in a screen area of a display screen divided into a main screen and a sub screen, and multi-view video data captured from a viewpoint different from the video data. Is displayed on the sub screen. When the user performs a predetermined operation using the operation device, the multi-view video data to be displayed on the sub-screen can be switched to another multi-view video data.

特開２００５−１５９５９２号公報JP 2005-155952 A P.Herrera, et al., Automatic classification of drum sounds: a comparison offeature selection methods and classification techniques, Proc. InternationalConference on Music and Artificial Intelligence, 2002.P. Herrera, et al., Automatic classification of drum sounds: a comparison offeature selection methods and classification techniques, Proc. International Conference on Music and Artificial Intelligence, 2002. P.Herrera, et al., Automatic classification of musical instrument sounds, JournalOf New Music Research, vol. 32, 2003.P. Herrera, et al., Automatic classification of musical instrument sounds, JournalOf New Music Research, vol. 32, 2003.

ところで、ライブ動画の作成或いは編集を行う際には、動画の再生時に各パートの演奏者（音を発する者、例えば楽器の演奏者或いはボーカル）の映像が画面の適切な位置に表示されるように留意する必要がある。これは、各パートの演奏者の映像の画面内での配置位置が適切でないと、映像に合わせて再生される各パートの音の音像定位位置との整合性が損なわれ、視聴者に違和感を与えるためである。例えば、ギターの演奏者の映像が画面の右側に映っているにもかかわらず、スピーカから放音されるギターの演奏音が左側から聴こえてくる（すなわち、音像が左側に定位している）と、視聴者に違和感を与えることになる。しかし、上記留意点に注意を払いつつ多視点映像を編集することは煩わしい。 By the way, when creating or editing a live video, the video of the player of each part (sounder, eg, musical instrument player or vocal) is displayed at an appropriate position on the screen when the video is played back. It is necessary to pay attention to. This is because, if the position of the player's video in each part on the screen is not appropriate, the consistency with the sound image localization position of the sound of each part played in accordance with the video will be impaired, and the viewer will feel uncomfortable. To give. For example, when the guitar player's image is shown on the right side of the screen, but the guitar performance sound from the speaker is heard from the left side (ie, the sound image is localized on the left side) , It will make the viewer feel uncomfortable. However, it is troublesome to edit a multi-viewpoint video while paying attention to the above points.

この発明は、以上説明した事情に鑑みてなされたものであり、視聴者に違和感を与えることのない多視点映像を手軽に編集することを可能にする技術を提供することを目的としている。 The present invention has been made in view of the circumstances described above, and an object of the present invention is to provide a technique that makes it possible to easily edit a multi-view video that does not give the viewer a sense of incongruity.

この発明は、少なくとも映像情報を含む動画情報を複数受け取る情報受取ステップと、複数の前記動画情報を解析し、音の種別を映像情報毎に識別する識別ステップと、音の種別毎に表示装置の画面の画面領域が対応付けられた映像配置テーブルを参照し、前記識別された音の種別に対応する映像情報の表す映像を表示装置の画面の画面領域に割り当てて表示させる表示ステップとを含むことを特徴とする再生方法を提供する。 The present invention provides an information receiving step for receiving a plurality of moving image information including at least video information, an identifying step for analyzing the plurality of moving image information and identifying a sound type for each video information, and a display device for each sound type. A display step of referring to a video arrangement table associated with the screen area of the screen and allocating and displaying the video represented by the video information corresponding to the identified sound type to the screen area of the screen of the display device. A reproduction method characterized by the above is provided.

本発明によれば、動画情報に含まれる映像情報の表す映像を画面に表示する際に映像配置テーブルが参照され、その映像に対応する音の種別に応じた画面領域に当該映像が表示される。実際にライブ演奏を行うバンドの各パートの演奏者が、映像配置テーブルに列挙された音を、当該音に対応付けられた領域に応じた位置で演奏する場合、各音の音像定位の位置と、画面に表示される各パートの演奏者の映像の位置とが整合する。したがって、視聴者に違和感を与えることなくライブ動画を再生することが可能となる。なお、動画情報に含まれる音の種別を特定する場合、当該音の種別の識別を支援する分類アルゴリズムとして、非特許文献１或いは非特許文献２に記載のｋ−ＮＮ（k-Nearest Neighbors）法等を用いてもよい。 According to the present invention, when the video represented by the video information included in the video information is displayed on the screen, the video layout table is referred to, and the video is displayed in the screen area corresponding to the type of sound corresponding to the video. . When the performer of each part of the band performing the live performance actually plays the sound listed in the video arrangement table at a position corresponding to the area associated with the sound, the sound image localization position of each sound The position of the video of the performer of each part displayed on the screen matches. Therefore, it is possible to play a live video without giving the viewer a sense of incongruity. When identifying the type of sound included in the moving image information, as a classification algorithm that supports identification of the type of sound, k-NN (k-Nearest Neighbors) method described in Non-Patent Document 1 or Non-Patent Document 2 is used. Etc. may be used.

本発明の第１実施形態である再生装置１の構成を示すブロック図である。It is a block diagram which shows the structure of the reproducing | regenerating apparatus 1 which is 1st Embodiment of this invention. 同実施形態における複数の領域に区分けされた表示部３０の表示画面を示す図である。It is a figure which shows the display screen of the display part 30 divided into the several area | region in the same embodiment. 同実施形態におけるクラス分類テーブル２１４の例を示す図である。It is a figure which shows the example of the class classification table 214 in the embodiment. 同実施形態における映像配置テーブル２１３の例を示す図である。It is a figure which shows the example of the image | video arrangement | positioning table 213 in the embodiment. ミックスダウンで推奨される楽器の空間配置を示す図である。It is a figure which shows the space arrangement of the musical instrument recommended by a mixdown. 同実施形態において、音識別部１２０および動画情報出力部１３０が実行する音識別処理の内容を示すフローチャートである。4 is a flowchart showing the contents of sound identification processing executed by a sound identification unit 120 and a moving picture information output unit 130 in the embodiment. 本発明の第２実施形態である再生装置１Ａの構成を示すブロック図である。It is a block diagram which shows the structure of 1 A of reproducing | regenerating apparatuses which are 2nd Embodiment of this invention. 同実施形態における映像配置テーブル２１３Ａ＿Ｔの一例を示す図である。It is a figure which shows an example of the video arrangement | positioning table 213A_T in the embodiment. 同実施形態における映像配置テーブル２１３Ａ＿Ｔに対応した表示部３０の表示画面の例を示す図である。It is a figure which shows the example of the display screen of the display part 30 corresponding to the video arrangement | positioning table 213A_T in the embodiment. 同実施形態における複数の領域に区分けされた表示部３０の表示画面の他の例を示す図である。It is a figure which shows the other example of the display screen of the display part 30 divided into the some area | region in the embodiment.

以下、図面を参照しつつ本発明の実施の形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

＜第１実施形態＞
図１は、この発明の第１実施形態である再生装置１の構成を示すブロック図である。この再生装置１は、例えばＰＣ（Personal Computer）やタブレット型端末などの動画再生機能を有する装置である。再生装置１は、動画撮像装置（図示略）から取得した動画情報の編集を行う動画編集システムに含まれ、編集対象の動画情報を再生する。この動画編集システムの利用者は、再生装置１により再生される動画を確認しながら、最終的にエンドユーザに配布する動画情報を編集することができる。 <First Embodiment>
FIG. 1 is a block diagram showing a configuration of a playback apparatus 1 according to the first embodiment of the present invention. The playback device 1 is a device having a video playback function, such as a PC (Personal Computer) or a tablet terminal. The playback device 1 is included in a video editing system that edits video information acquired from a video imaging device (not shown), and plays back video information to be edited. The user of this moving image editing system can edit the moving image information finally distributed to the end user while confirming the moving image reproduced by the reproducing apparatus 1.

本実施形態における編集対象の動画情報は、ライブ演奏を行うバンドの演奏シーンを収録したものである。この動画情報は、映像情報と音響情報とを含む時系列データである。映像情報は、楽器の演奏者や歌手（以下、演奏者）を表す情報である。具体的には、映像情報は、ライブ演奏を行うバンドの各構成メンバー（演奏者）を、担当するパート毎に撮影した映像を表す。一方、音響情報は、楽器の演奏音や歌唱音声（以下、音）を表す情報である。具体的には、音響情報は、各パートの演奏者が演奏する音を表す。本実施形態では、被写体であるバンドは、エレキギター、ボーカル、ベース、キーボード、バスドラムおよびコーラスの６種類のパートにより構成される。各演奏者をパート毎に撮影した映像情報の表す映像の大部分は各パートの演奏者であるが、他のパートの演奏者の一部も映り込んでいる。音響情報の表す音の大部分も各パートの演奏者が発する音であるが、他のパートの演奏者が発する音も含まれている。以下、動画撮像装置による撮影により得られる動画情報を動画情報ＣＡＶ＿Ｎ（Ｎ＝１〜ｎ：ｎは２以上の自然数、本実施形態ではｎ＝６）、動画情報ＣＡＶ＿Ｎに含まれる映像情報を映像情報Ｖ＿Ｎ（Ｎ＝１〜ｎ）、動画情報ＣＡＶ＿Ｎに含まれる音響情報を音響情報Ａ＿Ｎ（Ｎ＝１〜ｎ）と表記する。 The moving image information to be edited in the present embodiment is recorded with performance scenes of bands performing live performances. This moving image information is time-series data including video information and audio information. The video information is information representing a musical instrument player or singer (hereinafter, a player). Specifically, the video information represents a video obtained by photographing each constituent member (performer) of a band performing live performance for each part in charge. On the other hand, the acoustic information is information representing a performance sound of a musical instrument or a singing voice (hereinafter, sound). Specifically, the acoustic information represents a sound played by the performer of each part. In the present embodiment, a band that is a subject is composed of six types of parts: electric guitar, vocal, bass, keyboard, bass drum, and chorus. Most of the images represented by the video information obtained by photographing each performer for each part are performers of each part, but some performers of other parts are also reflected. Most of the sounds represented by the acoustic information are sounds produced by the performers of each part, but sounds produced by performers of other parts are also included. Hereinafter, moving image information obtained by shooting with the moving image capturing apparatus is moving image information CAV_N (N = 1 to n: n is a natural number of 2 or more, in this embodiment, n = 6), and video information included in the moving image information CAV_N is image information. V_N (N = 1 to n) and the acoustic information included in the moving image information CAV_N are denoted as acoustic information A_N (N = 1 to n).

図１に示すように、再生装置１は、ＣＰＵ（Central Processing Unit）１０、記憶部２０、表示部３０、メモリインタフェース部４０およびサウンドシステム５０を有している。メモリインタフェース部４０は、ＣＰＵ１０によって行われるＳＤメモリカード等の記憶媒体へのアクセスを仲介する装置である。本実施形態において、上述した動画撮像装置による撮影により得られる動画情報ＣＡＶ＿Ｎは、このメモリインタフェース部４０を介して再生装置１に入力される。より詳細には、動画情報ＣＡＶ＿Ｎが書き込まれた記録媒体がメモリインタフェース部４０に接続されると、ＣＰＵ１０（より正確には、動画情報取得部１１０）は当該記録媒体内の動画情報ＣＡＶ＿Ｎをメモリインタフェース部４０を介して読み出し、不揮発性記憶部２１０に書き込む。 As shown in FIG. 1, the playback apparatus 1 includes a CPU (Central Processing Unit) 10, a storage unit 20, a display unit 30, a memory interface unit 40, and a sound system 50. The memory interface unit 40 is a device that mediates access to a storage medium such as an SD memory card performed by the CPU 10. In the present embodiment, moving image information CAV_N obtained by shooting by the moving image capturing apparatus described above is input to the playback apparatus 1 via the memory interface unit 40. More specifically, when a recording medium in which moving image information CAV_N is written is connected to the memory interface unit 40, the CPU 10 (more precisely, the moving image information acquisition unit 110) stores the moving image information CAV_N in the recording medium as a memory interface. Read through the unit 40 and write to the nonvolatile storage unit 210.

表示部３０は、例えば液晶ディスプレイであり、ＣＰＵ１０による制御の下、映像情報Ｖ＿Ｎの各々が表す映像を表示画面に表示する。表示部３０の表示画面は、複数の領域（本実施形態では、７個の領域）に仮想的に区分けされており、これら各領域に映像情報Ｖ＿Ｎの各々の表す映像が一つずつ表示される。図２は、複数の領域に区分けされた表示部３０の表示画面の一例を示す図である。図２に示すように、表示部３０の画面は、エリア１〜エリア７に区分けされている。サウンドシステム５０は、左右各チャネルのスピーカ等を含んでおり、ＣＰＵ１０による制御の下、放音する。詳細については後述するが、ＣＰＵ１０は音響情報Ａ＿Ｎの各々を上記各スピーカに振り分けて出力する。 The display unit 30 is a liquid crystal display, for example, and displays the video represented by each video information V_N on the display screen under the control of the CPU 10. The display screen of the display unit 30 is virtually divided into a plurality of regions (seven regions in the present embodiment), and each of the images represented by the video information V_N is displayed one by one in each region. . FIG. 2 is a diagram illustrating an example of a display screen of the display unit 30 divided into a plurality of regions. As shown in FIG. 2, the screen of the display unit 30 is divided into areas 1 to 7. The sound system 50 includes left and right channel speakers and emits sound under the control of the CPU 10. Although details will be described later, the CPU 10 distributes and outputs each of the acoustic information A_N to the speakers.

記憶部２０は、不揮発性記憶部２１０と揮発性記憶部２２０とを含んでいる。揮発性記憶部２２０は、例えばＲＡＭ（Random Access Memory）であり、不揮発性記憶部２１０は、例えばＨＤＤ（Hard Disk Drive）やＦｌａｓｈＲＯＭ（Read Only Memory）である。前述したように、不揮発性記憶部２１０には、動画情報ＣＡＶ＿Ｎが格納される。また、不揮発性記憶部２１０には、再生プログラム２１１、音識別プログラム２１２、映像配置テーブル２１３およびクラス分類テーブル２１４が予め格納されている。 The storage unit 20 includes a nonvolatile storage unit 210 and a volatile storage unit 220. The volatile storage unit 220 is, for example, a RAM (Random Access Memory), and the nonvolatile storage unit 210 is, for example, an HDD (Hard Disk Drive) or a Flash ROM (Read Only Memory). As described above, the non-volatile storage unit 210 stores the moving image information CAV_N. The nonvolatile storage unit 210 stores a reproduction program 211, a sound identification program 212, a video arrangement table 213, and a class classification table 214 in advance.

再生プログラム２１１は、動画情報ＣＡＶ＿Ｎの不揮発性記憶部２１０への書き込みや動画情報ＣＡＶ＿Ｎの不揮発性記憶部２１０からの読み出し、動画情報ＣＡＶ＿Ｎの再生など、再生装置１の基本的な機能をＣＰＵ１０に実現させるプログラムである。なお、動画情報ＣＡＶ＿Ｎの再生には、当該動画情報ＣＡＶ＿Ｎに含まれる音響情報Ａ＿Ｎの表す映像の表示と、当該動画情報ＣＡＶ＿Ｎに含まれる映像情報Ｖ＿Ｎの表す音の出力とが含まれる。 The reproduction program 211 realizes the basic functions of the reproduction apparatus 1 in the CPU 10 such as writing the moving image information CAV_N into the nonvolatile storage unit 210, reading the moving image information CAV_N from the nonvolatile storage unit 210, and reproducing the moving image information CAV_N. It is a program to let you. Note that the reproduction of the moving image information CAV_N includes the display of the video represented by the audio information A_N included in the moving image information CAV_N and the output of the sound represented by the video information V_N included in the moving image information CAV_N.

音識別プログラム２１２は、本発明の特徴を顕著に示す処理をＣＰＵ１０に実行させるプログラムである。より詳細には、音識別プログラム２１２は、動画情報ＣＡＶ＿Ｎに含まれる音響情報Ａ＿Ｎを解析することにより、音響情報Ａ＿Ｎの表す音の種別（パート）をクラス分類テーブル２１４の格納内容を参照して識別する音識別処理をＣＰＵ１０に実現させるプログラムである。 The sound identification program 212 is a program that causes the CPU 10 to execute processing that clearly shows the features of the present invention. More specifically, the sound identification program 212 identifies the sound type (part) represented by the acoustic information A_N with reference to the stored contents of the class classification table 214 by analyzing the acoustic information A_N included in the video information CAV_N. This is a program that causes the CPU 10 to perform sound identification processing.

クラス分類テーブル２１４には、ボーカルの歌唱音や楽器の演奏音等の音の種別を示す音識別情報に対応付けてその音の音響的な特徴を示す特徴量ベクトルが格納されている。本実施形態では、１種類の音について、「ド」、「レ」、「ミ」、「ファ」、「ソ」、「ラ」、「シ」の７種類の音階の各々に対応する７種類の特徴量ベクトルがクラス分類テーブル２１４に格納されている。そして、各特徴量ベクトルは、その特徴量ベクトルに対応する音の基本周波数成分の信号レベルに対するＭ（２≦Ｍ≦ｍ＋１）次倍音成分の信号レベルの比を構成要素とするｍ次元ベクトルである。このような特徴量ベクトルを音の特徴を示す情報として用いるのは、倍音構造には音の種別が反映されているからである。この特徴量ベクトルについては音の波形を示す波形データにフーリエ変換を施して各周波数成分の信号レベルを抽出する等の周知の手法により生成すれば良い。 The class classification table 214 stores feature quantity vectors indicating the acoustic features of the sounds in association with sound identification information indicating the type of sound such as vocal singing sound or musical instrument performance sound. In this embodiment, for one type of sound, seven types corresponding to each of the seven types of scales of “do”, “le”, “mi”, “fa”, “so”, “la”, and “si”. Are stored in the class classification table 214. Each feature vector is an m-dimensional vector whose component is the ratio of the signal level of the M (2 ≦ M ≦ m + 1) overtone component to the signal level of the fundamental frequency component of the sound corresponding to the feature vector. . The reason why such a feature vector is used as information indicating the characteristics of a sound is that the type of sound is reflected in the overtone structure. The feature vector may be generated by a known method such as performing Fourier transform on the waveform data indicating the sound waveform to extract the signal level of each frequency component.

本実施形態では、エレキギター、ボーカル、ベース、キーボード、バスドラム、およびコーラスの６種類の音についての音識別情報と特徴量ベクトルとがクラス分類テーブル２１４に格納されている。以下では、この特徴量ベクトルを「特徴量ベクトルＶＩ＿Ｊ（Ｉ＝１〜７、Ｊ＝１〜６）」と表記する。また、本実施形態では、音識別情報として、エレキギターについて文字列ＬＢｅｇが、ボーカルについては文字列ＬＢｂｏが、ベースについては文字列ＬＢｂａが、キーボードについては文字列ＬＢｋｂが、バスドラムについては文字列ＬＢｂｄが、およびコーラスについては文字列ＬＢｃｈが用いられている。なお、本実施形態では、音識別情報と特徴量ベクトルＶＩ＿Ｊとを対応付けてクラス分類テーブル２１４に格納したが、図３に示すように、特徴量ベクトルＶＩ＿Ｊの算出元となった波形データ（すなわち、添え字Ｎの示す種別および添え字Ｉの示す音階の音の波形を示す波形データ）をさらに対応付けても良い。 In the present embodiment, sound classification information and feature amount vectors for six types of sounds of electric guitar, vocal, bass, keyboard, bass drum, and chorus are stored in the class classification table 214. Hereinafter, this feature vector is referred to as “feature vector VI_J (I = 1 to 7, J = 1 to 6)”. In this embodiment, as the sound identification information, the character string LBeg is used for electric guitars, the character string LBbo is used for vocals, the character string LBba is used for bass, the character string LBkb is used for keyboards, and the character string is used for bass drums. The character string LBch is used for LBbd and chorus. In the present embodiment, the sound identification information and the feature quantity vector VI_J are stored in the class classification table 214 in association with each other. However, as shown in FIG. 3, the waveform data (that is, the source of calculation of the feature quantity vector VI_J) , The type indicated by the subscript N and the waveform data indicating the waveform of the sound of the scale indicated by the subscript I) may be further associated.

映像配置テーブル２１３は、映像情報Ｖ＿Ｎの表す各パートの演奏者の映像を表示部３０の表示画面における表示位置（例えば、表示画面の左上隅を原点とする二次元座標系における当該映像の左上隅の位置座標）を規定するテーブルである。図４は、映像配置テーブル２１３の例を示す図である。図４に示すように、映像配置テーブル２１３には、各演奏者の担当する各パートの音識別情報に対応付けて、表示部３０における表示領域を示す情報が格納されている。より詳細には、エレキギターを示すＬＢｅｇにはエリア１が、ボーカルを示すＬＢｂｏにはエリア２が、ベースを示すＬＢｂａにはエリア３が、キーボードを示すＬＢｋｂにはエリア４が、バスドラムを示すＬＢｂｄにはエリア５が、コーラスを示すＬＢｃｈにはエリア６および７が対応付けられている。映像情報Ｖ＿Ｎの表す各パートの演奏者の映像は、この映像配置テーブル２１３に従って表示部３０の表示画面に表示され、この点に本発明の顕著な特徴がある。 The video arrangement table 213 displays the video of the performer of each part represented by the video information V_N on the display screen of the display unit 30 (for example, the upper left corner of the video in a two-dimensional coordinate system with the upper left corner of the display screen as the origin). It is a table which prescribes | regulates (position coordinates). FIG. 4 is a diagram illustrating an example of the video arrangement table 213. As shown in FIG. 4, the video arrangement table 213 stores information indicating the display area in the display unit 30 in association with the sound identification information of each part in charge of each performer. More specifically, area 1 represents an electric guitar LBeg, area 2 an LBbo vocal, area 3 an LBba bass, area 4 an LBkb keyboard, and bass drum. Area 5 is associated with LBbd, and areas 6 and 7 are associated with LBch indicating chorus. The video of the performer of each part represented by the video information V_N is displayed on the display screen of the display unit 30 in accordance with the video arrangement table 213, and this point has a remarkable feature of the present invention.

以下、この点について詳述する。図５は、ミックスダウンで推奨される楽器の空間配置を示す図である。図５において、ｘ軸、ｙ軸およびｚ軸の各軸は、それぞれ定位、周波数および奥行を示す。図５に示す空間配置図によると、ミックスダウン時にパンを調整する際には、例えばボーカルの音像定位を中央とし、コーラスの音像定位を左右とすることにより、バランスの良い聴感が得られることが知られている。 Hereinafter, this point will be described in detail. FIG. 5 is a diagram showing a spatial arrangement of musical instruments recommended for mixdown. In FIG. 5, the x-axis, y-axis, and z-axis indicate the localization, frequency, and depth, respectively. According to the spatial layout shown in FIG. 5, when adjusting the pan at the time of the mixdown, for example, the vocal sound image localization is set to the center and the chorus sound image localization is set to the left and right, a balanced hearing can be obtained. Are known.

通常、ライブ演奏では、バンドの各パートを担当する演奏者は、図５（或いは、図４）に従った立ち位置で演奏を行うことが多く、本実施形態における編集対象の動画情報の被写体であるバンドの各演奏者も図５に示す立ち位置で演奏を行っている。図４に従った立ち位置で演奏を行うことにより、バランスの良い聴感が得られるからである。映像情報Ｖ＿Ｎの表す映像が表示部３０の表示画面にランダムに配置されると、サウンドシステム５０から放音される各パートの音の音像定位の位置と、各パートの演奏者の画面内での表示位置との整合性が損なわれ、視聴者に違和感を与えることは前述した通りである。そこで、本実施形態では、映像配置テーブル２１３に従って、各パートの演奏者の映像を表示部３０の表示画面に表示することにより、パート毎の好適な映像表示位置を動画再生システムのユーザ（すなわち、再生装置１のユーザ）に提示し、上記不整合が発生するような動画の編集を防止するのである。 Usually, in a live performance, the performer in charge of each part of the band often performs at a standing position according to FIG. 5 (or FIG. 4), and the subject of the moving image information to be edited in this embodiment. Each performer of a certain band also performs at the standing position shown in FIG. This is because a balanced audibility can be obtained by performing at a standing position according to FIG. When the video represented by the video information V_N is randomly arranged on the display screen of the display unit 30, the position of the sound image localization of the sound of each part emitted from the sound system 50 and the screen of the performer of each part are displayed. As described above, the consistency with the display position is lost and the viewer feels uncomfortable. Therefore, in the present embodiment, the video of the performer of each part is displayed on the display screen of the display unit 30 in accordance with the video arrangement table 213, so that a suitable video display position for each part can be determined by the user of the video playback system (that is, It is presented to the user of the playback device 1 and the editing of the moving image that causes the inconsistency is prevented.

ＣＰＵ１０は、記憶部２０（より正確には、不揮発性記憶部２１０）に記憶されている各プログラムを実行することにより再生装置１の制御中枢として機能する。本実施形態では、ＣＰＵ１０は再生装置１の電源（図示略）の投入を契機として、再生プログラム２１１および音識別プログラム２１２を不揮発性記憶部２１０から揮発性記憶部２２０へ読み出し、これら各プログラムを並列に実行する。ＣＰＵ１０は、再生プログラム２１１を実行することにより、図１に示す動画情報取得部１１０および動画情報出力部１３０として機能する。また、ＣＰＵ１０は、音識別プログラム２１２を実行することにより、図１に示す音識別部１２０として機能する。 The CPU 10 functions as a control center of the playback device 1 by executing each program stored in the storage unit 20 (more precisely, the nonvolatile storage unit 210). In the present embodiment, the CPU 10 reads the playback program 211 and the sound identification program 212 from the nonvolatile storage unit 210 to the volatile storage unit 220 when the playback device 1 is turned on (not shown), and these programs are parallelized. To run. The CPU 10 functions as the moving image information acquisition unit 110 and the moving image information output unit 130 illustrated in FIG. 1 by executing the reproduction program 211. Further, the CPU 10 functions as the sound identification unit 120 illustrated in FIG. 1 by executing the sound identification program 212.

動画情報取得部１１０は、メモリインタフェース部４０に記録媒体が接続されると、動画情報ＣＡＶ＿Ｎを読み出し（情報受取ステップ）、当該動画情報を不揮発性記憶部２１０に格納するとともに、動画情報ＣＡＶ＿Ｎを音識別部１２０に与える。なお、動画情報取得部１１０は、ユーザに各種情報を入力させるための操作手段（図示略）を介したユーザの操作を契機として、動画情報ＣＡＶ＿Ｎの読出しを行ってもよい。 When the recording medium is connected to the memory interface unit 40, the moving image information acquisition unit 110 reads out the moving image information CAV_N (information receiving step), stores the moving image information in the nonvolatile storage unit 210, and stores the moving image information CAV_N as a sound. It gives to the identification part 120. Note that the moving image information acquisition unit 110 may read the moving image information CAV_N in response to a user operation via an operation unit (not shown) for allowing the user to input various types of information.

音識別部１２０は、動画情報取得部１１０から動画情報ＣＡＶ＿Ｎを受け取ると、当該動画情報に含まれる音響情報Ａ＿Ｎに対して音識別処理を実行する（識別ステップ）。音識別処理の詳細については重複を避けるため動作例において明らかにするが、概略は以下の通りである。音識別部１２０は、音響情報Ａ＿Ｎの表す音を解析して前述した特徴量ベクトルを生成し、この特徴量ベクトルとクラス分類テーブル２１４の格納内容とから当該演奏音の種別を識別し、その識別結果を示す音識別情報を動画情報ＣＡＶ＿Ｎに付与して、動画情報出力部１３０に与える。 When the sound identification unit 120 receives the moving image information CAV_N from the moving image information acquisition unit 110, the sound identification unit 120 performs sound identification processing on the acoustic information A_N included in the moving image information (identification step). The details of the sound identification processing will be clarified in the operation example in order to avoid duplication, but the outline is as follows. The sound identification unit 120 analyzes the sound represented by the acoustic information A_N to generate the above-described feature vector, identifies the type of the performance sound from the feature vector and the stored contents of the class classification table 214, and identifies Sound identification information indicating the result is given to the moving picture information CAV_N and given to the moving picture information output unit 130.

動画情報出力部１３０は、音識別部１２０から動画情報ＣＡＶ＿Ｎを受け取り、音響情報Ａ＿Ｎおよび映像情報Ｖ＿Ｎを取り出す。動画情報出力部１３０は、動画情報ＣＡＶ＿Ｎを受け取ったことを契機として、記憶部２０（より正確には、不揮発性記憶部２１０）から映像配置テーブル２１３を読み出す。動画情報出力部１３０は、音響情報Ａ＿Ｎをミキシングしてサウンドシステム５０に出力する。また、動画情報出力部１３０は、映像配置テーブル２１３を参照し、動画情報ＣＡＶ＿Ｎに付与された音識別情報に対応する各領域を特定し、各領域に映像情報Ｖ＿Ｎの各々の表す映像が表示されるよう、映像情報Ｖ＿Ｎを合成し（表示ステップ）、その合成結果を示す映像情報を表示部３０に出力する。
以上が再生装置１の構成である。 The moving image information output unit 130 receives the moving image information CAV_N from the sound identification unit 120 and extracts the acoustic information A_N and the video information V_N. The moving image information output unit 130 reads the video arrangement table 213 from the storage unit 20 (more precisely, the nonvolatile storage unit 210) when receiving the moving image information CAV_N. The moving picture information output unit 130 mixes the acoustic information A_N and outputs it to the sound system 50. Also, the moving picture information output unit 130 refers to the video arrangement table 213, specifies each area corresponding to the sound identification information given to the moving picture information CAV_N, and displays the video represented by the video information V_N in each area. Thus, the video information V_N is synthesized (display step), and video information indicating the synthesis result is output to the display unit 30.
The above is the configuration of the playback device 1.

次に、再生装置１が実行する動作について説明する。再生装置１が実行する処理は、上述したように音識別処理、および動画情報の再生処理や書込みまたは読出し等の基本的な処理に大別される。 Next, operations performed by the playback device 1 will be described. The processing executed by the playback device 1 is roughly divided into basic processing such as sound identification processing and moving image information playback processing and writing or reading as described above.

動画編集システムのユーザが、動画撮像装置等により収録した動画情報ＣＡＶ＿Ｎの書き込まれた記録媒体をメモリインタフェース部４０に接続すると、動画情報ＣＡＶ＿Ｎは動画情報取得部１１０による読出しおよび書込みを経て音識別部１２０に与えられる。音識別部１２０は、動画情報ＣＡＶ＿Ｎを取得すると、音響情報Ａ＿Ｎの各々を取り出して音識別処理を実行する。図６は、音識別部１２０および動画情報出力部１３０が実行する音識別処理の内容を示すフローチャートである。 When a user of the moving image editing system connects a recording medium in which moving image information CAV_N recorded by a moving image capturing device or the like is written to the memory interface unit 40, the moving image information CAV_N is read and written by the moving image information acquisition unit 110, and then is a sound identification unit. 120. When acquiring the moving image information CAV_N, the sound identification unit 120 extracts each piece of the acoustic information A_N and executes a sound identification process. FIG. 6 is a flowchart showing the contents of the sound identification processing executed by the sound identification unit 120 and the moving image information output unit 130.

音識別部１２０は、音響情報Ａ＿Ｎの各々について以下の処理を実行する。音識別部１２０は、音響情報Ａ＿Ｎをフレームに区切り、ＦＦＴ（Fast Fourier Transform）を施す。次いで、音識別部１２０は、予め定められたフレーム（例えば先頭フレームや、先頭フレームから所定時間経過後のフレーム）の各々についてピッチ（基本周波数）抽出を施し、その基本周波数ｆ［Ｈｚ］の信号成分の信号レベルと、そのＭ（２≦Ｍ≦ｍ＋１）次倍音成分（周波数：２ｆ、３ｆ、・・・、（ｍ＋１）ｆ［Ｈｚ］）の各信号成分の信号レベルをフレーム毎に算出する。なお、ピッチ抽出については周知の技術を適宜用いるようにすれば良い。次いで、音識別部１２０は、倍音成分の信号レベルと基本周波数における信号レベルの比を算出し、それらの比を並べて特徴量ベクトルＵ＿Ｎを生成する。 The sound identification unit 120 performs the following processing for each piece of acoustic information A_N. The sound identification unit 120 divides the acoustic information A_N into frames and performs FFT (Fast Fourier Transform). Next, the sound identification unit 120 performs pitch (basic frequency) extraction for each of predetermined frames (for example, the first frame or a frame after a predetermined time has elapsed from the first frame), and the signal of the basic frequency f [Hz]. The signal level of each component and the signal level of each of the M (2 ≦ M ≦ m + 1) order harmonic components (frequency: 2f, 3f,..., (M + 1) f [Hz]) are calculated for each frame. . For pitch extraction, a known technique may be used as appropriate. Next, the sound identification unit 120 calculates a ratio between the signal level of the harmonic component and the signal level at the fundamental frequency, and generates a feature vector U_N by arranging these ratios.

次いで、音識別部１２０は、ｋ−ＮＮ（k-Nearest Neighbors：ｋ−最近傍法）法による分類アルゴリズムに従って、特徴量ベクトルＵ＿Ｎの属性（すなわち、音響情報Ａ＿Ｎの表す音の種別）を特定する（ステップＳ１００）。音識別部１２０は、ｍ次元空間内に、特徴量ベクトルＵ＿Ｎの終点を中心とし、かつ、特徴量ベクトルＶＩ＿Ｊをｋ（例えば、ｋ＝５）個含むような半径ｒの球を設定する。より詳細には、音識別部１２０は、設定した球に含まれる特徴量ベクトルＶＩ＿Ｊの個数をカウントし、その球の内部にｋ個の特徴量ベクトルＶＩ＿Ｊが含まれるように、半径ｒの値を調整する。次いで、音識別部１２０は、記憶部２０（より正確には、揮発性記憶部２１０）からクラス分類テーブル２１４を読み出し、当該クラス分類テーブル２１４を参照することにより、その球の内部に含まれるｋ個の特徴量ベクトルＶＩ＿Ｊの各々の属性を特定する。球の内部に含まれる特徴量ベクトルＶＩ＿Ｊの各々の属性が全て等しい場合、音識別部１２０は当該属性を特徴量ベクトルＵ＿Ｎの属性として決定する。球の内部に含まれるｋ個の特徴量ベクトルＶＩ＿Ｊの各々の属性が複数種類に亙っている場合、多数決により決定した属性、すなわちより多く特定された属性を特徴量ベクトルＵ＿Ｎの属性として決定する。例えば、球の内部に含まれる５つの特徴量ベクトルＶＩ＿Ｊの属性としてエレキギター（ＬＢｅｇ）が３個、ボーカル（ＬＢｂｏ）が２個特定された場合、多数決により特徴量ベクトルＵ＿Ｎの属性はエレキギターと特定される。なお、特徴量ベクトルＵ＿Ｎの属性を精度良く特定したい場合、他のフレームを基に特定した特徴量ベクトルＵ＿ＮついてもステップＳ１１０に示す処理を実行し、そのうち最も多く特定された属性を特徴量ベクトルＵ＿Ｎの属性として決定すればよい。 Next, the sound identification unit 120 identifies the attribute of the feature vector U_N (that is, the type of sound represented by the acoustic information A_N) according to a classification algorithm based on a k-NN (k-Nearest Neighbors) method. (Step S100). The sound identification unit 120 sets a sphere having a radius r in the m-dimensional space with the end point of the feature vector U_N as the center and including k feature vectors VI_J (for example, k = 5). More specifically, the sound identification unit 120 counts the number of feature quantity vectors VI_J included in the set sphere, and sets the value of the radius r so that k feature quantity vectors VI_J are included in the sphere. adjust. Next, the sound identification unit 120 reads the class classification table 214 from the storage unit 20 (more precisely, the volatile storage unit 210), and refers to the class classification table 214 to thereby include k included in the sphere. The attribute of each feature vector VI_J is specified. When the attributes of the feature vector VI_J included in the sphere are all equal, the sound identification unit 120 determines the attribute as the attribute of the feature vector U_N. When there are a plurality of types of attributes of the k feature quantity vectors VI_J included in the sphere, attributes determined by majority decision, that is, more specified attributes are determined as attributes of the feature quantity vector U_N. . For example, when three electric guitars (LBeg) and two vocals (LBbo) are specified as attributes of five feature vectors VI_J included in the inside of the sphere, the attribute of the feature vector U_N is electric guitar by majority vote. Identified. When it is desired to specify the attribute of the feature vector U_N with high accuracy, the process shown in step S110 is executed for the feature vector U_N specified based on another frame, and the attribute specified most frequently is selected as the feature vector U_N. It may be determined as an attribute of.

音識別部１２０は、特徴量ベクトルＵ＿Ｎの各々の属性を特定すると、当該特徴量ベクトルＵ＿Ｎの属性を示す音識別情報を動画情報ＣＡＶ＿Ｎに付与し、当該動画情報ＣＡＶ＿Ｎを動画情報出力部１３０に与える。例えば、特徴量ベクトルＵ＿１の属性がエレキギターであった場合、ラベルＬＢｅｇを動画情報ＣＡＶ＿１に付与して動画情報出力部１３０に与える。 When identifying each attribute of the feature vector U_N, the sound identification unit 120 gives sound identification information indicating the attribute of the feature vector U_N to the moving image information CAV_N, and gives the moving image information CAV_N to the moving image information output unit 130. . For example, when the attribute of the feature vector U_1 is an electric guitar, the label LBeg is given to the moving picture information CAV_1 and given to the moving picture information output unit 130.

動画情報出力部１３０は、各々音識別情報を付与された動画情報ＣＡＶ＿Ｎを音識別部１２０から受け取ると、記憶部２０（より正確には、不揮発性記憶部２１０）から映像配置テーブル２１３を読み出し、動画情報ＣＡＶ＿Ｎに付与された音識別情報に対応付けられた領域、すなわち映像情報Ｖ＿Ｎの表す映像を表示する領域を特定する（ステップＳ１１０）。次いで、動画情報出力部１３０は、特定した各領域に映像情報Ｖ＿Ｎの表す映像が表示されるよう、映像情報Ｖ＿Ｎを合成し、その合成結果を示す映像情報を表示部３０に出力する（ステップＳ１２０）。この結果、表示部３０の表示画面には、映像情報Ｖ＿Ｎの表す映像が図２に示す領域に表示される。 When the moving image information output unit 130 receives the moving image information CAV_N to which the sound identification information is assigned from the sound identification unit 120, the moving image information output unit 130 reads the video arrangement table 213 from the storage unit 20 (more precisely, the nonvolatile storage unit 210), An area associated with the sound identification information given to the moving picture information CAV_N, that is, an area for displaying the video represented by the video information V_N is specified (step S110). Next, the moving picture information output unit 130 synthesizes the video information V_N so that the video represented by the video information V_N is displayed in each identified area, and outputs video information indicating the synthesis result to the display unit 30 (step S120). ). As a result, the video represented by the video information V_N is displayed on the display screen of the display unit 30 in the area shown in FIG.

以上、本実施形態によれば、音識別部１２０により識別された各音の演奏者の映像が映像配置テーブル２１３により指定された領域に表示される。前述したように、本実施形態の被写体のバンドの各演奏者は、図５に示す立ち位置で演奏を行っており、音響情報Ｖ＿Ｎにおける音像の定位位置も図５に示す位置に応じたものとなっている。このため、サウンドシステム５０から放音される各音の音像の定位位置と、各音の演奏者の映像の表示画面内での配置位置との整合性が損なわれることはない。このため、エンドユーザに配布する動画情報における各演奏者の映像の表示位置を、再生装置１により決定された位置から変更しないように編集を行えば、エンドユーザに違和感を与えることのない多視点映像を手軽に編集することができる。 As described above, according to the present embodiment, the video of the performer of each sound identified by the sound identification unit 120 is displayed in the area specified by the video arrangement table 213. As described above, each performer of the band of the subject of the present embodiment performs at the standing position shown in FIG. 5, and the localization position of the sound image in the acoustic information V_N also corresponds to the position shown in FIG. It has become. For this reason, the consistency between the localization position of the sound image of each sound emitted from the sound system 50 and the arrangement position of the sound of the player in the video display screen is not impaired. Therefore, if editing is performed so that the display position of each player's video in the moving image information distributed to the end user is not changed from the position determined by the playback device 1, a multi-viewpoint that does not give the end user a sense of incongruity The video can be edited easily.

なお、動画再生システムのユーザが操作手段（図示略）を介して、サウンドシステム５０から放音された音響情報Ａ＿Ｎの表す演奏音の定位位置を変更させる編集（パンの調整）を行った場合には、映像情報Ｖ＿Ｎ（或いは映像情報Ｖ＿１〜Ｖ＿６の各々）の表示位置を変更後の定位位置に応じて変更する処理をＣＰＵ１０に行わせて良く、また、映像情報Ｖ＿Ｎの表示位置を変更させる編集を行った場合には、音響情報Ａ＿Ｎ（或いは音響情報Ｖ＿１〜Ｖ＿６の各々）のパンを変更後の表示位置に応じて調整する処理をＣＰＵ１０に行わせても良い。 When the user of the video playback system performs editing (pan adjustment) for changing the localization position of the performance sound represented by the acoustic information A_N emitted from the sound system 50 via the operation means (not shown). May cause the CPU 10 to perform processing for changing the display position of the video information V_N (or each of the video information V_1 to V_6) in accordance with the changed localization position, and editing for changing the display position of the video information V_N. When performing the above, the CPU 10 may perform processing for adjusting the pan of the acoustic information A_N (or each of the acoustic information V_1 to V_6) according to the changed display position.

＜第２実施形態＞
図７は、本発明の第２実施形態である再生装置１Ａの構成を示す図である。図７では、図１と同一の構成要素には同一の符号が付されている。図７と図１を対比すれば明らかなように、本実施形態による再生装置１Ａは、記憶部２０に代えて記憶部２０Ａを有する点において第１実施形態による再生装置１と異なる。以下、第１実施形態との相違点である記憶部２０Ａを中心に説明する。 Second Embodiment
FIG. 7 is a diagram showing the configuration of a playback apparatus 1A that is the second embodiment of the present invention. In FIG. 7, the same components as those in FIG. 1 are denoted by the same reference numerals. As is clear from a comparison between FIG. 7 and FIG. 1, the playback device 1A according to the present embodiment is different from the playback device 1 according to the first embodiment in that it has a storage unit 20A instead of the storage unit 20. Hereinafter, the storage unit 20A, which is a difference from the first embodiment, will be mainly described.

記憶部２０Ａは、不揮発性記憶部２１０に換えて不揮発性記憶部２１０Ａを有する点において記憶部２０と異なる。不揮発性記憶部２１０Ａは、以下の３つの点が不揮発性記憶部２１０と異なる。第１に、複数の映像配置テーブル（図８では、映像配置テーブル２１３Ａ＿Ｔ（Ｔ＝１〜ｔ：ｔは２以上の自然数））が格納されている点である。第２に、クラス分類テーブル２１４に代えてクラス分類テーブル２１４Ａが格納されている点である。そして、第３に、再生プログラム２１１に代えて再生プログラム２１１Ａが格納されている点である。 The storage unit 20A is different from the storage unit 20 in that the storage unit 20A includes a nonvolatile storage unit 210A instead of the nonvolatile storage unit 210. The nonvolatile storage unit 210A is different from the nonvolatile storage unit 210 in the following three points. First, a plurality of video arrangement tables (in FIG. 8, video arrangement tables 213A_T (T = 1 to t: t is a natural number of 2 or more)) are stored. Second, a class classification table 214A is stored instead of the class classification table 214. Thirdly, a reproduction program 211A is stored instead of the reproduction program 211.

映像配置テーブル２１３Ａ＿Ｔ（Ｔ＝１〜ｔ：ｔは２以上の自然数）の各々の格納内容は、第１実施形態における映像配置テーブル２１３と同様に、ミックスダウンで推奨される楽器の空間配置に対応している。映像配置テーブル２１３Ａ＿Ｔ（Ｔ＝１〜ｔ：ｔは２以上の自然数）の各々は、それぞれ編成の異なるバンド（楽団）に対応している。例えば、図８に示す映像配置テーブル２１３Ａ＿１は、音識別情報としてＬＢｅｇ、ＬＢｂｏ、ＬＢｂａ、ＬＢｋｂ、ＬＢｐｉ（ピアノを示す音識別情報）を含んでいる。つまり、図８に示す映像配置テーブル２１３Ａ＿１の格納内容は、エレキギター、ボーカル、ベース、およびピアノにより構成されるバンド（楽団）に対応する。クラス分類テーブル２１４Ａは、映像配置テーブル２１３Ａ＿Ｔ（Ｔ＝１〜ｔ：ｔは２以上の自然数）の各々に格納されている音識別情報のすべてを含んでいる。 The stored contents of each of the video layout tables 213A_T (T = 1 to t: t is a natural number of 2 or more) correspond to the spatial layout of the musical instruments recommended for the mixdown as in the video layout table 213 in the first embodiment. doing. Each of the video arrangement tables 213A_T (T = 1 to t: t is a natural number of 2 or more) corresponds to a band (orches) having a different organization. For example, the video arrangement table 213A_1 illustrated in FIG. 8 includes LBeg, LBbo, LBba, LBkb, and LBpi (sound identification information indicating a piano) as sound identification information. That is, the stored content of the video arrangement table 213A_1 shown in FIG. 8 corresponds to a band (orchestra) composed of an electric guitar, vocals, bass, and piano. The class classification table 214A includes all of the sound identification information stored in each of the video arrangement tables 213A_T (T = 1 to t: t is a natural number of 2 or more).

ＣＰＵ１０は、再生プログラム２１１Ａを実行することにより動画情報取得部１１０および動画情報出力部１３０Ａとして機能する。動画情報出力部１３０Ａは、映像配置テーブル２１３Ａ＿Ｔのうちから、動画情報ＣＡＶ＿Ｎに付与された音識別情報をすべて含むテーブル（以下、映像配置テーブル２１３Ａ＿ｔ０）を選択する。そして、動画情報出力部１３０Ａは、映像配置テーブル２１３Ａ＿ｔ０を基に、動画情報ＣＡＶ＿Ｎに付与された音識別情報に対応付けられた領域を特定する。図９は、各映像配置テーブル２１３Ａ＿ｔ０に対応した表示部３０の表示画面の例を示す図である。図９に示すように、動画情報出力部１３０Ａにより特定された各映像配置テーブル２１３Ａ＿ｔ０に従って、映像情報Ｖ＿Ｎの表す各パートの演奏者の映像が表示部３０の表示画面に表示される。 The CPU 10 functions as the moving image information acquisition unit 110 and the moving image information output unit 130A by executing the reproduction program 211A. The moving picture information output unit 130A selects a table (hereinafter referred to as a video arrangement table 213A_t0) including all sound identification information given to the moving picture information CAV_N from the video arrangement table 213A_T. Then, the moving image information output unit 130A specifies an area associated with the sound identification information given to the moving image information CAV_N based on the video arrangement table 213A_t0. FIG. 9 is a diagram illustrating an example of a display screen of the display unit 30 corresponding to each video arrangement table 213A_t0. As shown in FIG. 9, according to each video arrangement table 213A_t0 specified by the moving picture information output unit 130A, the player's video of each part represented by the video information V_N is displayed on the display screen of the display unit 30.

一般にバンドの構成はバンド毎に区々であるが、本実施形態によれば、バンドの構成に応じた最適な表示位置に各演奏者の映像を配置して表示部３０に表示させることが可能になる。 In general, the band configuration varies from band to band, but according to the present embodiment, it is possible to display the video of each performer on the display unit 30 at the optimal display position according to the band configuration. become.

＜他の実施形態＞
以上、この発明の各種の実施形態について説明したが、この発明には他にも実施形態が考えられる。 <Other embodiments>
While various embodiments of the present invention have been described above, other embodiments are possible for the present invention.

（１）上記各実施形態では、音響情報Ａ＿Ｎの表す音の種別を音識別部１２０により特定したが、音識別部１２０（音識別処理）を省略してもよい。この場合、キーボード等の入力手段を再生装置１に設け、音響情報Ａ＿Ｎの表す音の種別を示す情報を、当該入力手段を介してユーザに入力させる。そして、この情報を動画情報出力部１３０に参照させればよい。この態様によれば、音識別部１２０を省略することができるため、ＣＰＵ１０の処理負荷を低減させることが可能となる。 (1) In each of the above embodiments, the type of sound represented by the acoustic information A_N is specified by the sound identification unit 120, but the sound identification unit 120 (sound identification processing) may be omitted. In this case, an input unit such as a keyboard is provided in the playback device 1, and information indicating the type of sound represented by the acoustic information A_N is input to the user via the input unit. Then, this information may be referred to the moving image information output unit 130. According to this aspect, since the sound identification unit 120 can be omitted, the processing load on the CPU 10 can be reduced.

（２）上記各実施形態では、音響情報Ａ＿Ｎに対して音識別処理を実行することにより、その音響情報Ａ＿Ｎの表す音の種別を特定した。しかし、映像情報Ｖ＿Ｎの表す映像に対して画像解析処理を実行することにより、音響情報Ａ＿Ｎの表す音の種別を特定してもよい。また、音識別処理と画像解析処理を併用してもよい。この態様によれば、いずれか一方のみでは識別不能な場合であっても、音の種別を特定可能な場合があり、また、音の種別をより精度良く特定することが可能となる。 (2) In each of the above embodiments, the type of sound represented by the acoustic information A_N is specified by performing the sound identification process on the acoustic information A_N. However, the type of sound represented by the acoustic information A_N may be specified by performing image analysis processing on the video represented by the video information V_N. Further, sound identification processing and image analysis processing may be used in combination. According to this aspect, even if only one of them cannot be identified, the type of sound may be specified, and the type of sound can be specified with higher accuracy.

（３）上記各実施形態において、音識別部１２０により識別された音響情報Ａ＿Ｎの各々の表す音の種別を基に、当該音により構成される楽曲ジャンルを識別し、識別した楽曲ジャンルの種別に応じて、映像情報Ｖ＿Ｎの各々の表す映像の表示画面内の配置位置を決定してもよい。具体的には、各々異なる楽曲ジャンル毎に、楽曲を構成する音の種別毎に当該音に対応する映像の配置位置を規定したテーブルを予め記憶部２０に格納しておく。そして、音識別部１２０には、音響情報Ａ＿Ｎの各々が表す音の種別を過不足なく含む映像配置テーブルに対応する楽曲ジャンルを当該音により構成される楽曲の楽曲ジャンルとして識別させ、動画情報出力部１３０には、音識別部１２０により識別された楽曲ジャンルに対応する映像配置テーブルを参照させればよい。なお、表示画面内の配置位置を決定する際には、その楽曲ジャンルで一般的に推奨されている配置位置を基に決定してもよい。 (3) In each of the above embodiments, the music genre constituted by the sound is identified based on the type of sound represented by each of the acoustic information A_N identified by the sound identification unit 120, and the identified music genre type is used. Accordingly, the arrangement position in the display screen of the video represented by each of the video information V_N may be determined. Specifically, for each different music genre, a table defining the arrangement position of the video corresponding to the sound for each type of sound constituting the music is stored in the storage unit 20 in advance. Then, the sound identification unit 120 identifies the music genre corresponding to the video arrangement table including the type of sound represented by each of the acoustic information A_N as the music genre of the music composed of the sound, and outputs the video information. The unit 130 may be referred to the video arrangement table corresponding to the music genre identified by the sound identification unit 120. When determining the arrangement position in the display screen, the arrangement position may be determined based on the arrangement position generally recommended for the music genre.

（４）上記第２実施形態において、音識別部１２０および１２０Ａは、表示部３０への映像の表示を開始してから所定時間が経過する毎に音識別処理を行ってもよい。この態様によれば、ライブ演奏中に異なる種類の楽器を演奏し始めたりするようなことがあったとしても、都度、楽器の種類に応じた映像配置テーブル２１３Ａ＿Ｔに従って、映像情報Ｖ＿Ｎの表す各パートの演奏者の映像を表示部３０の表示画面に表示することが可能となる。また、ライブの開演から所定時間が経過する毎に各演奏者が立ち位置を変更する場合には、その変更時刻毎にその変更後の立ち位置に応じた映像配置テーブル２１３Ａ＿Ｔを用意しておき、表示部３０への映像の表示を開始してから上記所定時間が経過する毎にその変更時刻に応じた映像配置テーブル２１３Ａ＿Ｔを用いて音識別処理を行うことで映像情報Ｖ＿Ｎの表す各パートの演奏者の映像の表示位置を各演奏者の動きに追従させることが可能になる。 (4) In the second embodiment, the sound identification units 120 and 120A may perform the sound identification process every time a predetermined time has elapsed since the start of video display on the display unit 30. According to this aspect, each part represented by the video information V_N according to the video layout table 213A_T corresponding to the type of the instrument each time, even if a different type of musical instrument starts to be played during the live performance. Can be displayed on the display screen of the display unit 30. In addition, when each performer changes the standing position every time a predetermined time has elapsed since the live performance, a video arrangement table 213A_T corresponding to the changed standing position is prepared for each change time, The performance of each part represented by the video information V_N is performed by performing sound identification processing using the video layout table 213A_T corresponding to the change time every time the predetermined time has elapsed since the start of video display on the display unit 30. The display position of the player's video can be made to follow the movement of each player.

（５）上記第１実施形態において、表示部３０の表示画面の区分けの方法は適宜変更してもよい。図１０は、複数の領域に区分けされた表示部３０の表示画面の他の例を示す図である。図１０に示す例では、表示部３０の表示画面は、表示画面全体に亘って複数の長方形の領域に区切られている。この態様によれば、表示部３０の表示画面を有効利用することが可能となる。 (5) In the first embodiment, the method of dividing the display screen of the display unit 30 may be changed as appropriate. FIG. 10 is a diagram illustrating another example of the display screen of the display unit 30 divided into a plurality of regions. In the example shown in FIG. 10, the display screen of the display unit 30 is divided into a plurality of rectangular areas over the entire display screen. According to this aspect, the display screen of the display unit 30 can be effectively used.

（６）上記第２実施形態において、編集対象の動画の被写体であるバンドの編成と同じ編成に対応する映像配置テーブル２１３Ａ＿Ｔが存在しない場合、動画情報出力部１３０は、映像配置テーブル２１３Ａ＿Ｔのうちから、動画情報ＣＡＶ＿Ｎに付与された音識別情報を最も多く含むテーブルを選択し、その映像配置テーブルに従って映像情報Ｖ＿Ｎの表す各パートの演奏者の映像の表示位置を決定してもよい。 (6) In the second embodiment, when there is no video arrangement table 213A_T corresponding to the same organization as that of the band that is the subject of the moving image to be edited, the video information output unit 130 selects from the video arrangement table 213A_T. Alternatively, a table including the most sound identification information given to the moving image information CAV_N may be selected, and the display position of the player's video of each part represented by the video information V_N may be determined according to the video arrangement table.

（７）上記各実施形態では、特徴量ベクトルＵ＿Ｎの属性を決定する際のアルゴリズムとしてｋ−ＮＮ法を用いたが、例えばＳＶＭ（Support Vector Machine）などの他のアルゴリズムを用いても良い。 (7) In each of the above embodiments, the k-NN method is used as an algorithm for determining the attribute of the feature vector U_N. However, other algorithms such as SVM (Support Vector Machine) may be used.

（８）上記実施形態では、音識別部１２０が抽出する特徴量として音響情報Ａ＿Ｎの表す音信号の倍音成分の信号レベルを例として挙げたが、例えばケプストラムなどの他の特徴量であってもよい。 (8) In the above embodiment, the signal level of the harmonic component of the sound signal represented by the acoustic information A_N is given as an example of the feature quantity extracted by the sound identification unit 120. However, for example, other feature quantities such as a cepstrum may be used. Good.

（９）上記各実施形態では、バンドを構成する各演奏者の映像および演奏音を表す動画情報ＣＡＶ＿Ｎが再生装置１に入力されたが、バンド全体の映像および演奏音を表す動画情報が再生装置１に入力されてもよい。この場合、以下の処理を再生装置１の各部およびユーザに実行させてもよい。 (9) In each of the above embodiments, the moving image information CAV_N representing the video and performance sound of each player constituting the band is input to the playback device 1, but the moving image information representing the video and performance sound of the entire band is input to the playback device. 1 may be input. In this case, the following processing may be executed by each unit of the playback device 1 and the user.

この場合、再生装置１には、当該動画情報から各演奏者の演奏音を表す音響情報Ａ＿Ｎと各演奏者の映像を表す映像情報Ｖ＿Ｎとを生成させた後に上記実施形態の処理を実行させるようにすればよい。なお、バンド全体の映像および演奏音を表す動画情報からの各演奏者の演奏音を表す音響情報Ａ＿Ｎの生成については、独立成分分析等の既存の音源分離技術等を用いるようにすればよい。また、当該動画情報からの映像情報Ｖ＿Ｎの生成および音響情報Ａ＿Ｎとの対応付けについては、例えば上記動画情報の表す映像において各演奏者の占める領域およびその演奏者の演奏音の種別（すなわち、音響情報Ａ＿Ｎとの対応）をユーザに指定させることにより実現すればよい。 In this case, the playback apparatus 1 is caused to generate the acoustic information A_N representing the performance sound of each performer and the video information V_N representing the video of each performer from the moving image information, and then execute the processing of the above embodiment. You can do it. It should be noted that existing sound source separation technology such as independent component analysis may be used for generating the acoustic information A_N representing the performance sound of each performer from the video information representing the entire band image and performance sound. As for the generation of the video information V_N from the video information and the association with the audio information A_N, for example, in the video represented by the video information, the region occupied by each player and the type of the performance sound of the player (ie, the sound) This may be realized by causing the user to specify (corresponding to information A_N).

上記実施形態において、動画情報ＣＡＶ＿Ｎの再生時に、音響情報Ａ＿Ｎの表す音に基づいて映像情報Ｖ＿Ｎの表す映像を同期させてもよい（特開２００１−３６８６７号公報参照）。この場合、音響情報Ａ＿Ｎに含まれる演奏音の特徴を基に各音響情報Ａ＿Ｎを同期させることにより、各映像情報Ｖ＿Ｎの表す映像を同期させることが可能となる。 In the above embodiment, when the moving image information CAV_N is reproduced, the video represented by the video information V_N may be synchronized based on the sound represented by the acoustic information A_N (see JP 2001-36867 A). In this case, it is possible to synchronize the video represented by each video information V_N by synchronizing each acoustic information A_N based on the characteristics of the performance sound included in the acoustic information A_N.

１，１Ａ…再生装置、１０…ＣＰＵ、１１０…動画情報取得部、１２０…音識別部、１３０，１３０Ａ…動画情報出力部、２０，２０Ａ…記憶部、２１０，２１０Ａ…不揮発性記憶部、２１１，２１１Ａ…再生プログラム、２１２…音識別プログラム、２１３，２１３Ａ＿Ｔ…映像配置テーブル、２２０…揮発性記憶部、２１４，２１４Ａ…クラス分類テーブル、３０…表示部、４０…メモリインタフェース、５０…サウンドシステム。 DESCRIPTION OF SYMBOLS 1,1A ... Playback apparatus, 10 ... CPU, 110 ... Movie information acquisition part, 120 ... Sound identification part, 130, 130A ... Movie information output part, 20, 20A ... Memory | storage part, 210, 210A ... Nonvolatile memory part, 211 , 211A ... reproduction program, 212 ... sound identification program, 213, 213A_T ... video arrangement table, 220 ... volatile storage unit, 214, 214A ... class classification table, 30 ... display unit, 40 ... memory interface, 50 ... sound system.

Claims

An information receiving step for receiving a plurality of pieces of video information including at least video information;
An analysis step of analyzing a plurality of the moving image information and identifying a type of sound for each video information;
The video arrangement table in which the screen area of the display device screen is associated with each sound type is referred to, and the video represented by the video information corresponding to the identified sound type is assigned to the screen area of the display device screen. A display method comprising: a display step for displaying.

A plurality of pieces of video information are received, and a sound corresponding to the video represented by the video information is obtained for each of the plurality of pieces of video information with reference to a video arrangement table in which screen areas of the display device screen are associated with each type of sound. A playback apparatus comprising: a moving picture information output unit that identifies a screen area corresponding to the type of the video, assigns the video to the area, and causes the display to display the video.

A plurality of pieces of video information including the video information and sound information representing sound corresponding to the video represented by the video information are received, and the type of sound represented by each of the plurality of sound information is set to the sound information or the sound information. A sound identification unit that analyzes and identifies at least one of video information corresponding to
The moving picture information output unit
The playback apparatus according to claim 2, wherein a screen area of a video represented by each of the plurality of video information is specified according to a type of sound identified by the sound identification unit.

A plurality of the video arrangement tables having different combinations of stored sound types;
The moving image information output unit selects a video arrangement table including all the types of sounds identified by the sound identification unit from the plurality of video arrangement tables, and refers to the selected video arrangement table for each video. The playback apparatus according to claim 3, wherein a screen area is specified.

The sound represented by the plurality of acoustic information constitutes a music piece,
A plurality of the video arrangement tables corresponding to different music genres,
The sound identification unit identifies a music genre constituted by the sound based on the identified type of sound,
The moving image information output unit selects a video arrangement table indicating the music genre identified by the sound identification unit from the plurality of video arrangement tables, and refers to the selected video arrangement table to display a screen area of each video. The playback apparatus according to claim 3, wherein the playback apparatus is specified.