JP2008047998A

JP2008047998A - Moving video reproducer, and moving video reproducing method

Info

Publication number: JP2008047998A
Application number: JP2006219227A
Authority: JP
Inventors: Shigeru Kafuku; 滋加福
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2006-08-11
Filing date: 2006-08-11
Publication date: 2008-02-28
Anticipated expiration: 2026-08-11
Also published as: JP5050445B2

Abstract

<P>PROBLEM TO BE SOLVED: To perform feature parameter extraction for sound data recorded together with a moving video, and to perform synchronization of two moving videos from the similarity of feature parameters. <P>SOLUTION: The moving video reproducer comprises a means (2) for inputting a moving video file with sound; a means (3) for extracting the feature parameters of sound data in the moving video file with sound; a means (3) for generating a sound label file corresponding to the moving video file with sound consisting of the information about a label by labeling the feature parameters extracted by the extraction means; means (5, 6) for storing two moving video files with sound inputted by the input means, and storing two sound label files corresponding, respectively, to the two moving video files with sound generated by the generation means; and means (8, 9) for comparing/contrasting the information of the label included in two sound label files, and reproducing the frame of two moving video files with sound while synchronizing. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、動画再生装置及び動画再生方法に関し、たとえば、ゴルフのスイング等を撮影した二つの動画を同時に再生して見比べることができる動画再生装置及び動画再生方法に関する。 The present invention relates to a moving image reproducing apparatus and a moving image reproducing method, for example, to a moving image reproducing device and a moving image reproducing method capable of simultaneously reproducing and comparing two moving images obtained by shooting golf swings and the like.

従来のこの種の動画再生装置としては、たとえば、下記の特許文献１に記載された「画像記録再生装置及びその画像記録再生方法」が知られている。以下、この技術を従来技術１ということにすると、この従来技術１では、二つの動画の各々の再生開始点を人為的に指定し、各画像の再生開始点を揃えて同時に再生するようにしている。したがって、二つの画像を、たとえば、手本となるインストラクターの動画と比較対称のレッスン対象者の動画とすれば、両者のスイングの違い等を視覚的に見分けることができ、効果的なレッスンを行うことができる。 As a conventional moving image reproducing apparatus of this type, for example, “an image recording / reproducing apparatus and an image recording / reproducing method thereof” described in Patent Document 1 below are known. Hereinafter, this technology is referred to as prior art 1. In this prior art 1, the playback start points of each of the two moving images are artificially specified, and the playback start points of the respective images are aligned and played back simultaneously. Yes. Therefore, if the two images are, for example, an instructor's video as a model and a video of the subject of a comparatively symmetric lesson, the difference in swing between the two can be visually distinguished, and effective lessons are performed. be able to.

上記の従来技術１の欠点は、二つの動画の再生開始点を“人為的”に指定するので、手間がかかって面倒を否めない点にある。そこで、下記の特許文献２に記載された「動画像の再生方法及び動画像の再生システム」では、以下のとおり、再生開始点の指定を自動的に行うようにしている。以下、この技術を従来技術２ということにする。 The disadvantage of the above-mentioned prior art 1 is that it takes time and effort to deny the trouble because it designates “artificial” playback start points of two moving images. Therefore, in the “moving image playback method and moving image playback system” described in Patent Document 2 below, the playback start point is automatically specified as follows. Hereinafter, this technique is referred to as Conventional Technique 2.

すなわち、従来技術２では、動画Ａの参照フレームに対する動画Ｂの各フレームの類似度を算出すると共に、この類似度に基づいて動画Ｂの被参照フレームを決定し、前記参照フレームと前記被参照フレームとをそれぞれ再生開始点として、それらの再生開始点から動画Ａと動画Ｂとを同時に再生するようにしている。 That is, in the related art 2, the similarity of each frame of the moving image B with respect to the reference frame of the moving image A is calculated, the referenced frame of the moving image B is determined based on the similarity, and the reference frame and the referenced frame are determined. And the moving image A and the moving image B are simultaneously reproduced from the reproduction start points.

ここで、従来技術２における「類似度」は、類似度算出部（文献中の類似度算出部２１を参照）によって算出される。この類似度算出部では、二つの動画の各フレームの色の特徴から類似度Ｓを算出している。また、１フレームの音の強弱データについての絶対差分値から類似度Ｓを求めてよい旨の記載もある。 Here, the “similarity” in the conventional technique 2 is calculated by a similarity calculation unit (see the similarity calculation unit 21 in the literature). In the similarity calculation unit, the similarity S is calculated from the color feature of each frame of the two moving images. There is also a statement that the similarity S may be obtained from the absolute difference value for the intensity data of the sound of one frame.

特開平１０−１４５７２４号公報Japanese Patent Laid-Open No. 10-145724 特開平８−１０６５４３号公報JP-A-8-106543

しかしながら、上記の従来技術２にあっては、再生開始点の指定を自動的に行うことができ、手間を軽減して操作の簡略化を図ることができる点で優れているものの、以下の点で解決すべき問題点がある。 However, although the above-described conventional technique 2 is excellent in that the playback start point can be automatically specified and the operation can be simplified and the operation can be simplified, the following points are provided. There is a problem to be solved.

たとえば、ゴルフのスイングにおいては、アドレス、バックスイング、ダウンスイング、インパクト、フォロースイングといったいくつかの過程を辿り、各々の過程毎に二つの画像間の同期を取らなければならないものの、色の特徴は、これらの過程でそれほど大きく変化しないため、過程毎のシーンを特定することができない。したがって、二つの動画の、たとえば、インパクトの瞬間を取り出してそれらを同期させることができない。 For example, in a golf swing, it is necessary to follow several processes such as address, back swing, down swing, impact, and follow swing, and each process must synchronize two images, but the color characteristics are Since these processes do not change so much, the scene for each process cannot be specified. Therefore, for example, it is impossible to take out the moment of impact of two moving images and synchronize them.

なお、従来技術２においては、「音の強弱データについての絶対差分値から類似度Ｓを求める」旨の記載があり、この記載から、大きな音が記録されたフレーム同士を同期させることができると解されるが、「大きな音」はインパクトの音だけでなく、その他の音（歓声や拍手等の雑音）も含まれるので、「音の強弱データ」だけでは、必ずしもインパクトの瞬間等の期待したシーンの同期効果を得ることができない。 In the prior art 2, there is a description that “similarity S is obtained from an absolute difference value for sound intensity data”, and from this description, frames in which loud sounds are recorded can be synchronized. Although it is understood that “loud sound” includes not only impact sound but also other sounds (noises such as cheers and applause), the “sound strength data” alone would not necessarily expect the moment of impact, etc. The scene synchronization effect cannot be obtained.

そこで、本発明は、動画と一緒に記録された音響データの特徴パラメータを抽出し、その特徴パラメータの類似度に基づいて二つの動画の同期再生を行うようにした動画再生装置及び動画再生方法を提供することにある。 Therefore, the present invention provides a moving image playback apparatus and a moving image playback method that extract feature parameters of acoustic data recorded together with a moving image and perform synchronous playback of the two moving images based on the similarity of the feature parameters. It is to provide.

請求項１記載の発明は、音声付き動画ファイルを入力する入力手段と、前記音声付き動画ファイルの音響データの特徴パラメータを抽出する抽出手段と、前記抽出手段によって抽出された特徴パラメータにラベル付けし、該ラベルの情報からなる、当該音声付き動画ファイルに対応した音響ラベルファイルを生成する生成手段と、少なくとも前記入力手段によって入力された二つの音声付き動画ファイルを記憶すると共に、前記生成手段によって生成された前記二つの音声付き動画ファイルの各々に対応する二つの音響ラベルファイルを記憶する記憶手段と、前記二つの音響ラベルファイルに含まれるラベルの情報を比較対照して前記二つの音声付き動画ファイルのフレームを同期させて再生する同期再生手段とを備えたことを特徴とする動画再生装置である。
請求項２記載の発明は、音声付き動画ファイルを入力する入力工程と、前記音声付き動画ファイルの音響データの特徴パラメータを抽出する抽出工程と、前記抽出工程によって抽出された特徴パラメータにラベル付けし、該ラベルの情報からなる、当該音声付き動画ファイルに対応した音響ラベルファイルを生成する生成工程と、少なくとも前記入力工程によって入力された二つの音声付き動画ファイルを記憶すると共に、前記生成工程によって生成された前記二つの音声付き動画ファイルの各々に対応する二つの音響ラベルファイルを記憶する記憶工程と、前記二つの音響ラベルファイルに含まれるラベルの情報を比較対照して前記二つの音声付き動画ファイルのフレームを同期させて再生する同期再生工程とを含むことを特徴とする動画再生方法である。 The invention described in claim 1 is an input means for inputting a moving image file with sound, an extracting means for extracting feature parameters of acoustic data of the moving image file with sound, and labeling the feature parameter extracted by the extracting means. Generating means for generating an audio label file corresponding to the moving image file with sound, comprising the information of the label, and storing at least two moving image files with sound input by the input means, and generating by the generating means Storage means for storing two sound label files corresponding to each of the two sound-added moving image files, and the two sound-added moving image files by comparing and contrasting the label information contained in the two sound label files. And a synchronous playback means for playing back the frames in synchronization with each other. It is a reproduction apparatus.
The invention described in claim 2 is an input step of inputting a moving image file with sound, an extracting step of extracting feature parameters of acoustic data of the moving image file with sound, and labeling the feature parameter extracted by the extracting step. A generation step of generating an acoustic label file corresponding to the moving image file with sound, comprising the label information, and storing at least two moving image files with sound input by the input step, and generating by the generation step Storing the two sound label files corresponding to each of the two sound-added moving image files, and comparing and contrasting the information of the labels included in the two sound label files, the two sound-added movie files. Including a synchronized playback step of synchronizing and playing back frames of It is a raw way.

本発明では、動画と一緒に記録された音響データの類似度から二つの動画の同期再生を行うようにしたので、たとえば、ゴルフのレッスンビデオ等の動画再生に適用した場合には、インストラクターの動画とレッスン対象者の動画の双方について、インパクト音やスイングの風切り音等を手がかりにして、これら二つの動画を同期させて同時再生することが可能となり、効果的なレッスンを行うことができるようになる。 In the present invention, the two videos are synchronously reproduced based on the similarity of the acoustic data recorded together with the video. For example, when applied to video reproduction such as a golf lesson video, the instructor's video For both the lesson and the subject video, it is possible to synchronize and play these two videos at the same time using the impact sound and wind noise of the swing as a clue so that effective lessons can be performed. Become.

以下、本発明の実施形態を、図面を参照しながら説明する。なお、以下の説明における様々な細部の特定ないし実例および数値や文字列その他の記号の例示は、本発明の思想を明瞭にするための、あくまでも参考であって、それらのすべてまたは一部によって本発明の思想が限定されないことは明らかである。また、周知の手法、周知の手順、周知のアーキテクチャおよび周知の回路構成等（以下「周知事項」）についてはその細部にわたる説明を避けるが、これも説明を簡潔にするためであって、これら周知事項のすべてまたは一部を意図的に排除するものではない。かかる周知事項は本発明の出願時点で当業者の知り得るところであるので、以下の説明に当然含まれている。 Embodiments of the present invention will be described below with reference to the drawings. It should be noted that the specific details or examples in the following description and the illustrations of numerical values, character strings, and other symbols are only for reference in order to clarify the idea of the present invention, and the present invention may be used in whole or in part. Obviously, the idea of the invention is not limited. In addition, a well-known technique, a well-known procedure, a well-known architecture, a well-known circuit configuration, and the like (hereinafter, “well-known matter”) are not described in detail, but this is also to simplify the description. Not all or part of the matter is intentionally excluded. Such well-known matters are known to those skilled in the art at the time of filing of the present invention, and are naturally included in the following description.

図１は、実施形態における動画再生装置の構成図である。この図において、動画再生装置１は、音声付き動画入力部２、音響ラベル作成部３、振り分け部４、手本データ記憶部５、比較データ記憶部６、データ読み出し部７、フレーム同期部８、合成動画再生部９、表示部１０及び音声出力部１１を備える。 FIG. 1 is a configuration diagram of a moving image playback apparatus in the embodiment. In this figure, the moving image playback apparatus 1 includes a moving image input unit 2 with sound, an acoustic label creation unit 3, a distribution unit 4, a sample data storage unit 5, a comparison data storage unit 6, a data reading unit 7, a frame synchronization unit 8, A synthetic moving image reproduction unit 9, a display unit 10, and an audio output unit 11 are provided.

各部の詳細を説明すると、まず、音声付き動画入力部２は、たとえば、ゴルフのスイング等を撮影した音声付きの動画ファイルを取り込むための部分であり、具体的には、ビデオカメラ、あるいは、それに相当する機能を有する部分である。又は、別途にビデオカメラ等で撮影された音声付き動画ファイルを記録するハードディスク等の蓄積手段、もしくは、ネットワーク等の通信手段を介して当該音声付き動画ファイルを取り込む部分である。 The details of each part will be described. First, the moving image input unit 2 with sound is a part for taking in a moving image file with sound, such as a golf swing, and specifically, a video camera, It is a part having a corresponding function. Alternatively, it is a part that takes in the moving image file with sound through storage means such as a hard disk for recording a moving image file with sound taken separately by a video camera or the like, or communication means such as a network.

この音声付き動画入力部２によって入力される「音声付き動画ファイル」は、少なくとも次の二つの動画である。すなわち、一の動画は手本となる音声付き手本動画であり、二の動画は、この音声付き手本動画と比較される音声付き比較動画である。ここで、ゴルフスイングを例にすると、上記の一の動画（音声付き手本動画）は、インストラクター等のスイングを音声付きで記録した動画であり、二の動画（音声付き比較動画）は、レッスン対象者のスイングを音声付きで記録した動画である。 The “moving image file with sound” input by the moving image input section 2 with sound is at least the following two moving images. That is, one moving image is a model moving image with sound serving as a model, and the second moving image is a comparative moving image with sound compared with the sample moving image with sound. Here, taking a golf swing as an example, the above one video (example video with audio) is a video recording the swing of an instructor etc. with audio, and the second video (comparison video with audio) is a lesson. This is a video recording the subject's swing with sound.

上記の音声付き動画入力部２によって入力された音声付き動画ファイル（音声付き手本動画ファイルと音声付き比較動画ファイル）は、音響ラベル作成部３と振り分け部４に供給される。 The moving image file with sound (the sample moving image file with sound and the comparative moving image file with sound) input by the moving image input unit 2 with sound is supplied to the acoustic label creation unit 3 and the sorting unit 4.

音響ラベル作成部３は、予め音響ラベルを付された音響サンプルを有している。
図２は、動画ファイルと音響ラベルの概念図である。この図において、上段には、左から右へと時間順に並ぶ動画ファイルの各フレーム画像が描かれている。ここでは、ゴルフスイングの動画ファイルを例にしており、この場合、各フレーム画像は、アドレス、バックスイング、ダウンスイング、インパクト、フォロースルーなどの過程に分けることができる。 The acoustic label creating unit 3 has an acoustic sample to which an acoustic label is attached in advance.
FIG. 2 is a conceptual diagram of a moving image file and an acoustic label. In this figure, each frame image of a moving image file arranged in time order from left to right is drawn on the upper stage. Here, a movie file of a golf swing is taken as an example. In this case, each frame image can be divided into processes such as an address, a back swing, a down swing, an impact, and a follow through.

同図において、中段には、動画と一緒に記録された音響データの波形が模式的に示されている。この波形は、ほぼ背景ノイズだけの無音部分と、たとえば、ゴルフクラブの風切り音やインパクト瞬間の打球音などからなる有音部分とからなる。 In the figure, the waveform of the acoustic data recorded with the moving image is schematically shown in the middle stage. This waveform is composed of a silent part of almost only background noise and a sounded part consisting of, for example, a wind noise of a golf club or a hitting sound at the moment of impact.

たとえば、図示の例では、アドレスからバックスイング完了までの無音部分が「ラベル１」としてラベル付けされている。また、続くダウンスイングの風切り音が「ラベル２」としてラベル付けされている。さらに、インパクトの打球音が「ラベル３」としてラベル付けされている。また、インパクト直後のフォロースルーの風切り音が「ラベル４及びラベル５」としてラベル付けされている。そして、最後のフォロースルーの無音部分が「ラベル６」としてラベル付けされている。 For example, in the illustrated example, the silent part from the address to the completion of the backswing is labeled as “label 1”. Further, the wind noise of the subsequent downswing is labeled as “Label 2”. Further, the impact hitting sound is labeled as “label 3”. In addition, the follow-through wind noise immediately after the impact is labeled as “Label 4 and Label 5”. The silent part of the last follow-through is labeled as “Label 6”.

これらのラベル名は、一つの音響データにつき重複しない名前であればよく、図示の例のような連番（“ラベル１”〜“ラベル６”）であってもよいが、各々の特徴パラメータの意味を表す、人為的に入力された明示的名称又はそれに相当する文字列であってもよい。たとえば、図示の例では、ラベル１の明示的名称として“ｓｉｌＡ”、ラベル２の明示的名称として“ｓｗｉｎｇ”、ラベル３の明示的名称として“ｉｍｐａｃｔ”、ラベル４の明示的名称として“ｃｌｕｂＡ”、ラベル５の明示的名称として“ｃｌｕｂＢ”、ラベル６の明示的名称として“ｓｉｌＢ”が付加されている。これらの明示的名称の意味は、“ｓｉｌＡ”と“ｓｉｌＢ”が無音を表し、“ｓｗｉｎｇ”がダウンスイングの風切り音、“ｉｍｐａｃｔ”がインパクト音、“ｃｌｕｂＡ”と“ｃｌｕｂＢ”がそれぞれインパクト直後の風切り音を表している。 These label names need only be unique names for one acoustic data, and may be sequential numbers as shown in the figure (“label 1” to “label 6”). It may be an artificially input explicit name or a character string corresponding to the meaning. For example, in the illustrated example, “silA” is the explicit name of label 1, “swing” is the explicit name of label 2, “impact” is the explicit name of label 3, and “clubA” is the explicit name of label 4. , “ClubB” is added as the explicit name of the label 5, and “silB” is added as the explicit name of the label 6. The meanings of these explicit names are that “silA” and “silB” indicate silence, “swing” is the wind noise of the downswing, “impact” is the impact sound, “clubA” and “clubB” are immediately after the impact, respectively. Represents wind noise.

音声ラベル作成部３は、このようなラベル付けをされた多数の音響サンプルに対して音声分析を行うことにより各ラベルに対応する特徴パラメータを抽出し、入力されたラベル無し音響データについてこれらの特徴パラメータに対応する箇所を探索して対応する部分にラベル付けを行う機能を有している。 The voice label creation unit 3 performs voice analysis on a number of such labeled acoustic samples to extract feature parameters corresponding to each label, and inputs these features for unlabeled acoustic data. It has a function of searching for a location corresponding to a parameter and labeling the corresponding portion.

ここで、音声分析手法、つまり、音響データに対する特徴パラメータの抽出手法としては、たとえば、フィルタバンク分析（filter bank analysis）や線形予測符号化（linear
predictive cording）などの様々な手法が知られている。音響ラベル作成部３に適用する手法としては、特にそれに限定されないが、たとえば、フィルタバンク分析を使用することにすると、この分析手法では、ＦＦＴ（Fast Fourier Transform：高速フーリエ変換）によるスペクトルを元に、メルスケール上に等間隔に配置された帯域フィルタバンクの出力を抽出し、この出力を対数変換して、逆フーリエ変換することによってＭＦＣＣ（メル周波数ケプストラム係数：Mel Frequency Cepstrum Coefficient）と呼ばれる、音響データに対する特徴パラメータを抽出する。 Here, as a speech analysis method, that is, a feature parameter extraction method for acoustic data, for example, filter bank analysis or linear predictive coding (linear
Various methods such as predictive cording) are known. The method applied to the acoustic label creating unit 3 is not particularly limited to this. For example, when using filter bank analysis, this analysis method is based on a spectrum by FFT (Fast Fourier Transform). , The output of the band filter bank arranged at equal intervals on the mel scale is extracted, the output is logarithmically transformed, and the inverse Fourier transform is performed, so that the sound called MFCC (Mel Frequency Cepstrum Coefficient) is obtained. Extract feature parameters for data.

このように音響ラベル作成部３は、供給された動画の音響データの特徴パラメータを抽出し、対応する音響ラベルファイルを生成する。この音響ラベルファイルは、供給された動画（音声付き手本動画ファイル５ａと音声付き比較動画ファイル６ａ）の各々について一つずつ生成される。すなわち、音声付き手本動画ファイル５ａの音響ラベルファイル（以下、手本音響ラベルファイル５ｂ）と、音声付き比較動画ファイル６ａの音響ラベルファイル（以下、比較音響ラベルファイル６ｂ）とが生成される。 As described above, the acoustic label creating unit 3 extracts the characteristic parameters of the acoustic data of the supplied moving image and generates a corresponding acoustic label file. One acoustic label file is generated for each of the supplied moving images (example moving image file with sound 5a and comparative moving image file with sound 6a). That is, an acoustic label file (hereinafter referred to as “example acoustic label file 5b”) of the sample movie file with sound 5a and an acoustic label file (hereinafter referred to as “comparison acoustic label file 6b”) of the comparison movie file with sound 6a are generated.

振り分け部４は、上記の音声付き動画入力部２によって入力された音声付き動画ファイル（音声付き手本動画ファイル５ａ／音声付き比較動画ファイル６ａ）と、上記の音響ラベル作成部３によって生成された音響ラベルファイル（手本音響ラベルファイル５ｂ／比較音響ラベルファイル６ｂ）とを、ユーザ指定に基づいて、それぞれ手本データ記憶部５と比較データ記憶部６に振り分けて供給する部分である。 The distribution unit 4 is generated by the moving image file with sound (the sample moving image file with sound 5a / compared moving image file with sound 6a) input by the moving image input unit with sound 2 and the acoustic label creating unit 3 described above. The sound label file (example sound label file 5b / comparison sound label file 6b) is a part that is distributed and supplied to the example data storage unit 5 and the comparison data storage unit 6 based on user designation.

すなわち、ユーザ指定が「手本動画」である場合には、上記の音声付き動画入力部２によって入力された音声付き動画ファイル（この場合は音声付き手本動画ファイル５ａ）と、上記の音響ラベル作成部３によって生成された音響ラベルファイル（この場合は手本音響ラベルファイル５ｂ）とを手本データ記憶部５に供給し、一方、ユーザ指定が「比較動画」である場合には、上記の音声付き動画入力部２によって入力された音声付き動画ファイル（この場合は音声付き比較動画ファイル６ａ）と、上記の音響ラベル作成部３によって生成された音響ラベルファイル（この場合は比較音響ラベルファイル６ｂ）とを比較データ記憶部６に供給する。 That is, when the user designation is “example movie”, the movie file with audio (in this case, the sample movie file with audio 5a) input by the above-described movie-with-audio input unit 2 and the above-described acoustic label. When the acoustic label file generated by the creation unit 3 (in this case, the model acoustic label file 5b) is supplied to the model data storage unit 5, while the user designation is “comparison video”, the above-mentioned A moving image file with sound (in this case, a comparison moving image file with sound 6a) input by the moving image input section 2 with sound, and an acoustic label file (in this case, a comparative sound label file 6b) generated by the sound label creating section 3 ) Is supplied to the comparison data storage unit 6.

手本データ記憶部５と比較データ記憶部６は、いずれもハードディスクや不揮発性半導体記憶装置あるいは磁気ディスク等から構成された大容量の記憶装置である。なお、図では、手本データ記憶部５と比較データ記憶部６を別体として描いているが、これは、手本データの記憶空間と比較データの記憶空間が各々独立していればよいことを概念的に示したものであり、必ずしも物理的に別体となっている必要はない。 Each of the model data storage unit 5 and the comparison data storage unit 6 is a large-capacity storage device composed of a hard disk, a nonvolatile semiconductor storage device, a magnetic disk, or the like. In the figure, the model data storage unit 5 and the comparison data storage unit 6 are drawn as separate bodies. However, this requires that the sample data storage space and the comparison data storage space are independent of each other. Is conceptually shown and need not be physically separated.

データ読み出し部７は、ユーザによる再生指示に応答して、手本データ記憶部５と比較データ記憶部６から手本データと比較データとを読み出す部分であり、読み出された手本データと比較データは、フレーム同期部８に供給される。 The data reading unit 7 is a part that reads sample data and comparison data from the sample data storage unit 5 and the comparison data storage unit 6 in response to a reproduction instruction by the user, and compares the sample data with the read sample data. Data is supplied to the frame synchronization unit 8.

フレーム同期部８は、手本データ記憶部５と比較データ記憶部６から読み出された手本データ及び比較データに基づいて、二つの動画（音声付き手本動画ファイル５ａ／音声付き比較動画ファイル６ａ）のフレーム同期を取る部分であり、このフレーム同期は、各々の動画ファイル５ａ、６ａに対応した音響ラベルファイル５ｂ、６ｂの特徴パラメータを比較照合することによって行われる。
The frame synchronization unit 8 generates two moving images (an example movie file with sound 5a / a comparison movie file with audio) based on the example data and comparison data read from the example data storage unit 5 and the comparison data storage unit 6. 6a) is a part that takes frame synchronization, and this frame synchronization is performed by comparing and collating the characteristic parameters of the acoustic label files 5b and 6b corresponding to the respective moving image files 5a and 6a.

図３は、音響ラベルファイルの概念図であり、（ａ）は手本音響ラベルファイル５ｂを示し、（ｂ）は比較音響ラベルファイル６ｂを示している。図において、特に限定しないが、手本音響ラベルファイル５ｂと比較音響ラベルファイル６ｂは、それぞれ前記のラベル１〜６の順番に沿った６つの行からなるテキストファイルである。各行の書式は「ＦｓＦｅＬｎａｍｅ」であり、ＦｓとＦｅは、その音響ラベルに対応する動画ファイルのフレーム番号、Ｌｎａｍｅはラベル名（たとえば、前記の明示的名称）である。ただし、Ｆｓは、そのＬｎａｍｅで示された音の開始フレーム番号であり、Ｆｅは、そのＬｎａｍｅで示された音の終了フレーム番号である。 FIG. 3 is a conceptual diagram of an acoustic label file, where (a) shows a model acoustic label file 5b, and (b) shows a comparative acoustic label file 6b. In the figure, although not particularly limited, the model sound label file 5b and the comparative sound label file 6b are text files composed of six lines in the order of the labels 1 to 6, respectively. The format of each line is “Fs Fe Lname”, where Fs and Fe are frame numbers of the moving image file corresponding to the acoustic label, and Lname is a label name (for example, the above-described explicit name). Here, Fs is the start frame number of the sound indicated by the Lname, and Fe is the end frame number of the sound indicated by the Lname.

たとえば、手本音響ラベルファイル５ｂの１行目は「０３８ｓｉｌＡ」となっており、この意味は、音声付き手本動画ファイル５ａにおけるラベル名“ｓｉｌＡ”に対応したフレームは“フレーム０”から“フレーム３８”までであることを示している。同様に、手本音響ラベルファイル５ｂの２行目は「３８５２ｓｗｉｎｇ」となっており、この意味は、音声付き手本動画ファイル５ａにおけるラベル名“ｓｗｉｎｇ”に対応したフレームは“フレーム３８”から“フレーム５２”までであることを示している。 For example, the first line of the model acoustic label file 5b is “0 38 silA”, and this means that the frame corresponding to the label name “silA” in the model video file 5a with audio is “frame 0”. This indicates that the frame is up to “frame 38”. Similarly, the second line of the model acoustic label file 5b is “38 52 swing”, and this means that the frame corresponding to the label name “swing” in the sample video file with sound 5a is “frame 38”. To “frame 52”.

このことは、比較音響ラベルファイル６ｂについても同じであり、たとえば、比較音響ラベルファイル６ｂの１行目は「０５２ｓｉｌＡ」となっており、この意味は、音声付き比較動画ファイル６ａにおけるラベル名“ｓｉｌＡ”に対応したフレームは“フレーム０”から“フレーム５２”までであることを示している。同様に、比較音響ラベルファイル６ｂの２行目は「５２６４ｓｗｉｎｇ」となっており、この意味は、音声付き比較動画ファイル６ａにおけるラベル名“ｓｗｉｎｇ”に対応したフレームは“フレーム５２”から“フレーム６４”までであることを示している。 This is the same for the comparative sound label file 6b. For example, the first line of the comparative sound label file 6b is “0 52 silA”, which means that the label name in the comparison video file with sound 6a is the same. The frames corresponding to “silA” are “frame 0” to “frame 52”. Similarly, the second line of the comparative audio label file 6b is “52 64 swing”, which means that the frames corresponding to the label name “swing” in the audio comparison video file 6a are “frame 52” to “ This indicates that the frame is up to 64 ″.

図示の二つの音響ラベルファイル（手本音響ラベルファイル５ｂと比較音響ラベルファイル６ｂ）を見比べると、いずれも同一のラベル名が並んでおり、ラベル名毎の開始フレーム番号（Ｆｓ）と終了フレーム番号（Ｆｅ）の双方またはいずれか一方が相違している。前記のフレーム同期部８で、二つの音響ラベルファイルのラベル名を照合し、同一ラベル名行の開始フレーム番号（Ｆｓ）と終了フレーム番号（Ｆｅ）を、二つの動画ファイル（音声付き手本動画ファイル５ａ／音声付き比較動画ファイル６ａ）で一致させるように各動画ファイルのフレームを同期させることにより、二つの動画ファイルの注目フレーム（たとえば、インパクトの瞬間等）を同時に再生することが可能になる。 Comparing the two illustrated acoustic label files (example acoustic label file 5b and comparative acoustic label file 6b), the same label names are aligned, and the start frame number (Fs) and the end frame number for each label name. Both (Fe) or any one of them is different. The frame synchronization unit 8 collates the label names of the two acoustic label files, and uses the same frame name line for the start frame number (Fs) and end frame number (Fe). By synchronizing the frames of each video file so that they match in the file 5a / comparative video file with sound 6a), it becomes possible to simultaneously play the frames of interest (for example, the moment of impact) of the two video files. .

図４は、フレーム同期の概念図である。この図において、縦軸は基準側動画のフレーム番号ｉ、横軸は再生速度調整側動画のフレーム番号ｊを示している。たとえば、基準側動画を音声付き手本動画ファイル５ａとし、再生速度調整側動画を音声付き比較動画ファイル６ａとすると、この場合、音声付き手本動画ファイル５ａの注目フレーム（たとえば、インパクトの瞬間）と音声付き比較動画ファイル６ａの当該注目フレームとが一致するように、音声付き比較動画ファイル６ａの再生速度が調節（フレーム間引きやフレーム補間）される。 FIG. 4 is a conceptual diagram of frame synchronization. In this figure, the vertical axis represents the frame number i of the reference side moving image, and the horizontal axis represents the frame number j of the reproduction speed adjustment side moving image. For example, if the reference-side moving image is a sample moving image file 5a with sound and the playback speed adjustment-side moving image is a comparative moving image file 6a with sound, in this case, the frame of interest of the sounded moving image file 5a (for example, the moment of impact) And the reproduction speed of the comparative moving image file with sound 6a are adjusted (frame thinning or frame interpolation) so that the frame of interest in the comparative moving image file with sound 6a matches.

図において、図中の実線は再生速度を調整しない場合のものであり、この場合、基準側動画の再生フレーム番号と再生速度調整側動画の再生フレーム番号とが一対一に対応している。一方、図中の一点鎖線は再生速度を調整した場合のものであり、この場合、たとえば、基準側動画のフレーム４を再生中、再生速度調整側動画のフレーム５が再生されており、以降同様に、基準側動画のフレーム５→再生速度調整側動画のフレーム６、基準側動画のフレーム６→再生速度調整側動画のフレーム７、基準側動画のフレーム７→再生速度調整側動画のフレーム８・・・・というように、再生速度調整側動画のフレーム番号が一つずれて再生されている。 In the figure, the solid line in the figure shows the case where the playback speed is not adjusted, and in this case, the playback frame number of the reference side moving image and the playback frame number of the playback speed adjustment side moving image have a one-to-one correspondence. On the other hand, the alternate long and short dash line in the figure is the case where the playback speed is adjusted. In this case, for example, while the frame 4 of the reference side moving image is being played, the frame 5 of the playback speed adjusting side movie is being played. In addition, frame 5 of the reference side moving image → frame 6 of the reproducing speed adjusting side moving image, frame 6 of the reference side moving image → frame 7 of the reproducing speed adjusting side moving image, frame 7 of the reference side moving image → frame 8 of the moving speed adjusting side moving image. As described above, the frame number of the moving image on the playback speed adjustment side is shifted by one.

このように、フレーム同期部８においては、基準側動画のフレーム番号ｉに対して、再生速度調整側動画のフレーム番号ｊを同期して再生するように指定するが、これらのｉやｊは、もっぱら、音声付き手本動画ファイル５ａと音声付き比較動画ファイル６ａの音響データに対する特徴パラメータ（前記のラベル１〜６参照）に基づいて決定される。つまり、簡単に言えば、音声付き手本動画ファイル５ａの注目フレーム（たとえば、インパクトの瞬間）と音声付き比較動画ファイル６ａの当該注目フレームとが一致するように、ｉ及びｊが決定されるのである。 In this way, the frame synchronization unit 8 specifies that the frame number j of the playback speed adjustment side video is synchronized with the frame number i of the reference side video, and these i and j are It is determined exclusively based on the characteristic parameters (see the above-mentioned labels 1 to 6) for the acoustic data of the model video file with audio 5a and the comparative video file with audio 6a. That is, simply speaking, i and j are determined so that the frame of interest (for example, the moment of impact) of the sample video file 5a with audio matches the frame of interest of the comparative video file 6a with audio. is there.

合成動画再生部９は、フレーム同期部８によって決定されたｉ及びｊに基づき、音声付き手本動画ファイル５ａと音声付き比較動画ファイル６ａの各フレームを一つの画面に合成して再生する。表示部１０は、その合成画像を表示し、音声出力部１１は、基準側動画（音声付き手本動画ファイル５ａ又は音声付き比較動画ファイル６ａ）の音響データを出力する。 Based on i and j determined by the frame synchronization unit 8, the synthesized moving image reproduction unit 9 synthesizes and reproduces each frame of the sample movie file with sound 5a and the comparison movie file with sound 6a on one screen. The display unit 10 displays the synthesized image, and the audio output unit 11 outputs the acoustic data of the reference side moving image (the sample moving image file 5a with sound or the comparison moving image file 6a with sound).

図５は、合成画像の表示例を示す図であり、左側に手本画像が右側に比較画像が同時に表示されている。この表示例で示すように、本実施形態では、動画と一緒に記録された音響データの特徴パラメータ抽出を行い、その特徴パラメータの類似度から二つの動画の同期をとるようにしたので、たとえば、注目すべきインパクトの瞬間を一つの画面に同時に表示することができるようになり、インストラクターのスイングとレッスン対象者のスイングとを見比べることができ、より効果的なレッスンを行うことができるようになる。 FIG. 5 is a diagram showing a display example of a composite image, in which a model image is displayed on the left side and a comparison image is displayed on the right side simultaneously. As shown in this display example, in the present embodiment, the feature parameter extraction of the acoustic data recorded together with the moving image is performed, and the two moving images are synchronized from the similarity of the feature parameter. The moment of remarkable impact can be displayed on one screen at the same time, the instructor's swing and the lesson's swing can be compared, and more effective lessons can be performed. .

なお、以上の説明では、ゴルフレッスンに適用する例を示したが、この用途に限定されない。要は、時間軸上の要所要所で特徴的な音を発する様々な動画の比較であれば、如何なるものであっても適用することが可能である。 In addition, although the example applied to a golf lesson was shown in the above description, it is not limited to this use. In short, any video can be applied as long as it is a comparison of various moving images that emit characteristic sounds at the necessary points on the time axis.

また、以上の説明では、複数の音声付き比較動画ファイル（ゴルフレッスンを例にすれば、インストラクター毎の複数の音声付き比較動画ファイル）を収集し、それらを総合的に分析して、一つの比較音響ラベルファイル６ｂを生成しているが、これに限定されない。予め分析を外部で済ませ、その結果の特徴パラメターのみを持つようにしてもよい。この場合、図１の音響ラベル作成部３の分析に相当する機能を有する任意の外部機器（たとえば、パーソナルコンピュータ等）において、上記の分析処理を行うようにしてもよい。 In addition, in the above description, a plurality of comparative video files with audio (in the case of a golf lesson, for example, a plurality of comparative video files with audio for each instructor) are collected and analyzed comprehensively to make a single comparison. Although the acoustic label file 6b is generated, the present invention is not limited to this. The analysis may be performed outside in advance and only the resulting characteristic parameters may be included. In this case, you may make it perform said analysis process in the arbitrary external apparatuses (for example, personal computer etc.) which have the function corresponded to the analysis of the acoustic label preparation part 3 of FIG.

実施形態における動画再生装置の構成図である。It is a block diagram of the moving image reproducing device in the embodiment. 動画ファイルと音響ラベルの概念図である。It is a conceptual diagram of a moving image file and an acoustic label. 音響ラベルファイルの概念図である。It is a conceptual diagram of an acoustic label file. フレーム同期の概念図である。It is a conceptual diagram of frame synchronization. 合成画像の表示例を示す図である。It is a figure which shows the example of a display of a synthesized image.

Explanation of symbols

１動画再生装置
２音声付き動画入力部（入力手段）
３音響ラベル作成部（抽出手段、生成手段）
５手本データ記憶部（記憶手段）
５ａ音声付き手本動画ファイル（音声付き動画ファイル）
５ｂ手本音響ラベルファイル（音響ラベルファイル）
６比較データ記憶部（記憶手段）
６ａ音声付き比較動画ファイル（音声付き動画ファイル）
６ｂ比較音響ラベルファイル（音響ラベルファイル）
８フレーム同期部（同期再生手段）
９合成画像再生部（同期再生手段）
1 video playback device 2 video input unit with audio (input means)
3 Acoustic label creation unit (extraction means, generation means)
5 Model data storage (storage means)
5a Model video file with audio (video file with audio)
5b Model acoustic label file (acoustic label file)
6 Comparison data storage unit (storage means)
6a Comparison video file with audio (video file with audio)
6b Comparative acoustic label file (acoustic label file)
8 Frame synchronization unit (synchronous playback means)
9 Composite image playback unit (synchronous playback means)

Claims

An input means for inputting a video file with sound;
Extracting means for extracting feature parameters of acoustic data of the moving image file with sound;
Generating means for labeling the feature parameter extracted by the extracting means and generating an acoustic label file corresponding to the moving image file with sound, comprising the information of the label;
Storage means for storing at least two moving image files with sound input by the input means, and storing two acoustic label files corresponding to each of the two moving image files with sound generated by the generating means;
A video playback apparatus comprising: synchronous playback means for comparing and contrasting the label information contained in the two audio label files and playing back the frames of the two video files with audio in synchronization.

Input process to input video file with audio,
An extraction step of extracting feature parameters of the acoustic data of the video file with audio;
Generating the acoustic label file corresponding to the moving image file with sound, which is labeled with the feature parameter extracted by the extraction step and includes the information of the label;
Storing at least two moving image files with sound input by the input step, and storing two acoustic label files corresponding to each of the two moving image files with sound generated by the generating step;
And a synchronized playback step of synchronizing and playing back the frames of the two video files with audio by comparing and contrasting the label information contained in the two acoustic label files.