JP2004187043A

JP2004187043A - Video processor

Info

Publication number: JP2004187043A
Application number: JP2002352164A
Authority: JP
Inventors: Hiroki Yoshimura; 宏樹吉村; Kazuki Hirata; 和貴平田
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2002-12-04
Filing date: 2002-12-04
Publication date: 2004-07-02
Anticipated expiration: 2022-12-04
Also published as: JP4228673B2

Abstract

<P>PROBLEM TO BE SOLVED: To realize video control processing based on a natural movement of an object in video by eliminating the need for applying special equipment to the object. <P>SOLUTION: A physical characteristic amount extracting part 2 extracts a characteristic action aspect C of the object M from the video inputted by a video inputting part 1 such as a video camera. When a matching part 4 determines matching between a gesture model J set in a model describing part 3 and the extracted action aspect C, a cut point extracting part 5 makes a video editing part 6 perform the editing that eliminates a corresponding part in the video by using this extraction point as a start point or end point and makes a video storing part 7 store the edited video. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、ビデオカメラなどで撮影される映像に入力制御や編集といった映像制御処理を加える技術に関し、特に、当該制御処理を映像中の被写体の行動態様から直接特定した特徴的な動作に基づいて行う技術に関する。
【０００２】
【従来の技術】
例えば、企業活動において、会議や講演会をビデオカメラで撮影して録画し、後に録画した映像を利用することが行われているが、企業の機密情報を含む会議などについては撮影された映像中に部分的に記録を残せない場面がある。このような場面では、映像の被写体となる話者としては、機密情報を含む場面を記録に残さないようにするために、意図的に自ら映像の録画を止めたい場合もある。
従来は、ビデオカメラを操作するカメラマンに指示して映像の撮影を中止させたり、または、撮影後に編集者が映像中の該当部分を削除する編集作業を行って、上記事情に対処していた。
【０００３】
ここで、上記事情の対処技術として、以下に説明するように、本発明の着想に照らせば、被写体となる話者の身体動作による行動態様（例えば、ジェスチャー）や音響出力動作による行動態様（例えば、「カット」などの所定の音声出力）によって、撮影される映像に入力制御や編集処理と言った映像制御処理を行えるようにするのが、人間の自然な動作による映像制御処理がなされて、会話などの連続性を妨げることなく実用上は極めて有効であると考える。
しかしながら、従来にあっては、このような技術は実現されてはおらず、その着想すらなされていなかった。
【０００４】
従来、ジェスチャーによる機器制御として、以下のような技術が知られている。
撮影者が腕と手に身体の動きを示す筋電信号を検出する装置を装着し、撮影者の動作を筋電信号として検出することによって、カメラをコントロールする電子カメラシステムが知られている。この電子カメラシステムでは、カメラ制御コマンドは、手首から先の手や指の動きおよび二の腕の動きの組み合わせからなる所定のジェスチャーに関連付けられており、撮影者は腕と手を動かすだけでその動きが筋電信号を検出する装置の信号検出部によって検出され、当該ジェスチャーに定義付けられているコマンドによってカメラが制御される（特許文献１参照）。
【０００５】
また、人間の身振りや手振りなどから抽出した固有の特徴パラメータを利用する各種の技術も知られている。
動画キャラクターの動作を自然にするため、動作者のジェスチャーを映像から取得し、動作を再生可能な基準にパラメータ化して、パラメータにタグを付け記憶し、キャラクター動作に利用するパフォーマンス動画ジェスチャーの取得及び動画キャラクター上での再生方法及び装置が知られている（特許文献２参照）。
【０００６】
映像中のオブジェクトに対するモーション情報を効率的に記述するために、モーションヒストグラムを累積した累積モーションヒストグラムを生成し、映像中の被写体に対するモーション情報を効率的に記述したモーションディスクリプタを生成し、ビデオ検索に利用する累積モーションヒストグラムを利用したモーションディスクリプタ生成装置及びその方法が知られている（特許文献３参照）。
【０００７】
また、ジェスチャー認識を行うために、モデルを構築して、映像中に連続するフレーム画像から精度よく被写体の動作と構造を推定する技術も知られている。
動画像を構成する複数のフレーム画像の各々をベクトル空間上の１つの点とみなし、当該点の動作軌跡をジェスチャーの種類毎の特徴パラメータとし、当該抽出された特徴パラメータと基準パターンの特徴パラメータとを比較することにより、ジェスチャー認識を行うジェスチャ動画像認識方法が知られている（特許文献４参照）。
【０００８】
映像中において被写体により行われるジェスチャーに関して、問いかけ（身を乗り出す）または同意（うなづく）など、ジェスチャーの意味的な単位に付与される意味ラベルを構築し、意味ラベルからジェスチャーの意味を抽出して、ジェスチャーの開始時刻と終了時刻を記述したスクリプト生成を行うジェスチャ映像再構成方法および装置およびその方法を記録した記録媒体が知られている（特許文献５参照）。
【０００９】
動画像を構成する複数の画像フレームを入力し、画像フレーム間における少なくとも３つの特徴点の位置の変化から画像フレーム間のアフィン変形を推定して、対象物体の動きおよび構造を検出する動画像処理装置が知られている（特許文献６参照）。
【００１０】
【特許文献１】
特開２０００―１３８８５８号公報
【特許文献２】
特開２００１―２２９３９８号公報
【特許文献３】
特開２０００―２２２５８６号公報
【特許文献４】
特開平９―２４５１７８号公報
【特許文献５】
特開平１１―２３８１４２号公報
【特許文献６】
特開平６―８９３４２号公報
【００１１】
【発明が解決しようとする課題】
しかしながら、従来では以下のような種々の問題があった。
上述した従来技術の共通な問題点として、利用者にジェスチャーを認識するための特別な装置や器具を装着することなしに、あるいは、撮影装置の操作用リモコンを用いることなしに、利用者が撮影中に映像の制御処理操作をすることができないという問題があった。
【００１２】
より具体的には、特許文献１に記載される技術では利用者はカメラを操作するための認識装置を体に装着しなければならず、また、特許文献２に記載される技術ではジェスチャーを識別するためセンサを利用者が体に装着しなければならず、利用者の自然な状態での行動を妨げたり、これら特別な装備を用意しなければならないものである。
また、特許文献３乃至６に記載される技術は、被写体のジェスチャーを認識するための技術を開示するだけで、映像の入力制御や編集制御に応用可能な要素技術でしかない。
【００１３】
本発明は上記従来の事情に鑑みなされたものであり、被写体に特段の装備を施す必要をなくして、映像中の被写体の自然な動作に基づく映像制御処理を実現することを目的としている。
なお、本発明の更なる目的は以下の説明において明らかなところである。
【００１４】
【課題を解決するための手段】
本発明に係る映像処理装置は、ビデオカメラなどの映像入力手段によって入力された映像から、映像制御処理手段が、被写体の身体動作又は音響出力動作による特徴的な行動態様を特定し、所定の特徴的行動態様に基づいて映像に対して対応する制御処理を加えるための制御信号を出力する。
そして、本発明に係る映像処理装置は、上記制御信号応答して、映像編集手段がある場面を削除するなどと言った対応する編集処理を映像入力手段から入力されてメモリなどに蓄積される映像に加える、又は、映像入力制御手段がビデオカメラをフェードアウトさせるなどと言った対応する制御を映像入力手段からの映像入力に加える映像制御処理を行う。
【００１５】
したがって、被写体に制御用の特段の装備を行う必要なく、映像中の被写体の自然な動作による行動態様に基づいて、映像編集制御や映像入力制御などといった映像制御処理を行うことができる。すなわち、このような映像制御処理を、映像中の被写体画像自体から直接抽出した特徴や、映像中の被写体音響自体から直接抽出した特徴に基づいて行うことができる。
【００１６】
より具体的には、本発明に係る映像処理装置は、映像制御処理手段として、入力された映像から被写体の身体動作又は音響出力動作による特徴的な行動態様を特定する特徴量抽出手段と、身体動作又は音響出力動作による行動態様モデルを保持したモデル記憶手段と、特徴量抽出手段により特定された行動態様とモデル記憶手段に保持された行動態様モデルとの整合性を判定する整合手段と、を有し、整合手段による判定結果に基づいて前記制御信号を出力する。
このように本発明は種々な態様の映像処理装置として把握されるが、例えば当該映像処理装置を動作させることにより実施される映像制御方法や、当該映像処理装置をコンピュータにより実現するプログラムとしても把握される。
【００１７】
【発明の実施の形態】
本発明を実施例に基づいて具体的に説明する。
図１には本発明の第１実施例に係る映像処理装置の構成を示してある。
ここで、本例は、映像中の被写体のジェスチャー（身体動作による行動態様）に基づいて入力された映像に場面削除などの編集処理を加えるものであるが、本発明は、映像中の被写体のジェスチャー（身体動作による行動態様）に基づいて入力映像の切替えなどの制御処理を行う、又は、映像中の被写体の音声出力内容（音響出力動作による行動態様）に基づいて上記のような編集処理若しくは入力映像制御を行う、又は、これらを組み合わせるなどと言った種々な実施形態をすることができる。
【００１８】
図１に示すように、本例の映像処理装置は、映像入力部１、身体特徴量抽出部２、ジェスチャー・モデル記述部３、整合部４、カット点抽出部５、映像編集部６、映像蓄積部７を備えている。
なお、上記した種々な実施形態についても同様であるが、例えば本発明の主要な機能である映像制御処理手段（本例では、身体特徴量抽出部２、整合部４、カット点抽出部５）をコンピュータに本発明に係るプログラムを実行させることにより構成してもよく、これによって、入力された映像から被写体の特徴的な行動態様を特定して、所定の特徴的行動態様に基づいて映像に対して対応する制御処理を加えるための制御信号を出力するようにしてもよい。
【００１９】
映像入力部１は、ビデオカメラ、ビデオキャプチャ装置を備えたコンピュータ、又は、他で撮影された映像信号を入力するインタフェースなどによって構成され、被写体である後援者を含む会議の映像（動画像）を撮像又は装置内に取り込む。
撮像された映像は図２に示すように連続する多数の画像フレーム９からなる動画像データであり、撮影時に、各画像フレームには識別子として順次フレーム番号（００１、００２・・）と映像時間情報（ｔ１、ｔ２・・）が付加される。
【００２０】
身体特徴量抽出部２には映像入力部１から画像フレーム単位で映像が入力され、図３に示すように、身体特徴量抽出部２は、被写体Ｍを含む入力された画像フレーム９から（同図（ａ））、身体特徴量（身体動作による特徴的な行動態様を表す線分モデルＣ）を識別し（同図（ｂ））、当該身体特徴量Ｃを抽出し（同図（ｃ））、抽出された身体特徴量Ｃを整合部４に出力する。
【００２１】
なお、後述するように、身体特徴量Ｃは図４に示すように、被写体Ｍのジャスチャーを特徴付けて表す線分Ｌ１〜Ｌ７からなる線分モデルとして処理される。より具体的には、本例では、図３（ａ）に示すように被写体Ｍが両腕を広げて、その両手の二本の指をＶ字型に開いた行動態様（蟹を模したジェスチャーであるので、カニモデルとも称せられる）を編集処理の制御タイミングに利用しているため、線分モデルＣは、図４に示すように、被写体Ｍの二本の指に対応する部位線分データ（Ｌ１及びＬ２とＬ６及びＬ７）、二本の腕に対応する部位線分データ（Ｌ３とＬ５）、頭部及び胴体部に対応する部位線分データ（Ｌ４）である。
【００２２】
身体特徴量抽出部２は、抽出した身体特徴量Ｃとともに、図５に示すように抽出した画像フレームのフレーム番号１１と抽出した身体特徴量Ｃを識別する身体特徴量ＩＤ１１を整合部４へ出力する。
ジェスチャーモデル記述部３は、所定のジェスチャーモデルデータを記憶したメモリ及び当該データに基づいて図６に示すようなジェスチャーモデルＪを記述する機能を有しており、身体特徴量抽出部２から抽出された身体特徴量Ｃが出力されたことに応じて、当該身体特徴量Ｃとマッチング評価するためにジェスチャーモデルＪを記述して当該ジェスチャーモデルＪを整合部４に出力する。
【００２３】
整合部４は、身体特徴量抽出部２から入力した身体特徴量Ｃとジェスチャーモデル記述部３から入力したジェスチャーモデルＪを比較し、身体特徴量ＣとジェスチャーモデルＪがマッチングしているか否かを判定し、図７に示すように、画像フレーム毎の判定結果（整合又は不整合）１２をフレーム番号１０に対応付けてカット点抽出部５に出力する。
【００２４】
カット点抽出部５は、映像中の削除編集（カット）を行う範囲を決定して、当該決定結果に応じた制御信号を映像編集部６に出力する。具体的には、カット点抽出部５は、整合部４から入力した判定結果１２に基づいて、判定結果１２が身体特徴量ＣとジェスチャーモデルＪとが整合しているものであるときには、映像データ中のカット開始時刻及びカット終了時刻に対応するフレーム番号１０を取得し、これらフレーム番号で挟まれた部分を映像データ中から削除させる指示を映像編集部６に出力する。
【００２５】
なお、この指示はフレーム番号に代えて映像中の画像フレーム時刻を直接指定した制御信号で行ってもよく、また、この指示は、カット開始時刻を指示して映像編集部６が開始時刻から所定時間後にカット処理を終了するようにしたり、或いは、カット終了時刻を指示して映像編集部６が終了時刻から所定時間遡った時刻からカット処理を開始するようにしてもよい。
【００２６】
映像編集部６は、カット点抽出部５から入力した制御情報に基づいて、映像入力部１から入力された映像データを編集して、当該映像編集済データを映像蓄積部７に出力し、メモリからなる映像蓄積部７のデータベースに格納させる。
具体的には、カット点抽出部５からの制御情報に応答して、映像データの該当映像部分を削除して映像蓄積部７に格納させる。
【００２７】
次に、本例の身体特徴量抽出に関する処理について詳しく説明する。
図８には身体特徴量抽出部２が行う本例の身体特徴量抽出の処理手順を示してあり、まず、画像入力部１から画像フレームが入力されると当該画像フレームを二値化して、図９に示すように、当該二値化画像データ１６を身体特徴量抽出部２の作業領域として映像処理装置に設けられている内部メモリ１５に保持する（ステップＳ１）。
【００２８】
そして、身体特徴量抽出部２が二値化画像データ１６から、被写体Ｍのいわゆる細線化画像データ１７を抽出し、さらに、細線化画像データ１７から線分分割をして、二本の指、二本の腕、頭部及び胴体部に対応する部位線分データ１８を抽出し、当該部位線分データ１８を内部メモリ１５に保持して、身体特徴量抽出処理を終了する（ステップＳ２）。なお、この処理は画像入力部１から画像フレームが入力される毎に繰り返し行われ、内部メモリ１５に保持された部位線分データ１８は特徴量Ｃとして上記のように整合部４へ順次出力される。
【００２９】
すなわち、身体特徴量抽出処理は、図３（ａ）に示すように被写体Ｍの画像を抽出して二値化し、同図（ｂ）に示すように例えば被写体Ｍの外郭を成す稜線間の中央位置を結ぶことにより骨格線を抽出し、更に、同図（ｃ）、詳しくは図４に示すように二本の指、二本の腕、頭部及び胴体部に該当する部位線分データに分解する処理である。
【００３０】
次に、ジェスチャーモデルＪについて説明する。
図６は、ジェスチャーモデルＪの概念を示しており、本例のジェスチャーモデルＪは、二本の指、二本の腕、頭部及び胴体部を特定する記述で構成されている。右腕はＲ＿ＡＲＭ、右手の第一の指をＲ＿ＦＩＮＧＥＲ１、右手の第二の指をＲ＿ＦＩＮＧＥＲ２とする。左腕はＬ＿ＡＲＭ、左手の第一の指をＬ＿ＦＩＮＧＥＲ１、左手の第二の指をＬ＿ＦＩＮＧＥＲ２とする。胴体を含む頭部をＨＥＡＤとする。
【００３１】
すなわち、体全体をＢＯＤＹとすると、ジェスチャーモデルＪは次のような組み合わせによる記述である。
ＢＯＤＹ（ｔ）：＝（ＨＥＡＤ（ｔ），Ｒ＿ＡＲＭ（ｔ），Ｌ＿ＡＲＭ（ｔ））
Ｒ＿ＡＲＭ（ｔ）：＝（Ｒ＿ＦＩＮＧＥＲ１（ｔ），Ｒ＿ＦＩＮＧＥＲ２（ｔ））
Ｌ＿ＡＲＭ（ｔ）：＝（Ｌ＿ＦＩＮＧＥＲ１（ｔ），Ｌ＿ＦＩＮＧＥＲ２（ｔ））
【００３２】
ここで、本例のジェスチャーモデルＪは、モデルの時間的変化も表す時間パラメータｔを含んで記述されている。これは、ジェスチャーモデルＪの形態に時間変化を与えることによって、時間と共に変化している被写体Ｍの動作の内の或る画像フレームがジェスチャーモデルＪに整合すれば編集処理を行うようにするためであり、これによって、整合性検出の幅をもたせることにより被写体Ｍたる講演者が所期の動作指示を行い易くしている。
【００３３】
図１０には二本の指の開閉動作の概念を示すが、Ｒ＿ＦＩＮＧＥＲ１ならびにＲ＿ＦＩＮＧＥＲ２（Ｌ＿ＦＩＮＧＥＲ１ならびにＬ＿ＦＩＮＧＥＲ２）は、接続部を中心にそれぞれ時間が経つごとに離合する。具体的には、同図（ａ）に示す或る時刻ｔ１において、Ｒ＿ＦＩＮＧＥＲ１（ｔ１）およびＲ＿ＦＩＮＧＥＲ２（ｔ１）の二本の指がＲ＿ＡＲＭとの接合点を中心に或る角度（例えば、約３０度）まで開いており、同図（ｂ）に示すその後の或る時刻ｔ２においては、Ｒ＿ＦＩＮＧＥＲ１（ｔ２）およびＲ＿ＦＩＮＧＥＲ２（ｔ２）の二本の指が、Ｒ＿ＡＲＭの接合点中心に、閉じていることを表している。
したがって、部位線分データがＲ＿ＦＩＮＧＥＲ１などジェスチャーモデルＪの要素に対して、それぞれが対応していれば整合部４によって整合したと判定される。
【００３４】
次に、整合部４による部位線分データとジェスチャーモデルＪとの整合判定処理について説明する。
整合部４は、図４に示すようなそれぞれの部位線分データ（Ｌ１〜Ｌ７）が、図６に示すようなＲ＿ＦＩＮＧＥＲ１などのジェスチャーモデルＪの要素に対して対応していれば、整合したと判定する。
【００３５】
具体的には、図４に示すように、Ｌ１、Ｌ２およびＬ３は一点で接続され、また、Ｌ３、Ｌ４およびＬ５は一点で接続され、さらに、Ｌ５、Ｌ６、およびＬ７は一点で接続されている。
このＬ１、Ｌ２およびＬ３はＲ＿ＦＩＮＧＥＲ２、Ｒ＿ＦＩＮＧＥＲ１およびＲ＿ＡＲＭに対応し、Ｌ３、Ｌ４およびＬ５はＲ＿ＡＲＭ、ＨＥＡＤ、Ｌ＿ＡＲＭに対応し、Ｌ５、Ｌ６およびＬ７はＬ＿ＡＲＭ、Ｌ＿ＦＩＮＧＥＲ１、Ｌ＿ＦＩＮＧＥＲ２に対応する。
これらの部位線分データＬ１〜Ｌ７とジェスチャーモデルＪの各要素の接続関係や位置の対応が取れた場合に、身体特徴量ＣとジェスチャーモデルＪが整合したと判断される。
【００３６】
これらの身体特徴量ＣとジェスチャーモデルＪの整合判定が、連続した画像フレーム９に対して順次行われ、整合が取れた画像フレーム９をカット点（削除編集点）の候補とする。
ここで、ジェスチャーモデルＪに時間幅をもたせた本例では特に、整合の取れた画像フレーム９が複数連続して候補とされる場合が想定されるが、これら複数の画像フレームの中から、先頭のもの、最後のもの、真中のものなどといったように、カット点を規定する画像フレーム画像を特定して、当該画像フレーム番号を用いて映像編集処理を行えばよい。
【００３７】
なお、本発明では、映像中のジェスチャーモデルＪに整合する画像フレーム位置に基づいて、当該位置から所定時間の映像部分を削除する、あるいは、当該位置から所定時間遡った位置から当該位置までの映像部分を削除する、あるいは、整合する画像フレームが連続する映像部分を削除する、あるいは、図１１に示すように、映像中の整合する或る画像フレームＡの位置から削除処理を開始して次にまた整合する画像フレームＢが現れたところで削除処理を終了するなどと言ったように、種々な態様で編集処理範囲を設定すればよい。
【００３８】
上記の例は被写体Ｍのジェスチャーと言う身体動作による特徴的な行動態様に基づいて編集処理を行うようにしたが、本発明は、映像中で被写体が発した音声などの音響的出力動作による特徴的な行動態様に基づいて編集処理を行うようにしてもよい。
例えば、図１２に示すように、映像に映像データ（動画トラック）２０と被写体が発した音声データ（音声トラック）が含まれている場合、各音響トラック２１ａ、２１ｂの開始点や終了点、或いは、音響トラック２１ａ、２１ｂの切換え点を編集処理の開始や終了の位置として利用するようにしてもよい。
【００３９】
具体的には、発言毎に音響トラックを異ならせておき、上記の開始点などを編集処理の制御タイミングの候補としてジェスチャーモデル整合と併せて利用することができる。
例えば、音声トラックの開始時刻をカット点とする場合には、身体特徴量ＣとジェスチャーモデルＪとの整合によって特定されたカット点をカット終了点として映像を削除編集すればよく、また、音声トラックの終了時刻をカット点とする場合には、身体特徴量ＣとジェスチャーモデルＪとの整合によって特定されたカット点をカット開始点として映像を削除編集すればよい。
【００４０】
なお、音声トラックの連続性を検出する方法としては、図１３に示すような無音区間に基づく方法を採用することができる。
音声データの連続性を検出する場合、音量レベルを計測して、音量レベルが既定の閾値以下になったときには、その時間を無音状態時間Δｔとして検出する。例えば、或る映像データにおいて、音声トラックＡｕｄｉｏａの直後に無音状態時間Δｔ、続いて音声トラックＡｕｄｉｏｂが検出された場合、無音状態時間Δｔが所定時間（例えば、５秒）以内であれば、音声トラックＡｕｄｉｏａとＡｕｄｉｏｂは連続した音声トラックとして取り扱い、所定時間を上回る時にはこれらを異なる音声トラックとして取り扱うようにすればよい。
【００４１】
さらに、本発明において、映像データに含まれる音声などの被写体が発した音響データをより積極的に編集処理に利用する場合には、例えば、図１に示した装置構成において、身体特徴量抽出部２を映像中に含まれる音響データをその映像中の時刻情報と共に抽出するものとし、ジェスチャーモデル記述部３を図１４に示すように所定の語句（例えば、「カット」）の音声波形モデルを記述したものとし、整合部４を映像中から抽出された音響データと音声波形モデルとの整合性を判定するものとして、被写体である講演者が所定の音声を発したこと及び発した時点に応じて、カット点抽出部５が編集処理の内容や範囲を決定して、映像編集部６に制御信号を出力して対応する編集処理を映像入力部１から入力された映像に施すようにしてもよい。
【００４２】
また、上記の説明では、映像中の被写体による身体動作や音響出力動作に基づいて、入力されて蓄積される映像に編集処理を施すようにしたが、本発明では、これら映像中の被写体による身体動作や音響出力動作に基づいて、例えば映像入力部１を構成するビデオカメラを他のビデオカメラに切換えると言った映像入力制御を行うようにしてもよい。
【００４３】
この場合には、例えば、図１５に示すような装置構成として、上記と同様にして整合部４で特定された画像フレーム位置を制御点抽出部３５が制御の開始や終了の位置として利用し、これに基づいて映像制御部３６が映像入力部１へ映像入力の態様を変更する制御信号を出力するようにすればよい。
なお、図１５には図１に示した構成と同一部分には同一符号を付してある。また、図１５にはジェスチャーモデルによる整合処理の例を示すが、被写体の発した音響出力による整合処理についても上記と同様に適用できる。
【００４４】
このように映像入力の制御を行うことにより、被写体となる講演者はカメラマンに指示を与えずとも、自らのジェスチャーや発言内容などに基づいて、複数台あるビデオカメラを切換えて異なるアングルの映像を撮影する、ビデオカメラをフェードアウトさせたり入力映像にモザイク掛け処理を行って映像中に機密資料が明瞭に映らないようにする、映像入力部１の入力経路を切換えて代替映像を入力するようにする、などと言った映像入力操作を行うことができる。
【００４５】
なお、上記の説明では、音響出力動作を被写体が発した音声を例にして説明したが、本発明は、音声以外に、被写体が手を叩いて発した音や、被写体がベルやブザーを操作して発した音などと言った種々な態様の音響を編集や映像入力の制御に用いることができる。
また、上記の説明では、編集処理を映像部分の削除を例にとって説明したが、本発明は、これ以外に、映像部分の解像を低下させる、映像部分中にモザイク掛けをする、映像部分中に注釈を加入するなどと言った種々な態様の編集処理を採用することができる。
【００４６】
【発明の効果】
以上説明したように、本発明によると、映像から被写体の特徴的な行動態様を抽出し、これを起点や終点として映像の制御処理を行うようにしたため、被写体が映像に記録されたくない箇所を削除するなどといった制御処理を被写体人物による会話などの連続性を妨げない自然な人間の動作で実現することができる。
【図面の簡単な説明】
【図１】本発明の一実施例に係る映像処理装置の構成を示す図である。
【図２】映像データの構成を説明する図である。
【図３】本発明の一例に係るジェスチャー抽出を説明する図である。
【図４】本発明の一例に係る部位線分データ（特徴量）を説明する図である。
【図５】本発明の一例に係る身体特徴量抽出部が出力するデータ例を示す図である。
【図６】本発明の一例に係るジェスチャーモデルを説明する図である。
【図７】本発明の一例に係るカット点情報のデータ例を示す図である。
【図８】本発明の一例に係る身体特徴量抽出の処理手順を示す図である。
【図９】本発明の一例に係る身体特徴量抽出処理の内部メモリのデータ例を示す図である。
【図１０】本発明の一例に係る二本の指の開閉を説明する図である。
【図１１】本発明の一例に係るカット点による映像データ編集部分を説明する図である。
【図１２】本発明の一例に係るジェスチャーモデルと音声処理の組合せによるカット点を説明する図である。
【図１３】本発明の一例に係る音声情報の連続性検出を説明する図である。
【図１４】本発明の一例に係る音声情報モデルを説明する図である。
【図１５】本発明の他の一実施例に係る映像処理装置の構成を示す図である。
【符号の説明】
１：映像入力部、２：身体特徴量抽出部、
３：ジェスチャーモデル記述部、４：整合部、
５：カット点抽出部、６：映像編集部、
７：映像蓄積部、９：画像フレーム、
３５：制御点抽出部、３６：映像制御部、
Ｃ：特徴量、Ｊ：ジェスチャーモデル、
Ｌ１〜Ｌ７：部位線分データ、Ｍ：被写体、[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a technique for adding video control processing such as input control and editing to video captured by a video camera or the like, and in particular, based on a characteristic operation directly specifying the control processing from the behavior mode of a subject in the video. Regarding the technology to be performed.
[0002]
[Prior art]
For example, in corporate activities, meetings and lectures are shot and recorded with a video camera, and the recorded images are used later. There is a scene where it is not possible to leave a record partially. In such a scene, the speaker serving as the subject of the video sometimes intentionally wants to stop recording the video in order to prevent the scene including the confidential information from being recorded.
Conventionally, the above situation has been dealt with by instructing a cameraman operating a video camera to stop shooting of a video, or after shooting, an editor performs an editing operation to delete a corresponding portion in the video.
[0003]
Here, as a technique for coping with the above-mentioned circumstances, as described below, in light of the idea of the present invention, a behavior mode (for example, a gesture) based on a body motion of a speaker serving as a subject or a behavior mode (for example, a sound output operation). , A predetermined audio output such as "cut"), the video control processing such as input control and editing processing can be performed on the video to be shot. It is considered to be extremely effective in practice without interrupting the continuity of conversation and the like.
However, heretofore, such a technology has not been realized, and no idea has been made.
[0004]
2. Description of the Related Art Conventionally, the following technologies have been known as device control using gestures.
2. Description of the Related Art There is known an electronic camera system in which a photographer wears a device for detecting a myoelectric signal indicating a body movement on an arm and a hand, and controls a camera by detecting a motion of the photographer as a myoelectric signal. In this electronic camera system, the camera control command is associated with a predetermined gesture consisting of a combination of the movement of the hand or finger and the movement of the upper arm from the wrist, and the photographer simply moves the arm and hand to perform the movement. The signal is detected by a signal detection unit of the device that detects the myoelectric signal, and the camera is controlled by a command defined in the gesture (see Patent Document 1).
[0005]
Also, various techniques using unique characteristic parameters extracted from human gestures and hand gestures are known.
In order to make the motion of the video character natural, the gesture of the operator is acquired from the video, the motion is parameterized based on the reproducible reference, the parameters are tagged and stored, and the performance video gesture used for the character motion is acquired and acquired. A method and apparatus for reproducing a moving image character is known (see Patent Document 2).
[0006]
To efficiently describe motion information for objects in the video, generate a cumulative motion histogram that accumulates motion histograms, generate a motion descriptor that efficiently describes motion information for the subject in the video, and use it for video search. There is known a motion descriptor generation device using a cumulative motion histogram to be used and a method thereof (see Patent Document 3).
[0007]
There is also known a technique for constructing a model for performing gesture recognition and accurately estimating the motion and structure of a subject from continuous frame images in a video.
Each of a plurality of frame images constituting a moving image is regarded as one point on a vector space, and the motion trajectory of the point is set as a feature parameter for each type of gesture, and the extracted feature parameter and the feature parameter of the reference pattern are Are compared, a gesture moving image recognition method for performing gesture recognition is known (see Patent Document 4).
[0008]
Concerning the gesture performed by the subject in the video, construct a semantic label attached to the semantic unit of the gesture, such as asking (leaning) or agreeing (nodding), extracting the meaning of the gesture from the semantic label, There is known a gesture video reconstruction method and apparatus for generating a script describing a start time and an end time of a gesture and a recording medium on which the method is recorded (see Patent Document 5).
[0009]
Moving image processing for inputting a plurality of image frames constituting a moving image, estimating affine deformation between image frames from changes in positions of at least three feature points between the image frames, and detecting movement and structure of a target object An apparatus is known (see Patent Document 6).
[0010]
[Patent Document 1]
JP 2000-138858 A
[Patent Document 2]
JP 2001-229398 A
[Patent Document 3]
JP 2000-222586 A
[Patent Document 4]
JP-A-9-245178
[Patent Document 5]
JP-A-11-238142
[Patent Document 6]
JP-A-6-89342
[0011]
[Problems to be solved by the invention]
However, conventionally, there have been the following various problems.
A common problem with the prior art described above is that the user can take a picture without attaching a special device or device for recognizing a gesture or without using a remote control for operating the photographing apparatus. There was a problem that it was not possible to perform video control processing operations.
[0012]
More specifically, the technique described in Patent Document 1 requires a user to wear a recognition device for operating a camera on the body, and the technique described in Patent Document 2 identifies a gesture. Therefore, the sensor must be worn by the user on the body, the user must be prevented from acting in a natural state, or special equipment must be provided.
Further, the technologies described in Patent Documents 3 to 6 only disclose a technology for recognizing a gesture of a subject, and are only elemental technologies applicable to video input control and editing control.
[0013]
SUMMARY OF THE INVENTION The present invention has been made in view of the above-described conventional circumstances, and has as its object to realize a video control process based on a natural motion of a subject in a video without having to equip a subject with special equipment.
Further objects of the present invention will be apparent in the following description.
[0014]
[Means for Solving the Problems]
In the video processing device according to the present invention, a video control processing unit specifies a characteristic behavior mode of a subject by a physical motion or a sound output operation from a video input by a video input device such as a video camera, and a predetermined characteristic And outputting a control signal for applying a corresponding control process to the video based on the target behavior.
Then, the video processing device according to the present invention, in response to the control signal, the video editing unit inputs a corresponding editing process such as deleting a certain scene from the video input unit and stores the video in a memory or the like. Or a video control process in which the video input control means adds corresponding control such as fading out the video camera to the video input from the video input means.
[0015]
Therefore, it is possible to perform video control processing such as video editing control and video input control based on the behavior of the subject in the video due to the natural action without having to provide the subject with special control equipment. That is, such video control processing can be performed based on features directly extracted from the subject image itself in the video and features directly extracted from the subject sound itself in the video.
[0016]
More specifically, the video processing device according to the present invention includes, as a video control processing unit, a feature amount extraction unit that specifies a characteristic behavior mode of a subject by a body motion or a sound output operation from an input video; Model storage means holding an action mode model based on the operation or the sound output operation, and matching means for determining consistency between the action mode specified by the feature amount extraction means and the action mode model held in the model storage means. And outputting the control signal based on the determination result by the matching means.
As described above, the present invention can be grasped as a video processing device of various aspects. For example, a video control method implemented by operating the video processing device and a program for realizing the video processing device by a computer are also grasped. Is done.
[0017]
BEST MODE FOR CARRYING OUT THE INVENTION
The present invention will be specifically described based on examples.
FIG. 1 shows the configuration of a video processing apparatus according to a first embodiment of the present invention.
Here, in the present example, an editing process such as scene deletion is added to a video input based on a gesture of a subject in a video (an action mode by a body motion). Control processing such as switching of an input video is performed based on a gesture (behavior mode by a body motion), or the above-described editing process based on audio output content of a subject in a video (behavior mode by a sound output operation) or Various embodiments such as performing input video control or combining them can be performed.
[0018]
As shown in FIG. 1, the video processing apparatus according to the present embodiment includes a video input unit 1, a body characteristic amount extraction unit 2, a gesture model description unit 3, a matching unit 4, a cut point extraction unit 5, a video editing unit 6, A storage unit 7 is provided.
The same applies to the above-described various embodiments. For example, a video control processing unit (in this example, the body characteristic amount extraction unit 2, the matching unit 4, and the cut point extraction unit 5) which is a main function of the present invention. May be configured by causing a computer to execute a program according to the present invention, whereby a characteristic behavior mode of a subject is specified from an input video, and a video is generated based on a predetermined characteristic behavior mode. Alternatively, a control signal for performing a corresponding control process may be output.
[0019]
The video input unit 1 is configured by a video camera, a computer equipped with a video capture device, or an interface for inputting a video signal captured by another device, etc., and is used to input a video (moving image) of a conference including a backer who is a subject. Capture the image or take it into the device.
The captured video is moving image data composed of a number of continuous image frames 9 as shown in FIG. 2. At the time of shooting, each image frame has a frame number (001, 002,...) And video time information as identifiers. (T1, t2...) Are added.
[0020]
A video is input from the video input unit 1 to the body feature extraction unit 2 in image frame units. As shown in FIG. (A), a body feature (line segment model C representing a characteristic behavior mode by body motion) is identified ((b) in the figure), and the body feature C is extracted ((c) in the figure). ), And outputs the extracted body feature value C to the matching unit 4.
[0021]
As will be described later, as shown in FIG. 4, the body characteristic amount C is processed as a line segment model including line segments L1 to L7 that characterize the gesture of the subject M. More specifically, in this example, as shown in FIG. 3A, the subject M spreads both arms and opens two fingers of both hands in a V-shape (a gesture imitating a crab). Therefore, the line segment model C is used as the control timing of the editing process, so that the line segment model C has the part line segment data corresponding to the two fingers of the subject M as shown in FIG. (L1 and L2 and L6 and L7), part line segment data (L3 and L5) corresponding to two arms, and part line segment data (L4) corresponding to the head and body.
[0022]
The body characteristic amount extraction unit 2 outputs the extracted body characteristic amount C, the frame number 11 of the extracted image frame, and the body characteristic amount ID 11 for identifying the extracted body characteristic amount C to the matching unit 4 as shown in FIG. I do.
The gesture model description unit 3 has a memory for storing predetermined gesture model data and a function of describing a gesture model J as shown in FIG. 6 based on the data. In response to the output of the body feature C, the gesture model J is described for matching and evaluation with the body feature C, and the gesture model J is output to the matching unit 4.
[0023]
The matching unit 4 compares the body feature C input from the body feature extraction unit 2 with the gesture model J input from the gesture model description unit 3, and determines whether the body feature C and the gesture model J match. Then, as shown in FIG. 7, the determination result (matching or mismatching) 12 for each image frame is output to the cut point extracting unit 5 in association with the frame number 10.
[0024]
The cut point extraction unit 5 determines a range in the video to be deleted and edited (cut), and outputs a control signal corresponding to the determination result to the video editing unit 6. Specifically, based on the determination result 12 input from the matching unit 4, the cut point extraction unit 5 determines whether the body feature C and the gesture model J match, and The frame number 10 corresponding to the middle cut start time and cut end time is obtained, and an instruction to delete a portion sandwiched between these frame numbers from the video data is output to the video editing unit 6.
[0025]
Note that this instruction may be performed using a control signal that directly specifies the image frame time in the video instead of the frame number. In addition, this instruction may be performed by specifying the cut start time and causing the video editing unit 6 to perform a predetermined operation from the start time. The cut processing may be ended after a certain time, or the cut end time may be designated and the video editing unit 6 may start the cut processing from a time that is a predetermined time earlier than the end time.
[0026]
The video editing unit 6 edits the video data input from the video input unit 1 based on the control information input from the cut point extraction unit 5, outputs the video edited data to the video storage unit 7, Is stored in the database of the video storage unit 7 composed of
Specifically, in response to the control information from the cut point extracting unit 5, the corresponding video portion of the video data is deleted and stored in the video storage unit 7.
[0027]
Next, the processing relating to the extraction of the body characteristic amount of the present example will be described in detail.
FIG. 8 shows a processing procedure of the body feature amount extraction of the present example performed by the body feature amount extraction unit 2. First, when an image frame is input from the image input unit 1, the image frame is binarized. As shown in FIG. 9, the binarized image data 16 is stored in the internal memory 15 provided in the video processing device as a work area of the body characteristic amount extraction unit 2 (step S1).
[0028]
Then, the body characteristic amount extraction unit 2 extracts so-called thin line image data 17 of the subject M from the binarized image data 16, and further performs line segment division from the thin line image data 17 to obtain two fingers, The part line segment data 18 corresponding to the two arms, the head, and the torso are extracted, the part line segment data 18 is stored in the internal memory 15, and the body characteristic amount extraction processing ends (step S2). This process is repeated each time an image frame is input from the image input unit 1, and the part line segment data 18 stored in the internal memory 15 is sequentially output to the matching unit 4 as the feature amount C as described above. You.
[0029]
That is, in the body characteristic amount extraction processing, an image of the subject M is extracted and binarized as shown in FIG. 3A, and, for example, as shown in FIG. The skeleton line is extracted by connecting the positions, and furthermore, as shown in FIG. 4C, more specifically, as shown in FIG. 4, the part line data corresponding to the two fingers, the two arms, the head, and the body is obtained. This is the process of decomposing.
[0030]
Next, the gesture model J will be described.
FIG. 6 shows the concept of the gesture model J. The gesture model J of the present example is configured with descriptions for specifying two fingers, two arms, a head, and a body. The right arm is R_ARM, the first finger of the right hand is R_FINGER1, and the second finger of the right hand is R_FINGER2. The left arm is L_ARM, the first finger of the left hand is L_FINGER1, and the second finger of the left hand is L_FINGER2. The head including the torso is referred to as HEAD.
[0031]
That is, assuming that the entire body is BODY, the gesture model J is a description based on the following combinations.
BODY (t): = (HEAD (t), R_ARM (t), L_ARM (t))
R_ARM (t): = (R_FINGER1 (t), R_FINGER2 (t))
L_ARM (t): = (L_FINGER1 (t), L_FINGER2 (t))
[0032]
Here, the gesture model J of the present example is described including a time parameter t that also represents a temporal change of the model. This is because, by giving a time change to the form of the gesture model J, if a certain image frame in the motion of the subject M changing with time matches the gesture model J, the editing process is performed. This allows the speaker as the subject M to easily give a desired operation instruction by providing a range of consistency detection.
[0033]
FIG. 10 shows the concept of the opening / closing operation of two fingers, and R_FINGER1 and R_FINGER2 (L_FINGER1 and L_FINGER2) are separated from each other with time passing around the connection part. More specifically, at a certain time t1 shown in FIG. 9A, two fingers R_FINGER1 (t1) and R_FINGER2 (t1) are at a certain angle (for example, about 30 degrees) about the junction with R_ARM. ), And at a certain time t2 thereafter, that two fingers of R_FINGER1 (t2) and R_FINGER2 (t2) are closed at the center of the junction of R_ARM. Represents.
Therefore, if the part line segment data corresponds to each element of the gesture model J such as R_FINGER1, it is determined that the matching is performed by the matching unit 4.
[0034]
Next, a description will be given of a process of determining matching between the part line segment data and the gesture model J by the matching unit 4.
The matching unit 4 determines that the matching has been performed if the respective segment line data (L1 to L7) as shown in FIG. 4 correspond to the elements of the gesture model J such as R_FINGER1 as shown in FIG. judge.
[0035]
Specifically, as shown in FIG. 4, L1, L2, and L3 are connected at one point, L3, L4, and L5 are connected at one point, and L5, L6, and L7 are connected at one point. I have.
L1, L2 and L3 correspond to R_FINGER2, R_FINGER1 and R_ARM, L3, L4 and L5 correspond to R_ARM, HEAD and L_ARM, and L5, L6 and L7 correspond to L_ARM, L_FINGER1 and L_FINGER2.
When the connection relations and positions of these part line segment data L1 to L7 and the elements of the gesture model J are matched, it is determined that the body feature C and the gesture model J match.
[0036]
The determination of the match between the body feature C and the gesture model J is sequentially performed on the continuous image frames 9, and the matched image frames 9 are set as candidates for cut points (deletion edit points).
Here, in this example in which the gesture model J is given a time width, in particular, it is assumed that a plurality of matched image frames 9 are consecutively set as candidates. It is sufficient to specify an image frame image that defines the cut point, such as the image frame image, the last image, or the middle image, and perform the video editing process using the image frame number.
[0037]
Note that, in the present invention, based on the image frame position matching the gesture model J in the video, the video portion at a predetermined time from the position is deleted, or the video from the position retroactive for a predetermined time from the position to the relevant position is deleted. A portion is deleted, or a video portion in which matching image frames are consecutively deleted, or, as shown in FIG. 11, a deletion process is started from a position of a certain matching image frame A in a video, and The editing process range may be set in various modes, such as terminating the deletion process when the matching image frame B appears.
[0038]
In the above example, the editing process is performed based on a characteristic behavior mode of a body motion called a gesture of the subject M. However, the present invention provides a feature based on an acoustic output operation such as a sound emitted by the subject in a video. The editing process may be performed based on a typical behavior mode.
For example, as shown in FIG. 12, when the video includes video data (video track) 20 and audio data (audio track) emitted by the subject, the start and end points of each of the audio tracks 21a and 21b, or Alternatively, the switching points of the audio tracks 21a and 21b may be used as the start and end positions of the editing process.
[0039]
Specifically, the sound track can be made different for each utterance, and the above-mentioned starting point and the like can be used as a candidate for the control timing of the editing process together with the gesture model matching.
For example, when the start time of the audio track is set as the cut point, the video may be deleted and edited with the cut point specified by the matching between the body feature C and the gesture model J as the cut end point. When the end time of is used as the cut point, the video may be deleted and edited with the cut point specified by the matching of the body feature C and the gesture model J as the cut start point.
[0040]
As a method of detecting the continuity of the audio track, a method based on a silent section as shown in FIG. 13 can be adopted.
When detecting the continuity of the audio data, the volume level is measured, and when the volume level falls below a predetermined threshold, the time is detected as the silent state time Δt. For example, in a certain video data, if a silence state time Δt is detected immediately after the audio track Audio a and then the audio track Audio b is detected, if the silence state time Δt is within a predetermined time (for example, 5 seconds), The audio tracks Audio a and Audio b may be handled as continuous audio tracks, and when they exceed a predetermined time, they may be handled as different audio tracks.
[0041]
Further, in the present invention, when sound data generated by a subject such as audio included in video data is more actively used for editing processing, for example, in the apparatus configuration shown in FIG. 2 is to extract the audio data included in the video together with the time information in the video, and the gesture model description unit 3 describes an audio waveform model of a predetermined phrase (for example, “cut”) as shown in FIG. The matching unit 4 determines the consistency between the audio data extracted from the video and the audio waveform model. Alternatively, the cut point extracting unit 5 may determine the content and range of the editing process, output a control signal to the video editing unit 6, and perform the corresponding editing process on the video input from the video input unit 1. There.
[0042]
Further, in the above description, the editing process is performed on the video that is input and stored based on the physical motion and the sound output operation of the subject in the video. Based on the operation and the sound output operation, for example, video input control for switching a video camera constituting the video input unit 1 to another video camera may be performed.
[0043]
In this case, for example, as a device configuration as shown in FIG. 15, the control point extracting unit 35 uses the image frame position specified by the matching unit 4 in the same manner as above as the start or end position of the control, Based on this, the video control unit 36 may output a control signal to the video input unit 1 for changing the mode of video input.
In FIG. 15, the same components as those shown in FIG. 1 are denoted by the same reference numerals. FIG. 15 shows an example of the matching process based on the gesture model. However, the matching process based on the sound output from the subject can be applied in the same manner as described above.
[0044]
By controlling the video input in this way, the speaker who is the subject can switch between multiple video cameras based on his / her gestures and remarks and produce videos at different angles without giving instructions to the cameraman. Shooting, fading out a video camera or performing mosaic processing on an input video so that confidential materials are not clearly seen in the video, and switching an input path of a video input unit 1 to input a substitute video , Etc. can be performed.
[0045]
Note that, in the above description, the sound output operation has been described using the sound emitted by the subject as an example. Various types of sounds, such as sounds generated by the user, can be used for editing and controlling video input.
In the above description, the editing process has been described by taking the deletion of the video portion as an example. However, the present invention may also reduce the resolution of the video portion, mosaic the video portion, Various types of editing processing, such as adding an annotation to a document, can be employed.
[0046]
【The invention's effect】
As described above, according to the present invention, a characteristic behavior mode of a subject is extracted from a video, and the control process of the video is performed using the extracted behavior as a start point and an end point. Control processing such as deletion can be realized by natural human actions that do not hinder continuity such as conversation by the subject person.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a video processing device according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a configuration of video data.
FIG. 3 is a diagram illustrating gesture extraction according to an example of the present invention.
FIG. 4 is a diagram illustrating part line segment data (feature amount) according to an example of the present invention.
FIG. 5 is a diagram illustrating an example of data output by a body characteristic amount extraction unit according to an example of the present invention.
FIG. 6 is a diagram illustrating a gesture model according to an example of the present invention.
FIG. 7 is a diagram showing an example of data of cut point information according to an example of the present invention.
FIG. 8 is a diagram showing a processing procedure of body feature extraction according to an example of the present invention.
FIG. 9 is a diagram illustrating an example of data in an internal memory in a body characteristic amount extraction process according to an example of the present invention.
FIG. 10 is a diagram illustrating opening and closing of two fingers according to an example of the present invention.
FIG. 11 is a diagram illustrating a video data editing portion based on a cut point according to an example of the present invention.
FIG. 12 is a diagram illustrating cut points due to a combination of a gesture model and voice processing according to an example of the present invention.
FIG. 13 is a diagram illustrating continuity detection of audio information according to an example of the present invention.
FIG. 14 is a diagram illustrating a voice information model according to an example of the present invention.
FIG. 15 is a diagram showing a configuration of a video processing device according to another embodiment of the present invention.
[Explanation of symbols]
1: video input unit 2: body feature extraction unit
3: Gesture model description part 4: Matching part
5: cut point extraction unit, 6: video editing unit,
7: video storage unit, 9: image frame,
35: control point extraction unit, 36: video control unit,
C: feature quantity, J: gesture model,
L1 to L7: part line segment data, M: subject,

Claims

An image control device that performs a predetermined control process on an image,
Video input means for inputting video,
Image control processing means for specifying a characteristic behavior mode of the subject from the input video, and outputting a control signal for adding a corresponding control process to the video based on the predetermined characteristic behavior mode,
A video processing device comprising:

The video processing device according to claim 1,
The image control processing means includes: a characteristic amount extracting means for specifying a characteristic behavior mode of the subject from the input video by a physical motion; a model storage means holding a behavior mode model by the physical motion; Matching means for determining consistency between the behavior mode specified by the means and the behavior mode model held in the model storage means, and outputting the control signal based on a determination result by the matching means. A video processing device characterized by the above-mentioned.

The video processing device according to claim 2,
The video processing apparatus according to claim 1, wherein the behavior mode model stored in the model storage unit has a description format having a time width representing a time change of the body motion.

The video processing device according to claim 1,
The image input from the image input means includes audio information,
The image control processing means includes: a feature amount extracting means for specifying a characteristic action mode by an audio output operation by the subject from the input video; a model storage means holding an action mode model by the audio output operation; Matching means for determining consistency between the behavior mode specified by the quantity extraction means and the behavior mode model held in the model storage means, and outputting the control signal based on the determination result by the matching means A video processing device characterized by performing:

The video processing device according to any one of claims 1 to 4,
A video processing apparatus comprising: a video editing unit that adds an editing process corresponding to a control signal output from the video control processing unit to a video input and stored from the video input unit.

The video processing device according to any one of claims 1 to 4,
A video processing apparatus comprising: a video input control unit that applies control corresponding to a control signal output from the video control processing unit to a video input from the video input unit.

A video control method for applying a predetermined control process to a video,
A video processing method comprising: identifying a characteristic behavior mode of a subject from an input video; and performing a corresponding control process on the video based on the specified predetermined characteristic behavior mode.

A video control method for applying a predetermined control process to a video,
From the input video, identify the characteristic behavior mode of the subject's body motion,
Determine the consistency between the specified behavior mode and the behavior mode model prepared in advance,
A video processing method, wherein a corresponding control process is added to a video based on the determination result.

A video control method for applying a predetermined control process to a video, identifying a characteristic behavior mode by a sound output operation by a subject from a video including input audio information,
Determine the consistency between the specified behavior mode and the behavior mode model prepared in advance,
A video processing method, wherein a corresponding control process is added to a video based on the determination result.

A video control method for applying a predetermined control process to a video,
Identify the characteristic behavior of the subject from the input video,
A video processing method, wherein an editing process corresponding to the specified predetermined characteristic behavior mode is accumulated in addition to the input image.

A video control method for applying a predetermined control process to a video,
Identify the characteristic behavior of the subject from the input video,
A video processing method, wherein control corresponding to the specified predetermined characteristic behavior mode is added to the video input.

A program for causing a computer to execute video control for applying a predetermined control process to an input video,
A function of identifying a characteristic behavior mode of the subject from the input video,
A function of determining consistency between the specified behavior mode and a behavior mode model prepared in advance,
A program for realizing, to a computer, a function of outputting a control signal for adding a corresponding control process to an image based on the determination result.